326 lines
9.8 KiB
Markdown
Vendored
326 lines
9.8 KiB
Markdown
Vendored
[//]: # (title: Create DataFrame)
|
|
<!---IMPORT org.jetbrains.kotlinx.dataframe.samples.api.Create-->
|
|
|
|
This section describes ways to create a [`DataFrame`](DataFrame.md) instance.
|
|
|
|
### emptyDataFrame
|
|
|
|
Returns a [`DataFrame`](DataFrame.md) with no rows and no columns.
|
|
|
|
<!---FUN createEmptyDataFrame-->
|
|
|
|
```kotlin
|
|
val df = emptyDataFrame<Any>()
|
|
```
|
|
|
|
<!---END-->
|
|
|
|
### dataFrameOf
|
|
|
|
<!---FUN createDataFrameOfPairs-->
|
|
|
|
```kotlin
|
|
// DataFrame with 2 columns and 3 rows
|
|
val df = dataFrameOf(
|
|
"name" to listOf("Alice", "Bob", "Charlie"),
|
|
"age" to listOf(15, 20, 100),
|
|
)
|
|
```
|
|
|
|
<!---END-->
|
|
|
|
Create DataFrame with nested columns inplace:
|
|
|
|
<!---FUN createNestedDataFrameInplace-->
|
|
|
|
```kotlin
|
|
// DataFrame with 2 columns and 3 rows
|
|
val df = dataFrameOf(
|
|
"name" to columnOf(
|
|
"firstName" to columnOf("Alice", "Bob", "Charlie"),
|
|
"lastName" to columnOf("Cooper", "Dylan", "Daniels"),
|
|
),
|
|
"age" to columnOf(15, 20, 100),
|
|
)
|
|
```
|
|
|
|
<!---END-->
|
|
|
|
<!---FUN createDataFrameFromColumns-->
|
|
|
|
```kotlin
|
|
// DataFrame with 2 columns
|
|
val df = dataFrameOf(
|
|
"name" to columnOf("Alice", "Bob", "Charlie"),
|
|
"age" to columnOf(15, 20, 22)
|
|
)
|
|
```
|
|
|
|
<!---END-->
|
|
|
|
Returns a [`DataFrame`](DataFrame.md) with given column names and values.
|
|
|
|
<!---FUN createDataFrameOf-->
|
|
|
|
```kotlin
|
|
// DataFrame with 2 columns and 3 rows
|
|
val df = dataFrameOf("name", "age")(
|
|
"Alice", 15,
|
|
"Bob", 20,
|
|
"Charlie", 100,
|
|
)
|
|
```
|
|
|
|
<!---END-->
|
|
|
|
### toDataFrame
|
|
|
|
#### `DataFrame` from `Map<String, List<*>>`:
|
|
|
|
<!---FUN createDataFrameFromMap-->
|
|
|
|
```kotlin
|
|
val map = mapOf("name" to listOf("Alice", "Bob", "Charlie"), "age" to listOf(15, 20, 22))
|
|
|
|
// DataFrame with 2 columns
|
|
map.toDataFrame()
|
|
```
|
|
|
|
<inline-frame src="resources/org.jetbrains.kotlinx.dataframe.samples.api.Create.createDataFrameFromMap.html" width="100%"/>
|
|
<!---END-->
|
|
|
|
#### `DataFrame` from random data:
|
|
|
|
Use `IntRange` to generate rows filled with random values:
|
|
|
|
<!---FUN createRandomDataFrame-->
|
|
|
|
```kotlin
|
|
val categories = listOf("Electronics", "Books", "Clothing")
|
|
// DataFrame with 4 columns and 7 rows
|
|
(0 until 7).toDataFrame {
|
|
"productId" from { "P${1000 + it}" }
|
|
"category" from { categories.random() }
|
|
"price" from { Random.nextDouble(10.0, 500.0) }
|
|
"inStock" from { Random.nextInt(0..100) }
|
|
}
|
|
```
|
|
|
|
<inline-frame src="resources/org.jetbrains.kotlinx.dataframe.samples.api.Create.createRandomDataFrame.html" width="100%"/>
|
|
<!---END-->
|
|
|
|
Generate DataFrame with nested ColumnGroup and FrameColumn:
|
|
|
|
<!---FUN createNestedRandomDataFrame-->
|
|
|
|
```kotlin
|
|
val categories = listOf("Electronics", "Books", "Clothing")
|
|
// DataFrame with 5 columns and 7 rows
|
|
(0 until 7).toDataFrame {
|
|
"productId" from { "P${1000 + it}" }
|
|
"category" from { categories.random() }
|
|
"price" from { Random.nextDouble(10.0, 500.0) }
|
|
|
|
// Column Group
|
|
"manufacturer" {
|
|
"country" from { listOf("USA", "China", "Germany", "Japan").random() }
|
|
"yearEstablished" from { Random.nextInt(1950..2020) }
|
|
}
|
|
|
|
// Frame Column
|
|
"reviews" from {
|
|
val reviewCount = Random.nextInt(0..7)
|
|
(0 until reviewCount).toDataFrame {
|
|
val ratings: DataColumn<Int> = expr { Random.nextInt(1..5) }
|
|
val comments = ratings.map {
|
|
when (it) {
|
|
5 -> listOf("Amazing quality!", "Best purchase ever!", "Highly recommend!", "Absolutely perfect!")
|
|
4 -> listOf("Great product!", "Very satisfied", "Good value for money", "Would buy again")
|
|
3 -> listOf("It's okay", "Does the job", "Average quality", "Neither good nor bad")
|
|
2 -> listOf("Could be better", "Disappointed", "Not what I expected", "Poor quality")
|
|
else -> listOf("Terrible!", "Not worth the price", "Complete waste of money", "Do not buy!")
|
|
}.random()
|
|
}
|
|
|
|
"author" from { "User${Random.nextInt(1000..10000)}" }
|
|
ratings into "rating"
|
|
comments into "comment"
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
<inline-frame src="resources/org.jetbrains.kotlinx.dataframe.samples.api.Create.createNestedRandomDataFrame.html" width="100%"/>
|
|
<!---END-->
|
|
|
|
Use `from` in combination with loops to generate DataFrame:
|
|
|
|
<!---FUN createDataFrameWithFill-->
|
|
|
|
```kotlin
|
|
// Multiplication table
|
|
(1..10).toDataFrame {
|
|
(1..10).forEach { x ->
|
|
"$x" from { x * it }
|
|
}
|
|
}
|
|
```
|
|
|
|
<inline-frame src="resources/org.jetbrains.kotlinx.dataframe.samples.api.Create.createDataFrameWithFill.html" width="100%"/>
|
|
<!---END-->
|
|
|
|
#### `DataFrame` from [`Iterable`](https://kotlinlang.org/api/latest/jvm/stdlib/kotlin.collections/-iterable/) of [basic types](https://kotlinlang.org/docs/basic-types.html) (except arrays):
|
|
|
|
The return type of these overloads is a typed [`DataFrame`](DataFrame.md).
|
|
Its data schema defines the column that can be used right after the conversion for additional computations.
|
|
|
|
<!---FUN readDataFrameFromValues-->
|
|
|
|
```kotlin
|
|
val names = listOf("Alice", "Bob", "Charlie")
|
|
// TODO fix with plugin???
|
|
val df = names.toDataFrame() as DataFrame<ValueProperty<String>>
|
|
df.add("length") { value.length }
|
|
```
|
|
|
|
<!---END-->
|
|
|
|
#### [`DataFrame`](DataFrame.md) with one column from [`Iterable<T>`](https://kotlinlang.org/api/latest/jvm/stdlib/kotlin.collections/-iterable/)
|
|
|
|
This is an easy way to create a [`DataFrame`](DataFrame.md) when you have a list of Files, URLs, or a structure
|
|
you want to extract data from.
|
|
|
|
In a notebook,
|
|
it can be convenient to start from the column of these values to see the number of rows, their `toString` in a table
|
|
and then iteratively add columns with the parts of the data you're interested in.
|
|
It could be a File's content, a specific section of an HTML document, some metadata, etc.
|
|
|
|
<!---FUN toDataFrameColumn-->
|
|
|
|
```kotlin
|
|
val files = listOf(File("data.csv"), File("data1.csv"))
|
|
val df = files.toDataFrame(columnName = "data")
|
|
```
|
|
|
|
<inline-frame src="resources/org.jetbrains.kotlinx.dataframe.samples.api.Create.toDataFrameColumn.html" width="100%"/>
|
|
<!---END-->
|
|
|
|
#### [`DataFrame`](DataFrame.md) from `List<List<T>>`:
|
|
|
|
This is useful for parsing text files. For example, the `.srt` subtitle format can be parsed like this:
|
|
|
|
<!---FUN toDataFrameLists-->
|
|
|
|
```kotlin
|
|
val lines = """
|
|
1
|
|
00:00:05,000 --> 00:00:07,500
|
|
This is the first subtitle.
|
|
|
|
2
|
|
00:00:08,000 --> 00:00:10,250
|
|
This is the second subtitle.
|
|
""".trimIndent().lines()
|
|
|
|
lines.chunked(4) { it.take(3) }.toDataFrame(header = listOf("n", "timestamp", "text"))
|
|
```
|
|
|
|
<inline-frame src="resources/org.jetbrains.kotlinx.dataframe.samples.api.Create.toDataFrameLists.html" width="100%"/>
|
|
<!---END-->
|
|
|
|
#### [`DataFrame`](DataFrame.md) from `Iterable<T>`:
|
|
|
|
<!---FUN readDataFrameFromObject-->
|
|
|
|
```kotlin
|
|
data class Person(val name: String, val age: Int)
|
|
|
|
val persons = listOf(Person("Alice", 15), Person("Bob", 20), Person("Charlie", 22))
|
|
|
|
val df = persons.toDataFrame()
|
|
```
|
|
|
|
<inline-frame src="resources/org.jetbrains.kotlinx.dataframe.samples.api.Create.readDataFrameFromObject.html" width="100%"/>
|
|
<!---END-->
|
|
|
|
Scans object properties using reflection and creates a [ValueColumn](DataColumn.md#valuecolumn) for every property.
|
|
The scope of properties for scanning is defined at compile-time by the formal types of the objects in the [`Iterable`](https://kotlinlang.org/api/latest/jvm/stdlib/kotlin.collections/-iterable/),
|
|
so the properties of implementation classes will not be scanned.
|
|
|
|
Specify the `depth` parameter to perform deep object graph traversal
|
|
and convert nested objects into [ColumnGroups](DataColumn.md#columngroup) and [FrameColumns](DataColumn.md#framecolumn):
|
|
|
|
<!---FUN readDataFrameFromDeepObject-->
|
|
|
|
```kotlin
|
|
data class Name(val firstName: String, val lastName: String)
|
|
|
|
data class Score(val subject: String, val value: Int)
|
|
|
|
data class Student(val name: Name, val age: Int, val scores: List<Score>)
|
|
|
|
val students = listOf(
|
|
Student(Name("Alice", "Cooper"), 15, listOf(Score("math", 4), Score("biology", 3))),
|
|
Student(Name("Bob", "Marley"), 20, listOf(Score("music", 5))),
|
|
)
|
|
|
|
val df = students.toDataFrame(maxDepth = 1)
|
|
```
|
|
|
|
<inline-frame src="resources/org.jetbrains.kotlinx.dataframe.samples.api.Create.readDataFrameFromDeepObject.html" width="100%"/>
|
|
<!---END-->
|
|
|
|
For detailed control over object graph transformations, use the configuration DSL.
|
|
It allows you to exclude particular properties or classes from the object graph traversal,
|
|
compute additional columns, and configure column grouping.
|
|
|
|
<!---FUN readDataFrameFromDeepObjectWithExclude-->
|
|
|
|
```kotlin
|
|
val df = students.toDataFrame {
|
|
// add column
|
|
"year of birth" from { 2021 - it.age }
|
|
|
|
// scan all properties
|
|
properties(maxDepth = 1) {
|
|
exclude(Score::subject) // `subject` property will be skipped from object graph traversal
|
|
preserve<Name>() // `Name` objects will be stored as-is without transformation into DataFrame
|
|
}
|
|
|
|
// add column group
|
|
"summary" {
|
|
"max score" from { it.scores.maxOf { it.value } }
|
|
"min score" from { it.scores.minOf { it.value } }
|
|
}
|
|
}
|
|
```
|
|
|
|
<inline-frame src="resources/org.jetbrains.kotlinx.dataframe.samples.api.Create.readDataFrameFromDeepObjectWithExclude.html" width="100%"/>
|
|
<!---END-->
|
|
|
|
### DynamicDataFrameBuilder
|
|
|
|
Previously mentioned [`DataFrame`](DataFrame.md) constructors throw an exception when column names are duplicated.
|
|
When implementing a custom operation involving multiple [`DataFrame`](DataFrame.md) objects,
|
|
or computed columns or when parsing some third-party data,
|
|
it might be desirable to disambiguate column names instead of throwing an exception.
|
|
|
|
<!---FUN duplicatedColumns-->
|
|
|
|
```kotlin
|
|
fun peek(vararg dataframes: AnyFrame): AnyFrame {
|
|
val builder = DynamicDataFrameBuilder()
|
|
for (df in dataframes) {
|
|
df.columns().firstOrNull()?.let { builder.add(it) }
|
|
}
|
|
return builder.toDataFrame()
|
|
}
|
|
|
|
val col by columnOf(1, 2, 3)
|
|
peek(dataFrameOf(col), dataFrameOf(col))
|
|
```
|
|
|
|
<inline-frame src="resources/org.jetbrains.kotlinx.dataframe.samples.api.Create.duplicatedColumns.html" width="100%"/>
|
|
<!---END-->
|
|
|