init research
This commit is contained in:
@@ -0,0 +1,214 @@
|
||||
[//]: # (title: Extension Properties API)
|
||||
|
||||
When working with a [`DataFrame`](DataFrame.md), the most convenient and reliable way
|
||||
to access its columns — including for operations and retrieving column values
|
||||
in row expressions — is through *auto-generated extension properties*.
|
||||
They are generated based on a [dataframe schema](schemas.md),
|
||||
with the name and type of properties inferred from the name and type of the corresponding columns.
|
||||
It also works for all types of hierarchical dataframes.
|
||||
|
||||
> The behavior of data schema generation differs between the
|
||||
> [Compiler Plugin](Compiler-Plugin.md) and [Kotlin Notebook](SetupKotlinNotebook.md).
|
||||
>
|
||||
> * In **Kotlin Notebook**, a schema is generated **only after cell execution** for
|
||||
> `DataFrame` variables defined within that cell.
|
||||
> * With the **Compiler Plugin**, a new schema is generated **after every operation**
|
||||
> — but support for all operations is still in progress.
|
||||
> Retrieving the schema for `DataFrame` read from a file or URL is **not yet supported** either.
|
||||
>
|
||||
> This behavior may change in future releases. See the [example](#example) below that demonstrates these differences.
|
||||
{style="warning"}
|
||||
|
||||
## Example
|
||||
|
||||
Consider a simple hierarchical dataframe from
|
||||
<resource src="example.csv"></resource>.
|
||||
|
||||
This table consists of two columns: `name`, which is a `String` column, and `info`,
|
||||
which is a [**column group**](DataColumn.md#columngroup) containing two nested
|
||||
[value columns](DataColumn.md#valuecolumn) —
|
||||
`age` of type `Int`, and `height` of type `Double`.
|
||||
|
||||
<table width="705">
|
||||
<thead>
|
||||
<tr>
|
||||
<th>name</th>
|
||||
<th colspan="2">info</th>
|
||||
</tr>
|
||||
<tr>
|
||||
<th></th>
|
||||
<th>age</th>
|
||||
<th>height</th>
|
||||
</tr>
|
||||
</thead>
|
||||
<tbody>
|
||||
<tr>
|
||||
<td>Alice</td>
|
||||
<td>23</td>
|
||||
<td>175.5</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>Bob</td>
|
||||
<td>27</td>
|
||||
<td>160.2</td>
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
|
||||
<tabs>
|
||||
<tab title="Kotlin Notebook">
|
||||
|
||||
Read the [`DataFrame`](DataFrame.md) from the CSV file:
|
||||
|
||||
```kotlin
|
||||
val df = DataFrame.readCsv("example.csv")
|
||||
```
|
||||
|
||||
**After cell execution** data schema and extensions for this `DataFrame` will be generated
|
||||
so you can use extensions for accessing columns,
|
||||
using it in operations inside the [Column Selector DSL](ColumnSelectors.md)
|
||||
and [DataRow API](DataRow.md):
|
||||
|
||||
|
||||
```kotlin
|
||||
// Get nested column
|
||||
df.info.age
|
||||
// Sort by multiple columns
|
||||
df.sortBy { name and info.height }
|
||||
// Filter rows using a row condition.
|
||||
// These extensions express the exact value in the row
|
||||
// with the corresponding type:
|
||||
df.filter { name.startsWith("A") && info.age >= 16 }
|
||||
```
|
||||
|
||||
If you change the dataframe's schema by changing any column [name](rename.md),
|
||||
or [type](convert.md) or [add](add.md) a new one, you need to
|
||||
run a cell with a new [`DataFrame`](DataFrame.md) declaration first.
|
||||
For example, rename the `name` column into "firstName":
|
||||
|
||||
```kotlin
|
||||
val dfRenamed = df.rename { name }.into("firstName")
|
||||
```
|
||||
|
||||
After running the cell with the code above, you can use `firstName` extensions in the following cells:
|
||||
|
||||
```kotlin
|
||||
dfRenamed.firstName
|
||||
dfRenamed.rename { firstName }.into("name")
|
||||
dfRenamed.filter { firstName == "Nikita" }
|
||||
```
|
||||
|
||||
See the [](quickstart.md) in Kotlin Notebook with basic Extension Properties API examples.
|
||||
|
||||
</tab>
|
||||
<tab title="Compiler Plugin">
|
||||
|
||||
For now, if you read [`DataFrame`](DataFrame.md) from a file or URL, you need to define its schema manually.
|
||||
You can do it quickly with [`generate..()` methods](DataSchemaGenerationMethods.md).
|
||||
|
||||
Define schemas:
|
||||
```kotlin
|
||||
@DataSchema
|
||||
data class PersonInfo(
|
||||
val age: Int,
|
||||
val height: Float
|
||||
)
|
||||
|
||||
@DataSchema
|
||||
data class Person(
|
||||
val info: PersonInfo,
|
||||
val name: String
|
||||
)
|
||||
```
|
||||
|
||||
Read the [`DataFrame`](DataFrame.md) from the CSV file and specify the schema with
|
||||
[`.convertTo()`](convertTo.md) or [`cast()`](cast.md):
|
||||
|
||||
```kotlin
|
||||
val df = DataFrame.readCsv("example.csv").convertTo<Person>()
|
||||
```
|
||||
|
||||
Extensions for this `DataFrame` will be generated automatically by the plugin,
|
||||
so you can use extensions for accessing columns,
|
||||
using it in operations inside the [Column Selector DSL](ColumnSelectors.md)
|
||||
and [DataRow API](DataRow.md).
|
||||
|
||||
|
||||
```kotlin
|
||||
// Get nested column
|
||||
df.info.age
|
||||
// Sort by multiple columns
|
||||
df.sortBy { name and info.height }
|
||||
// Filter rows using a row condition.
|
||||
// These extensions express the exact value in the row
|
||||
// with the corresponding type:
|
||||
df.filter { name.startsWith("A") && info.age >= 16 }
|
||||
```
|
||||
|
||||
Moreover, new extensions will be generated on-the-fly after each schema change:
|
||||
by changing any column [name](rename.md),
|
||||
or [type](convert.md) or [add](add.md) a new one.
|
||||
For example, rename the `name` column into "firstName" and then we can use `firstName` extensions
|
||||
in the following operations:
|
||||
|
||||
```kotlin
|
||||
// Rename "name" column into "firstName"
|
||||
df.rename { name }.into("firstName")
|
||||
// Can use `firstName` extension in the row condition
|
||||
// right after renaming
|
||||
.filter { firstName == "Nikita" }
|
||||
```
|
||||
|
||||
See [Compiler Plugin Example](https://github.com/Kotlin/dataframe/tree/plugin_example/examples/kotlin-dataframe-plugin-gradle-example)
|
||||
IDEA project with basic Extension Properties API examples.
|
||||
</tab>
|
||||
</tabs>
|
||||
|
||||
## Properties name generation
|
||||
|
||||
By default, each extension property is generated with a name equal to the original column name.
|
||||
|
||||
```kotlin
|
||||
val df = dataFrameOf("size_in_inches" to listOf(..))
|
||||
df.size_in_inches
|
||||
```
|
||||
|
||||
If the original column name cannot be used as a property name (for example, if it contains spaces
|
||||
or has a name equal to a keyword in Kotlin),
|
||||
it will be enclosed in backticks.
|
||||
|
||||
```kotlin
|
||||
val df = dataFrameOf("size in inches" to listOf(..))
|
||||
df.`size in inches`
|
||||
```
|
||||
|
||||
However, sometimes the original column name contains special symbols
|
||||
and can't be used as a property name in backticks.
|
||||
In such cases, special symbols in the auto-generated property name will be replaced.
|
||||
|
||||
```kotlin
|
||||
val df = dataFrameOf("size\nin:inches" to listOf(..))
|
||||
df.`size in - inches`
|
||||
```
|
||||
|
||||
> In such cases, use [**`rename`**](rename.md) to update column names,
|
||||
> or [**`renameToCamelCase`**](rename.md#renametocamelcase) to convert all column names
|
||||
> in a `DataFrame` to `camelCase`, which is the idiomatic and widely preferred naming style in Kotlin.
|
||||
|
||||
If you don't want to change the actual column name, but you need a convenient accessor for this column,
|
||||
you can use the `@ColumnName` annotation in a manually declared [data schema](schemas.md).
|
||||
It allows you to use a property name different
|
||||
from the original column name without changing the column's actual name:
|
||||
|
||||
```kotlin
|
||||
@DataSchema
|
||||
interface Info {
|
||||
@ColumnName("size\nin:inches")
|
||||
val sizeInInches: Double
|
||||
}
|
||||
```
|
||||
|
||||
```kotlin
|
||||
val df = dataFrameOf("size\nin:inches" to listOf(..)).cast<Info>()
|
||||
df.sizeInInches
|
||||
```
|
||||
Reference in New Issue
Block a user