Files
2026-02-08 11:20:43 -10:00

215 lines
6.6 KiB
Markdown
Vendored

[//]: # (title: Extension Properties API)
When working with a [`DataFrame`](DataFrame.md), the most convenient and reliable way
to access its columns — including for operations and retrieving column values
in row expressions — is through *auto-generated extension properties*.
They are generated based on a [dataframe schema](schemas.md),
with the name and type of properties inferred from the name and type of the corresponding columns.
It also works for all types of hierarchical dataframes.
> The behavior of data schema generation differs between the
> [Compiler Plugin](Compiler-Plugin.md) and [Kotlin Notebook](SetupKotlinNotebook.md).
>
> * In **Kotlin Notebook**, a schema is generated **only after cell execution** for
> `DataFrame` variables defined within that cell.
> * With the **Compiler Plugin**, a new schema is generated **after every operation**
> — but support for all operations is still in progress.
> Retrieving the schema for `DataFrame` read from a file or URL is **not yet supported** either.
>
> This behavior may change in future releases. See the [example](#example) below that demonstrates these differences.
{style="warning"}
## Example
Consider a simple hierarchical dataframe from
<resource src="example.csv"></resource>.
This table consists of two columns: `name`, which is a `String` column, and `info`,
which is a [**column group**](DataColumn.md#columngroup) containing two nested
[value columns](DataColumn.md#valuecolumn) —
`age` of type `Int`, and `height` of type `Double`.
<table width="705">
<thead>
<tr>
<th>name</th>
<th colspan="2">info</th>
</tr>
<tr>
<th></th>
<th>age</th>
<th>height</th>
</tr>
</thead>
<tbody>
<tr>
<td>Alice</td>
<td>23</td>
<td>175.5</td>
</tr>
<tr>
<td>Bob</td>
<td>27</td>
<td>160.2</td>
</tr>
</tbody>
</table>
<tabs>
<tab title="Kotlin Notebook">
Read the [`DataFrame`](DataFrame.md) from the CSV file:
```kotlin
val df = DataFrame.readCsv("example.csv")
```
**After cell execution** data schema and extensions for this `DataFrame` will be generated
so you can use extensions for accessing columns,
using it in operations inside the [Column Selector DSL](ColumnSelectors.md)
and [DataRow API](DataRow.md):
```kotlin
// Get nested column
df.info.age
// Sort by multiple columns
df.sortBy { name and info.height }
// Filter rows using a row condition.
// These extensions express the exact value in the row
// with the corresponding type:
df.filter { name.startsWith("A") && info.age >= 16 }
```
If you change the dataframe's schema by changing any column [name](rename.md),
or [type](convert.md) or [add](add.md) a new one, you need to
run a cell with a new [`DataFrame`](DataFrame.md) declaration first.
For example, rename the `name` column into "firstName":
```kotlin
val dfRenamed = df.rename { name }.into("firstName")
```
After running the cell with the code above, you can use `firstName` extensions in the following cells:
```kotlin
dfRenamed.firstName
dfRenamed.rename { firstName }.into("name")
dfRenamed.filter { firstName == "Nikita" }
```
See the [](quickstart.md) in Kotlin Notebook with basic Extension Properties API examples.
</tab>
<tab title="Compiler Plugin">
For now, if you read [`DataFrame`](DataFrame.md) from a file or URL, you need to define its schema manually.
You can do it quickly with [`generate..()` methods](DataSchemaGenerationMethods.md).
Define schemas:
```kotlin
@DataSchema
data class PersonInfo(
val age: Int,
val height: Float
)
@DataSchema
data class Person(
val info: PersonInfo,
val name: String
)
```
Read the [`DataFrame`](DataFrame.md) from the CSV file and specify the schema with
[`.convertTo()`](convertTo.md) or [`cast()`](cast.md):
```kotlin
val df = DataFrame.readCsv("example.csv").convertTo<Person>()
```
Extensions for this `DataFrame` will be generated automatically by the plugin,
so you can use extensions for accessing columns,
using it in operations inside the [Column Selector DSL](ColumnSelectors.md)
and [DataRow API](DataRow.md).
```kotlin
// Get nested column
df.info.age
// Sort by multiple columns
df.sortBy { name and info.height }
// Filter rows using a row condition.
// These extensions express the exact value in the row
// with the corresponding type:
df.filter { name.startsWith("A") && info.age >= 16 }
```
Moreover, new extensions will be generated on-the-fly after each schema change:
by changing any column [name](rename.md),
or [type](convert.md) or [add](add.md) a new one.
For example, rename the `name` column into "firstName" and then we can use `firstName` extensions
in the following operations:
```kotlin
// Rename "name" column into "firstName"
df.rename { name }.into("firstName")
// Can use `firstName` extension in the row condition
// right after renaming
.filter { firstName == "Nikita" }
```
See [Compiler Plugin Example](https://github.com/Kotlin/dataframe/tree/plugin_example/examples/kotlin-dataframe-plugin-gradle-example)
IDEA project with basic Extension Properties API examples.
</tab>
</tabs>
## Properties name generation
By default, each extension property is generated with a name equal to the original column name.
```kotlin
val df = dataFrameOf("size_in_inches" to listOf(..))
df.size_in_inches
```
If the original column name cannot be used as a property name (for example, if it contains spaces
or has a name equal to a keyword in Kotlin),
it will be enclosed in backticks.
```kotlin
val df = dataFrameOf("size in inches" to listOf(..))
df.`size in inches`
```
However, sometimes the original column name contains special symbols
and can't be used as a property name in backticks.
In such cases, special symbols in the auto-generated property name will be replaced.
```kotlin
val df = dataFrameOf("size\nin:inches" to listOf(..))
df.`size in - inches`
```
> In such cases, use [**`rename`**](rename.md) to update column names,
> or [**`renameToCamelCase`**](rename.md#renametocamelcase) to convert all column names
> in a `DataFrame` to `camelCase`, which is the idiomatic and widely preferred naming style in Kotlin.
If you don't want to change the actual column name, but you need a convenient accessor for this column,
you can use the `@ColumnName` annotation in a manually declared [data schema](schemas.md).
It allows you to use a property name different
from the original column name without changing the column's actual name:
```kotlin
@DataSchema
interface Info {
@ColumnName("size\nin:inches")
val sizeInInches: Double
}
```
```kotlin
val df = dataFrameOf("size\nin:inches" to listOf(..)).cast<Info>()
df.sizeInInches
```