Files
2026-02-08 11:20:43 -10:00

148 lines
5.5 KiB
Markdown
Vendored

[//]: # (title: Data Schemas)
The Kotlin DataFrame library provides typed data access via
[generation of extension properties](extensionPropertiesApi.md) for the type
[`DataFrame<T>`](DataFrame.md) (as well as for [`DataRow<T>`](DataRow.md)), where
`T` is a marker class representing the `DataSchema` of the [`DataFrame`](DataFrame.md).
A *schema* of a [`DataFrame`](DataFrame.md) is a mapping from column names to column types.
This data schema can be expressed as a Kotlin class or interface.
If the DataFrame is hierarchical — contains a [column group](DataColumn.md#columngroup) or a
[column of dataframes](DataColumn.md#framecolumn) — the data schema reflects this structure,
with a separate class representing the schema of each column group or nested `DataFrame`.
For example, consider a simple hierarchical DataFrame from
<resource src="example.csv"></resource>.
This DataFrame consists of two columns:
- `name`, which is a `String` column
- `info`, which is a [column group](DataColumn.md#columngroup) containing two nested [value columns](DataColumn.md#valuecolumn):
- `age` of type `Int`
- `height` of type `Double`
<table width="705">
<thead>
<tr>
<th>name</th>
<th colspan="2">info</th>
</tr>
<tr>
<th></th>
<th>age</th>
<th>height</th>
</tr>
</thead>
<tbody>
<tr>
<td>Alice</td>
<td>23</td>
<td>175.5</td>
</tr>
<tr>
<td>Bob</td>
<td>27</td>
<td>160.2</td>
</tr>
</tbody>
</table>
The data schema corresponding to this DataFrame can be represented as:
```kotlin
// Data schema of the "info" column group
@DataSchema
data class Info(
val age: Int,
val height: Float
)
// Data schema of the entire DataFrame
@DataSchema
data class Person(
val info: Info,
val name: String
)
```
[Extension properties](extensionPropertiesApi.md) for `DataFrame<Person>`
are generated based on this schema and allow accessing columns
or using them in operations:
```kotlin
// Assuming `df` has type `DataFrame<Person>`
// Get "age" column from "info" group
df.info.age
// Select "name" and "height" columns
df.select { name and info.height }
// Filter rows by "age"
df.filter { age >= 18 }
```
See [](extensionPropertiesApi.md) for more information.
## Schema Retrieving
Defining a data schema manually can be difficult, especially for dataframes with many columns or deeply nested
structures, and may lead to mistakes in column names or types.
Kotlin DataFrame provides several methods for generating data schemas.
* [**`generate..()` methods**](DataSchemaGenerationMethods.md) are extensions for [`DataFrame`](DataFrame.md)
(or for its [`schema`](schema.md)) that generate a code string representing its `DataSchema`.
* [**Kotlin DataFrame Compiler Plugin**](Compiler-Plugin.md) **cannot automatically infer** a
data schema from external sources such as files or URLs.
However, it **can** infer the schema if you construct the [`DataFrame`](DataFrame.md)
manually — that is, by explicitly declaring the columns using the API.
It will also **automatically update** the schema during operations that modify the structure of the DataFrame.
> For best results when working with the Compiler Plugin, it's recommended to
> generate the initial schema using one of
> the [`generate..()` methods](DataSchemaGenerationMethods.md).
> Once generated, the Compiler Plugin will automatically keep the schema up to date
> after any operations that change the structure of the DataFrame.
### Plugins
> The current Gradle plugin is **under consideration for deprecation** and
> may be officially marked as deprecated in future releases.
>
> The KSP plugin is **not compatible with [KSP2](https://github.com/google/ksp?tab=readme-ov-file#ksp2-is-here)**
> and may **not work properly with Kotlin 2.1 or newer**.
>
> At the moment, **[data schema generation is handled via dedicated methods](DataSchemaGenerationMethods.md)** instead of relying on the plugins.
{style="warning"}
* The [Gradle plugin](Gradle-Plugin.md) allows generating a data schema automatically by specifying a source file path in the Gradle build script.
* The KSP plugin allows generating a data schema automatically using
[Kotlin Symbol Processing](https://kotlinlang.org/docs/ksp-overview.html) by specifying
a source file path in your code file.
## Extension Properties Generation
Once you have a data schema, you can generate [extension properties](extensionPropertiesApi.md).
The easiest and most convenient way is to use the [**Kotlin DataFrame Compiler Plugin**](Compiler-Plugin.md),
which generates extension properties on the fly for declared data schemas
and automatically keeps them up to date after operations
that modify the structure of the [`DataFrame`](DataFrame.md).
> Extension properties generation was deprecated from the Gradle plugin in favor of the Compiler Plugin.
> {style="warning"}
* When using Kotlin DataFrame inside [Kotlin Notebook](SetupKotlinNotebook.md),
the schema and extension properties
are generated automatically after each cell execution for all `DataFrame` variables declared in that cell.
See [extension properties example in Kotlin Notebook](extensionPropertiesApi.md#example).
> Compiler Plugin is coming to Kotlin Notebook soon.
* If you're not using the Compiler Plugin, you can still generate
[extension properties](extensionPropertiesApi.md) for a [`DataFrame`](DataFrame.md)
manually by calling one of the [`generate..()` methods](DataSchemaGenerationMethods.md)
with the `extensionProperties = true` argument.