init research

This commit is contained in:
2026-02-08 11:20:43 -10:00
commit bdf064f54d
3041 changed files with 1592200 additions and 0 deletions
@@ -0,0 +1,180 @@
[//]: # (title: Data Schemas in Kotlin Notebook)
<!---IMPORT org.jetbrains.kotlinx.dataframe.samples.api.Schemas-->
After execution of a cell
<!---FUN createDfNullable-->
```kotlin
val df = dataFrameOf("name", "age")(
"Alice", 15,
"Bob", null,
)
```
<!---END-->
the following actions take place:
1. Columns in `df` are analyzed to extract data schema
2. Empty interface with [`DataSchema`](schema.md) annotation is generated:
```kotlin
@DataSchema
interface DataFrameType
```
3. Extension properties for this [`DataSchema`](schema.md) are generated:
```kotlin
val ColumnsContainer<DataFrameType>.age: DataColumn<Int?> @JvmName("DataFrameType_age") get() = this["age"] as DataColumn<Int?>
val DataRow<DataFrameType>.age: Int? @JvmName("DataFrameType_age") get() = this["age"] as Int?
val ColumnsContainer<DataFrameType>.name: DataColumn<String> @JvmName("DataFrameType_name") get() = this["name"] as DataColumn<String>
val DataRow<DataFrameType>.name: String @JvmName("DataFrameType_name") get() = this["name"] as String
```
Every column produces two extension properties:
* Property for `ColumnsContainer<DataFrameType>` returns column
* Property for `DataRow<DataFrameType>` returns cell value
4. `df` variable is typed by schema interface:
```kotlin
val temp = df
```
```kotlin
val df = temp.cast<DataFrameType>()
```
> _Note, that object instance after casting remains the same. See [cast](cast.md).
To log all these additional code executions, use cell magic
```
%trackExecution -all
```
## Custom Data Schemas
You can define your own [`DataSchema`](schema.md) interfaces and use them in functions and classes to represent [`DataFrame`](DataFrame.md) with
a specific set of columns:
```kotlin
@DataSchema
interface Person {
val name: String
val age: Int
}
```
After execution of this cell in notebook or annotation processing in IDEA, extension properties for data access will be
generated. Now we can use these properties to create functions for typed [`DataFrame`](DataFrame.md):
```kotlin
fun DataFrame<Person>.splitName() = split { name }.by(",").into("firstName", "lastName")
fun DataFrame<Person>.adults() = filter { age > 18 }
```
In Kotlin Notebook these functions will work automatically for any [`DataFrame`](DataFrame.md) that matches `Person` schema:
<!---FUN extendedDf-->
```kotlin
val df = dataFrameOf("name", "age", "weight")(
"Merton, Alice", 15, 60.0,
"Marley, Bob", 20, 73.5,
)
```
<!---END-->
Schema of `df` is compatible with `Person`, so auto-generated schema interface will inherit from it:
```kotlin
@DataSchema(isOpen = false)
interface DataFrameType : Person
val ColumnsContainer<DataFrameType>.weight: DataColumn<Double> get() = this["weight"] as DataColumn<Double>
val DataRow<DataFrameType>.weight: Double get() = this["weight"] as Double
```
Despite `df` has additional column `weight`, previously defined functions for `DataFrame<Person>` will work for it:
<!---FUN splitNameWorks-->
```kotlin
df.splitName()
```
<!---END-->
```text
firstName lastName age weight
Merton Alice 15 60.000
Marley Bob 20 73.125
```
<!---FUN adultsWorks-->
```kotlin
df.adults()
```
<!---END-->
```text
name age weight
Marley, Bob 20 73.5
```
## Use external Data Schemas
Sometimes it is convenient to extract reusable code from Kotlin Notebook into the Kotlin JVM library.
Schema interfaces should also be extracted if this code uses [Custom Data Schemas](#custom-data-schemas).
In order to enable support them in Kotlin, you should register them in
library [integration class](https://github.com/Kotlin/kotlin-jupyter/blob/master/docs/libraries.md) with `useSchema`
function:
```kotlin
@DataSchema
interface Person {
val name: String
val age: Int
}
fun DataFrame<Person>.countAdults() = count { it[Person::age] > 18 }
@JupyterLibrary
internal class Integration : JupyterIntegration() {
override fun Builder.onLoaded() {
onLoaded {
useSchema<Person>()
}
}
}
```
After loading this library into the notebook, schema interfaces for all [`DataFrame`](DataFrame.md) variables that match `Person`
schema will derive from `Person`
<!---FUN createDf-->
```kotlin
val df = dataFrameOf("name", "age")(
"Alice", 15,
"Bob", 20,
)
```
<!---END-->
Now `df` is assignable to `DataFrame<Person>` and `countAdults` is available:
```kotlin
df.countAdults()
```
@@ -0,0 +1,193 @@
# Data Schemas Generation From Existing DataFrame
<web-summary>
Generate useful Kotlin definitions based on your DataFrame structure.
</web-summary>
<card-summary>
Generate useful Kotlin definitions based on your DataFrame structure.
</card-summary>
<link-summary>
Generate useful Kotlin definitions based on your DataFrame structure.
</link-summary>
<!---IMPORT org.jetbrains.kotlinx.dataframe.samples.api.Generate-->
Special utility functions that generate code of useful Kotlin definitions (returned as a `String`)
based on the current `DataFrame` schema.
## generateDataClasses
```kotlin
inline fun <reified T> DataFrame<T>.generateDataClasses(
markerName: String? = null,
extensionProperties: Boolean = false,
visibility: MarkerVisibility = MarkerVisibility.IMPLICIT_PUBLIC,
useFqNames: Boolean = false,
nameNormalizer: NameNormalizer = NameNormalizer.default,
): CodeString
```
Generates Kotlin data classes corresponding to the `DataFrame` schema
(including all nested `DataFrame` columns and column groups).
Useful when you want to:
- Work with the data as regular Kotlin data classes.
- Convert a dataframe to instantiated data classes with `df.toListOf<DataClassType>()`.
- Work with data classes serialization.
- Extract structured types for further use in your application.
### Arguments {id="generateDataClasses-arguments"}
* `markerName`: `String?` — The base name to use for generated data classes.
If `null`, uses the `T` type argument of `DataFrame` simple name.
Default: `null`.
* `extensionProperties`: `Boolean` Whether to generate [extension properties](extensionPropertiesApi.md)
in addition to `interface` declarations.
Useful if you don't use the [compiler plugin](Compiler-Plugin.md), otherwise they are not needed;
the [compiler plugin](Compiler-Plugin.md), [notebooks](SetupKotlinNotebook.md),
and older [Gradle/KSP plugin](schemasGradle.md) generate them automatically.
Default: `false`.
* `visibility`: `MarkerVisibility` Visibility modifier for the generated declarations.
Default: `MarkerVisibility.IMPLICIT_PUBLIC`.
* `useFqNames`: `Boolean` If `true`, fully qualified type names will be used in generated code.
Default: `false`.
* `nameNormalizer`: `NameNormalizer` Strategy for converting column names (with spaces, underscores, etc.) to
Kotlin-style identifiers.
Generated properties will still refer to columns by their actual name using the `@ColumnName` annotation.
Default: `NameNormalizer.default`.
### Returns {id="generateDataClasses-returns"}
* `CodeString` A value class wrapper for `String`, containing
the generated Kotlin code of `data class` declarations and optionally [extension properties](extensionPropertiesApi.md).
### Examples {id="generateDataClasses-examples"}
<!---FUN notebook_test_generate_docs_4-->
```kotlin
df.generateDataClasses("Customer")
```
<!---END-->
Output:
```kotlin
@DataSchema
data class Customer1(
val amount: Double,
val orderId: Int
)
@DataSchema
data class Customer(
val orders: List<Customer1>,
val user: String
)
```
Add these classes to your project and convert the DataFrame to a list of typed objects:
<!---FUN notebook_test_generate_docs_5-->
```kotlin
val customers: List<Customer> = df.cast<Customer>().toList()
```
<!---END-->
## generateInterfaces
```kotlin
inline fun <reified T> DataFrame<T>.generateInterfaces(): CodeString
fun <T> DataFrame<T>.generateInterfaces(markerName: String): CodeString
```
Generates [`@DataSchema`](schemas.md) interfaces for this `DataFrame`
(including all nested `DataFrame` columns and column groups) as Kotlin interfaces.
This is useful when working with the [compiler plugin](Compiler-Plugin.md)
in cases where the schema cannot be inferred automatically from the source.
### Arguments {id="generateInterfaces-arguments"}
* `markerName`: `String?` — The base name to use for generated interfaces.
If `null`, uses the `T` type argument of `DataFrame` simple name.
Default: `null`.
* `extensionProperties`: `Boolean` Whether to generate [extension properties](extensionPropertiesApi.md)
in addition to `interface` declarations.
Useful if you don't use the [compiler plugin](Compiler-Plugin.md), otherwise they are not needed;
the [compiler plugin](Compiler-Plugin.md), [notebooks](SetupKotlinNotebook.md),
and older [Gradle/KSP plugin](schemasGradle.md) generate them automatically.
Default: `false`.
* `visibility`: `MarkerVisibility` Visibility modifier for the generated declarations.
Default: `MarkerVisibility.IMPLICIT_PUBLIC`.
* `useFqNames`: `Boolean` If `true`, fully qualified type names will be used in generated code.
Default: `false`.
* `nameNormalizer`: `NameNormalizer` Strategy for converting column names (with spaces, underscores, etc.) to
Kotlin-style identifiers.
Generated properties will still refer to columns by their actual name using the `@ColumnName` annotation.
Default: `NameNormalizer.default`.
### Returns {id="generateInterfaces-returns"}
* `CodeString` A value class wrapper for `String`, containing
the generated Kotlin code of `@DataSchema` interfaces without [extension properties](extensionPropertiesApi.md).
### Examples {id="generateInterfaces-examples"}
<!---FUN notebook_test_generate_docs_1-->
```kotlin
df
```
<!---END-->
<inline-frame src="./resources/notebook_test_generate_docs_1.html" width="100%" height="500px"></inline-frame>
<!---FUN notebook_test_generate_docs_2-->
```kotlin
df.generateInterfaces()
```
<!---END-->
Output:
```kotlin
@DataSchema(isOpen = false)
interface _DataFrameType11 {
val amount: kotlin.Double
val orderId: kotlin.Int
}
@DataSchema
interface _DataFrameType1 {
val orders: List<_DataFrameType11>
val user: kotlin.String
}
```
By adding these interfaces to your project with the [compiler plugin](Compiler-Plugin.md) enabled,
you'll gain full support for the [extension properties API](extensionPropertiesApi.md) and type-safe operations.
Use [`cast`](cast.md) to apply the generated schema to a `DataFrame`:
<!---FUN notebook_test_generate_docs_3-->
```kotlin
df.cast<_DataFrameType1>().filter { orders.all { orderId >= 102 } }
```
<!---END-->
<!--inline-frame src="./resources/notebook_test_generate_docs_3.html" width="100%" height="500px"></inline-frame>-->
@@ -0,0 +1,53 @@
# Migration from Gradle/KSP Plugin
Gradle and KSP plugins were useful tools in earlier versions of Kotlin DataFrame.
However, they are now being phased out. This section provides an overview of their current state and migration guidance.
## Gradle Plugin
> Do not confuse this with the [compiler plugin](Compiler-Plugin.md), which is a Kotlin compiler plugin
> and has a different plugin ID.
> {style="note"}
1. **Generation of [data schemas](schemas.md)** from data sources
(files, databases, or external URLs).
- You could copy already generated schemas from `build/generate` into your project sources.
- To generate a `DataSchema` for a [`DataFrame`](DataFrame.md) now, use
the [`generate..()` methods](DataSchemaGenerationMethods.md).
2. **Generation of [extension properties](extensionPropertiesApi.md)** from data schemas
This is now handled by the [compiler plugin](Compiler-Plugin.md), which:
- Generates extension properties for declared data schemas.
- Automatically updates the schema and regenerates properties after structural DataFrame operations.
> The Gradle plugin still works and may be helpful for generating schemas from data sources.
> However, it is planned for deprecation, and **we do not recommend using it going forward**.
> {style="warning"}
If you still choose to use Gradle plugin, make sure to disable the automatic KSP plugin dependency
to avoid compatibility issues with Kotlin 2.1+ by adding this line to `gradle.properties`:
```properties
kotlin.dataframe.add.ksp=false
```
## KSP Plugin
- **Generation of [data schemas](schemas.md)** from data sources
(files, databases, or external URLs).
- You could copy already generated schemas from `build/generate/ksp` into your project sources.
- To generate a `DataSchema` for a [`DataFrame`](DataFrame.md) now, use the
[`generate..()` methods](DataSchemaGenerationMethods.md) instead.
> The KSP plugin is **not compatible with [KSP2](https://github.com/google/ksp?tab=readme-ov-file#ksp2-is-here)**
> and may **not work properly with Kotlin 2.1 or newer**.
> It is planned for deprecation or major changes, and **we do not recommend using it at this time**.
> {style="warning"}
If you still choose to use the KSP plugin with Kotlin 2.1+,
disable [KSP2](https://github.com/google/ksp?tab=readme-ov-file#ksp2-is-here)
by adding this line to `gradle.properties`:
```properties
ksp.useKSP2=false
```
@@ -0,0 +1,234 @@
[//]: # (title: Gradle Plugin (deprecated))
> The current Gradle plugin is **under consideration for deprecation** and may be officially marked as deprecated in future releases.
>
> At the moment, **[data schema generation is handled via dedicated methods](DataSchemaGenerationMethods.md)** instead of relying on the plugin.
{style="warning"}
This page describes the Gradle plugin that generates `@DataSchema` from data samples.
```Kotlin
id("org.jetbrains.kotlinx.dataframe") version "%dataFrameVersion%"
```
It's different from the DataFrame compiler plugin:
```kotlin
kotlin("plugin.dataframe") version "%compilerPluginKotlinVersion%"
```
Gradle plugin by default adds a KSP annotation processor to your build:
```kotlin
ksp("org.jetbrains.kotlinx.dataframe:symbol-processor-all:%dataFrameVersion%")
```
You should disable it if you want to use the Gradle plugin together with the compiler plugin.
Add this to `gradle.properties`:
```properties
kotlin.dataframe.add.ksp=false
```
## Examples
In the best scenario, your schema could be defined as simple as this:
```kotlin
dataframes {
// output: build/generated/dataframe/main/kotlin/org/example/dataframe/JetbrainsRepositories.Generated.kt
schema {
data = "https://raw.githubusercontent.com/Kotlin/dataframe/master/data/jetbrains_repositories.csv"
}
}
```
Note that the name of the file and the interface are normalized: split by '_' and ' ' and joined to CamelCase.
You can set parsing options for CSV:
```kotlin
dataframes {
// output: build/generated/dataframe/main/kotlin/org/example/dataframe/JetbrainsRepositories.Generated.kt
schema {
data = "https://raw.githubusercontent.com/Kotlin/dataframe/master/data/jetbrains_repositories.csv"
csvOptions {
delimiter = ','
}
}
}
```
In this case, the output path will depend on your directory structure.
For project with package `org.example` path will be `build/generated/dataframe/main/kotlin/org/example/dataframe/JetbrainsRepositories.Generated.kt
`.
Note that the name of the Kotlin file is derived from the name of the data file with the suffix
`.Generated` and the package
is derived from the directory structure with child directory `dataframe`.
The name of the **data schema** itself is `JetbrainsRepositories`.
You could specify it explicitly:
```kotlin
schema {
// output: build/generated/dataframe/main/kotlin/org/example/dataframe/MyName.Generated.kt
data = "https://raw.githubusercontent.com/Kotlin/dataframe/master/data/jetbrains_repositories.csv"
name = "MyName"
}
```
If you want to change the default package for all schemas:
```kotlin
dataframes {
packageName = "org.example"
// Schemas...
}
```
Then you can set packageName for specific schema exclusively:
```kotlin
dataframes {
// output: build/generated/dataframe/main/kotlin/org/example/data/OtherName.Generated.kt
schema {
packageName = "org.example.data"
data = file("path/to/data.csv")
}
}
```
If you want non-default name and package, consider using fully qualified name:
```kotlin
dataframes {
// output: build/generated/dataframe/main/kotlin/org/example/data/OtherName.Generated.kt
schema {
name = "org.example.data.OtherName"
data = file("path/to/data.csv")
}
}
```
By default, the plugin will generate output in a specified source set.
Source set could be specified for all schemas or for specific schema:
```kotlin
dataframes {
packageName = "org.example"
sourceSet = "test"
// output: build/generated/dataframe/test/kotlin/org/example/Data.Generated.kt
schema {
data = file("path/to/data.csv")
}
// output: build/generated/dataframe/integrationTest/kotlin/org/example/Data.Generated.kt
schema {
sourceSet = "integrationTest"
data = file("path/to/data.csv")
}
}
```
If you need the generated files to be put in another directory, set `src`:
```kotlin
dataframes {
// output: schemas/org/example/test/OtherName.Generated.kt
schema {
data = "https://raw.githubusercontent.com/Kotlin/dataframe/master/data/jetbrains_repositories.csv"
name = "org.example.test.OtherName"
src = file("schemas")
}
}
```
## Schema Definitions from SQL Databases
To generate a schema for an existing SQL table,
you need to define a few parameters to establish a JDBC connection:
URL (passing to `data` field), username, and password.
Also, the `tableName` parameter should be specified to convert the data from the table with that name to the dataframe.
```kotlin
dataframes {
schema {
data = "jdbc:mariadb://localhost:3306/imdb"
name = "org.example.imdb.Actors"
jdbcOptions {
user = "root"
password = "pass"
tableName = "actors"
}
}
}
```
To generate a schema for the result of an SQL query,
you need to define the same parameters as before together with the SQL query to establish connection.
```kotlin
dataframes {
schema {
data = "jdbc:mariadb://localhost:3306/imdb"
name = "org.example.imdb.TarantinoFilms"
jdbcOptions {
user = "root"
password = "pass"
sqlQuery = """
SELECT name, year, rank,
GROUP_CONCAT (genre) as "genres"
FROM movies JOIN movies_directors ON movie_id = movies.id
JOIN directors ON directors.id=director_id LEFT JOIN movies_genres ON movies.id = movies_genres.movie_id
WHERE directors.first_name = "Quentin" AND directors.last_name = "Tarantino"
GROUP BY name, year, rank
ORDER BY year
"""
}
}
}
```
Find full example code [here](https://github.com/zaleslaw/KotlinDataFrame-SQL-Examples/blob/master/src/main/kotlin/Example_3_Import_schema_via_Gradle.kt).
**NOTE:** This is an experimental functionality and, for now,
we only support these databases: MariaDB, MySQL, PostgreSQL, SQLite, MS SQL, and DuckDB.
Additionally, support for JSON and date-time types is limited.
Please take this into consideration when using these functions.
## DSL reference
Inside `dataframes` you can configure parameters that will apply to all schemas.
Configuration inside `schema` will override these defaults for a specific schema.
Here is the full DSL for declaring data schemas:
```kotlin
dataframes {
sourceSet = "mySources" // [optional; default: "main"]
packageName = "org.jetbrains.data" // [optional; default: common package under source set]
visibility = // [optional; default: if explicitApiMode enabled then EXPLICIT_PUBLIC, else IMPLICIT_PUBLIC]
// KOTLIN SCRIPT: DataSchemaVisibility.INTERNAL DataSchemaVisibility.IMPLICIT_PUBLIC, DataSchemaVisibility.EXPLICIT_PUBLIC
// GROOVY SCRIPT: 'internal', 'implicit_public', 'explicit_public'
withoutDefaultPath() // disable a default path for all schemas
// i.e., plugin won't copy "data" property of the schemas to generated companion objects
// split property names by delimiters (arguments of this method), lowercase parts and join to camel case
// enabled by default
withNormalizationBy('_') // [optional: default: ['\t', '_', ' ']]
withoutNormalization() // disable property names normalization
schema {
sourceSet /* String */ = "" // [optional; override default]
packageName /* String */ = "" // [optional; override default]
visibility /* DataSchemaVisibility */ = "" // [optional; override default]
src /* File */ = file("") // [optional; default: file("build/generated/dataframe/$sourceSet/kotlin")]
data /* URL | File | String */ = "" // Data in JSON or CSV formats
name = "org.jetbrains.data.Person" // [optional; default: from filename]
csvOptions {
delimiter /* Char */ = ';' // [optional; default: ',']
}
// See names normalization
withNormalizationBy('_') // enable property names normalization for this schema and use these delimiters
withoutNormalization() // disable property names normalization for this schema
withoutDefaultPath() // disable the default path for this schema
withDefaultPath() // enable the default path for this schema
}
}
```
@@ -0,0 +1,157 @@
[//]: # (title: Data Schemas in Gradle projects)
<!---IMPORT org.jetbrains.kotlinx.dataframe.samples.api.Schemas-->
> The current Gradle plugin is **under consideration for deprecation** and may be officially marked as deprecated in future releases.
>
> At the moment, **[data schema generation is handled via dedicated methods](DataSchemaGenerationMethods.md)** instead of relying on the plugin.
{style="warning"}
In Gradle projects, the Kotlin DataFrame library provides
1. Annotation processing for generation of extension properties
2. Annotation processing for [`DataSchema`](schemas.md) inference from datasets.
3. Gradle task for [`DataSchema`](schemas.md) inference from datasets.
### Configuration
To use the [extension properties API](extensionPropertiesApi.md) in Gradle project add the `dataframe` plugin as follows:
<tabs>
<tab title="Kotlin DSL">
```kotlin
plugins {
id("org.jetbrains.kotlinx.dataframe") version "%dataFrameVersion%"
}
dependencies {
implementation("org.jetbrains.kotlinx:dataframe:%dataFrameVersion%")
}
```
</tab>
<tab title="Groovy DSL">
```groovy
plugins {
id("org.jetbrains.kotlinx.dataframe") version "%dataFrameVersion%"
}
dependencies {
implementation 'org.jetbrains.kotlinx:dataframe:%dataFrameVersion%'
}
```
</tab>
</tabs>
### Annotation processing
Declare data schemas in your code and use them to access data in [`DataFrame`](DataFrame.md) objects.
A data schema is a class or interface annotated with [`@DataSchema`](schemas.md):
```kotlin
import org.jetbrains.kotlinx.dataframe.annotations.DataSchema
@DataSchema
interface Person {
val name: String
val age: Int
}
```
#### Execute the `assemble` task to generate type-safe accessors for schemas:
<!---FUN useProperties-->
```kotlin
val df = dataFrameOf("name", "age")(
"Alice", 15,
"Bob", 20,
).cast<Person>()
// age only available after executing `build` or `kspKotlin`!
val teens = df.filter { age in 10..19 }
teens.print()
```
<!---END-->
### Schema inference
Specify schema with preferred method and execute the `assemble` task.
<tabs>
<tab title="Method 1. Annotation processing">
`@ImportDataSchema` annotation must be above package directive.
You can import schemas from a URL or from the relative path of a file.
Relative path by default is resolved to the project root directory.
You can configure it by [passing](https://kotlinlang.org/docs/ksp-quickstart.html#pass-options-to-processors) `dataframe.resolutionDir`
option to preprocessor.
For example:
```kotlin
ksp {
arg("dataframe.resolutionDir", file("data").absolutePath)
}
```
**Note that due to incremental processing, imported schema will be re-generated only if some source code has changed
from the previous invocation, at least one character.**
For the following configuration, file `Repository.Generated.kt` will be generated to `build/generated/ksp/` folder in
the same package as file containing the annotation.
```kotlin
@file:ImportDataSchema(
"Repository",
"https://raw.githubusercontent.com/Kotlin/dataframe/master/data/jetbrains_repositories.csv",
)
import org.jetbrains.kotlinx.dataframe.annotations.ImportDataSchema
import org.jetbrains.kotlinx.dataframe.api.*
```
See KDocs for `@ImportDataSchema` in IDE
or [GitHub](https://github.com/Kotlin/dataframe/blob/master/core/src/main/kotlin/org/jetbrains/kotlinx/dataframe/annotations/ImportDataSchema.kt)
for more details.
</tab>
<tab title="Method 2. Gradle task">
Put this in `build.gradle` or `build.gradle.kts`
For the following configuration, file `Repository.Generated.kt` will be generated
to `build/generated/dataframe/org/example` folder.
```kotlin
dataframes {
schema {
data = "https://raw.githubusercontent.com/Kotlin/dataframe/master/data/jetbrains_repositories.csv"
name = "org.example.Repository"
}
}
```
See [reference](Gradle-Plugin.md) and [examples](Gradle-Plugin.md#examples) for more details.
</tab>
</tabs>
After `assemble`, the following code should compile and run:
<!---FUN useInferredSchema-->
```kotlin
// Repository.readCsv() has argument 'path' with default value https://raw.githubusercontent.com/Kotlin/dataframe/master/data/jetbrains_repositories.csv
val df = Repository.readCsv()
// Use generated properties to access data in rows
df.maxBy { stargazersCount }.print()
// Or to access columns in dataframe.
print(df.fullName.count { it.contains("kotlin") })
```
<!---END-->
@@ -0,0 +1,75 @@
[//]: # (title: Import OpenAPI Schemas in Gradle project (Experimental))
<!---IMPORT org.jetbrains.kotlinx.dataframe.samples.api.Schemas-->
> The current Gradle plugin is **under consideration for deprecation** and may be officially marked as deprecated in future releases.
>
> At the moment, **[data schema generation is handled via dedicated methods](DataSchemaGenerationMethods.md)** instead of relying on the plugin.
{style="warning"}
<warning>
OpenAPI 3.0.0 schema support is marked as experimental. It might change or be removed in the future.
</warning>
JSON schema inference is great, but it's not perfect. However, more and more APIs offer
[OpenAPI (Swagger)](https://swagger.io/) specifications.
Aside from API endpoints, they also hold
[Data Models](https://swagger.io/docs/specification/data-models/) which include all the information about the types
that can be returned from or supplied to the API.
Why should we reinvent the wheel and write our own schema inference
when we can use the one provided by the API?
Not only will we now get the proper names of the types, but we will also
get enums, correct inheritance and overall better type safety.
First of all, you will need the extra dependency:
```kotlin
implementation("org.jetbrains.kotlinx:dataframe-openapi:$dataframe_version")
```
OpenAPI type schemas can be generated using both methods described above:
```kotlin
@file:ImportDataSchema(
path = "https://petstore3.swagger.io/api/v3/openapi.json",
name = "PetStore",
enableExperimentalOpenApi = true,
)
import org.jetbrains.kotlinx.dataframe.annotations.ImportDataSchema
```
```kotlin
dataframes {
schema {
data = "https://petstore3.swagger.io/api/v3/openapi.json"
name = "PetStore"
}
enableExperimentalOpenApi = true
}
```
The only difference is that the name provided is now irrelevant, since the type names are provided by the OpenAPI spec.
(If you were wondering, yes, the Kotlin DataFrame library can tell the difference between an OpenAPI spec and normal JSON data)
After importing the data schema, you can now start to import any JSON data you like using the generated schemas.
For instance, one of the types in the schema above is `PetStore.Pet` (which can also be
explored [here](https://petstore3.swagger.io/)),
so let's parse some Pets:
```kotlin
val df: DataFrame<PetStore.Pet> =
PetStore.Pet.readJson("https://petstore3.swagger.io/api/v3/pet/findByStatus?status=available")
```
Now you will have a correctly typed [`DataFrame`](DataFrame.md)!
You can also always ctrl+click on the `PetStore.Pet` type to see all the generated schemas.
If you experience any issues with the OpenAPI support (since there are many gotchas and edge-cases when converting
something as
type-fluid as JSON to a strongly typed language), please open an issue on
the [GitHub repo](https://github.com/Kotlin/dataframe/issues).
@@ -0,0 +1,141 @@
[//]: # (title: Import SQL Metadata as a Schema in Gradle Project)
<!---IMPORT org.jetbrains.kotlinx.dataframe.samples.api.Schemas-->
> The current Gradle plugin is **under consideration for deprecation** and may be officially marked as deprecated in future releases.
>
> At the moment, **[data schema generation is handled via dedicated methods](DataSchemaGenerationMethods.md)** instead of relying on the plugin.
{style="warning"}
Each SQL database contains the metadata for all the tables.
This metadata could be used for the schema generation.
**NOTE:** Visit this [page](readSqlDatabases.md) to see how to set up all Gradle dependencies for your project.
### With `@file:ImportDataSchema`
To generate schema for existing SQL table,
you need to define a few parameters to establish JDBC connection:
URL, username, and password.
Also, the `tableName` parameter could be specified.
You should also specify the name of the generated Kotlin class
as the first parameter of the annotation `@file:ImportDataSchema`.
```kotlin
@file:ImportDataSchema(
"Directors",
URL,
jdbcOptions = JdbcOptions(USER_NAME, PASSWORD, tableName = TABLE_NAME_DIRECTORS)
)
package org.jetbrains.kotlinx.dataframe.examples.jdbc
import org.jetbrains.kotlinx.dataframe.annotations.ImportDataSchema
```
```kotlin
const val URL = "jdbc:mariadb://localhost:3306/imdb"
const val USER_NAME = "root"
const val PASSWORD = "pass"
const val TABLE_NAME_DIRECTORS = "directors"
```
To generate schema for the result of an SQL query,
you need to define the SQL query itself
and the same parameters to establish connection with the database.
You should also specify the name of the generated Kotlin class
as a first parameter of annotation `@file:ImportDataSchema`.
```kotlin
@file:ImportDataSchema(
"NewActors",
URL,
jdbcOptions = JdbcOptions(USER_NAME, PASSWORD, sqlQuery = ACTORS_IN_LATEST_MOVIES)
)
package org.jetbrains.kotlinx.dataframe.examples.jdbc
import org.jetbrains.kotlinx.dataframe.annotations.ImportDataSchema
```
```kotlin
const val URL = "jdbc:mariadb://localhost:3306/imdb"
const val USER_NAME = "root"
const val PASSWORD = "pass"
const val ACTORS_IN_LATEST_MOVIES = """
SELECT a.first_name, a.last_name, r.role, m.name AS movie_name, m.year
FROM actors a
INNER JOIN roles r ON a.id = r.actor_id
INNER JOIN movies m ON m.id = r.movie_id
WHERE m.year > 2000
"""
```
Find full example code [here](https://github.com/zaleslaw/KotlinDataFrame-SQL-Examples/blob/master/src/main/kotlin/Example_2_Import_schema_annotation.kt).
### With Gradle Task
To generate a schema for an existing SQL table,
you need to define a few parameters to establish a JDBC connection:
URL (passing to `data` field), username, and password.
Also, the `tableName` parameter should be specified to convert the data from the table with that name to the [`DataFrame`](DataFrame.md).
```kotlin
dataframes {
schema {
data = "jdbc:mariadb://localhost:3306/imdb"
name = "org.example.imdb.Actors"
jdbcOptions {
user = "root"
password = "pass"
tableName = "actors"
}
}
}
```
To generate a schema for the result of an SQL query,
you need to define the same parameters as before together with the SQL query to establish connection.
```kotlin
dataframes {
schema {
data = "jdbc:mariadb://localhost:3306/imdb"
name = "org.example.imdb.TarantinoFilms"
jdbcOptions {
user = "root"
password = "pass"
sqlQuery = """
SELECT name, year, rank,
GROUP_CONCAT (genre) as "genres"
FROM movies JOIN movies_directors ON movie_id = movies.id
JOIN directors ON directors.id=director_id LEFT JOIN movies_genres ON movies.id = movies_genres.movie_id
WHERE directors.first_name = "Quentin" AND directors.last_name = "Tarantino"
GROUP BY name, year, rank
ORDER BY year
"""
}
}
}
```
Find full example code [here](https://github.com/zaleslaw/KotlinDataFrame-SQL-Examples/blob/master/src/main/kotlin/Example_3_Import_schema_via_Gradle.kt).
After importing the data schema, you can start to import any data from SQL table or as a result of an SQL query
you like using the generated schemas.
Now you will have a correctly typed [`DataFrame`](DataFrame.md)!
If you experience any issues with the SQL databases support (since there are many edge-cases when converting
SQL types from different databases to Kotlin types), please open an issue on
the [GitHub repo](https://github.com/Kotlin/dataframe/issues), specifying the database and the problem.
+147
View File
@@ -0,0 +1,147 @@
[//]: # (title: Data Schemas)
The Kotlin DataFrame library provides typed data access via
[generation of extension properties](extensionPropertiesApi.md) for the type
[`DataFrame<T>`](DataFrame.md) (as well as for [`DataRow<T>`](DataRow.md)), where
`T` is a marker class representing the `DataSchema` of the [`DataFrame`](DataFrame.md).
A *schema* of a [`DataFrame`](DataFrame.md) is a mapping from column names to column types.
This data schema can be expressed as a Kotlin class or interface.
If the DataFrame is hierarchical — contains a [column group](DataColumn.md#columngroup) or a
[column of dataframes](DataColumn.md#framecolumn) — the data schema reflects this structure,
with a separate class representing the schema of each column group or nested `DataFrame`.
For example, consider a simple hierarchical DataFrame from
<resource src="example.csv"></resource>.
This DataFrame consists of two columns:
- `name`, which is a `String` column
- `info`, which is a [column group](DataColumn.md#columngroup) containing two nested [value columns](DataColumn.md#valuecolumn):
- `age` of type `Int`
- `height` of type `Double`
<table width="705">
<thead>
<tr>
<th>name</th>
<th colspan="2">info</th>
</tr>
<tr>
<th></th>
<th>age</th>
<th>height</th>
</tr>
</thead>
<tbody>
<tr>
<td>Alice</td>
<td>23</td>
<td>175.5</td>
</tr>
<tr>
<td>Bob</td>
<td>27</td>
<td>160.2</td>
</tr>
</tbody>
</table>
The data schema corresponding to this DataFrame can be represented as:
```kotlin
// Data schema of the "info" column group
@DataSchema
data class Info(
val age: Int,
val height: Float
)
// Data schema of the entire DataFrame
@DataSchema
data class Person(
val info: Info,
val name: String
)
```
[Extension properties](extensionPropertiesApi.md) for `DataFrame<Person>`
are generated based on this schema and allow accessing columns
or using them in operations:
```kotlin
// Assuming `df` has type `DataFrame<Person>`
// Get "age" column from "info" group
df.info.age
// Select "name" and "height" columns
df.select { name and info.height }
// Filter rows by "age"
df.filter { age >= 18 }
```
See [](extensionPropertiesApi.md) for more information.
## Schema Retrieving
Defining a data schema manually can be difficult, especially for dataframes with many columns or deeply nested
structures, and may lead to mistakes in column names or types.
Kotlin DataFrame provides several methods for generating data schemas.
* [**`generate..()` methods**](DataSchemaGenerationMethods.md) are extensions for [`DataFrame`](DataFrame.md)
(or for its [`schema`](schema.md)) that generate a code string representing its `DataSchema`.
* [**Kotlin DataFrame Compiler Plugin**](Compiler-Plugin.md) **cannot automatically infer** a
data schema from external sources such as files or URLs.
However, it **can** infer the schema if you construct the [`DataFrame`](DataFrame.md)
manually — that is, by explicitly declaring the columns using the API.
It will also **automatically update** the schema during operations that modify the structure of the DataFrame.
> For best results when working with the Compiler Plugin, it's recommended to
> generate the initial schema using one of
> the [`generate..()` methods](DataSchemaGenerationMethods.md).
> Once generated, the Compiler Plugin will automatically keep the schema up to date
> after any operations that change the structure of the DataFrame.
### Plugins
> The current Gradle plugin is **under consideration for deprecation** and
> may be officially marked as deprecated in future releases.
>
> The KSP plugin is **not compatible with [KSP2](https://github.com/google/ksp?tab=readme-ov-file#ksp2-is-here)**
> and may **not work properly with Kotlin 2.1 or newer**.
>
> At the moment, **[data schema generation is handled via dedicated methods](DataSchemaGenerationMethods.md)** instead of relying on the plugins.
{style="warning"}
* The [Gradle plugin](Gradle-Plugin.md) allows generating a data schema automatically by specifying a source file path in the Gradle build script.
* The KSP plugin allows generating a data schema automatically using
[Kotlin Symbol Processing](https://kotlinlang.org/docs/ksp-overview.html) by specifying
a source file path in your code file.
## Extension Properties Generation
Once you have a data schema, you can generate [extension properties](extensionPropertiesApi.md).
The easiest and most convenient way is to use the [**Kotlin DataFrame Compiler Plugin**](Compiler-Plugin.md),
which generates extension properties on the fly for declared data schemas
and automatically keeps them up to date after operations
that modify the structure of the [`DataFrame`](DataFrame.md).
> Extension properties generation was deprecated from the Gradle plugin in favor of the Compiler Plugin.
> {style="warning"}
* When using Kotlin DataFrame inside [Kotlin Notebook](SetupKotlinNotebook.md),
the schema and extension properties
are generated automatically after each cell execution for all `DataFrame` variables declared in that cell.
See [extension properties example in Kotlin Notebook](extensionPropertiesApi.md#example).
> Compiler Plugin is coming to Kotlin Notebook soon.
* If you're not using the Compiler Plugin, you can still generate
[extension properties](extensionPropertiesApi.md) for a [`DataFrame`](DataFrame.md)
manually by calling one of the [`generate..()` methods](DataSchemaGenerationMethods.md)
with the `extensionProperties = true` argument.
@@ -0,0 +1,33 @@
[//]: # (title: Import Data Schemas, e.g. from OpenAPI, in Kotlin Notebook)
<!---IMPORT org.jetbrains.kotlinx.dataframe.samples.api.Schemas-->
<warning>
OpenAPI 3.0.0 schema support is marked as experimental. It might change or be removed in the future.
</warning>
Similar to [importing OpenAPI Data Schemas in Gradle projects](schemasImportOpenApiGradle.md),
you can also do this in Kotlin Notebook.
This requires enabling the `enableExperimentalOpenApi` setting, like:
```
%use dataframe(..., enableExperimentalOpenApi=true)
```
There is only a slight difference in notation:
Import the schema using any path (`String`), `URL`, or `File`:
```kotlin
val PetStore = importDataSchema("https://petstore3.swagger.io/api/v3/openapi.json")
```
and then from the next cell you run and onwards, you can call, for example:
```kotlin
val df = PetStore.Pet.readJson("https://petstore3.swagger.io/api/v3/pet/findByStatus?status=available")
```
So, very similar indeed!
(Note: The type of `PetStore` will be generated as `PetStoreDataSchema`, but this doesn't affect the way you can use
it.)
@@ -0,0 +1,38 @@
[//]: # (title: Schema inheritance)
<!---IMPORT org.jetbrains.kotlinx.dataframe.samples.api.Schemas-->
In order to reduce amount of generated code, previously generated [`DataSchema`](schema.md) interfaces are reused and only new
properties are introduced
Let's filter out all `null` values from `age` column and add one more column of type `Boolean`:
```kotlin
val filtered = df.filter { age != null }.add("isAdult") { age!! > 18 }
```
New schema interface for `filtered` variable will be derived from previously generated `DataFrameType`:
```kotlin
@DataSchema
interface DataFrameType1 : DataFrameType
```
Extension properties for data access are generated only for new and overridden members of `DataFrameType1` interface:
```kotlin
val ColumnsContainer<DataFrameType1>.age: DataColumn<Int> get() = this["age"] as DataColumn<Int>
val DataRow<DataFrameType1>.age: Int get() = this["age"] as Int
val ColumnsContainer<DataFrameType1>.isAdult: DataColumn<Boolean> get() = this["isAdult"] as DataColumn<Boolean>
val DataRow<DataFrameType1>.isAdult: String get() = this["isAdult"] as Boolean
```
Then variable `filtered` is cast to new interface:
```kotlin
val temp = filtered
```
```kotlin
val filtered = temp.cast<DataFrameType1>()
```