init research

This commit is contained in:
2026-02-08 11:20:43 -10:00
commit bdf064f54d
3041 changed files with 1592200 additions and 0 deletions
@@ -0,0 +1,41 @@
[//]: # (title: DataColumn)
<!---IMPORT org.jetbrains.kotlinx.dataframe.samples.api.Create-->
[`DataColumn`](DataColumn.md) represents a column of values.
It can store objects of primitive or reference types,
or other [`DataFrame`](DataFrame.md) objects.
See [how to create columns](createColumn.md)
### Properties
* `name: String` — name of the column; should be unique within containing dataframe
* `path: ColumnPath` — path to the column; depends on the way column was retrieved from dataframe
* `type: KType` — type of elements in the column
* `hasNulls: Boolean` — flag indicating whether column contains `null` values
* `values: Iterable<T>` — column data
* `size: Int` — number of elements in the column
### Column kinds
[`DataColumn`](DataColumn.md) instances can be one of three subtypes: `ValueColumn`, [`ColumnGroup`](DataColumn.md#columngroup) or [`FrameColumn`](DataColumn.md#framecolumn)
#### ValueColumn
Represents a sequence of values.
It can store values of primitive (integers, strings, decimals, etc.) or reference types.
Currently, it uses [`List`](https://kotlinlang.org/api/latest/jvm/stdlib/kotlin.collections/-list/) as underlying data storage.
#### ColumnGroup
Container for nested columns. Used to create column hierarchy.
You can create column groups using the group operation or by splitting inward — see [group](group.md) and [split](split.md) for details.
#### FrameColumn
Special case of [`ValueColumn`](#valuecolumn) that stores another [`DataFrame`](DataFrame.md) objects as elements.
[`DataFrame`](DataFrame.md) stored in [`FrameColumn`](DataColumn.md#framecolumn) may have different schemas.
[`FrameColumn`](DataColumn.md#framecolumn) may appear after [reading](read.md) from JSON or other hierarchical data structures, or after grouping operations such as [groupBy](groupBy.md) or [pivot](pivot.md).
@@ -0,0 +1,14 @@
[//]: # (title: DataFrame)
[`DataFrame`](DataFrame.md) represents a list of [`DataColumn`](DataColumn.md).
Columns in [`DataFrame`](DataFrame.md) must have equal size and unique names.
**Learn how to:**
- [Create DataFrame](createDataFrame.md)
- [Read DataFrame](read.md)
- [Get an overview of DataFrame](info.md)
- [Access data in DataFrame](access.md)
- [Modify data in DataFrame](modify.md)
- [Compute statistics for DataFrame](summaryStatistics.md)
- [Combine several DataFrame objects](multipleDataFrames.md)
+103
View File
@@ -0,0 +1,103 @@
[//]: # (title: DataRow)
<!---IMPORT org.jetbrains.kotlinx.dataframe.samples.api.DataRowApi-->
`DataRow` represents a single record, one piece of data within a [`DataFrame`](DataFrame.md)
## Row functions
<snippet id="rowFunctions">
* `index(): Int` — sequential row number in [`DataFrame`](DataFrame.md), starts from 0
* `prev(): DataRow?` — previous row (`null` for the first row)
* `next(): DataRow?` — next row (`null` for the last row)
* `diff(T) { rowExpression }: T / diffOrNull { rowExpression }: T?` — difference between the results of a [row expression](DataRow.md#row-expressions) calculated for current and previous rows
* `explode(columns): DataFrame<T>` — spread lists and [`DataFrame`](DataFrame.md) objects vertically into new rows
* `values(): List<Any?>` — list of all cell values from the current row
* `valuesOf<T>(): List<T>` — list of values of the given type
* `columnsCount(): Int` — number of columns
* `columnNames(): List<String>` — list of all column names
* `columnTypes(): List<KType>` — list of all column types
* `namedValues(): List<NameValuePair<Any?>>` — list of name-value pairs where `name` is a column name and `value` is cell value
* `namedValuesOf<T>(): List<NameValuePair<T>>` — list of name-value pairs where value has given type
* `transpose(): DataFrame<NameValuePair<*>>` — [`DataFrame`](DataFrame.md) of two columns: `name: String` is column names and `value: Any?` is cell values
* `transposeTo<T>(): DataFrame<NameValuePair<T>>`— [`DataFrame`](DataFrame.md) of two columns: `name: String` is column names and `value: T` is cell values
* `getRow(Int): DataRow` — row from [`DataFrame`](DataFrame.md) by row index
* `getRows(Iterable<Int>): DataFrame` — [`DataFrame`](DataFrame.md) with subset of rows selected by absolute row index.
* `relative(Iterable<Int>): DataFrame` — [`DataFrame`](DataFrame.md) with subset of rows selected by relative row index: `relative(-1..1)` will return previous, current and next row. Requested indices will be coerced to the valid range and invalid indices will be skipped
* `getValue<T>(columnName)` — cell value of type `T` by this row and given `columnName`
* `getValueOrNull<T>(columnName)` — cell value of type `T?` by this row and given `columnName` or `null` if there's no such column
* `get(column): T` — cell value by this row and given `column`
* `String.invoke<T>(): T` — cell value of type `T` by this row and given `this` column name
* `ColumnPath.invoke<T>(): T` — cell value of type `T` by this row and given `this` column path
* `ColumnReference.invoke(): T` — cell value of type `T` by this row and given `this` column
* `df()` — [`DataFrame`](DataFrame.md) that current row belongs to
</snippet>
## Row expressions
Row expressions provide a value for every row of [`DataFrame`](DataFrame.md) and are used in [add](add.md), [filter](filter.md), [forEach](iterate.md), [update](update.md) and other operations.
<!---FUN expressions-->
```kotlin
// Row expression computes values for a new column
df.add("fullName") { name.firstName + " " + name.lastName }
// Row expression computes updated values
df.update { weight }.at(1, 3, 4).with { prev()?.weight }
// Row expression computes cell content for values of pivoted column
df.pivot { city }.with { name.lastName.uppercase() }
```
<inline-frame src="resources/org.jetbrains.kotlinx.dataframe.samples.api.DataRowApi.expressions.html" width="100%"/>
<!---END-->
Row expression signature: ```DataRow.(DataRow) -> T```. Row values can be accessed with or without ```it``` keyword. Implicit and explicit argument represent the same `DataRow` object.
## Row conditions
Row condition is a special case of [row expression](#row-expressions) that returns `Boolean`.
<!---FUN conditions-->
```kotlin
// Row condition is used to filter rows by index
df.filter { index() % 5 == 0 }
// Row condition is used to drop rows where `age` is the same as in the previous row
df.drop { diffOrNull { age } == 0 }
// Row condition is used to filter rows for value update
df.update { weight }.where { index() > 4 && city != "Paris" }.with { 50 }
```
<inline-frame src="resources/org.jetbrains.kotlinx.dataframe.samples.api.DataRowApi.conditions.html" width="100%"/>
<!---END-->
Row condition signature: ```DataRow.(DataRow) -> Boolean```
## Row statistics
<snippet id="rowStatistics">
The following [statistics](summaryStatistics.md) are available for `DataRow`:
* `rowSum`
* `rowMean`
* `rowStd`
These statistics will be applied only to values of appropriate types, and incompatible values will be ignored.
For example, if a [dataframe](DataFrame.md) has columns of types `String` and `Int`,
`rowSum()` will compute the sum of the `Int` values in the row and ignore `String` values.
To apply statistics only to values of a particular type use `-Of` versions:
* `rowSumOf<T>`
* `rowMeanOf<T>`
* `rowStdOf<T>`
* `rowMinOf<T>`
* `rowMaxOf<T>`
* `rowMedianOf<T>`
* `rowPercentileOf<T>`
</snippet>
+117
View File
@@ -0,0 +1,117 @@
[//]: # (title: Access APIs)
<!---IMPORT org.jetbrains.kotlinx.dataframe.samples.api.ApiLevels-->
By nature, dataframes are dynamic objects;
column labels depend on the input source and new columns can be added
or deleted while wrangling.
Kotlin, in contrast, is a statically typed language where all types are defined and verified
ahead of execution.
That's why creating a flexible, handy, and, at the same time, safe API to a dataframe is tricky.
In the Kotlin DataFrame library, we provide two different ways to access columns
## List of Access APIs
Here's a list of all APIs in order of increasing safety.
* **String API** <br/>
Columns are accessed by `string` representing their name. Type-checking is done at runtime, name-checking too.
* [**Extension Properties API**](extensionPropertiesApi.md) <br/>
Extension access properties are generated based on the dataframe schema. The name and type of properties are inferred
from the name and type of the corresponding columns.
## Example
Here's an example of how the same operations can be performed via different Access APIs:
<note>
In the most of the code snippets in this documentation there's a tab selector that allows switching across Access APIs.
</note>
<tabs>
<tab title="String API">
<!---FUN strings-->
```kotlin
DataFrame.read("titanic.csv")
.add("lastName") { "name"<String>().split(",").last() }
.dropNulls("age")
.filter {
"survived"<Boolean>() &&
"home"<String>().endsWith("NY") &&
"age"<Int>() in 10..20
}
```
<!---END-->
</tab>
<tab title = "Extension Properties API">
<!---FUN extensionProperties1-->
```kotlin
val df /* : AnyFrame */ = DataFrame.read("titanic.csv")
```
<!---END-->
<!---FUN extensionProperties2-->
```kotlin
df.add("lastName") { name.split(",").last() }
.dropNulls { age }
.filter { survived && home.endsWith("NY") && age in 10..20 }
```
<!---END-->
</tab>
</tabs>
The `titanic.csv` file can be found [here](https://github.com/Kotlin/dataframe/blob/master/data/titanic.csv).
# Comparing APIs
The String API is the simplest and unsafest of them all. The main advantage of it is that it can be
used at any time, including when accessing new columns in chain calls. So we can write something like:
```kotlin
df.add("weight") { ... } // add a new column `weight`, calculated by some expression
.sortBy("weight") // sorting dataframe rows by its value
```
In contrast, generated [extension properties](extensionPropertiesApi.md) form the most convenient and the safest API.
Using them, you can always be sure that you work with correct data and types.
However, there's a bottleneck at the moment of generation.
To get new extension properties, you have to run a cell in a notebook,
which could lead to unnecessary variable declarations.
Currently, we are working on a compiler plugin that generates these properties on the fly while typing!
<table>
<tr>
<td> API </td>
<td> Type-checking </td>
<td> Column names checking </td>
<td> Column existence checking </td>
</tr>
<tr>
<td> String API </td>
<td> Runtime </td>
<td> Runtime </td>
<td> Runtime </td>
</tr>
<tr>
<td> Extension Properties API </td>
<td> Generation-time </td>
<td> Generation-time </td>
<td> Generation-time </td>
</tr>
</table>
+62
View File
@@ -0,0 +1,62 @@
# Concepts And Principles
<web-summary>
Learn what Kotlin DataFrame is about — its core concepts, design principles, and usage philosophy.
</web-summary>
<card-summary>
Discover the fundamentals of the library —
understand key concepts, motivation, and the overall structure of the library.
</card-summary>
<link-summary>
Explore the fundamentals of Kotlin DataFrame —
understand key concepts, motivation, and the overall structure of the library.
</link-summary>
<show-structure depth="3"/>
## What is a dataframe
A *dataframe* is an abstraction for working with structured data.
Essentially, its a 2-dimensional table with labeled columns of potentially different types.
You can think of it like a spreadsheet or SQL table, or a dictionary of series objects.
The handiness of this abstraction is not in the table itself but in a set of operations defined on it.
The Kotlin DataFrame library is an idiomatic Kotlin DSL defining such operations.
The process of working with dataframe is often called *data wrangling* which
is the process of transforming and mapping data from one "raw" data form into another format
that is more appropriate for analytics and visualization.
The goal of data wrangling is to ensure quality and useful data.
## Main Features and Concepts
* [**Hierarchical**](hierarchical.md) — the Kotlin DataFrame library provides an ability to read and present data from different sources,
including not only plain **CSV** but also **JSON** or **[SQL databases](readSqlDatabases.md)**.
This is why it was designed to be hierarchical and allows nesting of columns and cells.
* **Functional** — the data processing pipeline is organized in a chain of [`DataFrame`](DataFrame.md) transformation operations.
* **Immutable** — every operation returns a new instance of [`DataFrame`](DataFrame.md) reusing underlying storage wherever it's possible.
* **Readable** — data transformation operations are defined in DSL close to natural language.
* **Practical** — provides simple solutions for common problems and the ability to perform complex tasks.
* **Minimalistic** — simple, yet powerful data model of three [column kinds](DataColumn.md#column-kinds).
* [**Interoperable**](collectionsInterop.md) — convertable with Kotlin data classes and collections.
This also means conversion to/from other libraries' data structures is usually quite straightforward!
See our [examples](https://github.com/Kotlin/dataframe/tree/master/examples/idea-examples/unsupported-data-sources)
for some conversions between DataFrame and [Apache Spark](https://github.com/Kotlin/dataframe/tree/master/examples/idea-examples/unsupported-data-sources/spark), [Multik](https://github.com/Kotlin/dataframe/tree/master/examples/idea-examples/unsupported-data-sources/multik), and [JetBrains Exposed](https://github.com/Kotlin/dataframe/tree/master/examples/idea-examples/unsupported-data-sources/exposed).
* **Generic** — can store objects of any type, not only numbers or strings.
* **Typesafe** — the Kotlin DataFrame library provides a mechanism of on-the-fly [**generation of extension properties**](extensionPropertiesApi.md)
that correspond to the columns of a dataframe.
In interactive notebooks like Jupyter or Datalore, the generation runs after each cell execution.
In IntelliJ IDEA there's a Gradle plugin for generation properties based on CSV file or JSON file.
Also, were working on a compiler plugin that infers and transforms [`DataFrame`](DataFrame.md) schema while typing.
You can now clone this [project with many examples](https://github.com/koperagen/df-plugin-demo) showcasing how it allows you to reliably use our most convenient extension properties API.
The generated properties ensure youll never misspell column name and dont mess up with its type, and of course nullability is also preserved.
* [**Polymorphic**](schemas.md) —
if all columns of a [`DataFrame`](DataFrame.md) instance are presented in another dataframe,
then the first one will be seen as a superclass for the latter.
This means you can define a function on an interface with some set of columns
and then execute it safely on any [`DataFrame`](DataFrame.md) which contains this same set of columns.
In notebooks, this works out-of-the-box.
In ordinary projects, this requires casting (for now).
@@ -0,0 +1,20 @@
[//]: # (title: Hierarchical data structures)
[`DataFrame`](DataFrame.md) can represent hierarchical data structures using two special types of columns:
* [`ColumnGroup`](DataColumn.md#columngroup) is a group of [columns](DataColumn.md)
* [`FrameColumn`](DataColumn.md#framecolumn) is a column of [dataframes](DataFrame.md)
You can read [`DataFrame`](DataFrame.md) [from json](read.md#read-from-json) or [from in-memory object graph](createDataFrame.md#todataframe) preserving original tree structure.
Hierarchical columns can also appear as a result of some [modification operations](modify.md):
* [group](group.md) produces [`ColumnGroup`](DataColumn.md#columngroup)
* [groupBy](groupBy.md) produces [`FrameColumn`](DataColumn.md#framecolumn)
* [pivot](pivot.md) may produce [`FrameColumn`](DataColumn.md#framecolumn)
* [split](split.md) of [`FrameColumn`](DataColumn.md#framecolumn) will produce several [`ColumnGroup`](DataColumn.md#columngroup)
* [implode](implode.md) converts [`ColumnGroup`](DataColumn.md#columngroup) into [`FrameColumn`](DataColumn.md#framecolumn)
* [explode](explode.md) converts [`FrameColumn`](DataColumn.md#framecolumn) into [`ColumnGroup`](DataColumn.md#columngroup)
* [merge](merge.md) converts [`ColumnGroup`](DataColumn.md#columngroup) into [`FrameColumn`](DataColumn.md#framecolumn)
* etc.
Operations in the navigation tree are grouped such that you can find operations and their respective inverse together, like `group` and `ungroup`. This allows you to quickly find out how to simplify any hierarchical structure you come across.
+24
View File
@@ -0,0 +1,24 @@
[//]: # (title: NaN and NA)
Using the Kotlin DataFrame library, you might come across the terms `NaN` and `NA`.
This page explains what they mean and how to work with them.
## NaN
`Float` or `Double` values can be represented as `NaN`,
in cases where a mathematical operation is undefined, such as for dividing by zero. The
result of such an operation can only be described as "**N**ot **a** **N**umber".
This is different from `null`, which means that a value is missing and, in Kotlin, can only occur
for `Float?` and `Double?` types.
You can use [fillNaNs](fill.md#fillnans) to replace `NaNs` in certain columns with a given value or expression
or [dropNaNs](drop.md#dropnans) to drop rows with `NaNs` in them.
## NA
`NA` in Dataframe can be seen as: [`NaN`](#nan) or `null`. Which is another way to say that the value
is "**N**ot **A**vailable".
You can use [fillNA](fill.md#fillna) to replace `NAs` in certain columns with a given value or expression
or [dropNA](drop.md#dropna) to drop rows with `NAs` in them.
@@ -0,0 +1,46 @@
[//]: # (title: Number Unification)
Unifying numbers means converting them to a common number type without losing information.
This is currently an internal part of the library,
but its logic implementation can be encountered in multiple places, such as
[statistics](summaryStatistics.md), and [reading JSON](read.md#read-from-json).
The following graph shows the hierarchy of number types in Kotlin DataFrame.
<inline-frame src="resources/org.jetbrains.kotlinx.dataframe.documentation.UnifyingNumbers.Graph.html" width="100%"/>
The order is top-down from the most complex type to the simplest one.
For each number type in the graph, it holds that a number of that type can be expressed lossless by
a number of a more complex type (any of its parents).
This is either because the more complex type has a larger range or higher precision (in terms of bits).
Nullability, while not displayed in the graph, is also taken into account.
This means that `Int?` and `Float` will be unified to `Double?`.
`Nothing` is at the bottom of the graph and is the starting point in unification.
This can be interpreted as "no type" and can have no instance, while `Nothing?` can only be `null`.
> There may be parts of the library that "unify" numbers, such as [`readCsv`](read.md#column-type-inference-from-csv),
> or [`readExcel`](read.md#read-from-excel).
> However, because they rely on another library (like [Deephaven CSV](https://github.com/deephaven/deephaven-csv))
> this may behave slightly differently.
### Unified Number Type Options
There are variants of this graph that exclude some types, such as `BigDecimal` and `BigInteger`, or
allow some slightly lossy conversions, like from `Long` to `Double`.
This follows either `UnifiedNumberTypeOptions.PRIMITIVES_ONLY` or
`UnifiedNumberTypeOptions.DEFAULT`.
For `PRIMITIVES_ONLY`, used by [statistics](summaryStatistics.md), big numbers are excluded from the graph.
Additionally, `Double` is considered the most complex type,
meaning `Long`/`ULong` and `Double` can be joined to `Double`,
potentially losing a little precision(!).
For `DEFAULT`, used by [`readJson`](read.md#read-from-json), big numbers can appear.
`BigDecimal` is considered the most complex type, meaning that `Long`/`ULong` and `Double` will be joined
to `BigDecimal` instead.
@@ -0,0 +1,25 @@
# Spelling Conventions
<web-summary>
Clarifies naming conventions used in Kotlin DataFrame documentation for the library, data format, and Kotlin type.
</web-summary>
<card-summary>
Understand how to distinguish between "Kotlin DataFrame", "dataframe", and `DataFrame` in the documentation.
</card-summary>
<link-summary>
Spelling and naming rules for using "Kotlin DataFrame", "dataframe", and `DataFrame` properly.
</link-summary>
While reading Kotlin DataFrame documentation, you may come across several similar terms referring to different concepts:
* **Kotlin DataFrame** (or just "DataFrame") — the name of the official library.
* *dataframe* — a general term for data in a tabular (frame) format.
* [`DataFrame`](DataFrame.md) — a Kotlin type or its instance that represents a wrapper around a dataframe.
Heres a correct usage example:
```markdown
Kotlin DataFrame allows you to read a dataframe from a CSV file into a `DataFrame`.
```
+5
View File
@@ -0,0 +1,5 @@
[//]: # (title: Data Abstractions)
* [`DataColumn`](DataColumn.md) is a named, typed and ordered collection of elements
* [`DataFrame`](DataFrame.md) consists of one or several [`DataColumns`](DataColumn.md) with unique names and equal size
* [`DataRow`](DataRow.md) is a single row of [`DataFrame`](DataFrame.md) and provides a single value for every [`DataColumn`](DataColumn.md)