init research

2026-02-08 11:20:43 -10:00
commit bdf064f54d
3041 changed files with 1592200 additions and 0 deletions
@@ -0,0 +1,41 @@
+[//]: # (title: DataColumn)
+<!---IMPORT org.jetbrains.kotlinx.dataframe.samples.api.Create-->
+
+[`DataColumn`](DataColumn.md) represents a column of values.
+It can store objects of primitive or reference types, 
+or other [`DataFrame`](DataFrame.md) objects.
+
+See [how to create columns](createColumn.md)
+
+### Properties
+* `name: String` — name of the column; should be unique within containing dataframe
+* `path: ColumnPath` — path to the column; depends on the way column was retrieved from dataframe
+* `type: KType` — type of elements in the column
+* `hasNulls: Boolean` — flag indicating whether column contains `null` values
+* `values: Iterable<T>` — column data
+* `size: Int` — number of elements in the column
+
+### Column kinds
+[`DataColumn`](DataColumn.md) instances can be one of three subtypes: `ValueColumn`, [`ColumnGroup`](DataColumn.md#columngroup) or [`FrameColumn`](DataColumn.md#framecolumn)
+
+#### ValueColumn
+
+Represents a sequence of values. 
+
+It can store values of primitive (integers, strings, decimals, etc.) or reference types.
+Currently, it uses [`List`](https://kotlinlang.org/api/latest/jvm/stdlib/kotlin.collections/-list/) as underlying data storage.
+
+#### ColumnGroup
+
+Container for nested columns. Used to create column hierarchy.
+
+You can create column groups using the group operation or by splitting inward — see [group](group.md) and [split](split.md) for details.
+
+#### FrameColumn
+
+Special case of [`ValueColumn`](#valuecolumn) that stores another [`DataFrame`](DataFrame.md) objects as elements. 
+
+[`DataFrame`](DataFrame.md) stored in [`FrameColumn`](DataColumn.md#framecolumn) may have different schemas. 
+
+[`FrameColumn`](DataColumn.md#framecolumn) may appear after [reading](read.md) from JSON or other hierarchical data structures, or after grouping operations such as [groupBy](groupBy.md) or [pivot](pivot.md).
+
@@ -0,0 +1,14 @@
+[//]: # (title: DataFrame)
+
+[`DataFrame`](DataFrame.md) represents a list of [`DataColumn`](DataColumn.md).
+
+Columns in [`DataFrame`](DataFrame.md) must have equal size and unique names.
+
+**Learn how to:**
+- [Create DataFrame](createDataFrame.md)
+- [Read DataFrame](read.md)
+- [Get an overview of DataFrame](info.md)
+- [Access data in DataFrame](access.md)
+- [Modify data in DataFrame](modify.md)
+- [Compute statistics for DataFrame](summaryStatistics.md)
+- [Combine several DataFrame objects](multipleDataFrames.md)
@@ -0,0 +1,103 @@
+[//]: # (title: DataRow)
+<!---IMPORT org.jetbrains.kotlinx.dataframe.samples.api.DataRowApi-->
+
+`DataRow` represents a single record, one piece of data within a [`DataFrame`](DataFrame.md)
+
+## Row functions
+
+<snippet id="rowFunctions">
+
+* `index(): Int` — sequential row number in [`DataFrame`](DataFrame.md), starts from 0
+* `prev(): DataRow?` — previous row (`null` for the first row)
+* `next(): DataRow?` — next row (`null` for the last row)
+* `diff(T) { rowExpression }: T / diffOrNull { rowExpression }: T?` — difference between the results of a [row expression](DataRow.md#row-expressions) calculated for current and previous rows
+* `explode(columns): DataFrame<T>` — spread lists and [`DataFrame`](DataFrame.md) objects vertically into new rows
+* `values(): List<Any?>` — list of all cell values from the current row
+* `valuesOf<T>(): List<T>` — list of values of the given type 
+* `columnsCount(): Int` — number of columns
+* `columnNames(): List<String>` — list of all column names
+* `columnTypes(): List<KType>` — list of all column types 
+* `namedValues(): List<NameValuePair<Any?>>` — list of name-value pairs where `name` is a column name and `value` is cell value
+* `namedValuesOf<T>(): List<NameValuePair<T>>` — list of name-value pairs where value has given type 
+* `transpose(): DataFrame<NameValuePair<*>>` — [`DataFrame`](DataFrame.md) of two columns: `name: String` is column names and `value: Any?` is cell values
+* `transposeTo<T>(): DataFrame<NameValuePair<T>>`— [`DataFrame`](DataFrame.md) of two columns: `name: String` is column names and `value: T` is cell values
+* `getRow(Int): DataRow` — row from [`DataFrame`](DataFrame.md) by row index
+* `getRows(Iterable<Int>): DataFrame` — [`DataFrame`](DataFrame.md) with subset of rows selected by absolute row index. 
+* `relative(Iterable<Int>): DataFrame` — [`DataFrame`](DataFrame.md) with subset of rows selected by relative row index: `relative(-1..1)` will return previous, current and next row. Requested indices will be coerced to the valid range and invalid indices will be skipped
+* `getValue<T>(columnName)` — cell value of type `T` by this row and given `columnName`
+* `getValueOrNull<T>(columnName)` — cell value of type `T?` by this row and given `columnName` or `null` if there's no such column
+* `get(column): T` — cell value by this row and given `column`
+* `String.invoke<T>(): T` — cell value of type `T` by this row and given `this` column name
+* `ColumnPath.invoke<T>(): T` — cell value of type `T` by this row and given `this` column path
+* `ColumnReference.invoke(): T` — cell value of type `T` by this row and given `this` column
+* `df()` — [`DataFrame`](DataFrame.md) that current row belongs to
+
+</snippet>
+
+## Row expressions
+Row expressions provide a value for every row of [`DataFrame`](DataFrame.md) and are used in [add](add.md), [filter](filter.md), [forEach](iterate.md), [update](update.md) and other operations.
+
+<!---FUN expressions-->
+
+```kotlin
+// Row expression computes values for a new column
+df.add("fullName") { name.firstName + " " + name.lastName }
+
+// Row expression computes updated values
+df.update { weight }.at(1, 3, 4).with { prev()?.weight }
+
+// Row expression computes cell content for values of pivoted column
+df.pivot { city }.with { name.lastName.uppercase() }
+```
+
+<inline-frame src="resources/org.jetbrains.kotlinx.dataframe.samples.api.DataRowApi.expressions.html" width="100%"/>
+<!---END-->
+
+Row expression signature: ```DataRow.(DataRow) -> T```. Row values can be accessed with or without ```it``` keyword. Implicit and explicit argument represent the same `DataRow` object.
+
+## Row conditions
+Row condition is a special case of [row expression](#row-expressions) that returns `Boolean`. 
+
+<!---FUN conditions-->
+
+```kotlin
+// Row condition is used to filter rows by index
+df.filter { index() % 5 == 0 }
+
+// Row condition is used to drop rows where `age` is the same as in the previous row
+df.drop { diffOrNull { age } == 0 }
+
+// Row condition is used to filter rows for value update
+df.update { weight }.where { index() > 4 && city != "Paris" }.with { 50 }
+```
+
+<inline-frame src="resources/org.jetbrains.kotlinx.dataframe.samples.api.DataRowApi.conditions.html" width="100%"/>
+<!---END-->
+
+Row condition signature: ```DataRow.(DataRow) -> Boolean```
+
+
+
+## Row statistics
+
+<snippet id="rowStatistics">
+
+The following [statistics](summaryStatistics.md) are available for `DataRow`:
+* `rowSum`
+* `rowMean`
+* `rowStd`
+
+These statistics will be applied only to values of appropriate types, and incompatible values will be ignored.
+For example, if a [dataframe](DataFrame.md) has columns of types `String` and `Int`,
+`rowSum()` will compute the sum of the `Int` values in the row and ignore `String` values.
+
+To apply statistics only to values of a particular type use `-Of` versions:
+* `rowSumOf<T>`
+* `rowMeanOf<T>`
+* `rowStdOf<T>`
+* `rowMinOf<T>`
+* `rowMaxOf<T>`
+* `rowMedianOf<T>`
+* `rowPercentileOf<T>`
+
+</snippet>
@@ -0,0 +1,117 @@
+[//]: # (title: Access APIs)
+
+<!---IMPORT org.jetbrains.kotlinx.dataframe.samples.api.ApiLevels-->
+
+By nature, dataframes are dynamic objects;
+column labels depend on the input source and new columns can be added
+or deleted while wrangling.
+Kotlin, in contrast, is a statically typed language where all types are defined and verified
+ahead of execution.
+
+That's why creating a flexible, handy, and, at the same time, safe API to a dataframe is tricky.
+
+In the Kotlin DataFrame library, we provide two different ways to access columns
+
+## List of Access APIs
+
+Here's a list of all APIs in order of increasing safety.
+
+* **String API** <br/>
+  Columns are accessed by `string` representing their name. Type-checking is done at runtime, name-checking too.
+
+* [**Extension Properties API**](extensionPropertiesApi.md) <br/>
+  Extension access properties are generated based on the dataframe schema. The name and type of properties are inferred
+  from the name and type of the corresponding columns.
+
+## Example
+
+Here's an example of how the same operations can be performed via different Access APIs:
+
+<note>
+In the most of the code snippets in this documentation there's a tab selector that allows switching across Access APIs.
+</note>
+
+<tabs>
+
+<tab title="String API">
+
+<!---FUN strings-->
+
+```kotlin
+DataFrame.read("titanic.csv")
+    .add("lastName") { "name"<String>().split(",").last() }
+    .dropNulls("age")
+    .filter {
+        "survived"<Boolean>() &&
+            "home"<String>().endsWith("NY") &&
+            "age"<Int>() in 10..20
+    }
+```
+
+<!---END-->
+
+</tab>
+
+<tab title = "Extension Properties API">
+
+<!---FUN extensionProperties1-->
+
+```kotlin
+val df /* : AnyFrame */ = DataFrame.read("titanic.csv")
+```
+
+<!---END-->
+
+<!---FUN extensionProperties2-->
+
+```kotlin
+df.add("lastName") { name.split(",").last() }
+    .dropNulls { age }
+    .filter { survived && home.endsWith("NY") && age in 10..20 }
+```
+
+<!---END-->
+
+</tab>
+
+</tabs>
+
+The `titanic.csv` file can be found [here](https://github.com/Kotlin/dataframe/blob/master/data/titanic.csv).
+
+# Comparing APIs
+
+The String API is the simplest and unsafest of them all. The main advantage of it is that it can be
+used at any time, including when accessing new columns in chain calls. So we can write something like:
+
+```kotlin
+df.add("weight") { ... } // add a new column `weight`, calculated by some expression
+    .sortBy("weight") // sorting dataframe rows by its value
+```
+
+In contrast, generated [extension properties](extensionPropertiesApi.md) form the most convenient and the safest API. 
+Using them, you can always be sure that you work with correct data and types.
+However, there's a bottleneck at the moment of generation.
+To get new extension properties, you have to run a cell in a notebook,
+which could lead to unnecessary variable declarations.
+Currently, we are working on a compiler plugin that generates these properties on the fly while typing!
+
+<table>
+    <tr>
+        <td> API </td>
+        <td> Type-checking </td>
+        <td> Column names checking </td>
+        <td> Column existence checking </td>
+    </tr>
+    <tr>
+        <td> String API </td>
+        <td> Runtime </td>
+        <td> Runtime </td>
+        <td> Runtime </td>
+    </tr>
+    <tr>
+        <td> Extension Properties API </td>
+        <td> Generation-time </td>
+        <td> Generation-time </td>
+        <td> Generation-time </td>
+    </tr>
+</table>
@@ -0,0 +1,62 @@
+# Concepts And Principles
+
+<web-summary>
+Learn what Kotlin DataFrame is about — its core concepts, design principles, and usage philosophy.
+</web-summary>
+
+<card-summary>
+Discover the fundamentals of the library —
+understand key concepts, motivation, and the overall structure of the library.
+</card-summary>
+
+<link-summary>
+Explore the fundamentals of Kotlin DataFrame — 
+understand key concepts, motivation, and the overall structure of the library.
+</link-summary>
+
+
+<show-structure depth="3"/>
+
+
+## What is a dataframe
+
+A *dataframe* is an abstraction for working with structured data. 
+Essentially, it’s a 2-dimensional table with labeled columns of potentially different types. 
+You can think of it like a spreadsheet or SQL table, or a dictionary of series objects.
+
+The handiness of this abstraction is not in the table itself but in a set of operations defined on it. 
+The Kotlin DataFrame library is an idiomatic Kotlin DSL defining such operations. 
+The process of working with dataframe is often called *data wrangling* which 
+is the process of transforming and mapping data from one "raw" data form into another format 
+that is more appropriate for analytics and visualization. 
+The goal of data wrangling is to ensure quality and useful data.
+
+## Main Features and Concepts
+
+* [**Hierarchical**](hierarchical.md) — the Kotlin DataFrame library provides an ability to read and present data from different sources, 
+including not only plain **CSV** but also **JSON** or **[SQL databases](readSqlDatabases.md)**.
+This is why it was designed to be hierarchical and allows nesting of columns and cells.
+* **Functional** — the data processing pipeline is organized in a chain of [`DataFrame`](DataFrame.md)  transformation operations.
+* **Immutable** — every operation returns a new instance of [`DataFrame`](DataFrame.md)  reusing underlying storage wherever it's possible.
+* **Readable** — data transformation operations are defined in DSL close to natural language.
+* **Practical** — provides simple solutions for common problems and the ability to perform complex tasks.
+* **Minimalistic** — simple, yet powerful data model of three [column kinds](DataColumn.md#column-kinds).
+* [**Interoperable**](collectionsInterop.md) — convertable with Kotlin data classes and collections.
+  This also means conversion to/from other libraries' data structures is usually quite straightforward!
+  See our [examples](https://github.com/Kotlin/dataframe/tree/master/examples/idea-examples/unsupported-data-sources) 
+  for some conversions between DataFrame and [Apache Spark](https://github.com/Kotlin/dataframe/tree/master/examples/idea-examples/unsupported-data-sources/spark), [Multik](https://github.com/Kotlin/dataframe/tree/master/examples/idea-examples/unsupported-data-sources/multik), and [JetBrains Exposed](https://github.com/Kotlin/dataframe/tree/master/examples/idea-examples/unsupported-data-sources/exposed).
+* **Generic** — can store objects of any type, not only numbers or strings.
+* **Typesafe** — the Kotlin DataFrame library provides a mechanism of on-the-fly [**generation of extension properties**](extensionPropertiesApi.md) 
+that correspond to the columns of a dataframe. 
+In interactive notebooks like Jupyter or Datalore, the generation runs after each cell execution. 
+In IntelliJ IDEA there's a Gradle plugin for generation properties based on CSV file or JSON file. 
+Also, we’re working on a compiler plugin that infers and transforms [`DataFrame`](DataFrame.md) schema while typing.
+You can now clone this [project with many examples](https://github.com/koperagen/df-plugin-demo) showcasing how it allows you to reliably use our most convenient extension properties API.
+The generated properties ensure you’ll never misspell column name and don’t mess up with its type, and of course nullability is also preserved.
+* [**Polymorphic**](schemas.md) —
+  if all columns of a [`DataFrame`](DataFrame.md) instance are presented in another dataframe,
+  then the first one will be seen as a superclass for the latter. 
+This means you can define a function on an interface with some set of columns
+  and then execute it safely on any [`DataFrame`](DataFrame.md) which contains this same set of columns.
+  In notebooks, this works out-of-the-box.
+  In ordinary projects, this requires casting (for now).
@@ -0,0 +1,20 @@
+[//]: # (title: Hierarchical data structures)
+
+[`DataFrame`](DataFrame.md) can represent hierarchical data structures using two special types of columns:
+
+* [`ColumnGroup`](DataColumn.md#columngroup) is a group of [columns](DataColumn.md)
+* [`FrameColumn`](DataColumn.md#framecolumn) is a column of [dataframes](DataFrame.md)
+
+You can read [`DataFrame`](DataFrame.md) [from json](read.md#read-from-json) or [from in-memory object graph](createDataFrame.md#todataframe) preserving original tree structure.
+
+Hierarchical columns can also appear as a result of some [modification operations](modify.md):
+* [group](group.md) produces [`ColumnGroup`](DataColumn.md#columngroup) 
+* [groupBy](groupBy.md) produces [`FrameColumn`](DataColumn.md#framecolumn)
+* [pivot](pivot.md) may produce [`FrameColumn`](DataColumn.md#framecolumn)
+* [split](split.md) of [`FrameColumn`](DataColumn.md#framecolumn) will produce several [`ColumnGroup`](DataColumn.md#columngroup)
+* [implode](implode.md) converts [`ColumnGroup`](DataColumn.md#columngroup) into [`FrameColumn`](DataColumn.md#framecolumn)
+* [explode](explode.md) converts [`FrameColumn`](DataColumn.md#framecolumn) into [`ColumnGroup`](DataColumn.md#columngroup)
+* [merge](merge.md) converts [`ColumnGroup`](DataColumn.md#columngroup) into [`FrameColumn`](DataColumn.md#framecolumn)
+* etc.
+
+Operations in the navigation tree are grouped such that you can find operations and their respective inverse together, like `group` and `ungroup`. This allows you to quickly find out how to simplify any hierarchical structure you come across.
@@ -0,0 +1,24 @@
+[//]: # (title: NaN and NA)
+
+Using the Kotlin DataFrame library, you might come across the terms `NaN` and `NA`. 
+This page explains what they mean and how to work with them.
+
+## NaN
+
+`Float` or `Double` values can be represented as `NaN`,
+in cases where a mathematical operation is undefined, such as for dividing by zero. The
+result of such an operation can only be described as "**N**ot **a** **N**umber".
+
+This is different from `null`, which means that a value is missing and, in Kotlin, can only occur
+for `Float?` and `Double?` types.
+
+You can use [fillNaNs](fill.md#fillnans) to replace `NaNs` in certain columns with a given value or expression
+or [dropNaNs](drop.md#dropnans) to drop rows with `NaNs` in them.
+
+## NA
+
+`NA` in Dataframe can be seen as: [`NaN`](#nan) or `null`. Which is another way to say that the value
+is "**N**ot **A**vailable".
+
+You can use [fillNA](fill.md#fillna) to replace `NAs` in certain columns with a given value or expression
+or [dropNA](drop.md#dropna) to drop rows with `NAs` in them.
@@ -0,0 +1,46 @@
+[//]: # (title: Number Unification)
+
+Unifying numbers means converting them to a common number type without losing information.
+
+This is currently an internal part of the library, 
+but its logic implementation can be encountered in multiple places, such as
+[statistics](summaryStatistics.md), and [reading JSON](read.md#read-from-json). 
+
+The following graph shows the hierarchy of number types in Kotlin DataFrame.
+
+<inline-frame src="resources/org.jetbrains.kotlinx.dataframe.documentation.UnifyingNumbers.Graph.html" width="100%"/>
+
+The order is top-down from the most complex type to the simplest one.
+
+For each number type in the graph, it holds that a number of that type can be expressed lossless by
+a number of a more complex type (any of its parents).
+This is either because the more complex type has a larger range or higher precision (in terms of bits).
+
+Nullability, while not displayed in the graph, is also taken into account.
+This means that `Int?` and `Float` will be unified to `Double?`.
+
+`Nothing` is at the bottom of the graph and is the starting point in unification.
+This can be interpreted as "no type" and can have no instance, while `Nothing?` can only be `null`.
+
+> There may be parts of the library that "unify" numbers, such as [`readCsv`](read.md#column-type-inference-from-csv),
+> or [`readExcel`](read.md#read-from-excel).
+> However, because they rely on another library (like [Deephaven CSV](https://github.com/deephaven/deephaven-csv))
+> this may behave slightly differently.
+
+### Unified Number Type Options
+
+There are variants of this graph that exclude some types, such as `BigDecimal` and `BigInteger`, or
+allow some slightly lossy conversions, like from `Long` to `Double`.
+
+This follows either `UnifiedNumberTypeOptions.PRIMITIVES_ONLY` or
+`UnifiedNumberTypeOptions.DEFAULT`.
+
+For `PRIMITIVES_ONLY`, used by [statistics](summaryStatistics.md), big numbers are excluded from the graph.
+Additionally, `Double` is considered the most complex type,
+meaning `Long`/`ULong` and `Double` can be joined to `Double`,
+potentially losing a little precision(!).
+
+For `DEFAULT`, used by [`readJson`](read.md#read-from-json), big numbers can appear.
+`BigDecimal` is considered the most complex type, meaning that `Long`/`ULong` and `Double` will be joined
+to `BigDecimal` instead.
+
@@ -0,0 +1,25 @@
+# Spelling Conventions
+
+<web-summary>
+Clarifies naming conventions used in Kotlin DataFrame documentation for the library, data format, and Kotlin type.
+</web-summary>
+
+<card-summary>
+Understand how to distinguish between "Kotlin DataFrame", "dataframe", and `DataFrame` in the documentation.
+</card-summary>
+
+<link-summary>
+Spelling and naming rules for using "Kotlin DataFrame", "dataframe", and `DataFrame` properly.
+</link-summary>
+
+While reading Kotlin DataFrame documentation, you may come across several similar terms referring to different concepts:
+
+* **Kotlin DataFrame** (or just "DataFrame") — the name of the official library.
+* *dataframe* — a general term for data in a tabular (frame) format.
+* [`DataFrame`](DataFrame.md) — a Kotlin type or its instance that represents a wrapper around a dataframe.
+
+Here’s a correct usage example:
+
+```markdown
+Kotlin DataFrame allows you to read a dataframe from a CSV file into a `DataFrame`.
+```
@@ -0,0 +1,5 @@
+[//]: # (title: Data Abstractions)
+
+* [`DataColumn`](DataColumn.md) is a named, typed and ordered collection of elements
+* [`DataFrame`](DataFrame.md) consists of one or several [`DataColumns`](DataColumn.md) with unique names and equal size
+* [`DataRow`](DataRow.md) is a single row of [`DataFrame`](DataFrame.md) and provides a single value for every [`DataColumn`](DataColumn.md)