63 lines
4.3 KiB
Markdown
Vendored
63 lines
4.3 KiB
Markdown
Vendored
# Concepts And Principles
|
||
|
||
<web-summary>
|
||
Learn what Kotlin DataFrame is about — its core concepts, design principles, and usage philosophy.
|
||
</web-summary>
|
||
|
||
<card-summary>
|
||
Discover the fundamentals of the library —
|
||
understand key concepts, motivation, and the overall structure of the library.
|
||
</card-summary>
|
||
|
||
<link-summary>
|
||
Explore the fundamentals of Kotlin DataFrame —
|
||
understand key concepts, motivation, and the overall structure of the library.
|
||
</link-summary>
|
||
|
||
|
||
<show-structure depth="3"/>
|
||
|
||
|
||
## What is a dataframe
|
||
|
||
A *dataframe* is an abstraction for working with structured data.
|
||
Essentially, it’s a 2-dimensional table with labeled columns of potentially different types.
|
||
You can think of it like a spreadsheet or SQL table, or a dictionary of series objects.
|
||
|
||
The handiness of this abstraction is not in the table itself but in a set of operations defined on it.
|
||
The Kotlin DataFrame library is an idiomatic Kotlin DSL defining such operations.
|
||
The process of working with dataframe is often called *data wrangling* which
|
||
is the process of transforming and mapping data from one "raw" data form into another format
|
||
that is more appropriate for analytics and visualization.
|
||
The goal of data wrangling is to ensure quality and useful data.
|
||
|
||
## Main Features and Concepts
|
||
|
||
* [**Hierarchical**](hierarchical.md) — the Kotlin DataFrame library provides an ability to read and present data from different sources,
|
||
including not only plain **CSV** but also **JSON** or **[SQL databases](readSqlDatabases.md)**.
|
||
This is why it was designed to be hierarchical and allows nesting of columns and cells.
|
||
* **Functional** — the data processing pipeline is organized in a chain of [`DataFrame`](DataFrame.md) transformation operations.
|
||
* **Immutable** — every operation returns a new instance of [`DataFrame`](DataFrame.md) reusing underlying storage wherever it's possible.
|
||
* **Readable** — data transformation operations are defined in DSL close to natural language.
|
||
* **Practical** — provides simple solutions for common problems and the ability to perform complex tasks.
|
||
* **Minimalistic** — simple, yet powerful data model of three [column kinds](DataColumn.md#column-kinds).
|
||
* [**Interoperable**](collectionsInterop.md) — convertable with Kotlin data classes and collections.
|
||
This also means conversion to/from other libraries' data structures is usually quite straightforward!
|
||
See our [examples](https://github.com/Kotlin/dataframe/tree/master/examples/idea-examples/unsupported-data-sources)
|
||
for some conversions between DataFrame and [Apache Spark](https://github.com/Kotlin/dataframe/tree/master/examples/idea-examples/unsupported-data-sources/spark), [Multik](https://github.com/Kotlin/dataframe/tree/master/examples/idea-examples/unsupported-data-sources/multik), and [JetBrains Exposed](https://github.com/Kotlin/dataframe/tree/master/examples/idea-examples/unsupported-data-sources/exposed).
|
||
* **Generic** — can store objects of any type, not only numbers or strings.
|
||
* **Typesafe** — the Kotlin DataFrame library provides a mechanism of on-the-fly [**generation of extension properties**](extensionPropertiesApi.md)
|
||
that correspond to the columns of a dataframe.
|
||
In interactive notebooks like Jupyter or Datalore, the generation runs after each cell execution.
|
||
In IntelliJ IDEA there's a Gradle plugin for generation properties based on CSV file or JSON file.
|
||
Also, we’re working on a compiler plugin that infers and transforms [`DataFrame`](DataFrame.md) schema while typing.
|
||
You can now clone this [project with many examples](https://github.com/koperagen/df-plugin-demo) showcasing how it allows you to reliably use our most convenient extension properties API.
|
||
The generated properties ensure you’ll never misspell column name and don’t mess up with its type, and of course nullability is also preserved.
|
||
* [**Polymorphic**](schemas.md) —
|
||
if all columns of a [`DataFrame`](DataFrame.md) instance are presented in another dataframe,
|
||
then the first one will be seen as a superclass for the latter.
|
||
This means you can define a function on an interface with some set of columns
|
||
and then execute it safely on any [`DataFrame`](DataFrame.md) which contains this same set of columns.
|
||
In notebooks, this works out-of-the-box.
|
||
In ordinary projects, this requires casting (for now).
|