8.9 KiB
Vendored
Data transformation pipeline usually consists of several modification operations, such as filtering, sorting, grouping, pivoting, adding/removing columns etc.
The Kotlin DataFrame API is designed in functional style so that the whole processing pipeline can be represented as a single statement with a sequential chain of operations.
DataFrame object is immutable and all operations return a new DataFrame instance reusing underlying data structures as much as possible.
df.update { age }.where { city == "Paris" }.with { it - 5 }
.filter { isHappy && age > 100 }
.move { name.firstName and name.lastName }.after { isHappy }
.merge { age and weight }.by { "Age: ${it[0]}, weight: ${it[1]}" }.into("info")
.rename { isHappy }.into("isOK")
You can play with "people" dataset that is used in present guide here
Multiplex operations
Simple operations (such as filter or select) return new DataFrame immediately, while more complex operations return an intermediate object that is used for further configuration of the operation. Let's call such operations multiplex.
Every multiplex operation configuration consists of:
- column selector that is used to select target columns for the operation
- additional configuration functions
- terminal function that returns modified
DataFrame
Most multiplex operations end with into or with function. The following naming convention is used:
intodefines column names for storing operation results. Used inmove,group,split,merge,gather,groupBy,rename.withdefines row-wise data transformation withrow expression. Used inupdate,convert,replace,pivot.
List of DataRow operations
List of DataRow statistics
List of DataFrame operations
- add — add columns
- addId — add
idcolumn - append — add rows
- columns / columnNames / columnTypes — get list of top-level columns, column names or column types
- columnsCount — number of top-level columns
- concat — union rows from several
DataFrameobjects - convert — change column values and/or column types
- corr — pairwise correlation of columns
- count — number of rows that match condition
- countDistinct — number of unique rows
- cumSum — cumulative sum of column values
- describe — basic column statistics
- distinct / distinctBy — remove duplicated rows
- drop / dropLast / dropWhile / dropNulls / dropNA / dropNaNs — remove rows by condition
- duplicate — duplicate rows
- explode — spread lists and
DataFrameobjects vertically into new rows - fillNulls / fillNaNs / fillNA — replace missing values
- filter / filterBy — filter rows by condition
- first / firstOrNull — find first row by condition
- flatten — remove column groupings recursively
- forEachRow / forEachColumn — iterate over rows or columns
- format — conditional formatting for cell rendering
- gather — convert pairs of column names and values into new columns
- getColumn / getColumnOrNull / getColumnGroup / getColumns — get one or several columns
- group — group columns into
ColumnGroup - groupBy — group rows by key columns
- head — get first 5 rows of
DataFrame - implode — collapse column values into lists grouping by other columns
- inferType — infer column type from column values
- insert — insert column
- join — join two
DataFrameobjects by key columns - joinWith — join two
DataFrameobject by an expression that evaluates joined DataRows to Boolean - last / lastOrNull — find last row by condition
- map — map columns into new
DataFrameorDataColumn - max / maxBy / maxOf / maxFor — max of values
- mean / meanOf / meanFor — average of values
- median / medianOf / medianFor — median of values
- merge — merge several columns into one
- min / minBy / minOf / minFor — min of values
- move — move columns or change column groupings
- parse — try to convert strings into other types
- pivot / pivotCounts / pivotMatches — convert values into new columns
- remove — remove columns
- rename — rename columns
- reorder / reorderColumnsBy / reorderColumnsByName — reorder columns
- replace — replace columns
- reverse — reverse rows
- rows / rowsReversed — get rows in direct or reversed order
- rowsCount — number of rows
- schema — schema of columns: names, types and hierarchy
- select — select subset of columns
- shuffle — reorder rows randomly
- single / singleOrNull — get single row by condition
- sortBy / sortByDesc / sortWith — sort rows
- split — split column values into new rows/columns or inplace into lists
- std / stdOf / stdFor — standard deviation of values
- sum / sumOf / sumFor — sum of values
- take / takeLast / takeWhile — get first/last rows
- toList / toListOf — export
DataFrameinto a list of data classes - toMap — export
DataFrameinto a map from column names to column values - unfold - unfold objects (normal class instances) in columns according to their properties
- ungroup — remove column groupings
- update — update column values preserving column types
- values —
Sequenceof values traversed by row or by column - valueCounts — counts for unique values
- xs — slice
DataFrameby given key values
Shortcut operations
Some operations are shortcuts for more general operations:
- rename, group, flatten are special cases of move
- valueCounts is a special case of groupBy
- pivotCounts, pivotMatches are special cases of pivot
- fillNulls, fillNaNs, fillNA are special cases of update
- convert is a special case of replace
You can use these shortcuts to apply the most common DataFrame transformations easier, but you can always fall back to general operations if you need more customization.