df-research/dataframe/docs/StardustDocs/topics/operations.md at main

Files

2026-02-08 11:20:43 -10:00

8.9 KiB

Vendored

Raw Permalink Blame History

Explore the full range of operations for transforming and analyzing data with Kotlin DataFrame. Explore the full range of operations for transforming and analyzing data with Kotlin DataFrame. Navigate a wide set of operations in Kotlin DataFrame — from filtering and grouping to pivoting and merging.

Data transformation pipeline usually consists of several modification operations, such as filtering, sorting, grouping, pivoting, adding/removing columns etc. The Kotlin DataFrame API is designed in functional style so that the whole processing pipeline can be represented as a single statement with a sequential chain of operations. DataFrame object is immutable and all operations return a new DataFrame instance reusing underlying data structures as much as possible.

df.update { age }.where { city == "Paris" }.with { it - 5 }
    .filter { isHappy && age > 100 }
    .move { name.firstName and name.lastName }.after { isHappy }
    .merge { age and weight }.by { "Age: ${it[0]}, weight: ${it[1]}" }.into("info")
    .rename { isHappy }.into("isOK")

You can play with "people" dataset that is used in present guide here

Multiplex operations

Simple operations (such as filter or select) return new DataFrame immediately, while more complex operations return an intermediate object that is used for further configuration of the operation. Let's call such operations multiplex.

Every multiplex operation configuration consists of:

column selector that is used to select target columns for the operation
additional configuration functions
terminal function that returns modified DataFrame

Most multiplex operations end with into or with function. The following naming convention is used:

into defines column names for storing operation results. Used in move, group, split, merge, gather, groupBy, rename.
with defines row-wise data transformation with row expression. Used in update, convert, replace, pivot.

List of DataRow operations

List of DataRow statistics

List of DataFrame operations

add — add columns
addId — add id column
append — add rows
columns / columnNames / columnTypes — get list of top-level columns, column names or column types
columnsCount — number of top-level columns
concat — union rows from several DataFrame objects
convert — change column values and/or column types
corr — pairwise correlation of columns
count — number of rows that match condition
countDistinct — number of unique rows
cumSum — cumulative sum of column values
describe — basic column statistics
distinct / distinctBy — remove duplicated rows
drop / dropLast / dropWhile / dropNulls / dropNA / dropNaNs — remove rows by condition
duplicate — duplicate rows
explode — spread lists and DataFrame objects vertically into new rows
fillNulls / fillNaNs / fillNA — replace missing values
filter / filterBy — filter rows by condition
first / firstOrNull — find first row by condition
flatten — remove column groupings recursively
forEachRow / forEachColumn — iterate over rows or columns
format — conditional formatting for cell rendering
gather — convert pairs of column names and values into new columns
getColumn / getColumnOrNull / getColumnGroup / getColumns — get one or several columns
group — group columns into ColumnGroup
groupBy — group rows by key columns
head — get first 5 rows of DataFrame
implode — collapse column values into lists grouping by other columns
inferType — infer column type from column values
insert — insert column
join — join two DataFrame objects by key columns
joinWith — join two DataFrame object by an expression that evaluates joined DataRows to Boolean
last / lastOrNull — find last row by condition
map — map columns into new DataFrame or DataColumn
max / maxBy / maxOf / maxFor — max of values
mean / meanOf / meanFor — average of values
median / medianOf / medianFor — median of values
merge — merge several columns into one
min / minBy / minOf / minFor — min of values
move — move columns or change column groupings
parse — try to convert strings into other types
pivot / pivotCounts / pivotMatches — convert values into new columns
remove — remove columns
rename — rename columns
reorder / reorderColumnsBy / reorderColumnsByName — reorder columns
replace — replace columns
reverse — reverse rows
rows / rowsReversed — get rows in direct or reversed order
rowsCount — number of rows
schema — schema of columns: names, types and hierarchy
select — select subset of columns
shuffle — reorder rows randomly
single / singleOrNull — get single row by condition
sortBy / sortByDesc / sortWith — sort rows
split — split column values into new rows/columns or inplace into lists
std / stdOf / stdFor — standard deviation of values
sum / sumOf / sumFor — sum of values
take / takeLast / takeWhile — get first/last rows
toList / toListOf — export DataFrame into a list of data classes
toMap — export DataFrame into a map from column names to column values
unfold - unfold objects (normal class instances) in columns according to their properties
ungroup — remove column groupings
update — update column values preserving column types
values — Sequence of values traversed by row or by column
valueCounts — counts for unique values
xs — slice DataFrame by given key values

Shortcut operations

Some operations are shortcuts for more general operations:

rename, group, flatten are special cases of move
valueCounts is a special case of groupBy
pivotCounts, pivotMatches are special cases of pivot
fillNulls, fillNaNs, fillNA are special cases of update
convert is a special case of replace

You can use these shortcuts to apply the most common DataFrame transformations easier, but you can always fall back to general operations if you need more customization.

8.9 KiB Vendored Raw Permalink Blame History