Files
2026-02-08 11:20:43 -10:00

9.3 KiB
Vendored

Joins two DataFrame objects by join columns.

A join creates a new dataframe by combining rows from two input dataframes according to one or more key columns.
Rows are merged when the values in the join columns match.
If there is no match, whether the row is included and how missing values are filled depends on the type of join (e.g., inner, left, right, full).

Returns a new DataFrame that contains the merged rows and columns from both inputs.

join(otherDf, type = JoinType.Inner) [ { joinColumns } ]

joinColumns: JoinDsl.(LeftDataFrame) -> Columns

interface JoinDsl: LeftDataFrame {
    
    val right: RightDataFrame
    
    fun DataColumn.match(rightColumn: DataColumn)
}

joinColumns is a special case of columns selector that defines column mapping for join.

Related operations:

Examples

Join with explicit keys (with different names)

Use the Join DSL when the key column names differ:

  • access the right DataFrame via right;
  • define the join condition with match.
dfAges

dfCities

// INNER JOIN on differently named keys:
// Merge a row when dfAges.firstName == dfCities.name.
// With the given data all 3 names match → all rows merge.
dfAges.join(dfCities) { firstName match right.name }

Join with explicit keys (with the same names)

If mapped columns have the same name, just select join columns (one or several) from the left DataFrame:

dfLeft

dfRight

// INNER JOIN on "name" only:
// Merge when left.name == right.name.
// Duplicate keys produce multiple merged rows (one per pairing).
dfLeft.join(dfRight) { name }

In this example, the "city" columns from the left and right dataframes do not match to each other. After joining, the "city" column from the right dataframe is included in the result dataframe with the name "city1" to avoid a name conflict.
{ style = "note" }

Join with implicit keys (all columns with the same name)

If joinColumns is not specified, columns with the same name from both DataFrame objects will be used as join columns:

dfLeft

dfRight

// INNER JOIN on all same-named columns ("name" and "city"):
// Merge when BOTH name AND city are equal; otherwise the row is dropped.
dfLeft.join(dfRight)

Join types

Supported join types:

  • Inner (default) — only matched rows from left and right DataFrame objects
  • Filter — only matched rows from left DataFrame
  • Left — all rows from left DataFrame, mismatches from right DataFrame filled with null
  • Right — all rows from right DataFrame, mismatches from left DataFrame filled with null
  • Full — all rows from left and right DataFrame objects, any mismatches filled with null
  • Exclude — only mismatched rows from left DataFrame

For every join type there is a shortcut operation:

df.innerJoin(otherDf) [ { joinColumns } ]
df.filterJoin(otherDf) [ { joinColumns } ]
df.leftJoin(otherDf) [ { joinColumns } ]
df.rightJoin(otherDf) [ { joinColumns } ]
df.fullJoin(otherDf) [ { joinColumns } ]
df.excludeJoin(otherDf) [ { joinColumns } ]

Examples

Inner

dfLeft

dfRight

// INNER JOIN:
// Combines columns from the left and right dataframes
// and keep only rows where (name, city) matches on both sides.
dfLeft.innerJoin(dfRight) { name and city }

Filter

dfLeft

dfRight

// FILTER JOIN:
// Keep ONLY left rows that have ANY match on (name, city).
// No right-side columns are added.
dfLeft.filterJoin(dfRight) { name and city }

Left

dfLeft

dfRight

// LEFT JOIN:
// Keep ALL left rows and add columns from the right dataframe.
// If (name, city) matches, attach right columns values from
// the corresponding row in the right dataframe;
// if not (e.g. ("Bob", "Dubai") row), fill them with `null`.
dfLeft.leftJoin(dfRight) { name and city }

Right

dfLeft

dfRight

// RIGHT JOIN:
// Keep ALL right rows and add columns from the left dataframe.
// If (name, city) matches, attach left columns values from
// the corresponding row in the left dataframe;
// if not (e.g. ("Bob", "Tokyo") row), fill them with `null`.
dfLeft.rightJoin(dfRight) { name and city }

Full

dfLeft

dfRight

// FULL JOIN:
// Keep ALL rows from both sides. Where there's no match on (name, city),
// the other side is filled with nulls.
dfLeft.fullJoin(dfRight) { name and city }

Exclude

dfLeft

dfRight

// EXCLUDE JOIN:
// Keep ONLY left rows that have NO match on (name, city).
// Useful to find "unpaired" left rows.
dfLeft.excludeJoin(dfRight) { name and city }