9.3 KiB
Vendored
Joins two DataFrame objects by join columns.
A join creates a new dataframe by combining rows from two input dataframes according to one or more key columns.
Rows are merged when the values in the join columns match.
If there is no match, whether the row is included and how missing values are filled depends on the type of join (e.g., inner, left, right, full).
Returns a new DataFrame that contains the merged rows and columns from both inputs.
join(otherDf, type = JoinType.Inner) [ { joinColumns } ]
joinColumns: JoinDsl.(LeftDataFrame) -> Columns
interface JoinDsl: LeftDataFrame {
val right: RightDataFrame
fun DataColumn.match(rightColumn: DataColumn)
}
joinColumns is a special case of columns selector that defines column mapping for join.
Examples
Join with explicit keys (with different names)
Use the Join DSL when the key column names differ:
- access the right
DataFrameviaright; - define the join condition with
match.
dfAges
dfCities
// INNER JOIN on differently named keys:
// Merge a row when dfAges.firstName == dfCities.name.
// With the given data all 3 names match → all rows merge.
dfAges.join(dfCities) { firstName match right.name }
Join with explicit keys (with the same names)
If mapped columns have the same name, just select join columns (one or several) from the left DataFrame:
dfLeft
dfRight
// INNER JOIN on "name" only:
// Merge when left.name == right.name.
// Duplicate keys produce multiple merged rows (one per pairing).
dfLeft.join(dfRight) { name }
In this example, the "city" columns from the left and right dataframes do not match to each other. After joining, the "city" column from the right dataframe is included in the result dataframe with the name "city1" to avoid a name conflict.
{ style = "note" }
Join with implicit keys (all columns with the same name)
If joinColumns is not specified, columns with the same name from both DataFrame
objects will be used as join columns:
dfLeft
dfRight
// INNER JOIN on all same-named columns ("name" and "city"):
// Merge when BOTH name AND city are equal; otherwise the row is dropped.
dfLeft.join(dfRight)
Join types
Supported join types:
Inner(default) — only matched rows from left and rightDataFrameobjectsFilter— only matched rows from leftDataFrameLeft— all rows from leftDataFrame, mismatches from rightDataFramefilled withnullRight— all rows from rightDataFrame, mismatches from leftDataFramefilled withnullFull— all rows from left and rightDataFrameobjects, any mismatches filled withnullExclude— only mismatched rows from leftDataFrame
For every join type there is a shortcut operation:
df.innerJoin(otherDf) [ { joinColumns } ]
df.filterJoin(otherDf) [ { joinColumns } ]
df.leftJoin(otherDf) [ { joinColumns } ]
df.rightJoin(otherDf) [ { joinColumns } ]
df.fullJoin(otherDf) [ { joinColumns } ]
df.excludeJoin(otherDf) [ { joinColumns } ]
Examples
Inner
dfLeft
dfRight
// INNER JOIN:
// Combines columns from the left and right dataframes
// and keep only rows where (name, city) matches on both sides.
dfLeft.innerJoin(dfRight) { name and city }
Filter
dfLeft
dfRight
// FILTER JOIN:
// Keep ONLY left rows that have ANY match on (name, city).
// No right-side columns are added.
dfLeft.filterJoin(dfRight) { name and city }
Left
dfLeft
dfRight
// LEFT JOIN:
// Keep ALL left rows and add columns from the right dataframe.
// If (name, city) matches, attach right columns values from
// the corresponding row in the right dataframe;
// if not (e.g. ("Bob", "Dubai") row), fill them with `null`.
dfLeft.leftJoin(dfRight) { name and city }
Right
dfLeft
dfRight
// RIGHT JOIN:
// Keep ALL right rows and add columns from the left dataframe.
// If (name, city) matches, attach left columns values from
// the corresponding row in the left dataframe;
// if not (e.g. ("Bob", "Tokyo") row), fill them with `null`.
dfLeft.rightJoin(dfRight) { name and city }
Full
dfLeft
dfRight
// FULL JOIN:
// Keep ALL rows from both sides. Where there's no match on (name, city),
// the other side is filled with nulls.
dfLeft.fullJoin(dfRight) { name and city }
Exclude
dfLeft
dfRight
// EXCLUDE JOIN:
// Keep ONLY left rows that have NO match on (name, city).
// Useful to find "unpaired" left rows.
dfLeft.excludeJoin(dfRight) { name and city }