init research

This commit is contained in:
2026-02-08 11:20:43 -10:00
commit bdf064f54d
3041 changed files with 1592200 additions and 0 deletions
+257
View File
@@ -0,0 +1,257 @@
[//]: # (title: joinWith)
<!---IMPORT org.jetbrains.kotlinx.dataframe.samples.api.JoinWith-->
Joins two [`DataFrame`](DataFrame.md) objects by a join expression.
```kotlin
joinWith(otherDf, type = JoinType.Inner) { joinExpression }
joinExpression: JoinedDataRow.(JoinedDataRow) -> Boolean
interface JoinedDataRow: LeftDataRow {
val right: RightDataRow
}
```
This function is a [join](join.md) variant that lets you match data using any expression that returns a Boolean,
which also gives opportunity to perform operations that require values from both matching rows.
Can be helpful if the data you want to join wasn't designed relational and requires heuristics to tell if rows are matching,
or has relations other than `equals`.
For example, you can match rows based on:
* **Order relations** such as `>`, `<`, `in` for numerical or DateTime values
* **Spatial relations**, like distance within a certain range if your data includes spatial or geographical values
* **String equivalence** using more complex comparison techniques, such as `contains`, regular expressions, Levenshtein Distance or language models.
### Join types with examples
Supported join types:
* `Inner` (default) — only matched rows from left and right [`DataFrame`](DataFrame.md) objects
* `Filter` — only matched rows from left [`DataFrame`](DataFrame.md)
* `Left` — all rows from left [`DataFrame`](DataFrame.md), mismatches from right [`DataFrame`](DataFrame.md) filled with `null`
* `Right` — all rows from right [`DataFrame`](DataFrame.md), mismatches from left [`DataFrame`](DataFrame.md) filled with `null`
* `Full` — all rows from left and right [`DataFrame`](DataFrame.md) objects, any mismatches filled with `null`
* `Exclude` — only mismatched rows from left
For every join type there is a shortcut operation:
#### Inner join
<!---FUN joinWith-->
<tabs>
<tab title="Properties">
```kotlin
campaigns.innerJoinWith(visits) {
right.date in startDate..endDate
}
```
</tab>
<tab title="Strings">
```kotlin
campaigns.innerJoinWith(visits) {
right.getValue<LocalDate>("date") in "startDate"<LocalDate>().."endDate"<LocalDate>()
}
```
</tab></tabs>
<inline-frame src="resources/org.jetbrains.kotlinx.dataframe.samples.api.JoinWith.joinWith.html" width="100%"/>
<!---END-->
#### Filter join
Special case of inner join when you only need the data from the left table.
<!---FUN filterJoinWith-->
<tabs>
<tab title="Properties">
```kotlin
campaigns.filterJoinWith(visits) {
right.date in startDate..endDate
}
```
</tab>
<tab title="Strings">
```kotlin
campaigns.filterJoinWith(visits) {
right.getValue<LocalDate>("date") in "startDate"<LocalDate>().."endDate"<LocalDate>()
}
```
</tab></tabs>
<inline-frame src="resources/org.jetbrains.kotlinx.dataframe.samples.api.JoinWith.filterJoinWith.html" width="100%"/>
<!---END-->
#### Left join
<!---FUN leftJoinWith-->
<tabs>
<tab title="Properties">
```kotlin
campaigns.leftJoinWith(visits) {
right.date in startDate..endDate
}
```
</tab>
<tab title="Strings">
```kotlin
campaigns.leftJoinWith(visits) {
right.getValue<LocalDate>("date") in "startDate"<LocalDate>().."endDate"<LocalDate>()
}
```
</tab></tabs>
<inline-frame src="resources/org.jetbrains.kotlinx.dataframe.samples.api.JoinWith.leftJoinWith.html" width="100%"/>
<!---END-->
#### Right join
<!---FUN rightJoinWith-->
<tabs>
<tab title="Properties">
```kotlin
campaigns.rightJoinWith(visits) {
right.date in startDate..endDate
}
```
</tab>
<tab title="Strings">
```kotlin
campaigns.rightJoinWith(visits) {
right.getValue<LocalDate>("date") in "startDate"<LocalDate>().."endDate"<LocalDate>()
}
```
</tab></tabs>
<inline-frame src="resources/org.jetbrains.kotlinx.dataframe.samples.api.JoinWith.rightJoinWith.html" width="100%"/>
<!---END-->
#### Full join
<!---FUN fullJoinWith-->
<tabs>
<tab title="Properties">
```kotlin
campaigns.fullJoinWith(visits) {
right.date in startDate..endDate
}
```
</tab>
<tab title="Strings">
```kotlin
campaigns.fullJoinWith(visits) {
right.getValue<LocalDate>("date") in "startDate"<LocalDate>().."endDate"<LocalDate>()
}
```
</tab></tabs>
<inline-frame src="resources/org.jetbrains.kotlinx.dataframe.samples.api.JoinWith.fullJoinWith.html" width="100%"/>
<!---END-->
#### Exclude join
Can be viewed as **filterJoin** with logically opposite predicate
<!---FUN excludeJoinWith-->
<tabs>
<tab title="Properties">
```kotlin
campaigns.excludeJoinWith(visits) {
right.date in startDate..endDate
}
```
</tab>
<tab title="Strings">
```kotlin
campaigns.excludeJoinWith(visits) {
right.getValue<LocalDate>("date") in "startDate"<LocalDate>().."endDate"<LocalDate>()
}
```
</tab></tabs>
<inline-frame src="resources/org.jetbrains.kotlinx.dataframe.samples.api.JoinWith.excludeJoinWith.html" width="100%"/>
<!---END-->
#### Cross join
It can also be called cross product of two [`DataFrame`](DataFrame.md) objects.
<!---FUN crossProduct-->
```kotlin
campaigns.joinWith(visits) { true }
```
<inline-frame src="resources/org.jetbrains.kotlinx.dataframe.samples.api.JoinWith.crossProduct.html" width="100%"/>
<!---END-->
### Difference from join
[join](join.md) tries to take advantage of knowledge that data in matching columns is the same (because `equals` is used) to minimize number of columns in the resulting dataframe.
<!---FUN compareInnerColumns-->
```kotlin
df1.innerJoin(df2, "index", "age")
```
<inline-frame src="resources/org.jetbrains.kotlinx.dataframe.samples.api.JoinWith.compareInnerColumns.html" width="100%"/>
<!---END-->
Columns that were used in the condition: `index`, `age` - are present only once. Numerical suffix is used to disambiguate columns that are not used in the condition.
Compare it to an equivalent `joinWith`:
<!---FUN compareInnerValues-->
```kotlin
df1.innerJoinWith(df2) { it["index"] == right["index"] && it["age"] == right["age"] }
```
<inline-frame src="resources/org.jetbrains.kotlinx.dataframe.samples.api.JoinWith.compareInnerValues.html" width="100%"/>
<!---END-->
Here columns from both [`DataFrame`](DataFrame.md) objects are presented as is.
So [join](join.md) is better suited for `equals` relation, and joinWith is for everything else.
Below are two more examples with join types that allow mismatches.
Note the difference in `null` values
<!---FUN compareLeft-->
```kotlin
df1.leftJoin(df2, "index", "age")
df1.leftJoinWith(df2) { it["index"] == right["index"] && it["age"] == right["age"] }
```
<inline-frame src="resources/org.jetbrains.kotlinx.dataframe.samples.api.JoinWith.compareLeft.html" width="100%"/>
<!---END-->
<!---FUN compareRight-->
```kotlin
df1.rightJoin(df2, "index", "age")
df1.rightJoinWith(df2) { it["index"] == right["index"] && it["age"] == right["age"] }
```
<inline-frame src="resources/org.jetbrains.kotlinx.dataframe.samples.api.JoinWith.compareRight.html" width="100%"/>
<!---END-->