This commit is contained in:
2026-02-08 11:33:14 -10:00
parent bdf064f54d
commit 03f102bc33
+40 -30
View File
@@ -16,17 +16,19 @@ the two ecosystems on the JVM?
### Clojure equivalents exist
| KT DataFrame feature | Clojure equivalent |
|-|-|
| `DataFrame` | `tech.ml.dataset` / `tablecloth` dataset |
| `dataFrameOf(...)` | `(tc/dataset {...})` |
| `.filter { }` | `(tc/select-rows ds pred)` |
| `.groupBy {}.aggregate {}` | `(-> ds (tc/group-by :col) (tc/aggregate ...))` |
| `.add { }` (computed column) | `(tc/add-column ds :name fn)` |
| `@DataSchema` / compiler plugin | `malli` schema |
| Schema inference from data | `malli.provider/provide` |
| Kandy (plotting) | Tableplot, Hanami, Oz |
| Kotlin Notebook | Clay, Clerk |
| Concept | KT DataFrame | Tablecloth | Plain Kotlin (`List<Map>`) | Plain Clojure (seq of maps) |
|-|-|-|-|-|
| Tabular data | `DataFrame<T>` | `tc/dataset` | `List<Map<String, Any?>>` | `[{:a 1} {:b 2}]` |
| Create | `dataFrameOf("a" to listOf(1,2))` | `(tc/dataset {:a [1 2]})` | `listOf(mapOf("a" to 1), mapOf("a" to 2))` | `[{:a 1} {:a 2}]` |
| Filter | `.filter { col > 10 }` | `(tc/select-rows ds pred)` | `.filter { it["col"] as Int > 10 }` | `(filter #(> (:col %) 10) data)` |
| Group + Aggregate | `.groupBy{}.aggregate{}` | `(-> (tc/group-by) (tc/aggregate))` | `.groupBy { it["k"] }.mapValues { (_, vs) -> vs.map { it["x"] as Double }.average() }` | `(->> data (group-by :k) (map (fn [[k vs]] {:k k :avg (/ (reduce + (map :x vs)) (count vs))})))` |
| Computed column | `.add("y") { x * 2 }` | `(tc/add-column ds :y fn)` | `.map { it + ("y" to (it["x"] as Int) * 2) }` | `(map #(assoc % :y (* (:x %) 2)) data)` |
| Sort | `.sortBy { col }` | `(tc/order-by ds :col)` | `.sortedBy { it["col"] as Comparable<*> }` | `(sort-by :col data)` |
| Join | `.join(other) { col match right.col }` | `(tc/left-join ds other :col)` | manual `associateBy` + map | `(clojure.set/join a b)` or `merge` |
| Schema | `@DataSchema` / compiler plugin | `malli` schema | `data class` | spec / none |
| Schema inference | `@ImportDataSchema` | `malli.provider/provide` | — | — |
| Plotting | Kandy | Tableplot, Hanami, Oz | — | — |
| Notebook | Kotlin Notebook | Clay, Clerk | — | — |
### Schema inference from example data (Clojure)
@@ -223,28 +225,36 @@ The bridge is ~20 LOC. Both sides are columnar, so `Map<String, List>` is the na
### Clay + Tableplot (Clojure) vs Kotlin Notebook + Kandy
| Operation | Kotlin DataFrame + Kandy | Tablecloth + Tableplot |
|-----------|--------------------------|------------------------|
| **Create data** | `dataFrameOf("col" to list)` | `(tc/dataset {"col" list})` |
| **Group + Aggregate** | `df.groupBy { col }.aggregate { mean { x } into "y" }` | `(-> ds (tc/group-by :col) (tc/aggregate {:y #(dfn/mean (% :x))}))` |
| **Filter** | `df.filter { price > 100 }` | `(tc/select-rows ds #(> (:price %) 100))` |
| **Add column** | `df.add("revenue") { price * quantity }` | `(tc/map-columns ds "revenue" ["price" "quantity"] *)` |
| **Bar chart** | `df.plot { bars { x(col); y(col) } }` | `(-> ds (plotly/base {:=x :col :=y :col}) (plotly/layer-bar {}))` |
| **Scatter** | `df.plot { points { x(a); y(b); color(c) } }` | `(-> ds (plotly/base {:=x :a :=y :b}) (plotly/layer-point {:=color :c}))` |
| **Histogram** | `df.plot { histogram(x = col) }` | `(-> ds (plotly/base {:=x :col}) (plotly/layer-histogram {}))` |
| **Notebook** | Kotlin Notebook (IntelliJ plugin, `.ipynb`) | Clay (editor-agnostic, renders `.clj` → HTML) |
| Operation | Kotlin DataFrame + Kandy | Tablecloth + Tableplot | Plain Kotlin (stdlib) | Plain Clojure (core) |
|-----------|--------------------------|------------------------|-----------------------|----------------------|
| **Create data** | `dataFrameOf("col" to list)` | `(tc/dataset {"col" list})` | `listOf(mapOf("col" to v, ...))` | `[{:col v ...}]` |
| **Group + Aggregate** | `df.groupBy { col }.aggregate { mean { x } into "y" }` | `(-> ds (tc/group-by :col) (tc/aggregate {:y #(dfn/mean (% :x))}))` | `data.groupBy { it["col"] }.map { (k, vs) -> mapOf("col" to k, "avg" to vs.map { it["x"] as Double }.average()) }` | `(->> data (group-by :col) (map (fn [[k vs]] {:col k :avg (/ (reduce + (map :x vs)) (count vs))})))` |
| **Filter** | `df.filter { price > 100 }` | `(tc/select-rows ds #(> (:price %) 100))` | `data.filter { it["price"] as Int > 100 }` | `(filter #(> (:price %) 100) data)` |
| **Add column** | `df.add("revenue") { price * quantity }` | `(tc/map-columns ds "revenue" ["price" "quantity"] *)` | `data.map { it + ("revenue" to (it["price"] as Int) * (it["qty"] as Int)) }` | `(map #(assoc % :revenue (* (:price %) (:qty %))) data)` |
| **Sort** | `df.sortBy { price }` | `(tc/order-by ds :price)` | `data.sortedBy { it["price"] as Int }` | `(sort-by :price data)` |
| **Join** | `df.join(other) { col match right.col }` | `(tc/left-join ds other :col)` | `val idx = other.associateBy { it["col"] }; data.map { it + (idx[it["col"]] ?: emptyMap()) }` | `(clojure.set/join set-a set-b)` |
| **Bar chart** | `df.plot { bars { x(col); y(col) } }` | `(-> ds (plotly/base {:=x :col :=y :col}) (plotly/layer-bar {}))` | — | — |
| **Scatter** | `df.plot { points { x(a); y(b); color(c) } }` | `(-> ds (plotly/base {:=x :a :=y :b}) (plotly/layer-point {:=color :c}))` | — | — |
| **Histogram** | `df.plot { histogram(x = col) }` | `(-> ds (plotly/base {:=x :col}) (plotly/layer-histogram {}))` | — | — |
| **Notebook** | Kotlin Notebook (IntelliJ plugin, `.ipynb`) | Clay (editor-agnostic, renders `.clj` → HTML) | — | — |
**Key differences:**
1. **Type safety**: Kotlin has compile-time column access (`df.price`), Clojure has runtime schema inference
2. **DSL style**: Kotlin uses function builders `plot { bars { } }`, Clojure uses data-driven pipelines
3. **IDE integration**: Kotlin Notebook is IntelliJ-only; Clay is editor-agnostic
4. **Interactivity**: Kandy produces Let's Plot (SVG), Tableplot produces Plotly.js (interactive zoom/pan/hover)
5. **REPL experience**: Clojure's REPL-driven dev is faster for exploration
6. **Composability**: Tableplot layers are independently composable; Kandy's DSL is more monolithic
1. **Type safety**: Kotlin has compile-time column access (`df.price`), Clojure has runtime schema inference. Plain Kotlin collections need casts everywhere (`it["x"] as Int`). Plain Clojure needs neither — keywords are functions, values are dynamically typed.
2. **Filter/map/sort are ~identical**: For row-level operations, plain collections and DataFrame APIs converge. Clojure's `filter`/`map`/`sort-by` + keywords-as-functions is nearly as concise as tablecloth.
3. **Group+aggregate is close in Clojure, painful in plain Kotlin**: Clojure has `group-by` in core, and aggregation is just `map` + `reduce` over the groups — standard stuff. The tablecloth version is more declarative but not fundamentally simpler. In Kotlin, the plain version needs explicit casts at every step, making it significantly worse.
4. **Joins are built into Clojure**: `clojure.set/join` performs natural joins on sets of maps, matching on shared keys automatically (or with an explicit key mapping). Combined with `merge`, `clojure.set/union`, `clojure.set/difference`, and `clojure.set/intersection`, Clojure has relational algebra in the stdlib. Plain Kotlin has nothing comparable — you build index maps by hand.
5. **Nil punning eliminates missing-value boilerplate**: In Clojure, `(:x row)` on a row without `:x` returns `nil`. `nil` is falsy, `(seq nil)` is `nil`, `(conj nil x)` works, `(remove nil? xs)` is idiomatic. Missing values just flow through without special handling. TMD's RoaringBitmap approach is more principled for columnar statistics, but for row-oriented map traversal, nil punning already covers most cases naturally.
6. **Reducers and transducers close the performance gap**: Clojure's `clojure.core.reducers` (`r/fold`) can parallelize operations over plain vectors of maps using fork/join, and transducers (`(into [] (comp (filter pred) (map f)) data)`) eliminate intermediate seq allocation entirely. These work on existing Clojure data — no special data structure required. `r/fold` on a vector of maps gives you parallel group-by/aggregate without reaching for a DataFrame. That said, columnar storage still wins for numeric-heavy workloads: TMD's `dfn/mean` operates on a contiguous typed buffer with no boxing, which transducers over maps can't match.
7. **Interactivity**: Kandy produces Let's Plot (SVG), Tableplot produces Plotly.js (interactive zoom/pan/hover). No stdlib plotting in either language.
8. **REPL experience**: Plain Clojure data is already inspectable at the REPL — DataFrames add pretty-printing but plain maps are arguably *easier* to inspect (no special printer needed).
**Verdict**: Both ecosystems are fully capable. Kotlin wins on type safety and IDE ergonomics.
Clojure wins on REPL interactivity, composability, and editor freedom. The bridge makes it possible to
use both in the same project.
**Verdict**: Plain Clojure seq-of-maps is competitive with DataFrame APIs for most operations.
`filter`, `map`, `sort-by`, `group-by`, `clojure.set/join`, transducers, and `r/fold` are all in the
stdlib. Keywords-as-functions, nil punning, and reducers/transducers mean plain Clojure data already
supports expressive querying, missing-value tolerance, and parallel computation — without a library.
DataFrames earn their keep on: (a) contiguous typed column buffers for numeric-heavy workloads,
(b) built-in statistical functions (`dfn/mean`, `dfn/variance`, etc.), and (c) ecosystem integration
(plotting, notebooks, Arrow I/O). For small-to-medium data, plain Clojure maps may be all you need.
---