This commit is contained in:
2026-02-08 11:33:14 -10:00
parent bdf064f54d
commit 03f102bc33
+40 -30
View File
@@ -16,17 +16,19 @@ the two ecosystems on the JVM?
### Clojure equivalents exist ### Clojure equivalents exist
| KT DataFrame feature | Clojure equivalent | | Concept | KT DataFrame | Tablecloth | Plain Kotlin (`List<Map>`) | Plain Clojure (seq of maps) |
|-|-| |-|-|-|-|-|
| `DataFrame` | `tech.ml.dataset` / `tablecloth` dataset | | Tabular data | `DataFrame<T>` | `tc/dataset` | `List<Map<String, Any?>>` | `[{:a 1} {:b 2}]` |
| `dataFrameOf(...)` | `(tc/dataset {...})` | | Create | `dataFrameOf("a" to listOf(1,2))` | `(tc/dataset {:a [1 2]})` | `listOf(mapOf("a" to 1), mapOf("a" to 2))` | `[{:a 1} {:a 2}]` |
| `.filter { }` | `(tc/select-rows ds pred)` | | Filter | `.filter { col > 10 }` | `(tc/select-rows ds pred)` | `.filter { it["col"] as Int > 10 }` | `(filter #(> (:col %) 10) data)` |
| `.groupBy {}.aggregate {}` | `(-> ds (tc/group-by :col) (tc/aggregate ...))` | | Group + Aggregate | `.groupBy{}.aggregate{}` | `(-> (tc/group-by) (tc/aggregate))` | `.groupBy { it["k"] }.mapValues { (_, vs) -> vs.map { it["x"] as Double }.average() }` | `(->> data (group-by :k) (map (fn [[k vs]] {:k k :avg (/ (reduce + (map :x vs)) (count vs))})))` |
| `.add { }` (computed column) | `(tc/add-column ds :name fn)` | | Computed column | `.add("y") { x * 2 }` | `(tc/add-column ds :y fn)` | `.map { it + ("y" to (it["x"] as Int) * 2) }` | `(map #(assoc % :y (* (:x %) 2)) data)` |
| `@DataSchema` / compiler plugin | `malli` schema | | Sort | `.sortBy { col }` | `(tc/order-by ds :col)` | `.sortedBy { it["col"] as Comparable<*> }` | `(sort-by :col data)` |
| Schema inference from data | `malli.provider/provide` | | Join | `.join(other) { col match right.col }` | `(tc/left-join ds other :col)` | manual `associateBy` + map | `(clojure.set/join a b)` or `merge` |
| Kandy (plotting) | Tableplot, Hanami, Oz | | Schema | `@DataSchema` / compiler plugin | `malli` schema | `data class` | spec / none |
| Kotlin Notebook | Clay, Clerk | | Schema inference | `@ImportDataSchema` | `malli.provider/provide` | — | — |
| Plotting | Kandy | Tableplot, Hanami, Oz | — | — |
| Notebook | Kotlin Notebook | Clay, Clerk | — | — |
### Schema inference from example data (Clojure) ### Schema inference from example data (Clojure)
@@ -223,28 +225,36 @@ The bridge is ~20 LOC. Both sides are columnar, so `Map<String, List>` is the na
### Clay + Tableplot (Clojure) vs Kotlin Notebook + Kandy ### Clay + Tableplot (Clojure) vs Kotlin Notebook + Kandy
| Operation | Kotlin DataFrame + Kandy | Tablecloth + Tableplot | | Operation | Kotlin DataFrame + Kandy | Tablecloth + Tableplot | Plain Kotlin (stdlib) | Plain Clojure (core) |
|-----------|--------------------------|------------------------| |-----------|--------------------------|------------------------|-----------------------|----------------------|
| **Create data** | `dataFrameOf("col" to list)` | `(tc/dataset {"col" list})` | | **Create data** | `dataFrameOf("col" to list)` | `(tc/dataset {"col" list})` | `listOf(mapOf("col" to v, ...))` | `[{:col v ...}]` |
| **Group + Aggregate** | `df.groupBy { col }.aggregate { mean { x } into "y" }` | `(-> ds (tc/group-by :col) (tc/aggregate {:y #(dfn/mean (% :x))}))` | | **Group + Aggregate** | `df.groupBy { col }.aggregate { mean { x } into "y" }` | `(-> ds (tc/group-by :col) (tc/aggregate {:y #(dfn/mean (% :x))}))` | `data.groupBy { it["col"] }.map { (k, vs) -> mapOf("col" to k, "avg" to vs.map { it["x"] as Double }.average()) }` | `(->> data (group-by :col) (map (fn [[k vs]] {:col k :avg (/ (reduce + (map :x vs)) (count vs))})))` |
| **Filter** | `df.filter { price > 100 }` | `(tc/select-rows ds #(> (:price %) 100))` | | **Filter** | `df.filter { price > 100 }` | `(tc/select-rows ds #(> (:price %) 100))` | `data.filter { it["price"] as Int > 100 }` | `(filter #(> (:price %) 100) data)` |
| **Add column** | `df.add("revenue") { price * quantity }` | `(tc/map-columns ds "revenue" ["price" "quantity"] *)` | | **Add column** | `df.add("revenue") { price * quantity }` | `(tc/map-columns ds "revenue" ["price" "quantity"] *)` | `data.map { it + ("revenue" to (it["price"] as Int) * (it["qty"] as Int)) }` | `(map #(assoc % :revenue (* (:price %) (:qty %))) data)` |
| **Bar chart** | `df.plot { bars { x(col); y(col) } }` | `(-> ds (plotly/base {:=x :col :=y :col}) (plotly/layer-bar {}))` | | **Sort** | `df.sortBy { price }` | `(tc/order-by ds :price)` | `data.sortedBy { it["price"] as Int }` | `(sort-by :price data)` |
| **Scatter** | `df.plot { points { x(a); y(b); color(c) } }` | `(-> ds (plotly/base {:=x :a :=y :b}) (plotly/layer-point {:=color :c}))` | | **Join** | `df.join(other) { col match right.col }` | `(tc/left-join ds other :col)` | `val idx = other.associateBy { it["col"] }; data.map { it + (idx[it["col"]] ?: emptyMap()) }` | `(clojure.set/join set-a set-b)` |
| **Histogram** | `df.plot { histogram(x = col) }` | `(-> ds (plotly/base {:=x :col}) (plotly/layer-histogram {}))` | | **Bar chart** | `df.plot { bars { x(col); y(col) } }` | `(-> ds (plotly/base {:=x :col :=y :col}) (plotly/layer-bar {}))` | — | — |
| **Notebook** | Kotlin Notebook (IntelliJ plugin, `.ipynb`) | Clay (editor-agnostic, renders `.clj` → HTML) | | **Scatter** | `df.plot { points { x(a); y(b); color(c) } }` | `(-> ds (plotly/base {:=x :a :=y :b}) (plotly/layer-point {:=color :c}))` | — | — |
| **Histogram** | `df.plot { histogram(x = col) }` | `(-> ds (plotly/base {:=x :col}) (plotly/layer-histogram {}))` | — | — |
| **Notebook** | Kotlin Notebook (IntelliJ plugin, `.ipynb`) | Clay (editor-agnostic, renders `.clj` → HTML) | — | — |
**Key differences:** **Key differences:**
1. **Type safety**: Kotlin has compile-time column access (`df.price`), Clojure has runtime schema inference 1. **Type safety**: Kotlin has compile-time column access (`df.price`), Clojure has runtime schema inference. Plain Kotlin collections need casts everywhere (`it["x"] as Int`). Plain Clojure needs neither — keywords are functions, values are dynamically typed.
2. **DSL style**: Kotlin uses function builders `plot { bars { } }`, Clojure uses data-driven pipelines 2. **Filter/map/sort are ~identical**: For row-level operations, plain collections and DataFrame APIs converge. Clojure's `filter`/`map`/`sort-by` + keywords-as-functions is nearly as concise as tablecloth.
3. **IDE integration**: Kotlin Notebook is IntelliJ-only; Clay is editor-agnostic 3. **Group+aggregate is close in Clojure, painful in plain Kotlin**: Clojure has `group-by` in core, and aggregation is just `map` + `reduce` over the groups — standard stuff. The tablecloth version is more declarative but not fundamentally simpler. In Kotlin, the plain version needs explicit casts at every step, making it significantly worse.
4. **Interactivity**: Kandy produces Let's Plot (SVG), Tableplot produces Plotly.js (interactive zoom/pan/hover) 4. **Joins are built into Clojure**: `clojure.set/join` performs natural joins on sets of maps, matching on shared keys automatically (or with an explicit key mapping). Combined with `merge`, `clojure.set/union`, `clojure.set/difference`, and `clojure.set/intersection`, Clojure has relational algebra in the stdlib. Plain Kotlin has nothing comparable — you build index maps by hand.
5. **REPL experience**: Clojure's REPL-driven dev is faster for exploration 5. **Nil punning eliminates missing-value boilerplate**: In Clojure, `(:x row)` on a row without `:x` returns `nil`. `nil` is falsy, `(seq nil)` is `nil`, `(conj nil x)` works, `(remove nil? xs)` is idiomatic. Missing values just flow through without special handling. TMD's RoaringBitmap approach is more principled for columnar statistics, but for row-oriented map traversal, nil punning already covers most cases naturally.
6. **Composability**: Tableplot layers are independently composable; Kandy's DSL is more monolithic 6. **Reducers and transducers close the performance gap**: Clojure's `clojure.core.reducers` (`r/fold`) can parallelize operations over plain vectors of maps using fork/join, and transducers (`(into [] (comp (filter pred) (map f)) data)`) eliminate intermediate seq allocation entirely. These work on existing Clojure data — no special data structure required. `r/fold` on a vector of maps gives you parallel group-by/aggregate without reaching for a DataFrame. That said, columnar storage still wins for numeric-heavy workloads: TMD's `dfn/mean` operates on a contiguous typed buffer with no boxing, which transducers over maps can't match.
7. **Interactivity**: Kandy produces Let's Plot (SVG), Tableplot produces Plotly.js (interactive zoom/pan/hover). No stdlib plotting in either language.
8. **REPL experience**: Plain Clojure data is already inspectable at the REPL — DataFrames add pretty-printing but plain maps are arguably *easier* to inspect (no special printer needed).
**Verdict**: Both ecosystems are fully capable. Kotlin wins on type safety and IDE ergonomics. **Verdict**: Plain Clojure seq-of-maps is competitive with DataFrame APIs for most operations.
Clojure wins on REPL interactivity, composability, and editor freedom. The bridge makes it possible to `filter`, `map`, `sort-by`, `group-by`, `clojure.set/join`, transducers, and `r/fold` are all in the
use both in the same project. stdlib. Keywords-as-functions, nil punning, and reducers/transducers mean plain Clojure data already
supports expressive querying, missing-value tolerance, and parallel computation — without a library.
DataFrames earn their keep on: (a) contiguous typed column buffers for numeric-heavy workloads,
(b) built-in statistical functions (`dfn/mean`, `dfn/variance`, etc.), and (c) ecosystem integration
(plotting, notebooks, Arrow I/O). For small-to-medium data, plain Clojure maps may be all you need.
--- ---