update

2026-02-08 11:33:14 -10:00
parent bdf064f54d
commit 03f102bc33
1 changed files with 40 additions and 30 deletions
@@ -16,17 +16,19 @@ the two ecosystems on the JVM?

 ### Clojure equivalents exist

-| KT DataFrame feature | Clojure equivalent |
-|-|-|
-| `DataFrame` | `tech.ml.dataset` / `tablecloth` dataset |
-| `dataFrameOf(...)` | `(tc/dataset {...})` |
-| `.filter { }` | `(tc/select-rows ds pred)` |
-| `.groupBy {}.aggregate {}` | `(-> ds (tc/group-by :col) (tc/aggregate ...))` |
-| `.add { }` (computed column) | `(tc/add-column ds :name fn)` |
-| `@DataSchema` / compiler plugin | `malli` schema |
-| Schema inference from data | `malli.provider/provide` |
-| Kandy (plotting) | Tableplot, Hanami, Oz |
-| Kotlin Notebook | Clay, Clerk |
+| Concept | KT DataFrame | Tablecloth | Plain Kotlin (`List<Map>`) | Plain Clojure (seq of maps) |
+|-|-|-|-|-|
+| Tabular data | `DataFrame<T>` | `tc/dataset` | `List<Map<String, Any?>>` | `[{:a 1} {:b 2}]` |
+| Create | `dataFrameOf("a" to listOf(1,2))` | `(tc/dataset {:a [1 2]})` | `listOf(mapOf("a" to 1), mapOf("a" to 2))` | `[{:a 1} {:a 2}]` |
+| Filter | `.filter { col > 10 }` | `(tc/select-rows ds pred)` | `.filter { it["col"] as Int > 10 }` | `(filter #(> (:col %) 10) data)` |
+| Group + Aggregate | `.groupBy{}.aggregate{}` | `(-> (tc/group-by) (tc/aggregate))` | `.groupBy { it["k"] }.mapValues { (_, vs) -> vs.map { it["x"] as Double }.average() }` | `(->> data (group-by :k) (map (fn [[k vs]] {:k k :avg (/ (reduce + (map :x vs)) (count vs))})))` |
+| Computed column | `.add("y") { x * 2 }` | `(tc/add-column ds :y fn)` | `.map { it + ("y" to (it["x"] as Int) * 2) }` | `(map #(assoc % :y (* (:x %) 2)) data)` |
+| Sort | `.sortBy { col }` | `(tc/order-by ds :col)` | `.sortedBy { it["col"] as Comparable<*> }` | `(sort-by :col data)` |
+| Join | `.join(other) { col match right.col }` | `(tc/left-join ds other :col)` | manual `associateBy` + map | `(clojure.set/join a b)` or `merge` |
+| Schema | `@DataSchema` / compiler plugin | `malli` schema | `data class` | spec / none |
+| Schema inference | `@ImportDataSchema` | `malli.provider/provide` | — | — |
+| Plotting | Kandy | Tableplot, Hanami, Oz | — | — |
+| Notebook | Kotlin Notebook | Clay, Clerk | — | — |

 ### Schema inference from example data (Clojure)

@@ -223,28 +225,36 @@ The bridge is ~20 LOC. Both sides are columnar, so `Map<String, List>` is the na

 ### Clay + Tableplot (Clojure) vs Kotlin Notebook + Kandy

-| Operation | Kotlin DataFrame + Kandy | Tablecloth + Tableplot |
-|-----------|--------------------------|------------------------|
-| **Create data** | `dataFrameOf("col" to list)` | `(tc/dataset {"col" list})` |
-| **Group + Aggregate** | `df.groupBy { col }.aggregate { mean { x } into "y" }` | `(-> ds (tc/group-by :col) (tc/aggregate {:y #(dfn/mean (% :x))}))` |
-| **Filter** | `df.filter { price > 100 }` | `(tc/select-rows ds #(> (:price %) 100))` |
-| **Add column** | `df.add("revenue") { price * quantity }` | `(tc/map-columns ds "revenue" ["price" "quantity"] *)` |
-| **Bar chart** | `df.plot { bars { x(col); y(col) } }` | `(-> ds (plotly/base {:=x :col :=y :col}) (plotly/layer-bar {}))` |
-| **Scatter** | `df.plot { points { x(a); y(b); color(c) } }` | `(-> ds (plotly/base {:=x :a :=y :b}) (plotly/layer-point {:=color :c}))` |
-| **Histogram** | `df.plot { histogram(x = col) }` | `(-> ds (plotly/base {:=x :col}) (plotly/layer-histogram {}))` |
-| **Notebook** | Kotlin Notebook (IntelliJ plugin, `.ipynb`) | Clay (editor-agnostic, renders `.clj` → HTML) |
+| Operation | Kotlin DataFrame + Kandy | Tablecloth + Tableplot | Plain Kotlin (stdlib) | Plain Clojure (core) |
+|-----------|--------------------------|------------------------|-----------------------|----------------------|
+| **Create data** | `dataFrameOf("col" to list)` | `(tc/dataset {"col" list})` | `listOf(mapOf("col" to v, ...))` | `[{:col v ...}]` |
+| **Group + Aggregate** | `df.groupBy { col }.aggregate { mean { x } into "y" }` | `(-> ds (tc/group-by :col) (tc/aggregate {:y #(dfn/mean (% :x))}))` | `data.groupBy { it["col"] }.map { (k, vs) -> mapOf("col" to k, "avg" to vs.map { it["x"] as Double }.average()) }` | `(->> data (group-by :col) (map (fn [[k vs]] {:col k :avg (/ (reduce + (map :x vs)) (count vs))})))` |
+| **Filter** | `df.filter { price > 100 }` | `(tc/select-rows ds #(> (:price %) 100))` | `data.filter { it["price"] as Int > 100 }` | `(filter #(> (:price %) 100) data)` |
+| **Add column** | `df.add("revenue") { price * quantity }` | `(tc/map-columns ds "revenue" ["price" "quantity"] *)` | `data.map { it + ("revenue" to (it["price"] as Int) * (it["qty"] as Int)) }` | `(map #(assoc % :revenue (* (:price %) (:qty %))) data)` |
+| **Sort** | `df.sortBy { price }` | `(tc/order-by ds :price)` | `data.sortedBy { it["price"] as Int }` | `(sort-by :price data)` |
+| **Join** | `df.join(other) { col match right.col }` | `(tc/left-join ds other :col)` | `val idx = other.associateBy { it["col"] }; data.map { it + (idx[it["col"]] ?: emptyMap()) }` | `(clojure.set/join set-a set-b)` |
+| **Bar chart** | `df.plot { bars { x(col); y(col) } }` | `(-> ds (plotly/base {:=x :col :=y :col}) (plotly/layer-bar {}))` | — | — |
+| **Scatter** | `df.plot { points { x(a); y(b); color(c) } }` | `(-> ds (plotly/base {:=x :a :=y :b}) (plotly/layer-point {:=color :c}))` | — | — |
+| **Histogram** | `df.plot { histogram(x = col) }` | `(-> ds (plotly/base {:=x :col}) (plotly/layer-histogram {}))` | — | — |
+| **Notebook** | Kotlin Notebook (IntelliJ plugin, `.ipynb`) | Clay (editor-agnostic, renders `.clj` → HTML) | — | — |

 **Key differences:**
-1. **Type safety**: Kotlin has compile-time column access (`df.price`), Clojure has runtime schema inference
-2. **DSL style**: Kotlin uses function builders `plot { bars { } }`, Clojure uses data-driven pipelines
-3. **IDE integration**: Kotlin Notebook is IntelliJ-only; Clay is editor-agnostic
-4. **Interactivity**: Kandy produces Let's Plot (SVG), Tableplot produces Plotly.js (interactive zoom/pan/hover)
-5. **REPL experience**: Clojure's REPL-driven dev is faster for exploration
-6. **Composability**: Tableplot layers are independently composable; Kandy's DSL is more monolithic
+1. **Type safety**: Kotlin has compile-time column access (`df.price`), Clojure has runtime schema inference. Plain Kotlin collections need casts everywhere (`it["x"] as Int`). Plain Clojure needs neither — keywords are functions, values are dynamically typed.
+2. **Filter/map/sort are ~identical**: For row-level operations, plain collections and DataFrame APIs converge. Clojure's `filter`/`map`/`sort-by` + keywords-as-functions is nearly as concise as tablecloth.
+3. **Group+aggregate is close in Clojure, painful in plain Kotlin**: Clojure has `group-by` in core, and aggregation is just `map` + `reduce` over the groups — standard stuff. The tablecloth version is more declarative but not fundamentally simpler. In Kotlin, the plain version needs explicit casts at every step, making it significantly worse.
+4. **Joins are built into Clojure**: `clojure.set/join` performs natural joins on sets of maps, matching on shared keys automatically (or with an explicit key mapping). Combined with `merge`, `clojure.set/union`, `clojure.set/difference`, and `clojure.set/intersection`, Clojure has relational algebra in the stdlib. Plain Kotlin has nothing comparable — you build index maps by hand.
+5. **Nil punning eliminates missing-value boilerplate**: In Clojure, `(:x row)` on a row without `:x` returns `nil`. `nil` is falsy, `(seq nil)` is `nil`, `(conj nil x)` works, `(remove nil? xs)` is idiomatic. Missing values just flow through without special handling. TMD's RoaringBitmap approach is more principled for columnar statistics, but for row-oriented map traversal, nil punning already covers most cases naturally.
+6. **Reducers and transducers close the performance gap**: Clojure's `clojure.core.reducers` (`r/fold`) can parallelize operations over plain vectors of maps using fork/join, and transducers (`(into [] (comp (filter pred) (map f)) data)`) eliminate intermediate seq allocation entirely. These work on existing Clojure data — no special data structure required. `r/fold` on a vector of maps gives you parallel group-by/aggregate without reaching for a DataFrame. That said, columnar storage still wins for numeric-heavy workloads: TMD's `dfn/mean` operates on a contiguous typed buffer with no boxing, which transducers over maps can't match.
+7. **Interactivity**: Kandy produces Let's Plot (SVG), Tableplot produces Plotly.js (interactive zoom/pan/hover). No stdlib plotting in either language.
+8. **REPL experience**: Plain Clojure data is already inspectable at the REPL — DataFrames add pretty-printing but plain maps are arguably *easier* to inspect (no special printer needed).

-**Verdict**: Both ecosystems are fully capable. Kotlin wins on type safety and IDE ergonomics.
-Clojure wins on REPL interactivity, composability, and editor freedom. The bridge makes it possible to
-use both in the same project.
+**Verdict**: Plain Clojure seq-of-maps is competitive with DataFrame APIs for most operations.
+`filter`, `map`, `sort-by`, `group-by`, `clojure.set/join`, transducers, and `r/fold` are all in the
+stdlib. Keywords-as-functions, nil punning, and reducers/transducers mean plain Clojure data already
+supports expressive querying, missing-value tolerance, and parallel computation — without a library.
+DataFrames earn their keep on: (a) contiguous typed column buffers for numeric-heavy workloads,
+(b) built-in statistical functions (`dfn/mean`, `dfn/variance`, etc.), and (c) ecosystem integration
+(plotting, notebooks, Arrow I/O). For small-to-medium data, plain Clojure maps may be all you need.

 ---