update
This commit is contained in:
@@ -16,17 +16,19 @@ the two ecosystems on the JVM?
|
|||||||
|
|
||||||
### Clojure equivalents exist
|
### Clojure equivalents exist
|
||||||
|
|
||||||
| KT DataFrame feature | Clojure equivalent |
|
| Concept | KT DataFrame | Tablecloth | Plain Kotlin (`List<Map>`) | Plain Clojure (seq of maps) |
|
||||||
|-|-|
|
|-|-|-|-|-|
|
||||||
| `DataFrame` | `tech.ml.dataset` / `tablecloth` dataset |
|
| Tabular data | `DataFrame<T>` | `tc/dataset` | `List<Map<String, Any?>>` | `[{:a 1} {:b 2}]` |
|
||||||
| `dataFrameOf(...)` | `(tc/dataset {...})` |
|
| Create | `dataFrameOf("a" to listOf(1,2))` | `(tc/dataset {:a [1 2]})` | `listOf(mapOf("a" to 1), mapOf("a" to 2))` | `[{:a 1} {:a 2}]` |
|
||||||
| `.filter { }` | `(tc/select-rows ds pred)` |
|
| Filter | `.filter { col > 10 }` | `(tc/select-rows ds pred)` | `.filter { it["col"] as Int > 10 }` | `(filter #(> (:col %) 10) data)` |
|
||||||
| `.groupBy {}.aggregate {}` | `(-> ds (tc/group-by :col) (tc/aggregate ...))` |
|
| Group + Aggregate | `.groupBy{}.aggregate{}` | `(-> (tc/group-by) (tc/aggregate))` | `.groupBy { it["k"] }.mapValues { (_, vs) -> vs.map { it["x"] as Double }.average() }` | `(->> data (group-by :k) (map (fn [[k vs]] {:k k :avg (/ (reduce + (map :x vs)) (count vs))})))` |
|
||||||
| `.add { }` (computed column) | `(tc/add-column ds :name fn)` |
|
| Computed column | `.add("y") { x * 2 }` | `(tc/add-column ds :y fn)` | `.map { it + ("y" to (it["x"] as Int) * 2) }` | `(map #(assoc % :y (* (:x %) 2)) data)` |
|
||||||
| `@DataSchema` / compiler plugin | `malli` schema |
|
| Sort | `.sortBy { col }` | `(tc/order-by ds :col)` | `.sortedBy { it["col"] as Comparable<*> }` | `(sort-by :col data)` |
|
||||||
| Schema inference from data | `malli.provider/provide` |
|
| Join | `.join(other) { col match right.col }` | `(tc/left-join ds other :col)` | manual `associateBy` + map | `(clojure.set/join a b)` or `merge` |
|
||||||
| Kandy (plotting) | Tableplot, Hanami, Oz |
|
| Schema | `@DataSchema` / compiler plugin | `malli` schema | `data class` | spec / none |
|
||||||
| Kotlin Notebook | Clay, Clerk |
|
| Schema inference | `@ImportDataSchema` | `malli.provider/provide` | — | — |
|
||||||
|
| Plotting | Kandy | Tableplot, Hanami, Oz | — | — |
|
||||||
|
| Notebook | Kotlin Notebook | Clay, Clerk | — | — |
|
||||||
|
|
||||||
### Schema inference from example data (Clojure)
|
### Schema inference from example data (Clojure)
|
||||||
|
|
||||||
@@ -223,28 +225,36 @@ The bridge is ~20 LOC. Both sides are columnar, so `Map<String, List>` is the na
|
|||||||
|
|
||||||
### Clay + Tableplot (Clojure) vs Kotlin Notebook + Kandy
|
### Clay + Tableplot (Clojure) vs Kotlin Notebook + Kandy
|
||||||
|
|
||||||
| Operation | Kotlin DataFrame + Kandy | Tablecloth + Tableplot |
|
| Operation | Kotlin DataFrame + Kandy | Tablecloth + Tableplot | Plain Kotlin (stdlib) | Plain Clojure (core) |
|
||||||
|-----------|--------------------------|------------------------|
|
|-----------|--------------------------|------------------------|-----------------------|----------------------|
|
||||||
| **Create data** | `dataFrameOf("col" to list)` | `(tc/dataset {"col" list})` |
|
| **Create data** | `dataFrameOf("col" to list)` | `(tc/dataset {"col" list})` | `listOf(mapOf("col" to v, ...))` | `[{:col v ...}]` |
|
||||||
| **Group + Aggregate** | `df.groupBy { col }.aggregate { mean { x } into "y" }` | `(-> ds (tc/group-by :col) (tc/aggregate {:y #(dfn/mean (% :x))}))` |
|
| **Group + Aggregate** | `df.groupBy { col }.aggregate { mean { x } into "y" }` | `(-> ds (tc/group-by :col) (tc/aggregate {:y #(dfn/mean (% :x))}))` | `data.groupBy { it["col"] }.map { (k, vs) -> mapOf("col" to k, "avg" to vs.map { it["x"] as Double }.average()) }` | `(->> data (group-by :col) (map (fn [[k vs]] {:col k :avg (/ (reduce + (map :x vs)) (count vs))})))` |
|
||||||
| **Filter** | `df.filter { price > 100 }` | `(tc/select-rows ds #(> (:price %) 100))` |
|
| **Filter** | `df.filter { price > 100 }` | `(tc/select-rows ds #(> (:price %) 100))` | `data.filter { it["price"] as Int > 100 }` | `(filter #(> (:price %) 100) data)` |
|
||||||
| **Add column** | `df.add("revenue") { price * quantity }` | `(tc/map-columns ds "revenue" ["price" "quantity"] *)` |
|
| **Add column** | `df.add("revenue") { price * quantity }` | `(tc/map-columns ds "revenue" ["price" "quantity"] *)` | `data.map { it + ("revenue" to (it["price"] as Int) * (it["qty"] as Int)) }` | `(map #(assoc % :revenue (* (:price %) (:qty %))) data)` |
|
||||||
| **Bar chart** | `df.plot { bars { x(col); y(col) } }` | `(-> ds (plotly/base {:=x :col :=y :col}) (plotly/layer-bar {}))` |
|
| **Sort** | `df.sortBy { price }` | `(tc/order-by ds :price)` | `data.sortedBy { it["price"] as Int }` | `(sort-by :price data)` |
|
||||||
| **Scatter** | `df.plot { points { x(a); y(b); color(c) } }` | `(-> ds (plotly/base {:=x :a :=y :b}) (plotly/layer-point {:=color :c}))` |
|
| **Join** | `df.join(other) { col match right.col }` | `(tc/left-join ds other :col)` | `val idx = other.associateBy { it["col"] }; data.map { it + (idx[it["col"]] ?: emptyMap()) }` | `(clojure.set/join set-a set-b)` |
|
||||||
| **Histogram** | `df.plot { histogram(x = col) }` | `(-> ds (plotly/base {:=x :col}) (plotly/layer-histogram {}))` |
|
| **Bar chart** | `df.plot { bars { x(col); y(col) } }` | `(-> ds (plotly/base {:=x :col :=y :col}) (plotly/layer-bar {}))` | — | — |
|
||||||
| **Notebook** | Kotlin Notebook (IntelliJ plugin, `.ipynb`) | Clay (editor-agnostic, renders `.clj` → HTML) |
|
| **Scatter** | `df.plot { points { x(a); y(b); color(c) } }` | `(-> ds (plotly/base {:=x :a :=y :b}) (plotly/layer-point {:=color :c}))` | — | — |
|
||||||
|
| **Histogram** | `df.plot { histogram(x = col) }` | `(-> ds (plotly/base {:=x :col}) (plotly/layer-histogram {}))` | — | — |
|
||||||
|
| **Notebook** | Kotlin Notebook (IntelliJ plugin, `.ipynb`) | Clay (editor-agnostic, renders `.clj` → HTML) | — | — |
|
||||||
|
|
||||||
**Key differences:**
|
**Key differences:**
|
||||||
1. **Type safety**: Kotlin has compile-time column access (`df.price`), Clojure has runtime schema inference
|
1. **Type safety**: Kotlin has compile-time column access (`df.price`), Clojure has runtime schema inference. Plain Kotlin collections need casts everywhere (`it["x"] as Int`). Plain Clojure needs neither — keywords are functions, values are dynamically typed.
|
||||||
2. **DSL style**: Kotlin uses function builders `plot { bars { } }`, Clojure uses data-driven pipelines
|
2. **Filter/map/sort are ~identical**: For row-level operations, plain collections and DataFrame APIs converge. Clojure's `filter`/`map`/`sort-by` + keywords-as-functions is nearly as concise as tablecloth.
|
||||||
3. **IDE integration**: Kotlin Notebook is IntelliJ-only; Clay is editor-agnostic
|
3. **Group+aggregate is close in Clojure, painful in plain Kotlin**: Clojure has `group-by` in core, and aggregation is just `map` + `reduce` over the groups — standard stuff. The tablecloth version is more declarative but not fundamentally simpler. In Kotlin, the plain version needs explicit casts at every step, making it significantly worse.
|
||||||
4. **Interactivity**: Kandy produces Let's Plot (SVG), Tableplot produces Plotly.js (interactive zoom/pan/hover)
|
4. **Joins are built into Clojure**: `clojure.set/join` performs natural joins on sets of maps, matching on shared keys automatically (or with an explicit key mapping). Combined with `merge`, `clojure.set/union`, `clojure.set/difference`, and `clojure.set/intersection`, Clojure has relational algebra in the stdlib. Plain Kotlin has nothing comparable — you build index maps by hand.
|
||||||
5. **REPL experience**: Clojure's REPL-driven dev is faster for exploration
|
5. **Nil punning eliminates missing-value boilerplate**: In Clojure, `(:x row)` on a row without `:x` returns `nil`. `nil` is falsy, `(seq nil)` is `nil`, `(conj nil x)` works, `(remove nil? xs)` is idiomatic. Missing values just flow through without special handling. TMD's RoaringBitmap approach is more principled for columnar statistics, but for row-oriented map traversal, nil punning already covers most cases naturally.
|
||||||
6. **Composability**: Tableplot layers are independently composable; Kandy's DSL is more monolithic
|
6. **Reducers and transducers close the performance gap**: Clojure's `clojure.core.reducers` (`r/fold`) can parallelize operations over plain vectors of maps using fork/join, and transducers (`(into [] (comp (filter pred) (map f)) data)`) eliminate intermediate seq allocation entirely. These work on existing Clojure data — no special data structure required. `r/fold` on a vector of maps gives you parallel group-by/aggregate without reaching for a DataFrame. That said, columnar storage still wins for numeric-heavy workloads: TMD's `dfn/mean` operates on a contiguous typed buffer with no boxing, which transducers over maps can't match.
|
||||||
|
7. **Interactivity**: Kandy produces Let's Plot (SVG), Tableplot produces Plotly.js (interactive zoom/pan/hover). No stdlib plotting in either language.
|
||||||
|
8. **REPL experience**: Plain Clojure data is already inspectable at the REPL — DataFrames add pretty-printing but plain maps are arguably *easier* to inspect (no special printer needed).
|
||||||
|
|
||||||
**Verdict**: Both ecosystems are fully capable. Kotlin wins on type safety and IDE ergonomics.
|
**Verdict**: Plain Clojure seq-of-maps is competitive with DataFrame APIs for most operations.
|
||||||
Clojure wins on REPL interactivity, composability, and editor freedom. The bridge makes it possible to
|
`filter`, `map`, `sort-by`, `group-by`, `clojure.set/join`, transducers, and `r/fold` are all in the
|
||||||
use both in the same project.
|
stdlib. Keywords-as-functions, nil punning, and reducers/transducers mean plain Clojure data already
|
||||||
|
supports expressive querying, missing-value tolerance, and parallel computation — without a library.
|
||||||
|
DataFrames earn their keep on: (a) contiguous typed column buffers for numeric-heavy workloads,
|
||||||
|
(b) built-in statistical functions (`dfn/mean`, `dfn/variance`, etc.), and (c) ecosystem integration
|
||||||
|
(plotting, notebooks, Arrow I/O). For small-to-medium data, plain Clojure maps may be all you need.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
|
|||||||
Reference in New Issue
Block a user