update

2026-02-08 15:49:40 -10:00
parent 03f102bc33
commit 88206d519c
1 changed files with 13 additions and 12 deletions
@@ -21,7 +21,7 @@ the two ecosystems on the JVM?
 | Tabular data | `DataFrame<T>` | `tc/dataset` | `List<Map<String, Any?>>` | `[{:a 1} {:b 2}]` |
 | Create | `dataFrameOf("a" to listOf(1,2))` | `(tc/dataset {:a [1 2]})` | `listOf(mapOf("a" to 1), mapOf("a" to 2))` | `[{:a 1} {:a 2}]` |
 | Filter | `.filter { col > 10 }` | `(tc/select-rows ds pred)` | `.filter { it["col"] as Int > 10 }` | `(filter #(> (:col %) 10) data)` |
-| Group + Aggregate | `.groupBy{}.aggregate{}` | `(-> (tc/group-by) (tc/aggregate))` | `.groupBy { it["k"] }.mapValues { (_, vs) -> vs.map { it["x"] as Double }.average() }` | `(->> data (group-by :k) (map (fn [[k vs]] {:k k :avg (/ (reduce + (map :x vs)) (count vs))})))` |
+| Group + Aggregate | `.groupBy{}.aggregate{}` | `(-> (tc/group-by) (tc/aggregate))` | `.groupingBy { it["k"] }.fold(0.0) { acc, r -> acc + r["x"] as Double }` | `(->> data (group-by :k) (map (fn [[k vs]] {:k k :avg (/ (reduce + (map :x vs)) (count vs))})))` |
 | Computed column | `.add("y") { x * 2 }` | `(tc/add-column ds :y fn)` | `.map { it + ("y" to (it["x"] as Int) * 2) }` | `(map #(assoc % :y (* (:x %) 2)) data)` |
 | Sort | `.sortBy { col }` | `(tc/order-by ds :col)` | `.sortedBy { it["col"] as Comparable<*> }` | `(sort-by :col data)` |
 | Join | `.join(other) { col match right.col }` | `(tc/left-join ds other :col)` | manual `associateBy` + map | `(clojure.set/join a b)` or `merge` |
@@ -228,7 +228,7 @@ The bridge is ~20 LOC. Both sides are columnar, so `Map<String, List>` is the na
 | Operation | Kotlin DataFrame + Kandy | Tablecloth + Tableplot | Plain Kotlin (stdlib) | Plain Clojure (core) |
 |-----------|--------------------------|------------------------|-----------------------|----------------------|
 | **Create data** | `dataFrameOf("col" to list)` | `(tc/dataset {"col" list})` | `listOf(mapOf("col" to v, ...))` | `[{:col v ...}]` |
-| **Group + Aggregate** | `df.groupBy { col }.aggregate { mean { x } into "y" }` | `(-> ds (tc/group-by :col) (tc/aggregate {:y #(dfn/mean (% :x))}))` | `data.groupBy { it["col"] }.map { (k, vs) -> mapOf("col" to k, "avg" to vs.map { it["x"] as Double }.average()) }` | `(->> data (group-by :col) (map (fn [[k vs]] {:col k :avg (/ (reduce + (map :x vs)) (count vs))})))` |
+| **Group + Aggregate** | `df.groupBy { col }.aggregate { mean { x } into "y" }` | `(-> ds (tc/group-by :col) (tc/aggregate {:y #(dfn/mean (% :x))}))` | `data.groupingBy { it["col"] }.fold(0.0) { acc, r -> acc + (r["x"] as Double) }` | `(->> data (group-by :col) (map (fn [[k vs]] {:col k :avg (/ (reduce + (map :x vs)) (count vs))})))` |
 | **Filter** | `df.filter { price > 100 }` | `(tc/select-rows ds #(> (:price %) 100))` | `data.filter { it["price"] as Int > 100 }` | `(filter #(> (:price %) 100) data)` |
 | **Add column** | `df.add("revenue") { price * quantity }` | `(tc/map-columns ds "revenue" ["price" "quantity"] *)` | `data.map { it + ("revenue" to (it["price"] as Int) * (it["qty"] as Int)) }` | `(map #(assoc % :revenue (* (:price %) (:qty %))) data)` |
 | **Sort** | `df.sortBy { price }` | `(tc/order-by ds :price)` | `data.sortedBy { it["price"] as Int }` | `(sort-by :price data)` |
@@ -241,20 +241,21 @@ The bridge is ~20 LOC. Both sides are columnar, so `Map<String, List>` is the na
 **Key differences:**
 1. **Type safety**: Kotlin has compile-time column access (`df.price`), Clojure has runtime schema inference. Plain Kotlin collections need casts everywhere (`it["x"] as Int`). Plain Clojure needs neither — keywords are functions, values are dynamically typed.
 2. **Filter/map/sort are ~identical**: For row-level operations, plain collections and DataFrame APIs converge. Clojure's `filter`/`map`/`sort-by` + keywords-as-functions is nearly as concise as tablecloth.
-3. **Group+aggregate is close in Clojure, painful in plain Kotlin**: Clojure has `group-by` in core, and aggregation is just `map` + `reduce` over the groups — standard stuff. The tablecloth version is more declarative but not fundamentally simpler. In Kotlin, the plain version needs explicit casts at every step, making it significantly worse.
+3. **Group+aggregate is close in Clojure, improved in Kotlin via `groupingBy`**: Clojure has `group-by` in core, and aggregation is just `map` + `reduce` over the groups. Kotlin has `groupingBy` which avoids materializing intermediate lists — `.groupingBy { }.fold(init) { acc, elem -> }` aggregates in a single pass. Both are reasonable without a DataFrame. Casts still plague the Kotlin version.
-4. **Joins are built into Clojure**: `clojure.set/join` performs natural joins on sets of maps, matching on shared keys automatically (or with an explicit key mapping). Combined with `merge`, `clojure.set/union`, `clojure.set/difference`, and `clojure.set/intersection`, Clojure has relational algebra in the stdlib. Plain Kotlin has nothing comparable — you build index maps by hand.
+4. **Joins are built into Clojure**: `clojure.set/join` performs natural inner joins on sets of maps, auto-detecting shared keys (via the first element of each set). Also supports explicit key mapping via 3-arity `(join a b {:left-key :right-key})`. Combined with `merge`, `set/union`, `set/difference`, `set/intersection`, `set/select` (filter), `set/project` (column select), and `set/rename`, Clojure has relational algebra in the stdlib. **Caveats**: inner join only (no left/right/outer), inputs must be sets (not vectors — silent wrong results otherwise), shared-key detection only inspects the first element of each set. Plain Kotlin has nothing comparable.
-5. **Nil punning eliminates missing-value boilerplate**: In Clojure, `(:x row)` on a row without `:x` returns `nil`. `nil` is falsy, `(seq nil)` is `nil`, `(conj nil x)` works, `(remove nil? xs)` is idiomatic. Missing values just flow through without special handling. TMD's RoaringBitmap approach is more principled for columnar statistics, but for row-oriented map traversal, nil punning already covers most cases naturally.
+5. **Nil punning handles most missing-value cases**: `(:x row)` on a row without `:x` returns `nil`. `nil` is falsy, `(seq nil)` → `nil`, `(conj nil x)` works, `(assoc nil :a 1)` → `{:a 1}`, `(count nil)` → `0`. Missing values flow through collection operations naturally. **Caveat**: nil breaks arithmetic — `(+ 1 nil)` and `(> nil 5)` throw NPE. Clojure provides `fnil` (default-substitution wrapper) and `some->` (nil-short-circuiting thread) for these edges. In practice you `(remove nil? ...)` or `(keep ...)` before numeric reduction. TMD's RoaringBitmap approach handles this at the column level without per-row nil logic.
-6. **Reducers and transducers close the performance gap**: Clojure's `clojure.core.reducers` (`r/fold`) can parallelize operations over plain vectors of maps using fork/join, and transducers (`(into [] (comp (filter pred) (map f)) data)`) eliminate intermediate seq allocation entirely. These work on existing Clojure data — no special data structure required. `r/fold` on a vector of maps gives you parallel group-by/aggregate without reaching for a DataFrame. That said, columnar storage still wins for numeric-heavy workloads: TMD's `dfn/mean` operates on a contiguous typed buffer with no boxing, which transducers over maps can't match.
+6. **Reducers and transducers close the performance gap**: Transducers (`(into [] (comp (filter pred) (map f)) data)`) fuse pipeline stages into a single pass — no intermediate lazy seqs between steps. `r/fold` parallelizes via ForkJoinPool (default partition: 512 elements), splitting work across cores. **Caveat**: `r/fold` only parallelizes vectors and PersistentHashMaps; lists, lazy seqs, and sets silently fall back to sequential reduce. Parallel group-by works via `(r/fold (r/monoid #(merge-with + %1 %2) (constantly {})) rf data-vec)` but requires an associative combining function — not all aggregations (median, percentile) trivially parallelize. Columnar storage still wins for numeric-heavy workloads: TMD's `dfn/mean` operates on a contiguous typed `double[]` buffer with no boxing, which transducers over maps of boxed values can't match.
-7. **Interactivity**: Kandy produces Let's Plot (SVG), Tableplot produces Plotly.js (interactive zoom/pan/hover). No stdlib plotting in either language.
+7. **Plotting**: Both stacks are interactive. Kandy (Lets-Plot) supports tooltips by default and zoom/pan via `ggtb()` since Lets-Plot 4.5.0; renders via Swing, HTML, SVG, or PNG. Tableplot supports Plotly.js (interactive zoom/pan/hover out of the box) and Vega-Lite backends. No stdlib plotting in either language.
 8. **REPL experience**: Plain Clojure data is already inspectable at the REPL — DataFrames add pretty-printing but plain maps are arguably *easier* to inspect (no special printer needed).
 **Verdict**: Plain Clojure seq-of-maps is competitive with DataFrame APIs for most operations.
 `filter`, `map`, `sort-by`, `group-by`, `clojure.set/join`, transducers, and `r/fold` are all in the
-stdlib. Keywords-as-functions, nil punning, and reducers/transducers mean plain Clojure data already
+stdlib. Keywords-as-functions and nil punning cover most data-wrangling patterns naturally — with known
-supports expressive querying, missing-value tolerance, and parallel computation — without a library.
+edges around arithmetic on nil and `set/join` requiring actual sets (not vectors). DataFrames earn their
-DataFrames earn their keep on: (a) contiguous typed column buffers for numeric-heavy workloads,
+keep on: (a) contiguous typed column buffers for numeric-heavy workloads (no boxing, primitive reduction),
-(b) built-in statistical functions (`dfn/mean`, `dfn/variance`, etc.), and (c) ecosystem integration
+(b) built-in statistical functions (`dfn/mean`, `dfn/variance`, etc.), (c) left/right/outer joins
-(plotting, notebooks, Arrow I/O). For small-to-medium data, plain Clojure maps may be all you need.
+(tablecloth), and (d) ecosystem integration (plotting, notebooks, Arrow I/O). For small-to-medium data
 with simple transformations, plain Clojure maps may genuinely be all you need.
 ---