# Kotlin DataFrame ↔ Clojure Data Ecosystem Research Kotlin DataFrame (`org.jetbrains.kotlinx.dataframe`) is a JetBrains library for typesafe, columnar, in-memory data processing on the JVM. It pairs with **Kandy** for visualization. The core question: **Is this just a seq of maps with extra steps?** And if so, can we bridge the two ecosystems on the JVM? ## Key Findings ### Kotlin DataFrame is NOT a seq of maps internally - It's **column-oriented**: a list of named typed columns (primitive arrays, string tables) - But it **exposes** row iteration as `DataRow` (essentially `Map`) - It freely converts to/from `List` and `Map>` - So conceptually yes, it represents the same data as a seq of maps, but stored columnar ### Clojure equivalents exist | Concept | KT DataFrame | Tablecloth | Plain Kotlin (`List`) | Plain Clojure (seq of maps) | |-|-|-|-|-| | Tabular data | `DataFrame` | `tc/dataset` | `List>` | `[{:a 1} {:b 2}]` | | Create | `dataFrameOf("a" to listOf(1,2))` | `(tc/dataset {:a [1 2]})` | `listOf(mapOf("a" to 1), mapOf("a" to 2))` | `[{:a 1} {:a 2}]` | | Filter | `.filter { col > 10 }` | `(tc/select-rows ds pred)` | `.filter { it["col"] as Int > 10 }` | `(filter #(> (:col %) 10) data)` | | Group + Aggregate | `.groupBy{}.aggregate{}` | `(-> (tc/group-by) (tc/aggregate))` | `.groupBy { it["k"] }.mapValues { (_, vs) -> vs.map { it["x"] as Double }.average() }` | `(->> data (group-by :k) (map (fn [[k vs]] {:k k :avg (/ (reduce + (map :x vs)) (count vs))})))` | | Computed column | `.add("y") { x * 2 }` | `(tc/add-column ds :y fn)` | `.map { it + ("y" to (it["x"] as Int) * 2) }` | `(map #(assoc % :y (* (:x %) 2)) data)` | | Sort | `.sortBy { col }` | `(tc/order-by ds :col)` | `.sortedBy { it["col"] as Comparable<*> }` | `(sort-by :col data)` | | Join | `.join(other) { col match right.col }` | `(tc/left-join ds other :col)` | manual `associateBy` + map | `(clojure.set/join a b)` or `merge` | | Schema | `@DataSchema` / compiler plugin | `malli` schema | `data class` | spec / none | | Schema inference | `@ImportDataSchema` | `malli.provider/provide` | — | — | | Plotting | Kandy | Tableplot, Hanami, Oz | — | — | | Notebook | Kotlin Notebook | Clay, Clerk | — | — | ### Schema inference from example data (Clojure) **Malli** can infer schemas from example JSON/EDN: ```clojure (require '[malli.provider :as mp]) (mp/provide [{:name "Alice" :age 30 :tags ["admin"]} {:name "Bob" :age 25}]) ;; => [:map ;; [:name string?] ;; [:age number?] ;; [:tags {:optional true} [:vector string?]]] ``` - Detects optional keys automatically - Handles nested maps, vectors, sets - Can decode UUIDs, dates with custom decoders - Can export to JSON Schema via `malli.json-schema/transform` --- ## Deep Research Findings (from source code analysis) ### A. Kotlin DataFrame JVM Interop Surface #### Core Architecture ``` DataFrame (interface) -- container of columns └── DataFrameImpl (internal) -- stores List + nrow DataColumn (interface) -- a single column ├── ValueColumn -- leaf values (backed by List) ├── ColumnGroup -- nested DataFrame (struct column) └── FrameColumn -- column of DataFrames DataRow (interface) -- a single row view └── DataRowImpl (internal) -- index + DataFrame reference ``` #### What's callable from Clojure (non-inline, non-reified) | Operation | Java-callable? | How to call from Clojure | |-----------|---------------|--------------------------| | `Map.toDataFrame()` | **YES** | `(ToDataFrameKt/toDataFrame java-map)` | | `Iterable>.toDataFrame()` | **YES** | `(ToDataFrameKt/toDataFrameMapStringAnyNullable seq-of-maps)` | | `DataFrame.toMap()` | **YES** | `(TypeConversionsKt/toMap df)` → `Map>` | | `DataRow.toMap()` | **YES** | `(TypeConversionsKt/toMap row)` → `Map` | | `DataColumn.createByInference(name, values)` | **YES** | `(DataColumn/createByInference "col" java-list)` | | `dataFrameOf(columns)` | **YES** | `(ConstructorsKt/dataFrameOf column-list)` | | `DataFrame.columns()` | **YES** | `(.columns df)` → `List` | | `DataFrame.get(name)` | **YES** | `(.get df "colname")` → column | | `DataFrame.get(index)` | **YES** | `(.get df 0)` → DataRow | | `DataFrame.iterator()` | **YES** | `(iterator-seq (.iterator df))` | | `DataFrame.rowsCount()` | **YES** | `(.rowsCount df)` | | `Iterable.toDataFrame()` | **NO** | inline+reified, use Map variant instead | | `DataColumn.createValueColumn` | **NO** | Use `createByInference` instead | **Key insight**: DataRow does NOT implement `java.util.Map`. It has `.get(name)` and `.values()` but you need `.toMap()` to get a real Map. #### Column storage - **ValueColumn**: backed by `List` (object list, not primitive arrays) - No primitive specialization — even ints are boxed in `List` - This means no zero-copy to dtype-next primitive buffers ### B. tech.ml.dataset (TMD) / Tablecloth Interop Surface #### TMD Column internals ```clojure (deftype Column [^RoaringBitmap missing ;; missing value bitmap data ;; underlying data (dtype-next buffer, Java array, NIO buffer) ^IPersistentMap metadata ;; column metadata (name, datatype, etc.) ^Buffer buffer]) ;; cached buffer view for fast access ``` - Columns are **dtype-next buffers** — can be Java arrays, NIO ByteBuffers, or native memory - Missing values tracked separately via RoaringBitmap (not sentinel values) - Implements `PToArrayBuffer` — can convert to raw Java arrays if no missing values #### Dataset creation from Java collections ```clojure ;; From a column-oriented map (best for interop): (ds/->dataset {"name" ["Alice" "Bob"] "age" [30 25]}) ;; From a seq of row maps: (ds/->dataset [{:name "Alice" :age 30} {:name "Bob" :age 25}]) ;; Tablecloth wraps the same: (tc/dataset {"name" ["Alice" "Bob"] "age" [30 25]}) ``` Both `ds/->dataset` and `tc/dataset` accept `java.util.Map` directly. ### C. Apache Arrow as Interchange Format Both libraries support Arrow: **Kotlin DataFrame** — separate module `dataframe-arrow`, supports Feather (v1, v2), Arrow IPC (streaming + file), LZ4/ZSTD compression. **tech.ml.dataset** — built-in `tech.v3.libs.arrow`, supports memory-mapped reading for near-zero-copy (`{:open-type :mmap}`). **Verdict:** Arrow is the right choice for **large datasets** or **process boundaries**. For same-process bridging, **direct JVM interop via Map** is simpler. ### D. Malli Schema Inference | Aspect | Malli Provider | Kotlin @ImportDataSchema | |--------|---------------|--------------------------| | When | Runtime | Compile-time | | Input | Any Clojure data | JSON file on disk | | Output | Schema as data (EDN) | Generated typed accessor code | | Type safety | Dynamic validation | Static type checking | | IDE support | Limited | Full autocomplete | | Flexibility | Handles unknown/evolving schemas | Schema fixed at compile time | | Best for | Exploration, dynamic data | Production, stable APIs | --- ## Bridge Design The bridge is ~20 LOC. Both sides are columnar, so `Map` is the natural interchange type. ### Direct JVM interop (simplest, best for small-medium data) ```clojure (ns df-bridge.core (:import [org.jetbrains.kotlinx.dataframe.api ToDataFrameKt TypeConversionsKt] [org.jetbrains.kotlinx.dataframe DataColumn])) ;; KT DataFrame -> Clojure (column-oriented, fast) (defn kt->map [kt-df] (into {} (TypeConversionsKt/toMap kt-df))) ;; KT DataFrame -> TMD dataset (defn kt->dataset [kt-df] (-> (TypeConversionsKt/toMap kt-df) (ds/->dataset))) ;; Clojure -> KT DataFrame (column-oriented, fast) (defn map->kt [col-map] (ToDataFrameKt/toDataFrame col-map)) ;; TMD dataset -> KT DataFrame (defn dataset->kt [ds] (let [col-map (into {} (map (fn [col] [(name (ds-col/column-name col)) (vec col)]) (ds/columns ds)))] (ToDataFrameKt/toDataFrame col-map))) ``` ### Nested Data (ColumnGroups) ```clojure ;; Create KT DataFrame with ColumnGroup from Clojure: (bridge/make-kt-with-groups [["name" ["Alice" "Bob"]] ["address" {"city" ["NYC" "LA"] "zip" [10001 90001]}]]) ;; Convert back to row maps: (bridge/kt->rows kt-df) ;; => [{:name "Alice", :address {:city "NYC", :zip 10001}} ;; {:name "Bob", :address {:city "LA", :zip 90001}}] ``` ### Benchmark Results (3-column dataset: string, int, double) | Rows | Map→KT | KT→Map | KT→TMD | TC→KT | Full RT | |------|--------|--------|--------|-------|---------| | 1K | 0.3ms | 0.005ms | 0.3ms | 0.2ms | 0.4ms | | 100K | 3.3ms | 0.003ms | 5.7ms | 5.5ms | 12.2ms | | 1M | 33ms | 0.004ms | 72ms | 60ms | 134ms | ### Arrow vs Direct Map Comparison (4-column dataset) | Rows | Direct Map KT→TMD | Arrow file KT→TMD | Arrow byte[] KT→TMD | |------|-------------------|--------------------|--------------------| | 10K | 1.5ms | 2.0ms | 1.4ms | | 100K | 11.6ms | 9.1ms | 7.1ms | | 1M | 112ms | 118ms | 92ms | **Key observations:** - `KT→Map` is essentially free (~4µs) — `toMap()` just wraps existing column lists - Full roundtrip at 1M rows: 134ms — fine for interactive use - **Arrow is NOT faster** than direct Map for same-process bridging at any tested size - **Verdict: Direct Map bridge wins for same-process. Arrow only for cross-process.** --- ## Visualization Stack Comparison ### Clay + Tableplot (Clojure) vs Kotlin Notebook + Kandy | Operation | Kotlin DataFrame + Kandy | Tablecloth + Tableplot | Plain Kotlin (stdlib) | Plain Clojure (core) | |-----------|--------------------------|------------------------|-----------------------|----------------------| | **Create data** | `dataFrameOf("col" to list)` | `(tc/dataset {"col" list})` | `listOf(mapOf("col" to v, ...))` | `[{:col v ...}]` | | **Group + Aggregate** | `df.groupBy { col }.aggregate { mean { x } into "y" }` | `(-> ds (tc/group-by :col) (tc/aggregate {:y #(dfn/mean (% :x))}))` | `data.groupBy { it["col"] }.map { (k, vs) -> mapOf("col" to k, "avg" to vs.map { it["x"] as Double }.average()) }` | `(->> data (group-by :col) (map (fn [[k vs]] {:col k :avg (/ (reduce + (map :x vs)) (count vs))})))` | | **Filter** | `df.filter { price > 100 }` | `(tc/select-rows ds #(> (:price %) 100))` | `data.filter { it["price"] as Int > 100 }` | `(filter #(> (:price %) 100) data)` | | **Add column** | `df.add("revenue") { price * quantity }` | `(tc/map-columns ds "revenue" ["price" "quantity"] *)` | `data.map { it + ("revenue" to (it["price"] as Int) * (it["qty"] as Int)) }` | `(map #(assoc % :revenue (* (:price %) (:qty %))) data)` | | **Sort** | `df.sortBy { price }` | `(tc/order-by ds :price)` | `data.sortedBy { it["price"] as Int }` | `(sort-by :price data)` | | **Join** | `df.join(other) { col match right.col }` | `(tc/left-join ds other :col)` | `val idx = other.associateBy { it["col"] }; data.map { it + (idx[it["col"]] ?: emptyMap()) }` | `(clojure.set/join set-a set-b)` | | **Bar chart** | `df.plot { bars { x(col); y(col) } }` | `(-> ds (plotly/base {:=x :col :=y :col}) (plotly/layer-bar {}))` | — | — | | **Scatter** | `df.plot { points { x(a); y(b); color(c) } }` | `(-> ds (plotly/base {:=x :a :=y :b}) (plotly/layer-point {:=color :c}))` | — | — | | **Histogram** | `df.plot { histogram(x = col) }` | `(-> ds (plotly/base {:=x :col}) (plotly/layer-histogram {}))` | — | — | | **Notebook** | Kotlin Notebook (IntelliJ plugin, `.ipynb`) | Clay (editor-agnostic, renders `.clj` → HTML) | — | — | **Key differences:** 1. **Type safety**: Kotlin has compile-time column access (`df.price`), Clojure has runtime schema inference. Plain Kotlin collections need casts everywhere (`it["x"] as Int`). Plain Clojure needs neither — keywords are functions, values are dynamically typed. 2. **Filter/map/sort are ~identical**: For row-level operations, plain collections and DataFrame APIs converge. Clojure's `filter`/`map`/`sort-by` + keywords-as-functions is nearly as concise as tablecloth. 3. **Group+aggregate is close in Clojure, painful in plain Kotlin**: Clojure has `group-by` in core, and aggregation is just `map` + `reduce` over the groups — standard stuff. The tablecloth version is more declarative but not fundamentally simpler. In Kotlin, the plain version needs explicit casts at every step, making it significantly worse. 4. **Joins are built into Clojure**: `clojure.set/join` performs natural joins on sets of maps, matching on shared keys automatically (or with an explicit key mapping). Combined with `merge`, `clojure.set/union`, `clojure.set/difference`, and `clojure.set/intersection`, Clojure has relational algebra in the stdlib. Plain Kotlin has nothing comparable — you build index maps by hand. 5. **Nil punning eliminates missing-value boilerplate**: In Clojure, `(:x row)` on a row without `:x` returns `nil`. `nil` is falsy, `(seq nil)` is `nil`, `(conj nil x)` works, `(remove nil? xs)` is idiomatic. Missing values just flow through without special handling. TMD's RoaringBitmap approach is more principled for columnar statistics, but for row-oriented map traversal, nil punning already covers most cases naturally. 6. **Reducers and transducers close the performance gap**: Clojure's `clojure.core.reducers` (`r/fold`) can parallelize operations over plain vectors of maps using fork/join, and transducers (`(into [] (comp (filter pred) (map f)) data)`) eliminate intermediate seq allocation entirely. These work on existing Clojure data — no special data structure required. `r/fold` on a vector of maps gives you parallel group-by/aggregate without reaching for a DataFrame. That said, columnar storage still wins for numeric-heavy workloads: TMD's `dfn/mean` operates on a contiguous typed buffer with no boxing, which transducers over maps can't match. 7. **Interactivity**: Kandy produces Let's Plot (SVG), Tableplot produces Plotly.js (interactive zoom/pan/hover). No stdlib plotting in either language. 8. **REPL experience**: Plain Clojure data is already inspectable at the REPL — DataFrames add pretty-printing but plain maps are arguably *easier* to inspect (no special printer needed). **Verdict**: Plain Clojure seq-of-maps is competitive with DataFrame APIs for most operations. `filter`, `map`, `sort-by`, `group-by`, `clojure.set/join`, transducers, and `r/fold` are all in the stdlib. Keywords-as-functions, nil punning, and reducers/transducers mean plain Clojure data already supports expressive querying, missing-value tolerance, and parallel computation — without a library. DataFrames earn their keep on: (a) contiguous typed column buffers for numeric-heavy workloads, (b) built-in statistical functions (`dfn/mean`, `dfn/variance`, etc.), and (c) ecosystem integration (plotting, notebooks, Arrow I/O). For small-to-medium data, plain Clojure maps may be all you need. --- ## Project Structure ``` bridge/ — Working Clojure bridge project (deps.edn, benchmarks, notebooks) dataframe/ — Kotlin DataFrame source (reference) tech.ml.dataset/ — TMD source (reference) tablecloth/ — Tablecloth source (reference) malli/ — Malli source (reference) ``` ## Dependencies (bridge project) ```clojure {org.jetbrains.kotlinx/dataframe-core {:mvn/version "1.0.0-Beta4"} org.jetbrains.kotlin/kotlin-reflect {:mvn/version "2.1.10"} scicloj/tablecloth {:mvn/version "7.062"} metosin/malli {:mvn/version "0.17.0"} org.scicloj/tableplot {:mvn/version "1-beta14"} org.scicloj/clay {:mvn/version "2-beta56"}} ```