init research

This commit is contained in:
2026-02-08 11:20:43 -10:00
commit bdf064f54d
3041 changed files with 1592200 additions and 0 deletions
+232
View File
@@ -0,0 +1,232 @@
# tech.ml.dataset And nippy
We are big fans of the [nippy system](https://github.com/ptaoussanis/nippy) for
freezing/thawing data. So we were pleasantly surprized with how well it performs
with dataset and how easy it was to extend the dataset object to support nippy
natively.
## Nippy Hits One Out Of the Park
We start with a decent size gzipped tabbed-delimited file.
```console
chrisn@chrisn-lt-01:~/dev/tech.all/tech.ml.dataset/nippy-demo$ ls -alh
total 44M
drwxrwxr-x 2 chrisn chrisn 4.0K Jun 18 13:27 .
drwxr-xr-x 13 chrisn chrisn 4.0K Jun 18 13:27 ..
-rw-rw-r-- 1 chrisn chrisn 44M Jun 18 13:27 2010.tsv.gz
```
```clojure
user> (def ds-2010 (time (ds/->dataset
"nippy-demo/2010.tsv.gz"
{:parser-fn {"date" [:packed-local-date "yyyy-MM-dd"]}})))
"Elapsed time: 8588.080218 msecs"
#'user/ds-2010
user> ;;rename column names so the tables print nicely
user> (def ds-2010
(ds/select-columns ds-2010
(->> (ds/column-names ds-2010)
(map (fn [oldname]
[oldname (.replace ^String oldname "_" "-")]))
(into {}))))
user> ds-2010
nippy-demo/2010.tsv.gz [2769708 12]:
| low | comp-name-2 | high | currency-code | comp-name | m-ticker | ticker | close | volume | exchange | date | open |
|-------:|-------------|-------:|---------------|------------|----------|--------|-------:|---------------:|----------|------------|-------:|
| | | | USD | ALCOA CORP | AA2 | AA | 48.365 | | NYSE | 2010-01-01 | |
| 49.355 | | 51.065 | USD | ALCOA CORP | AA2 | AA | 51.065 | 1.10618840E+07 | NYSE | 2010-01-08 | 49.385 |
| | | | USD | ALCOA CORP | AA2 | AA | 46.895 | | NYSE | 2010-01-18 | |
| 39.904 | | 41.854 | USD | ALCOA CORP | AA2 | AA | 40.624 | 1.46292500E+07 | NYSE | 2010-01-26 | 40.354 |
| 40.294 | | 41.674 | USD | ALCOA CORP | AA2 | AA | 40.474 | 1.20107520E+07 | NYSE | 2010-02-03 | 40.804 |
| 39.304 | | 40.504 | USD | ALCOA CORP | AA2 | AA | 39.844 | 1.46702890E+07 | NYSE | 2010-02-09 | 40.084 |
| 39.574 | | 40.264 | USD | ALCOA CORP | AA2 | AA | 39.844 | 1.53728400E+07 | NYSE | 2010-02-12 | 39.994 |
| 40.324 | | 41.104 | USD | ALCOA CORP | AA2 | AA | 40.624 | 7.72947100E+06 | NYSE | 2010-02-22 | 41.044 |
| 39.664 | | 40.564 | USD | ALCOA CORP | AA2 | AA | 39.724 | 1.08365810E+07 | NYSE | 2010-03-02 | 40.234 |
```
Our 44MB gzipped tsv produced 2.7 million rows and 12 columns.
Let's check the ram usage:
```clojure
user> (require '[clj-memory-meter.core :as mm])
nil
user> (mm/measure ds-2010)
"121.5 MB"
```
Now, let's save to an uncompressed nippy file:
```clojure
user> (require '[tech.io :as io])
nil
user> (time (tech.io/put-nippy! "test.nippy" ds-2010))
"Elapsed time: 1069.781703 msecs"
nil
```
One second, pretty nice :-).
What is the file size?
```console
chrisn@chrisn-lt-01:~/dev/tech.all/tech.ml.dataset/nippy-demo$ ls -alh
total 95M
drwxrwxr-x 2 chrisn chrisn 4.0K Jun 18 13:38 .
drwxr-xr-x 13 chrisn chrisn 4.0K Jun 18 13:36 ..
-rw-rw-r-- 1 chrisn chrisn 51M Jun 18 13:38 2010.nippy
-rw-rw-r-- 1 chrisn chrisn 44M Jun 18 13:27 2010.tsv.gz
```
Not bad, just a slight bit larger.
The load performance, however, is spectacular:
```clojure
user> (def loaded-2010 (time (io/get-nippy "nippy-demo/2010.nippy")))
"Elapsed time: 314.502715 msecs"
#'user/loaded-2010
user> (mm/measure loaded-2010)
"93.9 MB"
user> loaded-2010
nippy-demo/2010.tsv.gz [2769708 12]:
| low | comp-name-2 | high | currency-code | comp-name | m-ticker | ticker | close | volume | exchange | date | open |
|-------:|-------------|-------:|---------------|------------|----------|--------|-------:|---------------:|----------|------------|-------:|
| | | | USD | ALCOA CORP | AA2 | AA | 48.365 | | NYSE | 2010-01-01 | |
| 49.355 | | 51.065 | USD | ALCOA CORP | AA2 | AA | 51.065 | 1.10618840E+07 | NYSE | 2010-01-08 | 49.385 |
| | | | USD | ALCOA CORP | AA2 | AA | 46.895 | | NYSE | 2010-01-18 | |
| 39.904 | | 41.854 | USD | ALCOA CORP | AA2 | AA | 40.624 | 1.46292500E+07 | NYSE | 2010-01-26 | 40.354 |
| 40.294 | | 41.674 | USD | ALCOA CORP | AA2 | AA | 40.474 | 1.20107520E+07 | NYSE | 2010-02-03 | 40.804 |
| 39.304 | | 40.504 | USD | ALCOA CORP | AA2 | AA | 39.844 | 1.46702890E+07 | NYSE | 2010-02-09 | 40.084 |
| 39.574 | | 40.264 | USD | ALCOA CORP | AA2 | AA | 39.844 | 1.53728400E+07 | NYSE | 2010-02-12 | 39.994 |
```
It takes 8 seconds to load the tsv. It takes 315 milliseconds to load the nippy!
That is great :-).
The resulting dataset is somewhat smaller in memory. This is because when we
parse a dataset we use fastutil lists and append elements to them and then return a
dataset that sits directly on top of those lists as the column storage mechanism. Those lists have a bit
more capacity than absolutely necessary.
When we save the data, we convert the data into base java/clojure datastructures
such as primitive arrays. This is what makes things smaller: converting from a list
with a bit of extra capacity allocated to an exact sized array. This operation is
optimized and hits System/arraycopy under the covers as fastutil lists use arrays as
the backing store and we make sure of the rest with `tech.datatype`.
## Gzipping The Nippy
We can do a bit better. If you are really concerned about dataset size on disk, we
can save out a gzipped nippy:
```clojure
user> (time (io/put-nippy! (io/gzip-output-stream! "nippy-demo/2010.nippy.gz") ds-2010))
"Elapsed time: 7026.500505 msecs"
nil
```
This beats the gzipped tsv in terms of size by 10%:
```console
chrisn@chrisn-lt-01:~/dev/tech.all/tech.ml.dataset/nippy-demo$ ls -alh
total 134M
drwxrwxr-x 2 chrisn chrisn 4.0K Jun 18 13:47 .
drwxr-xr-x 13 chrisn chrisn 4.0K Jun 18 13:36 ..
-rw-rw-r-- 1 chrisn chrisn 51M Jun 18 13:38 2010.nippy
-rw-rw-r-- 1 chrisn chrisn 40M Jun 18 13:47 2010.nippy.gz
-rw-rw-r-- 1 chrisn chrisn 44M Jun 18 13:27 2010.tsv.gz
```
And now it takes twice the time to load:
```clojure
user> (def loaded-gzipped-2010 (time (io/get-nippy (io/gzip-input-stream "nippy-demo/2010.nippy.gz"))))
"Elapsed time: 680.165118 msecs"
#'user/loaded-gzipped-2010
user> (mm/measure loaded-gzipped-2010)
"93.9 MB"
```
You can probably handle load times in the 700ms range if you have a strong reason to
have data compressed on disc.
## Intermix With Clojure Data
Another aspect of nippy that is really valuable is that it can save/load datasets that
are parts of arbitrary datastructures. So for example you can save
the result of `group-by-column`:
```clojure
user> (def tickers (ds/group-by-column "ticker" ds-2010))
#'user/tickers
user> (type tickers)
clojure.lang.PersistentHashMap
user> (count tickers)
11532
user> (first tickers)
["RBYCF" RBYCF [261 12]:
| low | comp_name_2 | high | currency_code | comp_name | m_ticker | ticker | close | volume | exchange | date | open |
|--------:|-------------|--------:|---------------|---------------|----------|--------|--------:|---------:|----------|------------|--------:|
| | | | USD | RUBICON MNRLS | RUBI | RBYCF | 759.677 | | OTC | 2010-01-01 | |
| 795.161 | | 827.419 | USD | RUBICON MNRLS | RUBI | RBYCF | 800.000 | 3596.775 | OTC | 2010-01-12 | 816.129 |
| 741.935 | | 779.032 | USD | RUBICON MNRLS | RUBI | RBYCF | 758.064 | 5490.292 | OTC | 2010-01-20 | 779.032 |
| 645.161 | | 688.710 | USD | RUBICON MNRLS | RUBI | RBYCF | 682.258 | 6201.953 | OTC | 2010-01-28 | 669.355 |
| 685.484 | | 725.806 | USD | RUBICON MNRLS | RUBI | RBYCF | 687.097 | 3491.220 | OTC | 2010-02-08 | 714.516 |
| 750.000 | | 783.871 | USD | RUBICON MNRLS | RUBI | RBYCF | 770.968 | 2927.057 | OTC | 2010-02-17 | 780.645 |
...
```
`group-by and `group-by-column` both return persistent maps of key->dataset.
```clojure
user> (tech.io/put-nippy! "ticker-sorted.nippy" tickers)
nil
user> (def loaded-tickers (tech.io/get-nippy "ticker-sorted.nippy"))
#'user/loaded-tickers
user> (count loaded-tickers)
11532
user> (first loaded-tickers)
["RBYCF" RBYCF [261 12]:
| low | comp_name_2 | high | currency_code | comp_name | m_ticker | ticker | close | volume | exchange | date | open |
|--------:|-------------|--------:|---------------|---------------|----------|--------|--------:|---------:|----------|------------|--------:|
| | | | USD | RUBICON MNRLS | RUBI | RBYCF | 759.677 | | OTC | 2010-01-01 | |
| 795.161 | | 827.419 | USD | RUBICON MNRLS | RUBI | RBYCF | 800.000 | 3596.775 | OTC | 2010-01-12 | 816.129 |
| 741.935 | | 779.032 | USD | RUBICON MNRLS | RUBI | RBYCF | 758.064 | 5490.292 | OTC | 2010-01-20 | 779.032 |
| 645.161 | | 688.710 | USD | RUBICON MNRLS | RUBI | RBYCF | 682.258 | 6201.953 | OTC | 2010-01-28 | 669.355 |
| 685.484 | | 725.806 | USD | RUBICON MNRLS | RUBI | RBYCF | 687.097 | 3491.220 | OTC | 2010-02-08 | 714.516 |
| 750.000 | | 783.871 | USD | RUBICON MNRLS | RUBI | RBYCF | 770.968 | 2927.057 | OTC | 2010-02-17 | 780.645 |
```
Thus datasets can be used in maps, vectors, you name it and you can load/save those
really complex datastructures. That can be a big help for complex dataflows.
## Simple Implementation
Our implementation of save/load for this pathway goes through two public functions:
* [dataset->data](https://github.com/techascent/tech.ml.dataset/blob/7c8c7514e0e35995050c1e326122a1826cc18273/src/tech/v3/dataset/base.clj#L666) - Convert a dataset into a pure
clojure/java datastructure suitable for serialization. Data is in arrays and string
tables have been slightly deconstructed.
* [data->dataset](https://github.com/techascent/tech.ml.dataset/blob/7c8c7514e0e35995050c1e326122a1826cc18273/src/tech/v3/dataset/base.clj#L694) - Given a data-description of a
dataset create a new dataset. This is mainly a zero copy operation so it should be
quite quick.
Near those functions you can see how easy it was to implement direct nippy support for
the dataset object itself. Really nice, Nippy is truly a great library :-).