init research

2026-02-08 11:20:43 -10:00
commit bdf064f54d
3041 changed files with 1592200 additions and 0 deletions
@@ -0,0 +1,232 @@
+# tech.ml.dataset And nippy
+
+
+We are big fans of the [nippy system](https://github.com/ptaoussanis/nippy) for
+freezing/thawing data.  So we were pleasantly surprized with how well it performs
+with dataset and how easy it was to extend the dataset object to support nippy
+natively.
+
+
+## Nippy Hits One Out Of the Park
+
+
+We start with a decent size gzipped tabbed-delimited file.
+
+```console
+chrisn@chrisn-lt-01:~/dev/tech.all/tech.ml.dataset/nippy-demo$ ls -alh
+total 44M
+drwxrwxr-x  2 chrisn chrisn 4.0K Jun 18 13:27 .
+drwxr-xr-x 13 chrisn chrisn 4.0K Jun 18 13:27 ..
+-rw-rw-r--  1 chrisn chrisn  44M Jun 18 13:27 2010.tsv.gz
+```
+
+
+```clojure
+user> (def ds-2010 (time (ds/->dataset
+                          "nippy-demo/2010.tsv.gz"
+                          {:parser-fn {"date" [:packed-local-date "yyyy-MM-dd"]}})))
+"Elapsed time: 8588.080218 msecs"
+#'user/ds-2010
+user> ;;rename column names so the tables print nicely
+user> (def ds-2010
+        (ds/select-columns ds-2010
+                           (->> (ds/column-names ds-2010)
+                                (map (fn [oldname]
+                                       [oldname (.replace ^String oldname "_" "-")]))
+                                (into {}))))
+user> ds-2010
+nippy-demo/2010.tsv.gz [2769708 12]:
+
+|    low | comp-name-2 |   high | currency-code |  comp-name | m-ticker | ticker |  close |         volume | exchange |       date |   open |
+|-------:|-------------|-------:|---------------|------------|----------|--------|-------:|---------------:|----------|------------|-------:|
+|        |             |        |           USD | ALCOA CORP |      AA2 |     AA | 48.365 |                |     NYSE | 2010-01-01 |        |
+| 49.355 |             | 51.065 |           USD | ALCOA CORP |      AA2 |     AA | 51.065 | 1.10618840E+07 |     NYSE | 2010-01-08 | 49.385 |
+|        |             |        |           USD | ALCOA CORP |      AA2 |     AA | 46.895 |                |     NYSE | 2010-01-18 |        |
+| 39.904 |             | 41.854 |           USD | ALCOA CORP |      AA2 |     AA | 40.624 | 1.46292500E+07 |     NYSE | 2010-01-26 | 40.354 |
+| 40.294 |             | 41.674 |           USD | ALCOA CORP |      AA2 |     AA | 40.474 | 1.20107520E+07 |     NYSE | 2010-02-03 | 40.804 |
+| 39.304 |             | 40.504 |           USD | ALCOA CORP |      AA2 |     AA | 39.844 | 1.46702890E+07 |     NYSE | 2010-02-09 | 40.084 |
+| 39.574 |             | 40.264 |           USD | ALCOA CORP |      AA2 |     AA | 39.844 | 1.53728400E+07 |     NYSE | 2010-02-12 | 39.994 |
+| 40.324 |             | 41.104 |           USD | ALCOA CORP |      AA2 |     AA | 40.624 | 7.72947100E+06 |     NYSE | 2010-02-22 | 41.044 |
+| 39.664 |             | 40.564 |           USD | ALCOA CORP |      AA2 |     AA | 39.724 | 1.08365810E+07 |     NYSE | 2010-03-02 | 40.234 |
+```
+
+
+Our 44MB gzipped tsv produced 2.7 million rows and 12 columns.
+
+Let's check the ram usage:
+```clojure
+user> (require '[clj-memory-meter.core :as mm])
+nil
+user> (mm/measure ds-2010)
+"121.5 MB"
+```
+
+Now, let's save to an uncompressed nippy file:
+
+```clojure
+user> (require '[tech.io :as io])
+nil
+user> (time (tech.io/put-nippy! "test.nippy" ds-2010))
+"Elapsed time: 1069.781703 msecs"
+nil
+```
+
+One second, pretty nice :-).
+
+What is the file size?
+```console
+chrisn@chrisn-lt-01:~/dev/tech.all/tech.ml.dataset/nippy-demo$ ls -alh
+total 95M
+drwxrwxr-x  2 chrisn chrisn 4.0K Jun 18 13:38 .
+drwxr-xr-x 13 chrisn chrisn 4.0K Jun 18 13:36 ..
+-rw-rw-r--  1 chrisn chrisn  51M Jun 18 13:38 2010.nippy
+-rw-rw-r--  1 chrisn chrisn  44M Jun 18 13:27 2010.tsv.gz
+```
+
+Not bad, just a slight bit larger.
+
+The load performance, however, is spectacular:
+```clojure
+user> (def loaded-2010 (time (io/get-nippy "nippy-demo/2010.nippy")))
+"Elapsed time: 314.502715 msecs"
+#'user/loaded-2010
+user> (mm/measure loaded-2010)
+"93.9 MB"
+user> loaded-2010
+nippy-demo/2010.tsv.gz [2769708 12]:
+
+|    low | comp-name-2 |   high | currency-code |  comp-name | m-ticker | ticker |  close |         volume | exchange |       date |   open |
+|-------:|-------------|-------:|---------------|------------|----------|--------|-------:|---------------:|----------|------------|-------:|
+|        |             |        |           USD | ALCOA CORP |      AA2 |     AA | 48.365 |                |     NYSE | 2010-01-01 |        |
+| 49.355 |             | 51.065 |           USD | ALCOA CORP |      AA2 |     AA | 51.065 | 1.10618840E+07 |     NYSE | 2010-01-08 | 49.385 |
+|        |             |        |           USD | ALCOA CORP |      AA2 |     AA | 46.895 |                |     NYSE | 2010-01-18 |        |
+| 39.904 |             | 41.854 |           USD | ALCOA CORP |      AA2 |     AA | 40.624 | 1.46292500E+07 |     NYSE | 2010-01-26 | 40.354 |
+| 40.294 |             | 41.674 |           USD | ALCOA CORP |      AA2 |     AA | 40.474 | 1.20107520E+07 |     NYSE | 2010-02-03 | 40.804 |
+| 39.304 |             | 40.504 |           USD | ALCOA CORP |      AA2 |     AA | 39.844 | 1.46702890E+07 |     NYSE | 2010-02-09 | 40.084 |
+| 39.574 |             | 40.264 |           USD | ALCOA CORP |      AA2 |     AA | 39.844 | 1.53728400E+07 |     NYSE | 2010-02-12 | 39.994 |
+```
+
+It takes 8 seconds to load the tsv.  It takes 315 milliseconds to load the nippy!
+That is great :-).
+
+
+The resulting dataset is somewhat smaller in memory.  This is because when we
+parse a dataset we use fastutil lists and append elements to them and then return a
+dataset that sits directly on top of those lists as the column storage mechanism.  Those lists have a bit
+more capacity than absolutely necessary.
+
+When we save the data, we convert the data into base java/clojure datastructures
+such as primitive arrays.  This is what makes things smaller: converting from a list
+with a bit of extra capacity allocated to an exact sized array.  This operation is
+optimized and hits System/arraycopy under the covers as fastutil lists use arrays as
+the backing store and we make sure of the rest with `tech.datatype`.
+
+
+## Gzipping The Nippy
+
+
+We can do a bit better.  If you are really concerned about dataset size on disk, we
+can save out a gzipped nippy:
+
+
+```clojure
+user> (time (io/put-nippy! (io/gzip-output-stream! "nippy-demo/2010.nippy.gz") ds-2010))
+"Elapsed time: 7026.500505 msecs"
+nil
+```
+
+This beats the gzipped tsv in terms of size by 10%:
+```console
+chrisn@chrisn-lt-01:~/dev/tech.all/tech.ml.dataset/nippy-demo$ ls -alh
+total 134M
+drwxrwxr-x  2 chrisn chrisn 4.0K Jun 18 13:47 .
+drwxr-xr-x 13 chrisn chrisn 4.0K Jun 18 13:36 ..
+-rw-rw-r--  1 chrisn chrisn  51M Jun 18 13:38 2010.nippy
+-rw-rw-r--  1 chrisn chrisn  40M Jun 18 13:47 2010.nippy.gz
+-rw-rw-r--  1 chrisn chrisn  44M Jun 18 13:27 2010.tsv.gz
+```
+
+And now it takes twice the time to load:
+
+```clojure
+user> (def loaded-gzipped-2010 (time (io/get-nippy (io/gzip-input-stream "nippy-demo/2010.nippy.gz"))))
+"Elapsed time: 680.165118 msecs"
+#'user/loaded-gzipped-2010
+user> (mm/measure loaded-gzipped-2010)
+"93.9 MB"
+```
+
+You can probably handle load times in the 700ms range if you have a strong reason to
+have data compressed on disc.
+
+
+## Intermix With Clojure Data
+
+Another aspect of nippy that is really valuable is that it can save/load datasets that
+are parts of arbitrary datastructures.  So for example you can save
+the result of `group-by-column`:
+
+```clojure
+
+user> (def tickers (ds/group-by-column "ticker" ds-2010))
+#'user/tickers
+user> (type tickers)
+clojure.lang.PersistentHashMap
+user> (count tickers)
+11532
+user> (first tickers)
+["RBYCF" RBYCF [261 12]:
+
+|     low | comp_name_2 |    high | currency_code |     comp_name | m_ticker | ticker |   close |   volume | exchange |       date |    open |
+|--------:|-------------|--------:|---------------|---------------|----------|--------|--------:|---------:|----------|------------|--------:|
+|         |             |         |           USD | RUBICON MNRLS |     RUBI |  RBYCF | 759.677 |          |      OTC | 2010-01-01 |         |
+| 795.161 |             | 827.419 |           USD | RUBICON MNRLS |     RUBI |  RBYCF | 800.000 | 3596.775 |      OTC | 2010-01-12 | 816.129 |
+| 741.935 |             | 779.032 |           USD | RUBICON MNRLS |     RUBI |  RBYCF | 758.064 | 5490.292 |      OTC | 2010-01-20 | 779.032 |
+| 645.161 |             | 688.710 |           USD | RUBICON MNRLS |     RUBI |  RBYCF | 682.258 | 6201.953 |      OTC | 2010-01-28 | 669.355 |
+| 685.484 |             | 725.806 |           USD | RUBICON MNRLS |     RUBI |  RBYCF | 687.097 | 3491.220 |      OTC | 2010-02-08 | 714.516 |
+| 750.000 |             | 783.871 |           USD | RUBICON MNRLS |     RUBI |  RBYCF | 770.968 | 2927.057 |      OTC | 2010-02-17 | 780.645 |
+...
+```
+
+`group-by and `group-by-column` both return persistent maps of key->dataset.
+
+```clojure
+user> (tech.io/put-nippy! "ticker-sorted.nippy" tickers)
+nil
+user> (def loaded-tickers (tech.io/get-nippy "ticker-sorted.nippy"))
+#'user/loaded-tickers
+user> (count loaded-tickers)
+11532
+user> (first loaded-tickers)
+["RBYCF" RBYCF [261 12]:
+
+|     low | comp_name_2 |    high | currency_code |     comp_name | m_ticker | ticker |   close |   volume | exchange |       date |    open |
+|--------:|-------------|--------:|---------------|---------------|----------|--------|--------:|---------:|----------|------------|--------:|
+|         |             |         |           USD | RUBICON MNRLS |     RUBI |  RBYCF | 759.677 |          |      OTC | 2010-01-01 |         |
+| 795.161 |             | 827.419 |           USD | RUBICON MNRLS |     RUBI |  RBYCF | 800.000 | 3596.775 |      OTC | 2010-01-12 | 816.129 |
+| 741.935 |             | 779.032 |           USD | RUBICON MNRLS |     RUBI |  RBYCF | 758.064 | 5490.292 |      OTC | 2010-01-20 | 779.032 |
+| 645.161 |             | 688.710 |           USD | RUBICON MNRLS |     RUBI |  RBYCF | 682.258 | 6201.953 |      OTC | 2010-01-28 | 669.355 |
+| 685.484 |             | 725.806 |           USD | RUBICON MNRLS |     RUBI |  RBYCF | 687.097 | 3491.220 |      OTC | 2010-02-08 | 714.516 |
+| 750.000 |             | 783.871 |           USD | RUBICON MNRLS |     RUBI |  RBYCF | 770.968 | 2927.057 |      OTC | 2010-02-17 | 780.645 |
+```
+
+Thus datasets can be used in maps, vectors, you name it and you can load/save those
+really complex datastructures.  That can be a big help for complex dataflows.
+
+
+## Simple Implementation
+
+
+Our implementation of save/load for this pathway goes through two public functions:
+
+
+* [dataset->data](https://github.com/techascent/tech.ml.dataset/blob/7c8c7514e0e35995050c1e326122a1826cc18273/src/tech/v3/dataset/base.clj#L666) - Convert a dataset into a pure
+clojure/java datastructure suitable for serialization.  Data is in arrays and string
+tables have been slightly deconstructed.
+
+* [data->dataset](https://github.com/techascent/tech.ml.dataset/blob/7c8c7514e0e35995050c1e326122a1826cc18273/src/tech/v3/dataset/base.clj#L694) - Given a data-description of a
+dataset create a new dataset.  This is mainly a zero copy operation so it should be
+quite quick.
+
+Near those functions you can see how easy it was to implement direct nippy support for
+the dataset object itself.  Really nice, Nippy is truly a great library :-).