233 lines
11 KiB
Markdown
Vendored
233 lines
11 KiB
Markdown
Vendored
# tech.ml.dataset And nippy
|
|
|
|
|
|
We are big fans of the [nippy system](https://github.com/ptaoussanis/nippy) for
|
|
freezing/thawing data. So we were pleasantly surprized with how well it performs
|
|
with dataset and how easy it was to extend the dataset object to support nippy
|
|
natively.
|
|
|
|
|
|
## Nippy Hits One Out Of the Park
|
|
|
|
|
|
We start with a decent size gzipped tabbed-delimited file.
|
|
|
|
```console
|
|
chrisn@chrisn-lt-01:~/dev/tech.all/tech.ml.dataset/nippy-demo$ ls -alh
|
|
total 44M
|
|
drwxrwxr-x 2 chrisn chrisn 4.0K Jun 18 13:27 .
|
|
drwxr-xr-x 13 chrisn chrisn 4.0K Jun 18 13:27 ..
|
|
-rw-rw-r-- 1 chrisn chrisn 44M Jun 18 13:27 2010.tsv.gz
|
|
```
|
|
|
|
|
|
```clojure
|
|
user> (def ds-2010 (time (ds/->dataset
|
|
"nippy-demo/2010.tsv.gz"
|
|
{:parser-fn {"date" [:packed-local-date "yyyy-MM-dd"]}})))
|
|
"Elapsed time: 8588.080218 msecs"
|
|
#'user/ds-2010
|
|
user> ;;rename column names so the tables print nicely
|
|
user> (def ds-2010
|
|
(ds/select-columns ds-2010
|
|
(->> (ds/column-names ds-2010)
|
|
(map (fn [oldname]
|
|
[oldname (.replace ^String oldname "_" "-")]))
|
|
(into {}))))
|
|
user> ds-2010
|
|
nippy-demo/2010.tsv.gz [2769708 12]:
|
|
|
|
| low | comp-name-2 | high | currency-code | comp-name | m-ticker | ticker | close | volume | exchange | date | open |
|
|
|-------:|-------------|-------:|---------------|------------|----------|--------|-------:|---------------:|----------|------------|-------:|
|
|
| | | | USD | ALCOA CORP | AA2 | AA | 48.365 | | NYSE | 2010-01-01 | |
|
|
| 49.355 | | 51.065 | USD | ALCOA CORP | AA2 | AA | 51.065 | 1.10618840E+07 | NYSE | 2010-01-08 | 49.385 |
|
|
| | | | USD | ALCOA CORP | AA2 | AA | 46.895 | | NYSE | 2010-01-18 | |
|
|
| 39.904 | | 41.854 | USD | ALCOA CORP | AA2 | AA | 40.624 | 1.46292500E+07 | NYSE | 2010-01-26 | 40.354 |
|
|
| 40.294 | | 41.674 | USD | ALCOA CORP | AA2 | AA | 40.474 | 1.20107520E+07 | NYSE | 2010-02-03 | 40.804 |
|
|
| 39.304 | | 40.504 | USD | ALCOA CORP | AA2 | AA | 39.844 | 1.46702890E+07 | NYSE | 2010-02-09 | 40.084 |
|
|
| 39.574 | | 40.264 | USD | ALCOA CORP | AA2 | AA | 39.844 | 1.53728400E+07 | NYSE | 2010-02-12 | 39.994 |
|
|
| 40.324 | | 41.104 | USD | ALCOA CORP | AA2 | AA | 40.624 | 7.72947100E+06 | NYSE | 2010-02-22 | 41.044 |
|
|
| 39.664 | | 40.564 | USD | ALCOA CORP | AA2 | AA | 39.724 | 1.08365810E+07 | NYSE | 2010-03-02 | 40.234 |
|
|
```
|
|
|
|
|
|
Our 44MB gzipped tsv produced 2.7 million rows and 12 columns.
|
|
|
|
Let's check the ram usage:
|
|
```clojure
|
|
user> (require '[clj-memory-meter.core :as mm])
|
|
nil
|
|
user> (mm/measure ds-2010)
|
|
"121.5 MB"
|
|
```
|
|
|
|
Now, let's save to an uncompressed nippy file:
|
|
|
|
```clojure
|
|
user> (require '[tech.io :as io])
|
|
nil
|
|
user> (time (tech.io/put-nippy! "test.nippy" ds-2010))
|
|
"Elapsed time: 1069.781703 msecs"
|
|
nil
|
|
```
|
|
|
|
One second, pretty nice :-).
|
|
|
|
What is the file size?
|
|
```console
|
|
chrisn@chrisn-lt-01:~/dev/tech.all/tech.ml.dataset/nippy-demo$ ls -alh
|
|
total 95M
|
|
drwxrwxr-x 2 chrisn chrisn 4.0K Jun 18 13:38 .
|
|
drwxr-xr-x 13 chrisn chrisn 4.0K Jun 18 13:36 ..
|
|
-rw-rw-r-- 1 chrisn chrisn 51M Jun 18 13:38 2010.nippy
|
|
-rw-rw-r-- 1 chrisn chrisn 44M Jun 18 13:27 2010.tsv.gz
|
|
```
|
|
|
|
Not bad, just a slight bit larger.
|
|
|
|
The load performance, however, is spectacular:
|
|
```clojure
|
|
user> (def loaded-2010 (time (io/get-nippy "nippy-demo/2010.nippy")))
|
|
"Elapsed time: 314.502715 msecs"
|
|
#'user/loaded-2010
|
|
user> (mm/measure loaded-2010)
|
|
"93.9 MB"
|
|
user> loaded-2010
|
|
nippy-demo/2010.tsv.gz [2769708 12]:
|
|
|
|
| low | comp-name-2 | high | currency-code | comp-name | m-ticker | ticker | close | volume | exchange | date | open |
|
|
|-------:|-------------|-------:|---------------|------------|----------|--------|-------:|---------------:|----------|------------|-------:|
|
|
| | | | USD | ALCOA CORP | AA2 | AA | 48.365 | | NYSE | 2010-01-01 | |
|
|
| 49.355 | | 51.065 | USD | ALCOA CORP | AA2 | AA | 51.065 | 1.10618840E+07 | NYSE | 2010-01-08 | 49.385 |
|
|
| | | | USD | ALCOA CORP | AA2 | AA | 46.895 | | NYSE | 2010-01-18 | |
|
|
| 39.904 | | 41.854 | USD | ALCOA CORP | AA2 | AA | 40.624 | 1.46292500E+07 | NYSE | 2010-01-26 | 40.354 |
|
|
| 40.294 | | 41.674 | USD | ALCOA CORP | AA2 | AA | 40.474 | 1.20107520E+07 | NYSE | 2010-02-03 | 40.804 |
|
|
| 39.304 | | 40.504 | USD | ALCOA CORP | AA2 | AA | 39.844 | 1.46702890E+07 | NYSE | 2010-02-09 | 40.084 |
|
|
| 39.574 | | 40.264 | USD | ALCOA CORP | AA2 | AA | 39.844 | 1.53728400E+07 | NYSE | 2010-02-12 | 39.994 |
|
|
```
|
|
|
|
It takes 8 seconds to load the tsv. It takes 315 milliseconds to load the nippy!
|
|
That is great :-).
|
|
|
|
|
|
The resulting dataset is somewhat smaller in memory. This is because when we
|
|
parse a dataset we use fastutil lists and append elements to them and then return a
|
|
dataset that sits directly on top of those lists as the column storage mechanism. Those lists have a bit
|
|
more capacity than absolutely necessary.
|
|
|
|
When we save the data, we convert the data into base java/clojure datastructures
|
|
such as primitive arrays. This is what makes things smaller: converting from a list
|
|
with a bit of extra capacity allocated to an exact sized array. This operation is
|
|
optimized and hits System/arraycopy under the covers as fastutil lists use arrays as
|
|
the backing store and we make sure of the rest with `tech.datatype`.
|
|
|
|
|
|
## Gzipping The Nippy
|
|
|
|
|
|
We can do a bit better. If you are really concerned about dataset size on disk, we
|
|
can save out a gzipped nippy:
|
|
|
|
|
|
```clojure
|
|
user> (time (io/put-nippy! (io/gzip-output-stream! "nippy-demo/2010.nippy.gz") ds-2010))
|
|
"Elapsed time: 7026.500505 msecs"
|
|
nil
|
|
```
|
|
|
|
This beats the gzipped tsv in terms of size by 10%:
|
|
```console
|
|
chrisn@chrisn-lt-01:~/dev/tech.all/tech.ml.dataset/nippy-demo$ ls -alh
|
|
total 134M
|
|
drwxrwxr-x 2 chrisn chrisn 4.0K Jun 18 13:47 .
|
|
drwxr-xr-x 13 chrisn chrisn 4.0K Jun 18 13:36 ..
|
|
-rw-rw-r-- 1 chrisn chrisn 51M Jun 18 13:38 2010.nippy
|
|
-rw-rw-r-- 1 chrisn chrisn 40M Jun 18 13:47 2010.nippy.gz
|
|
-rw-rw-r-- 1 chrisn chrisn 44M Jun 18 13:27 2010.tsv.gz
|
|
```
|
|
|
|
And now it takes twice the time to load:
|
|
|
|
```clojure
|
|
user> (def loaded-gzipped-2010 (time (io/get-nippy (io/gzip-input-stream "nippy-demo/2010.nippy.gz"))))
|
|
"Elapsed time: 680.165118 msecs"
|
|
#'user/loaded-gzipped-2010
|
|
user> (mm/measure loaded-gzipped-2010)
|
|
"93.9 MB"
|
|
```
|
|
|
|
You can probably handle load times in the 700ms range if you have a strong reason to
|
|
have data compressed on disc.
|
|
|
|
|
|
## Intermix With Clojure Data
|
|
|
|
Another aspect of nippy that is really valuable is that it can save/load datasets that
|
|
are parts of arbitrary datastructures. So for example you can save
|
|
the result of `group-by-column`:
|
|
|
|
```clojure
|
|
|
|
user> (def tickers (ds/group-by-column "ticker" ds-2010))
|
|
#'user/tickers
|
|
user> (type tickers)
|
|
clojure.lang.PersistentHashMap
|
|
user> (count tickers)
|
|
11532
|
|
user> (first tickers)
|
|
["RBYCF" RBYCF [261 12]:
|
|
|
|
| low | comp_name_2 | high | currency_code | comp_name | m_ticker | ticker | close | volume | exchange | date | open |
|
|
|--------:|-------------|--------:|---------------|---------------|----------|--------|--------:|---------:|----------|------------|--------:|
|
|
| | | | USD | RUBICON MNRLS | RUBI | RBYCF | 759.677 | | OTC | 2010-01-01 | |
|
|
| 795.161 | | 827.419 | USD | RUBICON MNRLS | RUBI | RBYCF | 800.000 | 3596.775 | OTC | 2010-01-12 | 816.129 |
|
|
| 741.935 | | 779.032 | USD | RUBICON MNRLS | RUBI | RBYCF | 758.064 | 5490.292 | OTC | 2010-01-20 | 779.032 |
|
|
| 645.161 | | 688.710 | USD | RUBICON MNRLS | RUBI | RBYCF | 682.258 | 6201.953 | OTC | 2010-01-28 | 669.355 |
|
|
| 685.484 | | 725.806 | USD | RUBICON MNRLS | RUBI | RBYCF | 687.097 | 3491.220 | OTC | 2010-02-08 | 714.516 |
|
|
| 750.000 | | 783.871 | USD | RUBICON MNRLS | RUBI | RBYCF | 770.968 | 2927.057 | OTC | 2010-02-17 | 780.645 |
|
|
...
|
|
```
|
|
|
|
`group-by and `group-by-column` both return persistent maps of key->dataset.
|
|
|
|
```clojure
|
|
user> (tech.io/put-nippy! "ticker-sorted.nippy" tickers)
|
|
nil
|
|
user> (def loaded-tickers (tech.io/get-nippy "ticker-sorted.nippy"))
|
|
#'user/loaded-tickers
|
|
user> (count loaded-tickers)
|
|
11532
|
|
user> (first loaded-tickers)
|
|
["RBYCF" RBYCF [261 12]:
|
|
|
|
| low | comp_name_2 | high | currency_code | comp_name | m_ticker | ticker | close | volume | exchange | date | open |
|
|
|--------:|-------------|--------:|---------------|---------------|----------|--------|--------:|---------:|----------|------------|--------:|
|
|
| | | | USD | RUBICON MNRLS | RUBI | RBYCF | 759.677 | | OTC | 2010-01-01 | |
|
|
| 795.161 | | 827.419 | USD | RUBICON MNRLS | RUBI | RBYCF | 800.000 | 3596.775 | OTC | 2010-01-12 | 816.129 |
|
|
| 741.935 | | 779.032 | USD | RUBICON MNRLS | RUBI | RBYCF | 758.064 | 5490.292 | OTC | 2010-01-20 | 779.032 |
|
|
| 645.161 | | 688.710 | USD | RUBICON MNRLS | RUBI | RBYCF | 682.258 | 6201.953 | OTC | 2010-01-28 | 669.355 |
|
|
| 685.484 | | 725.806 | USD | RUBICON MNRLS | RUBI | RBYCF | 687.097 | 3491.220 | OTC | 2010-02-08 | 714.516 |
|
|
| 750.000 | | 783.871 | USD | RUBICON MNRLS | RUBI | RBYCF | 770.968 | 2927.057 | OTC | 2010-02-17 | 780.645 |
|
|
```
|
|
|
|
Thus datasets can be used in maps, vectors, you name it and you can load/save those
|
|
really complex datastructures. That can be a big help for complex dataflows.
|
|
|
|
|
|
## Simple Implementation
|
|
|
|
|
|
Our implementation of save/load for this pathway goes through two public functions:
|
|
|
|
|
|
* [dataset->data](https://github.com/techascent/tech.ml.dataset/blob/7c8c7514e0e35995050c1e326122a1826cc18273/src/tech/v3/dataset/base.clj#L666) - Convert a dataset into a pure
|
|
clojure/java datastructure suitable for serialization. Data is in arrays and string
|
|
tables have been slightly deconstructed.
|
|
|
|
* [data->dataset](https://github.com/techascent/tech.ml.dataset/blob/7c8c7514e0e35995050c1e326122a1826cc18273/src/tech/v3/dataset/base.clj#L694) - Given a data-description of a
|
|
dataset create a new dataset. This is mainly a zero copy operation so it should be
|
|
quite quick.
|
|
|
|
Near those functions you can see how easy it was to implement direct nippy support for
|
|
the dataset object itself. Really nice, Nippy is truly a great library :-).
|