187 lines
20 KiB
HTML
Vendored
187 lines
20 KiB
HTML
Vendored
<!DOCTYPE html PUBLIC ""
|
|
"">
|
|
<html><head><meta charset="UTF-8" /><title>tech.ml.dataset And nippy</title><script async="true" src="https://www.googletagmanager.com/gtag/js?id=G-RGTB4J7LGP"></script><script>window.dataLayer = window.dataLayer || [];
|
|
function gtag(){dataLayer.push(arguments);}
|
|
gtag('js', new Date());
|
|
|
|
gtag('config', 'G-95TVFC1FEB');</script><link rel="stylesheet" type="text/css" href="css/default.css" /><link rel="stylesheet" type="text/css" href="highlight/solarized-light.css" /><script type="text/javascript" src="highlight/highlight.min.js"></script><script type="text/javascript" src="js/jquery.min.js"></script><script type="text/javascript" src="js/page_effects.js"></script><script>hljs.initHighlightingOnLoad();</script></head><body><div id="header"><h2>Generated by <a href="https://github.com/weavejester/codox">Codox</a> with <a href="https://github.com/xsc/codox-theme-rdash">RDash UI</a> theme</h2><h1><a href="index.html"><span class="project-title"><span class="project-name">TMD</span> <span class="project-version">8.003</span></span></a></h1></div><div class="sidebar primary"><h3 class="no-link"><span class="inner">Project</span></h3><ul class="index-link"><li class="depth-1 "><a href="index.html"><div class="inner">Index</div></a></li></ul><h3 class="no-link"><span class="inner">Topics</span></h3><ul><li class="depth-1 "><a href="000-getting-started.html"><div class="inner"><span>tech.ml.dataset Getting Started</span></div></a></li><li class="depth-1 "><a href="100-walkthrough.html"><div class="inner"><span>tech.ml.dataset Walkthrough</span></div></a></li><li class="depth-1 "><a href="200-quick-reference.html"><div class="inner"><span>tech.ml.dataset Quick Reference</span></div></a></li><li class="depth-1 "><a href="columns-readers-and-datatypes.html"><div class="inner"><span>tech.ml.dataset Columns, Readers, and Datatypes</span></div></a></li><li class="depth-1 current"><a href="nippy-serialization-rocks.html"><div class="inner"><span>tech.ml.dataset And nippy</span></div></a></li><li class="depth-1 "><a href="supported-datatypes.html"><div class="inner"><span>tech.ml.dataset Supported Datatypes</span></div></a></li></ul><h3 class="no-link"><span class="inner">Namespaces</span></h3><ul><li class="depth-1"><div class="no-link"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>tech</span></div></div></li><li class="depth-2"><div class="no-link"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>v3</span></div></div></li><li class="depth-3"><a href="tech.v3.dataset.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>dataset</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.categorical.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>categorical</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.clipboard.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>clipboard</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.column.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>column</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.column-filters.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>column-filters</span></div></a></li><li class="depth-4"><div class="no-link"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>io</span></div></div></li><li class="depth-5 branch"><a href="tech.v3.dataset.io.csv.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>csv</span></div></a></li><li class="depth-5 branch"><a href="tech.v3.dataset.io.datetime.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>datetime</span></div></a></li><li class="depth-5 branch"><a href="tech.v3.dataset.io.string-row-parser.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>string-row-parser</span></div></a></li><li class="depth-5"><a href="tech.v3.dataset.io.univocity.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>univocity</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.join.html"><div class="inner"><span class="tree" style="top: -145px;"><span class="top" style="height: 154px;"></span><span class="bottom"></span></span><span>join</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.math.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>math</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.metamorph.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>metamorph</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.modelling.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>modelling</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.print.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>print</span></div></a></li><li class="depth-4"><a href="tech.v3.dataset.reductions.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>reductions</span></div></a></li><li class="depth-5"><a href="tech.v3.dataset.reductions.apache-data-sketch.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>apache-data-sketch</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.rolling.html"><div class="inner"><span class="tree" style="top: -52px;"><span class="top" style="height: 61px;"></span><span class="bottom"></span></span><span>rolling</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.set.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>set</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.tensor.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>tensor</span></div></a></li><li class="depth-4"><a href="tech.v3.dataset.zip.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>zip</span></div></a></li><li class="depth-3"><div class="no-link"><div class="inner"><span class="tree" style="top: -641px;"><span class="top" style="height: 650px;"></span><span class="bottom"></span></span><span>libs</span></div></div></li><li class="depth-4 branch"><a href="tech.v3.libs.arrow.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>arrow</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.libs.clj-transit.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>clj-transit</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.libs.fastexcel.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>fastexcel</span></div></a></li><li class="depth-4"><div class="no-link"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>guava</span></div></div></li><li class="depth-5"><a href="tech.v3.libs.guava.cache.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>cache</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.libs.parquet.html"><div class="inner"><span class="tree" style="top: -52px;"><span class="top" style="height: 61px;"></span><span class="bottom"></span></span><span>parquet</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.libs.poi.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>poi</span></div></a></li><li class="depth-4"><a href="tech.v3.libs.tribuo.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>tribuo</span></div></a></li></ul></div><div class="document" id="content"><div class="doc"><div class="markdown"><h1>tech.ml.dataset And nippy</h1>
|
|
<p>We are big fans of the <a href="https://github.com/ptaoussanis/nippy">nippy system</a> for
|
|
freezing/thawing data. So we were pleasantly surprized with how well it performs
|
|
with dataset and how easy it was to extend the dataset object to support nippy
|
|
natively.</p>
|
|
<h2>Nippy Hits One Out Of the Park</h2>
|
|
<p>We start with a decent size gzipped tabbed-delimited file.</p>
|
|
<pre><code class="language-console">chrisn@chrisn-lt-01:~/dev/tech.all/tech.ml.dataset/nippy-demo$ ls -alh
|
|
total 44M
|
|
drwxrwxr-x 2 chrisn chrisn 4.0K Jun 18 13:27 .
|
|
drwxr-xr-x 13 chrisn chrisn 4.0K Jun 18 13:27 ..
|
|
-rw-rw-r-- 1 chrisn chrisn 44M Jun 18 13:27 2010.tsv.gz
|
|
</code></pre>
|
|
<pre><code class="language-clojure">user> (def ds-2010 (time (ds/->dataset
|
|
"nippy-demo/2010.tsv.gz"
|
|
{:parser-fn {"date" [:packed-local-date "yyyy-MM-dd"]}})))
|
|
"Elapsed time: 8588.080218 msecs"
|
|
#'user/ds-2010
|
|
user> ;;rename column names so the tables print nicely
|
|
user> (def ds-2010
|
|
(ds/select-columns ds-2010
|
|
(->> (ds/column-names ds-2010)
|
|
(map (fn [oldname]
|
|
[oldname (.replace ^String oldname "_" "-")]))
|
|
(into {}))))
|
|
user> ds-2010
|
|
nippy-demo/2010.tsv.gz [2769708 12]:
|
|
|
|
| low | comp-name-2 | high | currency-code | comp-name | m-ticker | ticker | close | volume | exchange | date | open |
|
|
|-------:|-------------|-------:|---------------|------------|----------|--------|-------:|---------------:|----------|------------|-------:|
|
|
| | | | USD | ALCOA CORP | AA2 | AA | 48.365 | | NYSE | 2010-01-01 | |
|
|
| 49.355 | | 51.065 | USD | ALCOA CORP | AA2 | AA | 51.065 | 1.10618840E+07 | NYSE | 2010-01-08 | 49.385 |
|
|
| | | | USD | ALCOA CORP | AA2 | AA | 46.895 | | NYSE | 2010-01-18 | |
|
|
| 39.904 | | 41.854 | USD | ALCOA CORP | AA2 | AA | 40.624 | 1.46292500E+07 | NYSE | 2010-01-26 | 40.354 |
|
|
| 40.294 | | 41.674 | USD | ALCOA CORP | AA2 | AA | 40.474 | 1.20107520E+07 | NYSE | 2010-02-03 | 40.804 |
|
|
| 39.304 | | 40.504 | USD | ALCOA CORP | AA2 | AA | 39.844 | 1.46702890E+07 | NYSE | 2010-02-09 | 40.084 |
|
|
| 39.574 | | 40.264 | USD | ALCOA CORP | AA2 | AA | 39.844 | 1.53728400E+07 | NYSE | 2010-02-12 | 39.994 |
|
|
| 40.324 | | 41.104 | USD | ALCOA CORP | AA2 | AA | 40.624 | 7.72947100E+06 | NYSE | 2010-02-22 | 41.044 |
|
|
| 39.664 | | 40.564 | USD | ALCOA CORP | AA2 | AA | 39.724 | 1.08365810E+07 | NYSE | 2010-03-02 | 40.234 |
|
|
</code></pre>
|
|
<p>Our 44MB gzipped tsv produced 2.7 million rows and 12 columns.</p>
|
|
<p>Let's check the ram usage:</p>
|
|
<pre><code class="language-clojure">user> (require '[clj-memory-meter.core :as mm])
|
|
nil
|
|
user> (mm/measure ds-2010)
|
|
"121.5 MB"
|
|
</code></pre>
|
|
<p>Now, let's save to an uncompressed nippy file:</p>
|
|
<pre><code class="language-clojure">user> (require '[tech.io :as io])
|
|
nil
|
|
user> (time (tech.io/put-nippy! "test.nippy" ds-2010))
|
|
"Elapsed time: 1069.781703 msecs"
|
|
nil
|
|
</code></pre>
|
|
<p>One second, pretty nice :-).</p>
|
|
<p>What is the file size?</p>
|
|
<pre><code class="language-console">chrisn@chrisn-lt-01:~/dev/tech.all/tech.ml.dataset/nippy-demo$ ls -alh
|
|
total 95M
|
|
drwxrwxr-x 2 chrisn chrisn 4.0K Jun 18 13:38 .
|
|
drwxr-xr-x 13 chrisn chrisn 4.0K Jun 18 13:36 ..
|
|
-rw-rw-r-- 1 chrisn chrisn 51M Jun 18 13:38 2010.nippy
|
|
-rw-rw-r-- 1 chrisn chrisn 44M Jun 18 13:27 2010.tsv.gz
|
|
</code></pre>
|
|
<p>Not bad, just a slight bit larger.</p>
|
|
<p>The load performance, however, is spectacular:</p>
|
|
<pre><code class="language-clojure">user> (def loaded-2010 (time (io/get-nippy "nippy-demo/2010.nippy")))
|
|
"Elapsed time: 314.502715 msecs"
|
|
#'user/loaded-2010
|
|
user> (mm/measure loaded-2010)
|
|
"93.9 MB"
|
|
user> loaded-2010
|
|
nippy-demo/2010.tsv.gz [2769708 12]:
|
|
|
|
| low | comp-name-2 | high | currency-code | comp-name | m-ticker | ticker | close | volume | exchange | date | open |
|
|
|-------:|-------------|-------:|---------------|------------|----------|--------|-------:|---------------:|----------|------------|-------:|
|
|
| | | | USD | ALCOA CORP | AA2 | AA | 48.365 | | NYSE | 2010-01-01 | |
|
|
| 49.355 | | 51.065 | USD | ALCOA CORP | AA2 | AA | 51.065 | 1.10618840E+07 | NYSE | 2010-01-08 | 49.385 |
|
|
| | | | USD | ALCOA CORP | AA2 | AA | 46.895 | | NYSE | 2010-01-18 | |
|
|
| 39.904 | | 41.854 | USD | ALCOA CORP | AA2 | AA | 40.624 | 1.46292500E+07 | NYSE | 2010-01-26 | 40.354 |
|
|
| 40.294 | | 41.674 | USD | ALCOA CORP | AA2 | AA | 40.474 | 1.20107520E+07 | NYSE | 2010-02-03 | 40.804 |
|
|
| 39.304 | | 40.504 | USD | ALCOA CORP | AA2 | AA | 39.844 | 1.46702890E+07 | NYSE | 2010-02-09 | 40.084 |
|
|
| 39.574 | | 40.264 | USD | ALCOA CORP | AA2 | AA | 39.844 | 1.53728400E+07 | NYSE | 2010-02-12 | 39.994 |
|
|
</code></pre>
|
|
<p>It takes 8 seconds to load the tsv. It takes 315 milliseconds to load the nippy!
|
|
That is great :-).</p>
|
|
<p>The resulting dataset is somewhat smaller in memory. This is because when we
|
|
parse a dataset we use fastutil lists and append elements to them and then return a
|
|
dataset that sits directly on top of those lists as the column storage mechanism. Those lists have a bit
|
|
more capacity than absolutely necessary.</p>
|
|
<p>When we save the data, we convert the data into base java/clojure datastructures
|
|
such as primitive arrays. This is what makes things smaller: converting from a list
|
|
with a bit of extra capacity allocated to an exact sized array. This operation is
|
|
optimized and hits System/arraycopy under the covers as fastutil lists use arrays as
|
|
the backing store and we make sure of the rest with <code>tech.datatype</code>.</p>
|
|
<h2>Gzipping The Nippy</h2>
|
|
<p>We can do a bit better. If you are really concerned about dataset size on disk, we
|
|
can save out a gzipped nippy:</p>
|
|
<pre><code class="language-clojure">user> (time (io/put-nippy! (io/gzip-output-stream! "nippy-demo/2010.nippy.gz") ds-2010))
|
|
"Elapsed time: 7026.500505 msecs"
|
|
nil
|
|
</code></pre>
|
|
<p>This beats the gzipped tsv in terms of size by 10%:</p>
|
|
<pre><code class="language-console">chrisn@chrisn-lt-01:~/dev/tech.all/tech.ml.dataset/nippy-demo$ ls -alh
|
|
total 134M
|
|
drwxrwxr-x 2 chrisn chrisn 4.0K Jun 18 13:47 .
|
|
drwxr-xr-x 13 chrisn chrisn 4.0K Jun 18 13:36 ..
|
|
-rw-rw-r-- 1 chrisn chrisn 51M Jun 18 13:38 2010.nippy
|
|
-rw-rw-r-- 1 chrisn chrisn 40M Jun 18 13:47 2010.nippy.gz
|
|
-rw-rw-r-- 1 chrisn chrisn 44M Jun 18 13:27 2010.tsv.gz
|
|
</code></pre>
|
|
<p>And now it takes twice the time to load:</p>
|
|
<pre><code class="language-clojure">user> (def loaded-gzipped-2010 (time (io/get-nippy (io/gzip-input-stream "nippy-demo/2010.nippy.gz"))))
|
|
"Elapsed time: 680.165118 msecs"
|
|
#'user/loaded-gzipped-2010
|
|
user> (mm/measure loaded-gzipped-2010)
|
|
"93.9 MB"
|
|
</code></pre>
|
|
<p>You can probably handle load times in the 700ms range if you have a strong reason to
|
|
have data compressed on disc.</p>
|
|
<h2>Intermix With Clojure Data</h2>
|
|
<p>Another aspect of nippy that is really valuable is that it can save/load datasets that
|
|
are parts of arbitrary datastructures. So for example you can save
|
|
the result of <code>group-by-column</code>:</p>
|
|
<pre><code class="language-clojure">
|
|
user> (def tickers (ds/group-by-column "ticker" ds-2010))
|
|
#'user/tickers
|
|
user> (type tickers)
|
|
clojure.lang.PersistentHashMap
|
|
user> (count tickers)
|
|
11532
|
|
user> (first tickers)
|
|
["RBYCF" RBYCF [261 12]:
|
|
|
|
| low | comp_name_2 | high | currency_code | comp_name | m_ticker | ticker | close | volume | exchange | date | open |
|
|
|--------:|-------------|--------:|---------------|---------------|----------|--------|--------:|---------:|----------|------------|--------:|
|
|
| | | | USD | RUBICON MNRLS | RUBI | RBYCF | 759.677 | | OTC | 2010-01-01 | |
|
|
| 795.161 | | 827.419 | USD | RUBICON MNRLS | RUBI | RBYCF | 800.000 | 3596.775 | OTC | 2010-01-12 | 816.129 |
|
|
| 741.935 | | 779.032 | USD | RUBICON MNRLS | RUBI | RBYCF | 758.064 | 5490.292 | OTC | 2010-01-20 | 779.032 |
|
|
| 645.161 | | 688.710 | USD | RUBICON MNRLS | RUBI | RBYCF | 682.258 | 6201.953 | OTC | 2010-01-28 | 669.355 |
|
|
| 685.484 | | 725.806 | USD | RUBICON MNRLS | RUBI | RBYCF | 687.097 | 3491.220 | OTC | 2010-02-08 | 714.516 |
|
|
| 750.000 | | 783.871 | USD | RUBICON MNRLS | RUBI | RBYCF | 770.968 | 2927.057 | OTC | 2010-02-17 | 780.645 |
|
|
...
|
|
</code></pre>
|
|
<p><code>group-by and</code>group-by-column` both return persistent maps of key->dataset.</p>
|
|
<pre><code class="language-clojure">user> (tech.io/put-nippy! "ticker-sorted.nippy" tickers)
|
|
nil
|
|
user> (def loaded-tickers (tech.io/get-nippy "ticker-sorted.nippy"))
|
|
#'user/loaded-tickers
|
|
user> (count loaded-tickers)
|
|
11532
|
|
user> (first loaded-tickers)
|
|
["RBYCF" RBYCF [261 12]:
|
|
|
|
| low | comp_name_2 | high | currency_code | comp_name | m_ticker | ticker | close | volume | exchange | date | open |
|
|
|--------:|-------------|--------:|---------------|---------------|----------|--------|--------:|---------:|----------|------------|--------:|
|
|
| | | | USD | RUBICON MNRLS | RUBI | RBYCF | 759.677 | | OTC | 2010-01-01 | |
|
|
| 795.161 | | 827.419 | USD | RUBICON MNRLS | RUBI | RBYCF | 800.000 | 3596.775 | OTC | 2010-01-12 | 816.129 |
|
|
| 741.935 | | 779.032 | USD | RUBICON MNRLS | RUBI | RBYCF | 758.064 | 5490.292 | OTC | 2010-01-20 | 779.032 |
|
|
| 645.161 | | 688.710 | USD | RUBICON MNRLS | RUBI | RBYCF | 682.258 | 6201.953 | OTC | 2010-01-28 | 669.355 |
|
|
| 685.484 | | 725.806 | USD | RUBICON MNRLS | RUBI | RBYCF | 687.097 | 3491.220 | OTC | 2010-02-08 | 714.516 |
|
|
| 750.000 | | 783.871 | USD | RUBICON MNRLS | RUBI | RBYCF | 770.968 | 2927.057 | OTC | 2010-02-17 | 780.645 |
|
|
</code></pre>
|
|
<p>Thus datasets can be used in maps, vectors, you name it and you can load/save those
|
|
really complex datastructures. That can be a big help for complex dataflows.</p>
|
|
<h2>Simple Implementation</h2>
|
|
<p>Our implementation of save/load for this pathway goes through two public functions:</p>
|
|
<ul>
|
|
<li>
|
|
<p><a href="https://github.com/techascent/tech.ml.dataset/blob/7c8c7514e0e35995050c1e326122a1826cc18273/src/tech/v3/dataset/base.clj#L666">dataset->data</a> - Convert a dataset into a pure
|
|
clojure/java datastructure suitable for serialization. Data is in arrays and string
|
|
tables have been slightly deconstructed.</p>
|
|
</li>
|
|
<li>
|
|
<p><a href="https://github.com/techascent/tech.ml.dataset/blob/7c8c7514e0e35995050c1e326122a1826cc18273/src/tech/v3/dataset/base.clj#L694">data->dataset</a> - Given a data-description of a
|
|
dataset create a new dataset. This is mainly a zero copy operation so it should be
|
|
quite quick.</p>
|
|
</li>
|
|
</ul>
|
|
<p>Near those functions you can see how easy it was to implement direct nippy support for
|
|
the dataset object itself. Really nice, Nippy is truly a great library :-).</p>
|
|
</div></div></div></body></html> |