145 lines
17 KiB
HTML
Vendored
145 lines
17 KiB
HTML
Vendored
<!DOCTYPE html PUBLIC ""
|
|
"">
|
|
<html><head><meta charset="UTF-8" /><title>tech.ml.dataset Columns, Readers, and Datatypes</title><script async="true" src="https://www.googletagmanager.com/gtag/js?id=G-RGTB4J7LGP"></script><script>window.dataLayer = window.dataLayer || [];
|
|
function gtag(){dataLayer.push(arguments);}
|
|
gtag('js', new Date());
|
|
|
|
gtag('config', 'G-95TVFC1FEB');</script><link rel="stylesheet" type="text/css" href="css/default.css" /><link rel="stylesheet" type="text/css" href="highlight/solarized-light.css" /><script type="text/javascript" src="highlight/highlight.min.js"></script><script type="text/javascript" src="js/jquery.min.js"></script><script type="text/javascript" src="js/page_effects.js"></script><script>hljs.initHighlightingOnLoad();</script></head><body><div id="header"><h2>Generated by <a href="https://github.com/weavejester/codox">Codox</a> with <a href="https://github.com/xsc/codox-theme-rdash">RDash UI</a> theme</h2><h1><a href="index.html"><span class="project-title"><span class="project-name">TMD</span> <span class="project-version">8.003</span></span></a></h1></div><div class="sidebar primary"><h3 class="no-link"><span class="inner">Project</span></h3><ul class="index-link"><li class="depth-1 "><a href="index.html"><div class="inner">Index</div></a></li></ul><h3 class="no-link"><span class="inner">Topics</span></h3><ul><li class="depth-1 "><a href="000-getting-started.html"><div class="inner"><span>tech.ml.dataset Getting Started</span></div></a></li><li class="depth-1 "><a href="100-walkthrough.html"><div class="inner"><span>tech.ml.dataset Walkthrough</span></div></a></li><li class="depth-1 "><a href="200-quick-reference.html"><div class="inner"><span>tech.ml.dataset Quick Reference</span></div></a></li><li class="depth-1 current"><a href="columns-readers-and-datatypes.html"><div class="inner"><span>tech.ml.dataset Columns, Readers, and Datatypes</span></div></a></li><li class="depth-1 "><a href="nippy-serialization-rocks.html"><div class="inner"><span>tech.ml.dataset And nippy</span></div></a></li><li class="depth-1 "><a href="supported-datatypes.html"><div class="inner"><span>tech.ml.dataset Supported Datatypes</span></div></a></li></ul><h3 class="no-link"><span class="inner">Namespaces</span></h3><ul><li class="depth-1"><div class="no-link"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>tech</span></div></div></li><li class="depth-2"><div class="no-link"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>v3</span></div></div></li><li class="depth-3"><a href="tech.v3.dataset.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>dataset</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.categorical.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>categorical</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.clipboard.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>clipboard</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.column.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>column</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.column-filters.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>column-filters</span></div></a></li><li class="depth-4"><div class="no-link"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>io</span></div></div></li><li class="depth-5 branch"><a href="tech.v3.dataset.io.csv.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>csv</span></div></a></li><li class="depth-5 branch"><a href="tech.v3.dataset.io.datetime.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>datetime</span></div></a></li><li class="depth-5 branch"><a href="tech.v3.dataset.io.string-row-parser.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>string-row-parser</span></div></a></li><li class="depth-5"><a href="tech.v3.dataset.io.univocity.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>univocity</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.join.html"><div class="inner"><span class="tree" style="top: -145px;"><span class="top" style="height: 154px;"></span><span class="bottom"></span></span><span>join</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.math.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>math</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.metamorph.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>metamorph</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.modelling.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>modelling</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.print.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>print</span></div></a></li><li class="depth-4"><a href="tech.v3.dataset.reductions.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>reductions</span></div></a></li><li class="depth-5"><a href="tech.v3.dataset.reductions.apache-data-sketch.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>apache-data-sketch</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.rolling.html"><div class="inner"><span class="tree" style="top: -52px;"><span class="top" style="height: 61px;"></span><span class="bottom"></span></span><span>rolling</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.set.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>set</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.tensor.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>tensor</span></div></a></li><li class="depth-4"><a href="tech.v3.dataset.zip.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>zip</span></div></a></li><li class="depth-3"><div class="no-link"><div class="inner"><span class="tree" style="top: -641px;"><span class="top" style="height: 650px;"></span><span class="bottom"></span></span><span>libs</span></div></div></li><li class="depth-4 branch"><a href="tech.v3.libs.arrow.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>arrow</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.libs.clj-transit.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>clj-transit</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.libs.fastexcel.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>fastexcel</span></div></a></li><li class="depth-4"><div class="no-link"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>guava</span></div></div></li><li class="depth-5"><a href="tech.v3.libs.guava.cache.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>cache</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.libs.parquet.html"><div class="inner"><span class="tree" style="top: -52px;"><span class="top" style="height: 61px;"></span><span class="bottom"></span></span><span>parquet</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.libs.poi.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>poi</span></div></a></li><li class="depth-4"><a href="tech.v3.libs.tribuo.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>tribuo</span></div></a></li></ul></div><div class="document" id="content"><div class="doc"><div class="markdown"><h1>tech.ml.dataset Columns, Readers, and Datatypes</h1>
|
|
<p>In <code>tech.ml.dataset</code>, columns are composed of three things:
|
|
<a href="https://github.com/techascent/tech.ml.dataset/blob/7c8c7514e0e35995050c1e326122a1826cc18273/src/tech/v3/dataset/impl/column.clj#L140">data, metadata, and the missing set</a>.
|
|
The column's datatype is the datatype of the <code>data</code> member. The data member can
|
|
be anything convertible to a tech.v2.datatype reader of the appropriate type.</p>
|
|
<p>Buffers are a <a href="https://github.com/cnuernber/dtype-next/blob/152f09f925041d41782e05009bbf84d7d6cfdbc6/java/tech/v3/datatype/Buffer.java">simple abstraction</a> of typed random access read-only
|
|
memory that implement all the interfaces required to both efficient and easy to use.
|
|
You can create a buffer by reifying the appropriately typed interface from
|
|
<code>tech.v3.datatype</code> but the datatype library has
|
|
<a href="https://github.com/cnuernber/dtype-next/blob/152f09f925041d41782e05009bbf84d7d6cfdbc6/src/tech/v3/datatype.clj#L102">quick paths</a> to creating these:</p>
|
|
<pre><code class="language-clojure">user> (require '[tech.v3.datatype :as dtype])
|
|
nil
|
|
user> (dtype/make-reader :float32 5 idx)
|
|
[0.0 1.0 2.0 3.0 4.0]
|
|
user> (dtype/make-reader :float32 5 (* 2 idx))
|
|
[0.0 2.0 4.0 6.0 8.0]
|
|
</code></pre>
|
|
<p>A read-only buffer only needs three methods - <code>elemwiseDatatype</code> (optional), <code>lsize</code>, and
|
|
<code>read[X]</code>. <code>read[X]</code> is typed to the datatype so for instance in the example above,
|
|
readFloat returns a primitive float object. <code>lsize</code> returns a long. Unlike a the
|
|
similar method <code>get</code> in java lists, the <code>read[X]</code> methods takes a long. This allows us
|
|
to use read methods on storage mechanism capable of addressing more than 2 (signed int)
|
|
or 4 (unsigned int) billion addresses.</p>
|
|
<p>Another way to create a reader is to do a 'map' type translation from one or more other
|
|
readers. This is provided in two ways:</p>
|
|
<ul>
|
|
<li><a href="https://github.com/cnuernber/dtype-next/blob/152f09f925041d41782e05009bbf84d7d6cfdbc6/src/tech/v3/datatype/emap.clj#L97"><code>dtype/emap</code></a> - Missing set ignorant mapping into a typed representation.</li>
|
|
<li><a href="https://github.com/techascent/tech.ml.dataset/blob/7c8c7514e0e35995050c1e326122a1826cc18273/src/tech/v3/dataset/column.clj#L174"><code>tech.v3.dataset.column/column-map</code></a> - Missing set aware mapping into a typed representation.</li>
|
|
</ul>
|
|
<p>The dataset system in general is smart enough to create columns out of readers in most
|
|
situations. So for instance if you have a dataset and you want a column of a
|
|
particular type, you can add-or-update-column and pass in a reader that implements what
|
|
you want:</p>
|
|
<pre><code class="language-clojure">user> (def stocks (ds/->dataset "test/data/stocks.csv"))
|
|
#'user/stocks
|
|
user> (ds/head stocks)
|
|
test/data/stocks.csv [5 3]:
|
|
|
|
| symbol | date | price |
|
|
|--------+------------+-------|
|
|
| MSFT | 2000-01-01 | 39.81 |
|
|
| MSFT | 2000-02-01 | 36.35 |
|
|
| MSFT | 2000-03-01 | 43.22 |
|
|
| MSFT | 2000-04-01 | 28.37 |
|
|
| MSFT | 2000-05-01 | 25.45 |
|
|
user> (ds/head (ds/add-or-update-column stocks "id"
|
|
(dtype/make-reader :int64
|
|
(ds/row-count stocks)
|
|
idx)))
|
|
test/data/stocks.csv [5 4]:
|
|
|
|
| symbol | date | price | id |
|
|
|--------+------------+-------+----|
|
|
| MSFT | 2000-01-01 | 39.81 | 0 |
|
|
| MSFT | 2000-02-01 | 36.35 | 1 |
|
|
| MSFT | 2000-03-01 | 43.22 | 2 |
|
|
| MSFT | 2000-04-01 | 28.37 | 3 |
|
|
| MSFT | 2000-05-01 | 25.45 | 4 |
|
|
</code></pre>
|
|
<p>There are many different datatypes currently used in the datatype system -
|
|
the primitive numeric types:</p>
|
|
<ul>
|
|
<li><code>:boolean</code> - convert to and from 0 (false) or 1 (true) when used as a number.</li>
|
|
<li><code>:int8</code>,<code>:uint8</code> - signed/unsigned bytes.</li>
|
|
<li><code>:int16</code>,<code>:uint16</code> - signed/unsigned shorts.</li>
|
|
<li><code>:int32</code>,<code>:uint32</code> - signed/unsigned ints.</li>
|
|
<li><code>:int64</code> - signed longs (haven't figured out unsigned longs really yet).</li>
|
|
<li><code>:float32</code>, <code>float64</code> - floats, doubles respectively.</li>
|
|
</ul>
|
|
<p>There are more types that can be represented by primitives (they 'alias' the primitive
|
|
type) but we will leave that for another article.</p>
|
|
<p>Outside of the primitive types (and types aliased to primitive types), we have an
|
|
infinite object types. Any datatype the system doesn't understand it will treat as
|
|
type :object during generic options.</p>
|
|
<p>One very important aspect to note is that columns marked as <code>:object</code> datatypes will
|
|
use the Clojure numerics stack during mathematical operations. This is
|
|
important because Clojure number tower, similar to the APL number tower,
|
|
actively promotes values to the next appropriate size and is thus less error prone
|
|
to use if you aren't absolutely certain of your value range how it interacts with
|
|
your arithmetic pathways.</p>
|
|
<pre><code class="language-clojure">user> (require '[tech.v3.dataset :as ds])
|
|
nil
|
|
user> (def stocks (ds/->dataset "test/data/stocks.csv"))
|
|
#'user/stocks
|
|
user> (require '[tech.v3.datatype.functional :as dfn])
|
|
nil
|
|
user> (def stocks-lag
|
|
(assoc stocks "price-lag"
|
|
(let [price-data (dtype/->reader (stocks "price"))]
|
|
(dtype/make-reader :float64 (.lsize price-data)
|
|
(.readDouble price-data
|
|
(max 0 (dec idx)))))))
|
|
|
|
#'user/stocks-lag
|
|
user> (ds/head (assoc stocks-lag "price-lag-diff" (dfn/- (stocks-lag "price")
|
|
(stocks-lag "price-lag"))))
|
|
test/data/stocks.csv [5 5]:
|
|
|
|
| symbol | date | price | price-lag | price-lag-diff |
|
|
|--------+------------+-------+-----------+----------------|
|
|
| MSFT | 2000-01-01 | 39.81 | 39.81 | 0.000 |
|
|
| MSFT | 2000-02-01 | 36.35 | 39.81 | -3.460 |
|
|
| MSFT | 2000-03-01 | 43.22 | 36.35 | 6.870 |
|
|
| MSFT | 2000-04-01 | 28.37 | 43.22 | -14.85 |
|
|
| MSFT | 2000-05-01 | 25.45 | 28.37 | -2.920 |
|
|
</code></pre>
|
|
<p>All these operations are intrinsically lazy, so values are only calculated when
|
|
requested. This is usually fine but in some cases it may be desired to force
|
|
the calculation of a particular column completely (like in the instance where
|
|
the calculation is particularly expensive). One way to force the column
|
|
efficiently is to clone it:</p>
|
|
<pre><code class="language-clojure">user> (ds/head (ds/update-column stocks-lag "price-lag" dtype/clone))
|
|
test/data/stocks.csv [5 4]:
|
|
|
|
| symbol | date | price | price-lag |
|
|
|--------+------------+-------+-----------|
|
|
| MSFT | 2000-01-01 | 39.81 | 39.81 |
|
|
| MSFT | 2000-02-01 | 36.35 | 39.81 |
|
|
| MSFT | 2000-03-01 | 43.22 | 36.35 |
|
|
| MSFT | 2000-04-01 | 28.37 | 43.22 |
|
|
| MSFT | 2000-05-01 | 25.45 | 28.37 |
|
|
</code></pre>
|
|
<p>If we now get the actual type of the column's data member, we can see that it is
|
|
a concrete type.</p>
|
|
<pre><code class="language-clojure">user> (-> (ds/update-column stocks-lag "price-lag" dtype/clone)
|
|
(get "price-lag")
|
|
(dtype/as-concrete-buffer))
|
|
#array-buffer<float64>[560]
|
|
[39.81, 39.81, 36.35, 43.22, 28.37, 25.45, 32.54, 28.40, 28.40, 24.53, 28.02, 23.34, 17.65, 24.84, 24.00, 22.25, 27.56, 28.14, 29.70, 26.93, ...]
|
|
</code></pre>
|
|
<p>This ability - lazily define a column via interface implementation and still
|
|
efficiently operate on that column - separates the implementation of
|
|
the <code>tech.ml.dataset</code> library from other libraries in this field. This is likely
|
|
to have an interesting and different set of advantages and disadvantages that will
|
|
present themselves over time. The dataset library is very loosely bound to the
|
|
underlying data representation allowing it to represent data that is much larger
|
|
than can fit in memory and allowing dynamic column definitions to be defined at
|
|
program runtime as equations and extensions derived from other sources of data.</p>
|
|
</div></div></div></body></html> |