Files
df-research/tech.ml.dataset/docs/columns-readers-and-datatypes.html
2026-02-08 11:20:43 -10:00

145 lines
17 KiB
HTML
Vendored

<!DOCTYPE html PUBLIC ""
"">
<html><head><meta charset="UTF-8" /><title>tech.ml.dataset Columns, Readers, and Datatypes</title><script async="true" src="https://www.googletagmanager.com/gtag/js?id=G-RGTB4J7LGP"></script><script>window.dataLayer = window.dataLayer || [];
function gtag(){dataLayer.push(arguments);}
gtag('js', new Date());
gtag('config', 'G-95TVFC1FEB');</script><link rel="stylesheet" type="text/css" href="css/default.css" /><link rel="stylesheet" type="text/css" href="highlight/solarized-light.css" /><script type="text/javascript" src="highlight/highlight.min.js"></script><script type="text/javascript" src="js/jquery.min.js"></script><script type="text/javascript" src="js/page_effects.js"></script><script>hljs.initHighlightingOnLoad();</script></head><body><div id="header"><h2>Generated by <a href="https://github.com/weavejester/codox">Codox</a> with <a href="https://github.com/xsc/codox-theme-rdash">RDash UI</a> theme</h2><h1><a href="index.html"><span class="project-title"><span class="project-name">TMD</span> <span class="project-version">8.003</span></span></a></h1></div><div class="sidebar primary"><h3 class="no-link"><span class="inner">Project</span></h3><ul class="index-link"><li class="depth-1 "><a href="index.html"><div class="inner">Index</div></a></li></ul><h3 class="no-link"><span class="inner">Topics</span></h3><ul><li class="depth-1 "><a href="000-getting-started.html"><div class="inner"><span>tech.ml.dataset Getting Started</span></div></a></li><li class="depth-1 "><a href="100-walkthrough.html"><div class="inner"><span>tech.ml.dataset Walkthrough</span></div></a></li><li class="depth-1 "><a href="200-quick-reference.html"><div class="inner"><span>tech.ml.dataset Quick Reference</span></div></a></li><li class="depth-1 current"><a href="columns-readers-and-datatypes.html"><div class="inner"><span>tech.ml.dataset Columns, Readers, and Datatypes</span></div></a></li><li class="depth-1 "><a href="nippy-serialization-rocks.html"><div class="inner"><span>tech.ml.dataset And nippy</span></div></a></li><li class="depth-1 "><a href="supported-datatypes.html"><div class="inner"><span>tech.ml.dataset Supported Datatypes</span></div></a></li></ul><h3 class="no-link"><span class="inner">Namespaces</span></h3><ul><li class="depth-1"><div class="no-link"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>tech</span></div></div></li><li class="depth-2"><div class="no-link"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>v3</span></div></div></li><li class="depth-3"><a href="tech.v3.dataset.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>dataset</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.categorical.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>categorical</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.clipboard.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>clipboard</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.column.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>column</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.column-filters.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>column-filters</span></div></a></li><li class="depth-4"><div class="no-link"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>io</span></div></div></li><li class="depth-5 branch"><a href="tech.v3.dataset.io.csv.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>csv</span></div></a></li><li class="depth-5 branch"><a href="tech.v3.dataset.io.datetime.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>datetime</span></div></a></li><li class="depth-5 branch"><a href="tech.v3.dataset.io.string-row-parser.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>string-row-parser</span></div></a></li><li class="depth-5"><a href="tech.v3.dataset.io.univocity.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>univocity</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.join.html"><div class="inner"><span class="tree" style="top: -145px;"><span class="top" style="height: 154px;"></span><span class="bottom"></span></span><span>join</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.math.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>math</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.metamorph.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>metamorph</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.modelling.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>modelling</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.print.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>print</span></div></a></li><li class="depth-4"><a href="tech.v3.dataset.reductions.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>reductions</span></div></a></li><li class="depth-5"><a href="tech.v3.dataset.reductions.apache-data-sketch.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>apache-data-sketch</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.rolling.html"><div class="inner"><span class="tree" style="top: -52px;"><span class="top" style="height: 61px;"></span><span class="bottom"></span></span><span>rolling</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.set.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>set</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.tensor.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>tensor</span></div></a></li><li class="depth-4"><a href="tech.v3.dataset.zip.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>zip</span></div></a></li><li class="depth-3"><div class="no-link"><div class="inner"><span class="tree" style="top: -641px;"><span class="top" style="height: 650px;"></span><span class="bottom"></span></span><span>libs</span></div></div></li><li class="depth-4 branch"><a href="tech.v3.libs.arrow.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>arrow</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.libs.clj-transit.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>clj-transit</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.libs.fastexcel.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>fastexcel</span></div></a></li><li class="depth-4"><div class="no-link"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>guava</span></div></div></li><li class="depth-5"><a href="tech.v3.libs.guava.cache.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>cache</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.libs.parquet.html"><div class="inner"><span class="tree" style="top: -52px;"><span class="top" style="height: 61px;"></span><span class="bottom"></span></span><span>parquet</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.libs.poi.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>poi</span></div></a></li><li class="depth-4"><a href="tech.v3.libs.tribuo.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>tribuo</span></div></a></li></ul></div><div class="document" id="content"><div class="doc"><div class="markdown"><h1>tech.ml.dataset Columns, Readers, and Datatypes</h1>
<p>In <code>tech.ml.dataset</code>, columns are composed of three things:
<a href="https://github.com/techascent/tech.ml.dataset/blob/7c8c7514e0e35995050c1e326122a1826cc18273/src/tech/v3/dataset/impl/column.clj#L140">data, metadata, and the missing set</a>.
The column's datatype is the datatype of the <code>data</code> member. The data member can
be anything convertible to a tech.v2.datatype reader of the appropriate type.</p>
<p>Buffers are a <a href="https://github.com/cnuernber/dtype-next/blob/152f09f925041d41782e05009bbf84d7d6cfdbc6/java/tech/v3/datatype/Buffer.java">simple abstraction</a> of typed random access read-only
memory that implement all the interfaces required to both efficient and easy to use.
You can create a buffer by reifying the appropriately typed interface from
<code>tech.v3.datatype</code> but the datatype library has
<a href="https://github.com/cnuernber/dtype-next/blob/152f09f925041d41782e05009bbf84d7d6cfdbc6/src/tech/v3/datatype.clj#L102">quick paths</a> to creating these:</p>
<pre><code class="language-clojure">user&gt; (require '[tech.v3.datatype :as dtype])
nil
user&gt; (dtype/make-reader :float32 5 idx)
[0.0 1.0 2.0 3.0 4.0]
user&gt; (dtype/make-reader :float32 5 (* 2 idx))
[0.0 2.0 4.0 6.0 8.0]
</code></pre>
<p>A read-only buffer only needs three methods - <code>elemwiseDatatype</code> (optional), <code>lsize</code>, and
<code>read[X]</code>. <code>read[X]</code> is typed to the datatype so for instance in the example above,
readFloat returns a primitive float object. <code>lsize</code> returns a long. Unlike a the
similar method <code>get</code> in java lists, the <code>read[X]</code> methods takes a long. This allows us
to use read methods on storage mechanism capable of addressing more than 2 (signed int)
or 4 (unsigned int) billion addresses.</p>
<p>Another way to create a reader is to do a 'map' type translation from one or more other
readers. This is provided in two ways:</p>
<ul>
<li><a href="https://github.com/cnuernber/dtype-next/blob/152f09f925041d41782e05009bbf84d7d6cfdbc6/src/tech/v3/datatype/emap.clj#L97"><code>dtype/emap</code></a> - Missing set ignorant mapping into a typed representation.</li>
<li><a href="https://github.com/techascent/tech.ml.dataset/blob/7c8c7514e0e35995050c1e326122a1826cc18273/src/tech/v3/dataset/column.clj#L174"><code>tech.v3.dataset.column/column-map</code></a> - Missing set aware mapping into a typed representation.</li>
</ul>
<p>The dataset system in general is smart enough to create columns out of readers in most
situations. So for instance if you have a dataset and you want a column of a
particular type, you can add-or-update-column and pass in a reader that implements what
you want:</p>
<pre><code class="language-clojure">user&gt; (def stocks (ds/-&gt;dataset "test/data/stocks.csv"))
#'user/stocks
user&gt; (ds/head stocks)
test/data/stocks.csv [5 3]:
| symbol | date | price |
|--------+------------+-------|
| MSFT | 2000-01-01 | 39.81 |
| MSFT | 2000-02-01 | 36.35 |
| MSFT | 2000-03-01 | 43.22 |
| MSFT | 2000-04-01 | 28.37 |
| MSFT | 2000-05-01 | 25.45 |
user&gt; (ds/head (ds/add-or-update-column stocks "id"
(dtype/make-reader :int64
(ds/row-count stocks)
idx)))
test/data/stocks.csv [5 4]:
| symbol | date | price | id |
|--------+------------+-------+----|
| MSFT | 2000-01-01 | 39.81 | 0 |
| MSFT | 2000-02-01 | 36.35 | 1 |
| MSFT | 2000-03-01 | 43.22 | 2 |
| MSFT | 2000-04-01 | 28.37 | 3 |
| MSFT | 2000-05-01 | 25.45 | 4 |
</code></pre>
<p>There are many different datatypes currently used in the datatype system -
the primitive numeric types:</p>
<ul>
<li><code>:boolean</code> - convert to and from 0 (false) or 1 (true) when used as a number.</li>
<li><code>:int8</code>,<code>:uint8</code> - signed/unsigned bytes.</li>
<li><code>:int16</code>,<code>:uint16</code> - signed/unsigned shorts.</li>
<li><code>:int32</code>,<code>:uint32</code> - signed/unsigned ints.</li>
<li><code>:int64</code> - signed longs (haven't figured out unsigned longs really yet).</li>
<li><code>:float32</code>, <code>float64</code> - floats, doubles respectively.</li>
</ul>
<p>There are more types that can be represented by primitives (they 'alias' the primitive
type) but we will leave that for another article.</p>
<p>Outside of the primitive types (and types aliased to primitive types), we have an
infinite object types. Any datatype the system doesn't understand it will treat as
type :object during generic options.</p>
<p>One very important aspect to note is that columns marked as <code>:object</code> datatypes will
use the Clojure numerics stack during mathematical operations. This is
important because Clojure number tower, similar to the APL number tower,
actively promotes values to the next appropriate size and is thus less error prone
to use if you aren't absolutely certain of your value range how it interacts with
your arithmetic pathways.</p>
<pre><code class="language-clojure">user&gt; (require '[tech.v3.dataset :as ds])
nil
user&gt; (def stocks (ds/-&gt;dataset "test/data/stocks.csv"))
#'user/stocks
user&gt; (require '[tech.v3.datatype.functional :as dfn])
nil
user&gt; (def stocks-lag
(assoc stocks "price-lag"
(let [price-data (dtype/-&gt;reader (stocks "price"))]
(dtype/make-reader :float64 (.lsize price-data)
(.readDouble price-data
(max 0 (dec idx)))))))
#'user/stocks-lag
user&gt; (ds/head (assoc stocks-lag "price-lag-diff" (dfn/- (stocks-lag "price")
(stocks-lag "price-lag"))))
test/data/stocks.csv [5 5]:
| symbol | date | price | price-lag | price-lag-diff |
|--------+------------+-------+-----------+----------------|
| MSFT | 2000-01-01 | 39.81 | 39.81 | 0.000 |
| MSFT | 2000-02-01 | 36.35 | 39.81 | -3.460 |
| MSFT | 2000-03-01 | 43.22 | 36.35 | 6.870 |
| MSFT | 2000-04-01 | 28.37 | 43.22 | -14.85 |
| MSFT | 2000-05-01 | 25.45 | 28.37 | -2.920 |
</code></pre>
<p>All these operations are intrinsically lazy, so values are only calculated when
requested. This is usually fine but in some cases it may be desired to force
the calculation of a particular column completely (like in the instance where
the calculation is particularly expensive). One way to force the column
efficiently is to clone it:</p>
<pre><code class="language-clojure">user&gt; (ds/head (ds/update-column stocks-lag "price-lag" dtype/clone))
test/data/stocks.csv [5 4]:
| symbol | date | price | price-lag |
|--------+------------+-------+-----------|
| MSFT | 2000-01-01 | 39.81 | 39.81 |
| MSFT | 2000-02-01 | 36.35 | 39.81 |
| MSFT | 2000-03-01 | 43.22 | 36.35 |
| MSFT | 2000-04-01 | 28.37 | 43.22 |
| MSFT | 2000-05-01 | 25.45 | 28.37 |
</code></pre>
<p>If we now get the actual type of the column's data member, we can see that it is
a concrete type.</p>
<pre><code class="language-clojure">user&gt; (-&gt; (ds/update-column stocks-lag "price-lag" dtype/clone)
(get "price-lag")
(dtype/as-concrete-buffer))
#array-buffer&lt;float64&gt;[560]
[39.81, 39.81, 36.35, 43.22, 28.37, 25.45, 32.54, 28.40, 28.40, 24.53, 28.02, 23.34, 17.65, 24.84, 24.00, 22.25, 27.56, 28.14, 29.70, 26.93, ...]
</code></pre>
<p>This ability - lazily define a column via interface implementation and still
efficiently operate on that column - separates the implementation of
the <code>tech.ml.dataset</code> library from other libraries in this field. This is likely
to have an interesting and different set of advantages and disadvantages that will
present themselves over time. The dataset library is very loosely bound to the
underlying data representation allowing it to represent data that is much larger
than can fit in memory and allowing dynamic column definitions to be defined at
program runtime as equations and extensions derived from other sources of data.</p>
</div></div></div></body></html>