df-research/tech.ml.dataset/docs/tech.v3.libs.arrow.html

<!DOCTYPE html PUBLIC ""
    "">
<html><head><meta charset="UTF-8" /><title>tech.v3.libs.arrow documentation</title><script async="true" src="https://www.googletagmanager.com/gtag/js?id=G-RGTB4J7LGP"></script><script>window.dataLayer = window.dataLayer || [];
  function gtag(){dataLayer.push(arguments);}
  gtag('js', new Date());

  gtag('config', 'G-95TVFC1FEB');</script><link rel="stylesheet" type="text/css" href="css/default.css" /><link rel="stylesheet" type="text/css" href="highlight/solarized-light.css" /><script type="text/javascript" src="highlight/highlight.min.js"></script><script type="text/javascript" src="js/jquery.min.js"></script><script type="text/javascript" src="js/page_effects.js"></script><script>hljs.initHighlightingOnLoad();</script></head><body><div id="header"><h2>Generated by <a href="https://github.com/weavejester/codox">Codox</a> with <a href="https://github.com/xsc/codox-theme-rdash">RDash UI</a> theme</h2><h1><a href="index.html"><span class="project-title"><span class="project-name">TMD</span> <span class="project-version">8.003</span></span></a></h1></div><div class="sidebar primary"><h3 class="no-link"><span class="inner">Project</span></h3><ul class="index-link"><li class="depth-1 "><a href="index.html"><div class="inner">Index</div></a></li></ul><h3 class="no-link"><span class="inner">Topics</span></h3><ul><li class="depth-1 "><a href="000-getting-started.html"><div class="inner"><span>tech.ml.dataset Getting Started</span></div></a></li><li class="depth-1 "><a href="100-walkthrough.html"><div class="inner"><span>tech.ml.dataset Walkthrough</span></div></a></li><li class="depth-1 "><a href="200-quick-reference.html"><div class="inner"><span>tech.ml.dataset Quick Reference</span></div></a></li><li class="depth-1 "><a href="columns-readers-and-datatypes.html"><div class="inner"><span>tech.ml.dataset Columns, Readers, and Datatypes</span></div></a></li><li class="depth-1 "><a href="nippy-serialization-rocks.html"><div class="inner"><span>tech.ml.dataset And nippy</span></div></a></li><li class="depth-1 "><a href="supported-datatypes.html"><div class="inner"><span>tech.ml.dataset Supported Datatypes</span></div></a></li></ul><h3 class="no-link"><span class="inner">Namespaces</span></h3><ul><li class="depth-1"><div class="no-link"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>tech</span></div></div></li><li class="depth-2"><div class="no-link"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>v3</span></div></div></li><li class="depth-3"><a href="tech.v3.dataset.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>dataset</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.categorical.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>categorical</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.clipboard.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>clipboard</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.column.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>column</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.column-filters.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>column-filters</span></div></a></li><li class="depth-4"><div class="no-link"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>io</span></div></div></li><li class="depth-5 branch"><a href="tech.v3.dataset.io.csv.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>csv</span></div></a></li><li class="depth-5 branch"><a href="tech.v3.dataset.io.datetime.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>datetime</span></div></a></li><li class="depth-5 branch"><a href="tech.v3.dataset.io.string-row-parser.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>string-row-parser</span></div></a></li><li class="depth-5"><a href="tech.v3.dataset.io.univocity.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>univocity</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.join.html"><div class="inner"><span class="tree" style="top: -145px;"><span class="top" style="height: 154px;"></span><span class="bottom"></span></span><span>join</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.math.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>math</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.metamorph.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>metamorph</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.modelling.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>modelling</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.print.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>print</span></div></a></li><li class="depth-4"><a href="tech.v3.dataset.reductions.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>reductions</span></div></a></li><li class="depth-5"><a href="tech.v3.dataset.reductions.apache-data-sketch.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>apache-data-sketch</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.rolling.html"><div class="inner"><span class="tree" style="top: -52px;"><span class="top" style="height: 61px;"></span><span class="bottom"></span></span><span>rolling</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.set.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>set</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.tensor.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>tensor</span></div></a></li><li class="depth-4"><a href="tech.v3.dataset.zip.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>zip</span></div></a></li><li class="depth-3"><div class="no-link"><div class="inner"><span class="tree" style="top: -641px;"><span class="top" style="height: 650px;"></span><span class="bottom"></span></span><span>libs</span></div></div></li><li class="depth-4 branch current"><a href="tech.v3.libs.arrow.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>arrow</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.libs.clj-transit.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>clj-transit</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.libs.fastexcel.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>fastexcel</span></div></a></li><li class="depth-4"><div class="no-link"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>guava</span></div></div></li><li class="depth-5"><a href="tech.v3.libs.guava.cache.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>cache</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.libs.parquet.html"><div class="inner"><span class="tree" style="top: -52px;"><span class="top" style="height: 61px;"></span><span class="bottom"></span></span><span>parquet</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.libs.poi.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>poi</span></div></a></li><li class="depth-4"><a href="tech.v3.libs.tribuo.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>tribuo</span></div></a></li></ul></div><div class="sidebar secondary"><h3><a href="#top"><span class="inner">Public Vars</span></a></h3><ul><li class="depth-1"><a href="tech.v3.libs.arrow.html#var-col-.3Ebuffers"><div class="inner"><span>col-&gt;buffers</span></div></a></li><li class="depth-1"><a href="tech.v3.libs.arrow.html#var-construct-column"><div class="inner"><span>construct-column</span></div></a></li><li class="depth-1"><a href="tech.v3.libs.arrow.html#var-dataset-.3Estream.21"><div class="inner"><span>dataset-&gt;stream!</span></div></a></li><li class="depth-1"><a href="tech.v3.libs.arrow.html#var-dataset-seq-.3Estream.21"><div class="inner"><span>dataset-seq-&gt;stream!</span></div></a></li><li class="depth-1"><a href="tech.v3.libs.arrow.html#var-decimal-column-metadata"><div class="inner"><span>decimal-column-metadata</span></div></a></li><li class="depth-1"><a href="tech.v3.libs.arrow.html#var-stream-.3Edataset"><div class="inner"><span>stream-&gt;dataset</span></div></a></li><li class="depth-1"><a href="tech.v3.libs.arrow.html#var-stream-.3Edataset-iterable"><div class="inner"><span>stream-&gt;dataset-iterable</span></div></a></li><li class="depth-1"><a href="tech.v3.libs.arrow.html#var-validity-.3Eindexes"><div class="inner"><span>validity-&gt;indexes</span></div></a></li><li class="depth-1"><a href="tech.v3.libs.arrow.html#var-validity-.3Emissing"><div class="inner"><span>validity-&gt;missing</span></div></a></li><li class="depth-1"><a href="tech.v3.libs.arrow.html#var-validity-info"><div class="inner"><span>validity-info</span></div></a></li></ul></div><div class="namespace-docs" id="content"><h1 class="anchor" id="top">tech.v3.libs.arrow</h1><div class="doc"><div class="markdown"><p>Support for reading/writing apache arrow datasets.  Datasets may be memory mapped
but default to being read via an input stream.</p>
<p>Supported datatypes:</p>
<ul>
<li>All numeric types - <code>:uint8</code>, <code>:int8</code>, <code>:uint16</code>, <code>:int16</code>, <code>:uint32</code>, <code>:int32</code>,
<code>:uint64</code>, <code>:int64</code>, <code>:float32</code>, <code>:float64</code>, <code>:boolean</code>.</li>
<li>String types - <code>:string</code>, <code>:text</code>.  During write you have the option to always write
data as text which can be more efficient in the memory-mapped read case as it doesnt'
require the creation of string tables at load time.</li>
<li>Datetime Types - <code>:local-date</code>, <code>:local-time</code>, <code>:instant</code>.  During read you have the
option to keep these types in their source numeric format e.g. 32 bit <code>:epoch-days</code>
for <code>:local-date</code> datatypes.  This format can make some types of processing, such as
set creation, more efficient.</li>
</ul>
<p>When writing a dataset an arrow file with a single record set is created.  When
writing a sequence of datasets downstream schemas must be compatible with the schema
of the initial dataset so for instance a conversion of int32 to double is fine but
double to int32 is not.</p>
<p>mmap support on systems running JDK-17 requires the foreign or memory module to be
loaded.  Appropriate JVM arguments can be found
<a href="https://github.com/techascent/tech.ml.dataset/blob/0524ddd5bbcb9421a0f11290ec8a01b7795dcff9/project.clj#L69">here</a>.</p>
<p>Example (with zstd compression):</p>
<pre><code class="language-clojure">  ;; Writing
  (arrow/dataset-&gt;stream! ds fname {:compression :zstd})
  ;; Reading
  (arrow/stream-&gt;dataset path)
</code></pre>
<h2>Required Dependencies</h2>
<p>In order to support both memory mapping and JDK-17, we only rely on the Arrow SDK's
flatbuffer and schema definitions:</p>
<pre><code class="language-clojure">  ;; netty isn't required and will inevitably conflict with some more recent version
  [org.apache.arrow/arrow-vector "6.0.0":exclusions [netty/netty io.netty/netty-common]]
  [com.cnuernber/jarrow "1.000"]
  [org.apache.commons/commons-compress "1.21"]

  ;;Compression codecs
  [org.lz4/lz4-java "1.8.0"]
  ;;Required for decompressing lz4 streams with dependent blocks.
  [net.java.dev.jna/jna "5.10.0"]
  [com.github.luben/zstd-jni "1.5.4-1"]
</code></pre>
<p>The lz4 decompression system will fallback to lz4-java if liblz4 isn't installed or if
jna isn't loaded.  The lz4-java java library will fail for arrow files that have dependent
block compression which are sometimes saved by python or R arrow implementations.
On current ubuntu, in order to install the lz4 library you need to do:</p>
<pre><code class="language-console">  sudo apt install liblz4-1
</code></pre>
<h2>Performance</h2>
<p>Arrow has hands down highest performance of any of the formats although nippy comes very close when using
any compression.  The highest performance pathway is to save out data with :strings-as-text? true and zero
compression then read them in using mmap - optionally with :text-as-strings? if you never want to see
tech.v3.datatype.Text objects in your dataset.  This avoids the creation of string dictionaries during
deserialization as these have to be done greedily.  It can dramatically increase many dataset sizes but
when mmap is used the overall size is irrelevant aside from iteration which can be heavily parallelized.</p>
<p>Example:</p>
<pre><code class="language-clojure">  ;; Writing
  (arrow/dataset-&gt;stream! ds fname {:strings-as-text? true})
  ;; Reading
  (arrow/stream-&gt;dataset path {:text-as-strings? true :open-type :mmap})
</code></pre>
</div></div><div class="public anchor" id="var-col-.3Ebuffers"><h3>col-&gt;buffers</h3><div class="usage"><code>(col-&gt;buffers col col-idx options)</code></div><div class="doc"><div class="markdown"></div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/libs/arrow.clj#L1229">view source</a></div></div><div class="public anchor" id="var-construct-column"><h3>construct-column</h3><div class="usage"><code>(construct-column sparse? node field buffers col-data-fn)</code></div><div class="doc"><div class="markdown"></div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/libs/arrow.clj#L1554">view source</a></div></div><div class="public anchor" id="var-dataset-.3Estream.21"><h3>dataset-&gt;stream!</h3><div class="usage"><code>(dataset-&gt;stream! ds path options)</code><code>(dataset-&gt;stream! ds path)</code></div><div class="doc"><div class="markdown"><p>Write a dataset as an arrow file.  File will contain one record set.
See documentation for <a href="tech.v3.libs.arrow.html#var-dataset-seq-.3Estream.21">dataset-seq-&gt;stream!</a>.</p>
<ul>
<li><code>:strings-as-text?</code> defaults to false.</li>
</ul>
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/libs/arrow.clj#L2451">view source</a></div></div><div class="public anchor" id="var-dataset-seq-.3Estream.21"><h3>dataset-seq-&gt;stream!</h3><div class="usage"><code>(dataset-seq-&gt;stream! path options ds-seq)</code><code>(dataset-seq-&gt;stream! path ds-seq)</code></div><div class="doc"><div class="markdown"><p>Write a sequence of datasets as an arrow stream file.  File will contain one record set
per dataset.  Datasets in the sequence must have matching schemas or downstream schema
must be able to be safely widened to the first schema.</p>
<p>Options:</p>
<ul>
<li>
<p><code>:strings-as-text?</code> - defaults to true - Save out strings into arrow files without
dictionaries.  This works well if you want to load an arrow file in-place or if
you know the strings in your dataset are either really large or should not be in
string tables.  <strong>Saving multiple datasets with <code>{:strings-as-text false}</code> requires arrow
7.0.0+ support from your python or R code due to
<a href="https://issues.apache.org/jira/browse/ARROW-13467">Arrow issue 13467</a>.  - the conservative
pathway for now is to set <code>:strings-as-text?</code> to true and only save text!!</strong>.</p>
</li>
<li>
<p><code>:format</code> - one of <code>[:file :ipc]</code>,  defaults to <code>:file</code>.</p>
<ul>
<li><code>:file</code> - arrow file format, compatible with pyarrow's <a href="https://arrow.apache.org/docs/python/generated/pyarrow.ipc.open_file.html#pyarrow.ipc.open_file">open_file</a>.  The suggested
suffix is <code>.arrow</code>.</li>
<li><code>:ipc</code> - arrow streaming format, compatible with pyarrow's <a href="https://arrow.apache.org/docs/python/generated/pyarrow.ipc.open_file.html#pyarrow.ipc.open_ipc">open_ipc</a> pathway.  The
suggested suffix is <code>.arrows</code>.</li>
</ul>
</li>
<li>
<p><code>:compression</code> - Either <code>:zstd</code> or <code>:lz4</code>,  defaults to no compression (nil).
Per-column compression of the data can result in some significant size savings
(2x+) and thus some significant time savings when loading over the network.
Using compression makes loading via mmap non-lazy - If you are going to use
compression mmap probably doesn't make sense and most likely will result in
slower loading times.</p>
<ul>
<li><code>:lz4</code> - Decent and very fast compression.</li>
<li><code>:zstd</code> - Good compression, somewhat slower than <code>:lz4</code>.  Can also have a
level parameter that ranges from 1-12 in which case compression is specified
in map form: <code>{:compression-type :zstd :level 5}</code>.</li>
</ul>
</li>
</ul>
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/libs/arrow.clj#L2366">view source</a></div></div><div class="public anchor" id="var-decimal-column-metadata"><h3>decimal-column-metadata</h3><div class="usage"><code>(decimal-column-metadata col)</code></div><div class="doc"><div class="markdown"></div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/libs/arrow.clj#L2170">view source</a></div></div><div class="public anchor" id="var-stream-.3Edataset"><h3>stream-&gt;dataset</h3><div class="usage"><code>(stream-&gt;dataset fname options)</code><code>(stream-&gt;dataset fname)</code></div><div class="doc"><div class="markdown"><p>Reads data non-lazily in arrow streaming format expecting to find a single dataset.</p>
<p>Options:</p>
<ul>
<li>
<p><code>:open-type</code> - Either <code>:mmap</code> or <code>:input-stream</code> defaulting to the slower but more robust
<code>:input-stream</code> pathway.  When using <code>:mmap</code> resources will be released when the resource
system dictates - see documentation for <a href="https://techascent.github.io/tech.resource/tech.v3.resource.html">tech.v3.resource</a>.
When using <code>:input-stream</code> the stream will be closed when the lazy sequence is either fully realized or an
exception is thrown.  Memory mapping is not supported on m-1 macs unless you are using JDK-17.</p>
</li>
<li>
<p><code>close-input-stream?</code> - When using <code>:input-stream</code> <code>:open-type</code>, close the input stream upon
exception or when stream is fully realized.  Defaults to true.</p>
</li>
<li>
<p><code>:integer-datetime-types?</code> - when true arrow columns in the appropriate packed
datatypes will be represented as their integer types as opposed to their respective
packed types.  For example columns of type <code>:epoch-days</code> will be returned to the user
as datatype <code>:epoch-days</code> as opposed to <code>:packed-local-date</code>.  This means reading values
will return integers as opposed to <code>java.time.LocalDate</code>s.</p>
</li>
</ul>
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/libs/arrow.clj#L2133">view source</a></div></div><div class="public anchor" id="var-stream-.3Edataset-iterable"><h3>stream-&gt;dataset-iterable</h3><div class="usage"><code>(stream-&gt;dataset-iterable fname &amp; [options])</code></div><div class="doc"><div class="markdown"><p>Loads data up to and including the first data record.  Returns the a lazy
sequence of datasets.  Datasets can be loaded using mmapped data and when that is true
realizing the entire sequence is usually safe, even for datasets that are larger than
available RAM.
The default resourc management pathway for this is :auto but you can override this
by explicity setting the option <code>:resource-type</code>.  See documentation for
tech.v3.datatype.mmap/mmap-file.</p>
<p>Options:</p>
<ul>
<li>
<p><code>:open-type</code> - Either <code>:mmap</code> or <code>:input-stream</code> defaulting to the slower but more robust
<code>:input-stream</code> pathway.  When using <code>:mmap</code> resources will be released when the resource
system dictates - see documentation for <a href="https://techascent.github.io/tech.resource/tech.v3.resource.html">tech.v3.resource</a>.
When using <code>:input-stream</code> the stream will be closed when the lazy sequence is either
fully realized or an exception is thrown.</p>
</li>
<li>
<p><code>close-input-stream?</code> - When using <code>:input-stream</code> <code>:open-type</code>, close the input stream upon
exception or when stream is fully realized.  Defaults to true.</p>
</li>
<li>
<p><code>:integer-datetime-types?</code> - when true arrow columns in the appropriate packed
datatypes will be represented as their integer types as opposed to their respective
packed types.  For example columns of type <code>:epoch-days</code> will be returned to the user
as datatype <code>:epoch-days</code> as opposed to <code>:packed-local-date</code>.  This means reading values
will return integers as opposed to <code>java.time.LocalDate</code>s.</p>
</li>
<li>
<p><code>:text-as-strings?</code> - Return strings instead of Text objects.  This breaks automatic round-tripping
as it changes datatypes <em>but</em> can be useful when used with <code>:strings-as-text?</code> when writing data out.
When used like this uncompressed mmap pathways typically have the highest performance - roughly 100x
any other method.</p>
</li>
</ul>
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/libs/arrow.clj#L2082">view source</a></div></div><div class="public anchor" id="var-validity-.3Eindexes"><h3>validity-&gt;indexes</h3><div class="usage"><code>(validity-&gt;indexes validity n-elems)</code></div><div class="doc"><div class="markdown"></div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/libs/arrow.clj#L1511">view source</a></div></div><div class="public anchor" id="var-validity-.3Emissing"><h3>validity-&gt;missing</h3><div class="usage"><code>(validity-&gt;missing validity n-elems)</code></div><div class="doc"><div class="markdown"></div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/libs/arrow.clj#L1504">view source</a></div></div><div class="public anchor" id="var-validity-info"><h3>validity-info</h3><div class="usage"><code>(validity-info col all-valid-buf)</code></div><div class="doc"><div class="markdown"><p>returns <a href="byte-array n-missing">byte-array n-missing</a></p>
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/libs/arrow.clj#L1206">view source</a></div></div></div></body></html>