Files
2026-02-08 11:20:43 -10:00

165 lines
26 KiB
HTML
Vendored

<!DOCTYPE html PUBLIC ""
"">
<html><head><meta charset="UTF-8" /><title>tech.ml.dataset Quick Reference</title><script async="true" src="https://www.googletagmanager.com/gtag/js?id=G-RGTB4J7LGP"></script><script>window.dataLayer = window.dataLayer || [];
function gtag(){dataLayer.push(arguments);}
gtag('js', new Date());
gtag('config', 'G-95TVFC1FEB');</script><link rel="stylesheet" type="text/css" href="css/default.css" /><link rel="stylesheet" type="text/css" href="highlight/solarized-light.css" /><script type="text/javascript" src="highlight/highlight.min.js"></script><script type="text/javascript" src="js/jquery.min.js"></script><script type="text/javascript" src="js/page_effects.js"></script><script>hljs.initHighlightingOnLoad();</script></head><body><div id="header"><h2>Generated by <a href="https://github.com/weavejester/codox">Codox</a> with <a href="https://github.com/xsc/codox-theme-rdash">RDash UI</a> theme</h2><h1><a href="index.html"><span class="project-title"><span class="project-name">TMD</span> <span class="project-version">7.000-beta-23</span></span></a></h1></div><div class="sidebar primary"><h3 class="no-link"><span class="inner">Project</span></h3><ul class="index-link"><li class="depth-1 "><a href="index.html"><div class="inner">Index</div></a></li></ul><h3 class="no-link"><span class="inner">Topics</span></h3><ul><li class="depth-1 "><a href="columns-readers-and-datatypes.html"><div class="inner"><span>tech.ml.dataset Columns, Readers, and Datatypes</span></div></a></li><li class="depth-1 "><a href="nippy-serialization-rocks.html"><div class="inner"><span>tech.ml.dataset And nippy</span></div></a></li><li class="depth-1 current"><a href="quick-reference.html"><div class="inner"><span>tech.ml.dataset Quick Reference</span></div></a></li><li class="depth-1 "><a href="supported-datatypes.html"><div class="inner"><span>tech.ml.dataset Supported Datatypes</span></div></a></li><li class="depth-1 "><a href="walkthrough.html"><div class="inner"><span>tech.ml.dataset Walkthrough</span></div></a></li></ul><h3 class="no-link"><span class="inner">Namespaces</span></h3><ul><li class="depth-1"><div class="no-link"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>tech</span></div></div></li><li class="depth-2"><div class="no-link"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>v3</span></div></div></li><li class="depth-3"><a href="tech.v3.dataset.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>dataset</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.categorical.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>categorical</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.clipboard.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>clipboard</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.column.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>column</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.column-filters.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>column-filters</span></div></a></li><li class="depth-4"><div class="no-link"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>io</span></div></div></li><li class="depth-5 branch"><a href="tech.v3.dataset.io.csv.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>csv</span></div></a></li><li class="depth-5 branch"><a href="tech.v3.dataset.io.datetime.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>datetime</span></div></a></li><li class="depth-5 branch"><a href="tech.v3.dataset.io.string-row-parser.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>string-row-parser</span></div></a></li><li class="depth-5"><a href="tech.v3.dataset.io.univocity.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>univocity</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.join.html"><div class="inner"><span class="tree" style="top: -145px;"><span class="top" style="height: 154px;"></span><span class="bottom"></span></span><span>join</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.math.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>math</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.metamorph.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>metamorph</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.modelling.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>modelling</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.neanderthal.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>neanderthal</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.print.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>print</span></div></a></li><li class="depth-4"><a href="tech.v3.dataset.reductions.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>reductions</span></div></a></li><li class="depth-5"><a href="tech.v3.dataset.reductions.apache-data-sketch.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>apache-data-sketch</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.rolling.html"><div class="inner"><span class="tree" style="top: -52px;"><span class="top" style="height: 61px;"></span><span class="bottom"></span></span><span>rolling</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.set.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>set</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.tensor.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>tensor</span></div></a></li><li class="depth-4"><a href="tech.v3.dataset.zip.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>zip</span></div></a></li><li class="depth-3"><div class="no-link"><div class="inner"><span class="tree" style="top: -672px;"><span class="top" style="height: 681px;"></span><span class="bottom"></span></span><span>libs</span></div></div></li><li class="depth-4 branch"><a href="tech.v3.libs.arrow.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>arrow</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.libs.fastexcel.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>fastexcel</span></div></a></li><li class="depth-4"><div class="no-link"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>guava</span></div></div></li><li class="depth-5"><a href="tech.v3.libs.guava.cache.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>cache</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.libs.parquet.html"><div class="inner"><span class="tree" style="top: -52px;"><span class="top" style="height: 61px;"></span><span class="bottom"></span></span><span>parquet</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.libs.poi.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>poi</span></div></a></li><li class="depth-4"><div class="no-link"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>smile</span></div></div></li><li class="depth-5"><a href="tech.v3.libs.smile.data.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>data</span></div></a></li><li class="depth-4"><a href="tech.v3.libs.tribuo.html"><div class="inner"><span class="tree" style="top: -52px;"><span class="top" style="height: 61px;"></span><span class="bottom"></span></span><span>tribuo</span></div></a></li></ul></div><div class="document" id="content"><div class="doc"><div class="markdown"><h1>tech.ml.dataset Quick Reference</h1>
<p>Functions are linked to their source but if no namespace is specified they are
also accessible via the <code>tech.ml.dataset</code> namespace.</p>
<p>This is not an exhaustive listing of all functionality; just a quick brief way to find
functions that are we find most useful.</p>
<h2>Loading/Saving</h2>
<ul>
<li><a href="https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var--.3Edataset">-&gt;dataset, -&gt;&gt;dataset</a> - loads csv, tsv,
sequence-of-maps, map-of-arrays, xlsx, xls, and if their respective namespaces and dependencies are loaded, parquet and arrow.</li>
<li><a href="https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-write.21">write!</a> - Writes csv, tsv or
nippy with gzipping. Depends on scanning file path string to determine options.</li>
<li><a href="https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-dataset-.3Edata">dataset-&gt;data</a> - Useful if you want the entire
dataset represented as (mostly) pure Clojure/JVM datastructures. Missing sets are
roaring bitmaps, data is probably in primitive arrays. String tables receive special
treatment.</li>
<li><a href="https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-data-.3Edataset">data-&gt;dataset</a> - Inverse of data-&gt;dataset.</li>
<li><a href="https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.io.univocity.html#var-csv-.3Erows">tech.ml.dataset.io.univocity/csv-&gt;rows</a> - Lazily parse a
csv or tsv returning a sequence of string[] rows. This uses a subset of the -&gt;dataset options.</li>
<li><a href="https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.io.string-row-parser.html#var-rows-.3Edataset">tech.ml.dataset.parse/rows-&gt;dataset</a> - Given
a sequence of string[] rows, parse data into a dataset. Uses subset of the -&gt;dataset
options.</li>
<li><a href="https://techascent.github.io/tech.ml.dataset/tech.v3.libs.parquet.html">parquet support</a></li>
<li><a href="https://techascent.github.io/tech.ml.dataset/tech.v3.libs.arrow.html">arrow support</a></li>
<li><a href="https://techascent.github.io/tech.ml.dataset/tech.v3.libs.poi.html">xlsx, xls support</a></li>
<li><a href="https://techascent.github.io/tech.ml.dataset/tech.v3.libs.fastexcel.html">fast xlsx support</a></li>
</ul>
<h2>Accessing Values</h2>
<ul>
<li>Datasets overload Ifn so are functions of their column names. <code>(ds :colname)</code> will
return the column named <code>:colname</code>. Datasets implement <code>IPersistentMap</code> so
<code>(map (comp second meta) ds)</code> or <code>(map meta (vals ds))</code> will return a sequence of column
metadata. <code>keys</code>, <code>vals</code>, <code>contains?</code>, <code>assoc</code>, <code>dissoc</code>, <code>merge</code> and map-style destructuring
all work on datasets. Note that <code>update</code> does not work as update will always return a
persistent map.</li>
<li>Columns are iterable and implement indexed so you can use them with <code>map</code>, <code>count</code>
and <code>nth</code>. They furthermore overload IFn such that they are functions of their indexes similar
to persistent vectors. Using negative values as indexes will index from the end similar to numpy and pandas.</li>
<li>Typed random access is supported the <code>(tech.v3.datatype/-&gt;reader col)</code>
transformation. This is guaranteed to return an implementation of <code>java.util.List</code>
and also overloads <code>IFn</code> such that like a persistent vector passing in the index
will return the value - e.g. <code>(col 0)</code> returns the value at index 0. Direct access
to packed datetime columns may be surprising; call <code>tech.v3.datatype.datetime/unpack</code>
on the column prior to calling <code>tech.v3.datatype/-&gt;reader</code> to get to the unpacked
datatype. Using negative values as indexes will index from the end similar to numpy and pandas.</li>
<li><a href="https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-row-count">row-count</a> - works on datasets and columns.</li>
<li><a href="https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-column-count">column-count</a> - number of columns.</li>
<li><a href="https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-rows">rows</a> - get the rows of the
dataset as a <code>java.util.List</code> of persistent-map-like maps. Implemented as a flyweight
implementation of <code>clojure.lang.APersistentMap</code> where data is read out of the underlying dataset on demand. This keeps the
data in the backing store and lazily reads it so you will have relatively more expensive reading of the
data but will not increase your memory working-set size. Using negative values as indexes will index from the end similar to numpy and pandas.</li>
<li><a href="https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-rowvecs">rowvecs</a> - Get the rows of the
dataset as a 'java.util.List' of rows. These rows behave like persistent vectors and are safe to use
in maps. If you are going to use row values as keys in maps, passing in <code>{:copying? true}</code> will be more
efficient as then each hash and equals comparison is using data in the vector and not re-reading the data
out of the source dataset. Using negative values as indexes will index from the end similar to numpy and pandas.</li>
<li><a href="https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-missing">missing</a> - return a RoaringBitmap of missing indexes. If this is a column, it is the column's specific missing indexes. If this is a dataset, return a union of all the columns' missing indexes.</li>
<li><a href="https://github.com/clojure/clojure/blob/master/src/clj/clojure/core.clj#L202">meta, with-meta, vary-meta</a> - Datasets and columns implement
<code>clojure.lang.IObj</code> so you can get/set metadata on them freely. <code>:name</code> has meaning in the system and setting it
directly on a column is not recommended. Metadata is generally carried forward through most of the operations below.</li>
</ul>
<p><code>rows</code> and <code>rowvecs</code> are lazy and thus <code>(rand-nth (ds/rows ds))</code> is
a relatively efficient pathway (and fun). <code>(ds/rows (ds/sample ds))</code> is also
pretty good for quick scans.</p>
<h2>Print Options</h2>
<p>We use these options frequently during exploration to get more/less printing
output. These are used like <code>(vary-meta ds assoc :print-column-max-width 10)</code>.
Often it is useful to print the entire table: <code>(vary-meta ds assoc :print-index-range :all)</code></p>
<ul>
<li><a href="https://github.com/techascent/tech.ml.dataset/blob/7c8c7514e0e35995050c1e326122a1826cc18273/src/tech/v3/dataset/print.clj#L93">print metadata options</a></li>
</ul>
<h2>Dataset Exploration</h2>
<ul>
<li><a href="https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-head">head</a> - Return dataset consisting of first N rows.</li>
<li><a href="https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-tail">tail</a> - Return dataset consisting of last N rows.</li>
<li><a href="https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-sample">sample</a> - Randomly sample N rows of the dataset.</li>
<li><a href="https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-rand-nth">rand-nth</a> - Randomly sample a row of the dataset.</li>
<li><a href="https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-descriptive-stats">descriptive-stats</a> - return a dataset of
columnwise descriptive statistics.</li>
<li><a href="https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-brief">brief</a> - Return a sequence of maps of descriptive statistics.</li>
</ul>
<h2>Subrect Selection</h2>
<p>Keeping in mind that assoc, dissoc, and merge all work at the dataset level - here are some other
pathways that are useful for subrect selection.</p>
<ul>
<li><a href="https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-select">select</a> - can be used for renaming.
Anything iterable can be used for the rows.</li>
<li><a href="https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-select-columns">select-columns</a> - Select a subset of columns.</li>
<li><a href="https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-select-rows">select-rows</a> - works on datasets and columns.</li>
<li><a href="https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-drop-rows">drop-rows</a> - works on datasets and columns.</li>
<li><a href="https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-drop-missing">drop-missing</a> - Drop rows with missing values from the dataset.</li>
</ul>
<h2>Dataset Manipulation</h2>
<p>Several of the functions below come in <code>-&gt;column</code> variants and some come additional
in <code>-&gt;indexes</code> variants. <code>-&gt;column</code> variants are going to be faster than the base
versions and <code>-&gt;indexes</code> simply return indexes and thus skip creating sub-datasets
so these are faster yet.</p>
<ul>
<li><a href="https://github.com/techascent/tech.ml.dataset/blob/7c8c7514e0e35995050c1e326122a1826cc18273/src/tech/v3/dataset/impl/dataset.clj#L380">new-dataset</a> - Create a new dataset from a sequence of columns. Columns may be actual columns created via <code>tech.ml.dataset.column/new-column</code> or they could be maps containing at least keys <code>#:tech.v3.dataset{:name :data}</code> but also potentially <code>#:tech.v3.dataset{:metadata :missing}</code> in order to create a column with a specific set of missing values and metadata. <code>:force-datatype true</code> will disable the system
from attempting to scan the data for missing values and e.g. create a float column
from a vector of Float objects. The above also applies to using <code>clojure.core/assoc</code> with a dataset.</li>
<li><a href="https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-replace-missing">replace-missing</a>, <a href="https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-replace-missing-value">replace-missing-value</a> - replace missing values in one or more columns.</li>
<li><a href="https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-group-by-column">group-by-column</a>, <a href="https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-group-by">group-by</a> - Create a persistent map of value-&gt;dataset. Sub-datasets are created via indexing into the original dataset so data is not copied.</li>
<li><a href="https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-sort-by-column">sort-by-column</a>, <a href="https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-sort-by">sort-by</a> - Return a sorted dataset.</li>
<li><a href="https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-filter-column">filter-column</a>, <a href="https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-filter">filter</a> - Return a new dataset with only rows that pass the predicate.</li>
<li><a href="https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-concat-copying">concat-copying</a>, <a href="https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-concat-inplace">concat-inplace</a> - Given Y datasets produce a new dataset. Copying is generally much more efficient than in-place for a dataset count &gt; 2. <code>(apply ds/concat-copying x-seq)</code> is
<strong>far</strong> more efficient than <code>(reduce ds/concat-copying x-seq)</code>; this also is true for <code>concat-inplace</code>.</li>
<li><a href="https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-unique-by-column">unique-by-column</a>, <a href="https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-unique-by">unique-by</a> - Remove duplicate rows. Passing in <code>keep-fn</code> allows
you to choose either first, last, or some other criteria for rows that have the same
values. For <code>unique-by</code>, <code>identity</code> will work just fine.</li>
<li><a href="https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-row-map">row-map</a> - In parallel, map a function from map-&gt;map over the dataset. The returned maps will be
used to create new columns in a new dataset and the result merged with the original. Note there are options to return a sequence of datasets as opposed to a single large
final dataset.</li>
<li><a href="https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-row-mapcat">row-mapcat</a> - In parallel, map a function from map-&gt;sequence-of-maps over the dataset potentially
expanding or shrinking the result. When multiple maps are returned, row information not included in the original map is efficiently duplicated. Note there are options to
return a sequence of datasets as opposed to a single potentially very large final dataset.</li>
<li><a href="https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-pmap-ds">pmap-ds</a> - Split dataset into batches and in parallel map a function from ds-&gt;ds across the dataset.
Can return either a new dataset via concat-copying or a sequence of datasets.</li>
<li><a href="https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.join.html#var-pd-merge">pd-merge</a> - Generalized left,right,inner,outer, and cross joins.</li>
<li><a href="https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.join.html#var-left-join-asof">left-join-asof</a> - Join-nearest type functionality useful for doing things like finding
the 3, 6, and 12 month prices from a dataset daily prices where you what the nearest price as things don't trade on the weekends.</li>
<li><a href="https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.rolling.html#var-rolling">rolling</a> - fixed and variable rolling window operations.</li>
<li><a href="https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.reductions.html#var-group-by-column-agg">group-column-by-agg</a> - Very high performance primitive taking a sequence of
datasets and producing a new dataset that is first grouped by one or more columns and then the per-group data is reduced using a map of reducers. Each key in the map becomes
a column in the result dataset.</li>
<li><a href="https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.neanderthal.html">neanderthal support</a> - transformations of datasets to/from neanderthal dense native matrixes.</li>
<li><a href="https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.tensor.html">tensor support</a> - transformations of datasets to/from <a href="https://cnuernber.github.io/dtype-next/tech.v3.tensor.html">tech.v3.tensor</a> objects.</li>
</ul>
<h2>Elementwise Arithmetic</h2>
<p>Functions in 'tech.v3.datatype.functional' all will apply various elementwise
arithmetic operations to a column lazily returning a new column. It is highly recommended to
remove all missing values before using elemwise arithmetic as the <code>functional</code> namespace
has no knowledge of missing values. Integer columns with missing values will be upcast
to float or double columns in order to support a missing value indicator.</p>
<pre><code class="language-clojure"> (assoc ds :value (dtype/elemwise-cast (ds :value) :int64)
:shrs-or-prn-amt (dtype/elemwise-cast (ds :shrs-or-prn-amt) :int64)
:cik (dtype/const-reader (:cik filing) (ds/row-count ds))
:investor (dtype/const-reader investor (ds/row-count ds))
:form-type (dtype/const-reader form-type (ds/row-count ds))
:edgar-id (dtype/const-reader (:edgar-id filing) (ds/row-count ds))
:weight (dfn// (ds :value)
(double (dfn/reduce-+ (ds :value)))))
</code></pre>
<h2>Forcing Lazy Evaluation</h2>
<p>The dataset system relies on index indirection and laziness quite often. This allows
you to aggregate up operations and pay relatively little for them however sometimes
it increases the accessing costs of the data by an undesirable amount. Because
of this we use <code>clone</code> quite often to force calculations to complete before
beginning a new stage of data processing. Clone is multithreaded and very efficient
often boiling down into either parallelized iteration over the data or
<code>System/arraycopy</code> calls.</p>
<p>Additionally calling 'clone' after loading will reduce the in-memory size of the
dataset by a bit - sometimes 20%. This is because lists that have allocated extra
capacity are copied into arrays that have no extra capacity.</p>
<ul>
<li><a href="https://github.com/cnuernber/dtype-next/blob/152f09f925041d41782e05009bbf84d7d6cfdbc6/src/tech/v3/datatype.clj#L95">tech.v3.datatype/clone</a> - Clones the dataset realizing lazy operation and copying the data into
java arrays. Will clone datasets or columns.</li>
</ul>
</div></div></div></body></html>