Files
2026-02-08 11:20:43 -10:00

980 lines
101 KiB
HTML
Vendored

<!DOCTYPE html PUBLIC ""
"">
<html><head><meta charset="UTF-8" /><title>tech.v3.dataset documentation</title><script async="true" src="https://www.googletagmanager.com/gtag/js?id=G-RGTB4J7LGP"></script><script>window.dataLayer = window.dataLayer || [];
function gtag(){dataLayer.push(arguments);}
gtag('js', new Date());
gtag('config', 'G-95TVFC1FEB');</script><link rel="stylesheet" type="text/css" href="css/default.css" /><link rel="stylesheet" type="text/css" href="highlight/solarized-light.css" /><script type="text/javascript" src="highlight/highlight.min.js"></script><script type="text/javascript" src="js/jquery.min.js"></script><script type="text/javascript" src="js/page_effects.js"></script><script>hljs.initHighlightingOnLoad();</script></head><body><div id="header"><h2>Generated by <a href="https://github.com/weavejester/codox">Codox</a> with <a href="https://github.com/xsc/codox-theme-rdash">RDash UI</a> theme</h2><h1><a href="index.html"><span class="project-title"><span class="project-name">TMD</span> <span class="project-version">8.003</span></span></a></h1></div><div class="sidebar primary"><h3 class="no-link"><span class="inner">Project</span></h3><ul class="index-link"><li class="depth-1 "><a href="index.html"><div class="inner">Index</div></a></li></ul><h3 class="no-link"><span class="inner">Topics</span></h3><ul><li class="depth-1 "><a href="000-getting-started.html"><div class="inner"><span>tech.ml.dataset Getting Started</span></div></a></li><li class="depth-1 "><a href="100-walkthrough.html"><div class="inner"><span>tech.ml.dataset Walkthrough</span></div></a></li><li class="depth-1 "><a href="200-quick-reference.html"><div class="inner"><span>tech.ml.dataset Quick Reference</span></div></a></li><li class="depth-1 "><a href="columns-readers-and-datatypes.html"><div class="inner"><span>tech.ml.dataset Columns, Readers, and Datatypes</span></div></a></li><li class="depth-1 "><a href="nippy-serialization-rocks.html"><div class="inner"><span>tech.ml.dataset And nippy</span></div></a></li><li class="depth-1 "><a href="supported-datatypes.html"><div class="inner"><span>tech.ml.dataset Supported Datatypes</span></div></a></li></ul><h3 class="no-link"><span class="inner">Namespaces</span></h3><ul><li class="depth-1"><div class="no-link"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>tech</span></div></div></li><li class="depth-2"><div class="no-link"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>v3</span></div></div></li><li class="depth-3 current"><a href="tech.v3.dataset.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>dataset</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.categorical.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>categorical</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.clipboard.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>clipboard</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.column.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>column</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.column-filters.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>column-filters</span></div></a></li><li class="depth-4"><div class="no-link"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>io</span></div></div></li><li class="depth-5 branch"><a href="tech.v3.dataset.io.csv.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>csv</span></div></a></li><li class="depth-5 branch"><a href="tech.v3.dataset.io.datetime.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>datetime</span></div></a></li><li class="depth-5 branch"><a href="tech.v3.dataset.io.string-row-parser.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>string-row-parser</span></div></a></li><li class="depth-5"><a href="tech.v3.dataset.io.univocity.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>univocity</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.join.html"><div class="inner"><span class="tree" style="top: -145px;"><span class="top" style="height: 154px;"></span><span class="bottom"></span></span><span>join</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.math.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>math</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.metamorph.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>metamorph</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.modelling.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>modelling</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.print.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>print</span></div></a></li><li class="depth-4"><a href="tech.v3.dataset.reductions.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>reductions</span></div></a></li><li class="depth-5"><a href="tech.v3.dataset.reductions.apache-data-sketch.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>apache-data-sketch</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.rolling.html"><div class="inner"><span class="tree" style="top: -52px;"><span class="top" style="height: 61px;"></span><span class="bottom"></span></span><span>rolling</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.set.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>set</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.tensor.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>tensor</span></div></a></li><li class="depth-4"><a href="tech.v3.dataset.zip.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>zip</span></div></a></li><li class="depth-3"><div class="no-link"><div class="inner"><span class="tree" style="top: -641px;"><span class="top" style="height: 650px;"></span><span class="bottom"></span></span><span>libs</span></div></div></li><li class="depth-4 branch"><a href="tech.v3.libs.arrow.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>arrow</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.libs.clj-transit.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>clj-transit</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.libs.fastexcel.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>fastexcel</span></div></a></li><li class="depth-4"><div class="no-link"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>guava</span></div></div></li><li class="depth-5"><a href="tech.v3.libs.guava.cache.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>cache</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.libs.parquet.html"><div class="inner"><span class="tree" style="top: -52px;"><span class="top" style="height: 61px;"></span><span class="bottom"></span></span><span>parquet</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.libs.poi.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>poi</span></div></a></li><li class="depth-4"><a href="tech.v3.libs.tribuo.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>tribuo</span></div></a></li></ul></div><div class="sidebar secondary"><h3><a href="#top"><span class="inner">Public Vars</span></a></h3><ul><li class="depth-1"><a href="tech.v3.dataset.html#var--.3E.3Edataset"><div class="inner"><span>-&gt;&gt;dataset</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var--.3Edataset"><div class="inner"><span>-&gt;dataset</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-add-column"><div class="inner"><span>add-column</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-add-or-update-column"><div class="inner"><span>add-or-update-column</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-all-descriptive-stats-names"><div class="inner"><span>all-descriptive-stats-names</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-append-columns"><div class="inner"><span>append-columns</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-assoc-ds"><div class="inner"><span>assoc-ds</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-assoc-metadata"><div class="inner"><span>assoc-metadata</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-bind-.3E"><div class="inner"><span>bind-&gt;</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-brief"><div class="inner"><span>brief</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-categorical-.3Enumber"><div class="inner"><span>categorical-&gt;number</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-categorical-.3Eone-hot"><div class="inner"><span>categorical-&gt;one-hot</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-column"><div class="inner"><span>column</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-column-.3Edataset"><div class="inner"><span>column-&gt;dataset</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-column-cast"><div class="inner"><span>column-cast</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-column-count"><div class="inner"><span>column-count</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-column-labeled-mapseq"><div class="inner"><span>column-labeled-mapseq</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-column-map"><div class="inner"><span>column-map</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-column-map-m"><div class="inner"><span>column-map-m</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-column-names"><div class="inner"><span>column-names</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-columns"><div class="inner"><span>columns</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-columns-with-missing-seq"><div class="inner"><span>columns-with-missing-seq</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-columnwise-concat"><div class="inner"><span>columnwise-concat</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-concat"><div class="inner"><span>concat</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-concat-copying"><div class="inner"><span>concat-copying</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-concat-inplace"><div class="inner"><span>concat-inplace</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-data-.3Edataset"><div class="inner"><span>data-&gt;dataset</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-dataset-.3Edata"><div class="inner"><span>dataset-&gt;data</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-dataset-name"><div class="inner"><span>dataset-name</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-dataset-parser"><div class="inner"><span>dataset-parser</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-dataset.3F"><div class="inner"><span>dataset?</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-descriptive-stats"><div class="inner"><span>descriptive-stats</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-drop-columns"><div class="inner"><span>drop-columns</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-drop-missing"><div class="inner"><span>drop-missing</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-drop-rows"><div class="inner"><span>drop-rows</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-empty-column-names"><div class="inner"><span>empty-column-names</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-empty-dataset"><div class="inner"><span>empty-dataset</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-ensure-array-backed"><div class="inner"><span>ensure-array-backed</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-filter"><div class="inner"><span>filter</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-filter-column"><div class="inner"><span>filter-column</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-filter-dataset"><div class="inner"><span>filter-dataset</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-group-by"><div class="inner"><span>group-by</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-group-by-.3Eindexes"><div class="inner"><span>group-by-&gt;indexes</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-group-by-column"><div class="inner"><span>group-by-column</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-group-by-column-.3Eindexes"><div class="inner"><span>group-by-column-&gt;indexes</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-group-by-column-consumer"><div class="inner"><span>group-by-column-consumer</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-has-column.3F"><div class="inner"><span>has-column?</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-head"><div class="inner"><span>head</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-induction"><div class="inner"><span>induction</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-major-version"><div class="inner"><span>major-version</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-mapseq-parser"><div class="inner"><span>mapseq-parser</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-mapseq-reader"><div class="inner"><span>mapseq-reader</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-mapseq-rf"><div class="inner"><span>mapseq-rf</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-min-n-by-column"><div class="inner"><span>min-n-by-column</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-missing"><div class="inner"><span>missing</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-new-column"><div class="inner"><span>new-column</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-new-dataset"><div class="inner"><span>new-dataset</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-order-column-names"><div class="inner"><span>order-column-names</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-pmap-ds"><div class="inner"><span>pmap-ds</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-print-all"><div class="inner"><span>print-all</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-rand-nth"><div class="inner"><span>rand-nth</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-remove-column"><div class="inner"><span>remove-column</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-remove-columns"><div class="inner"><span>remove-columns</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-remove-empty-columns"><div class="inner"><span>remove-empty-columns</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-remove-rows"><div class="inner"><span>remove-rows</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-rename-columns"><div class="inner"><span>rename-columns</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-replace-missing"><div class="inner"><span>replace-missing</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-replace-missing-value"><div class="inner"><span>replace-missing-value</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-reverse-rows"><div class="inner"><span>reverse-rows</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-row-at"><div class="inner"><span>row-at</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-row-count"><div class="inner"><span>row-count</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-row-map"><div class="inner"><span>row-map</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-row-mapcat"><div class="inner"><span>row-mapcat</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-rows"><div class="inner"><span>rows</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-rowvec-at"><div class="inner"><span>rowvec-at</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-rowvecs"><div class="inner"><span>rowvecs</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-sample"><div class="inner"><span>sample</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-select"><div class="inner"><span>select</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-select-by-index"><div class="inner"><span>select-by-index</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-select-columns"><div class="inner"><span>select-columns</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-select-columns-by-index"><div class="inner"><span>select-columns-by-index</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-select-missing"><div class="inner"><span>select-missing</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-select-rows"><div class="inner"><span>select-rows</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-set-dataset-name"><div class="inner"><span>set-dataset-name</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-shape"><div class="inner"><span>shape</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-shuffle"><div class="inner"><span>shuffle</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-sort-by"><div class="inner"><span>sort-by</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-sort-by-column"><div class="inner"><span>sort-by-column</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-tail"><div class="inner"><span>tail</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-take-nth"><div class="inner"><span>take-nth</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-unique-by"><div class="inner"><span>unique-by</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-unique-by-column"><div class="inner"><span>unique-by-column</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-unordered-select"><div class="inner"><span>unordered-select</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-unroll-column"><div class="inner"><span>unroll-column</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-update"><div class="inner"><span>update</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-update-column"><div class="inner"><span>update-column</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-update-columns"><div class="inner"><span>update-columns</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-update-columnwise"><div class="inner"><span>update-columnwise</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-update-elemwise"><div class="inner"><span>update-elemwise</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-value-reader"><div class="inner"><span>value-reader</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-write.21"><div class="inner"><span>write!</span></div></a></li></ul></div><div class="namespace-docs" id="content"><h1 class="anchor" id="top">tech.v3.dataset</h1><div class="doc"><div class="markdown"><p>Column major dataset abstraction for efficiently manipulating
in memory datasets.</p>
</div></div><div class="public anchor" id="var--.3E.3Edataset"><h3>-&gt;&gt;dataset</h3><div class="usage"><code>(-&gt;&gt;dataset options dataset)</code><code>(-&gt;&gt;dataset dataset)</code></div><div class="doc"><div class="markdown"><p>Please see documentation of -&gt;dataset. Options are the same.</p>
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L14">view source</a></div></div><div class="public anchor" id="var--.3Edataset"><h3>-&gt;dataset</h3><div class="usage"><code>(-&gt;dataset dataset options)</code><code>(-&gt;dataset dataset)</code></div><div class="doc"><div class="markdown"><p>Create a dataset from either csv/tsv or a sequence of maps.</p>
<ul>
<li>
<p>A <code>String</code> be interpreted as a file (or gzipped file if it
ends with .gz) of tsv or csv data. The system will attempt to autodetect if this
is csv or tsv and then engineering around detecting datatypes all of which can
be overridden.</p>
</li>
<li>
<p>InputStreams have no file type and thus a <code>file-type</code> must be provided in the
options.</p>
</li>
<li>
<p>A sequence of maps may be passed in in which case the first N maps are scanned in
order to derive the column datatypes before the actual columns are created.</p>
</li>
</ul>
<p>Parquet, xlsx, and xls formats require that you require the appropriate libraries
which are <code>tech.v3.libs.parquet</code> for parquet, <code>tech.v3.libs.fastexcel</code> for xlsx,
and <code>tech.v3.libs.poi</code> for xls.</p>
<p>Arrow support is provided via the tech.v3.libs.Arrow namespace not via a file-type
overload as the Arrow project current has 3 different file types and it is not clear
what their final suffix will be or which of the three file types it will indicate.
Please see documentation in the <code>tech.v3.libs.arrow</code> namespace for further information
on Arrow file types.</p>
<p>Options:</p>
<ul>
<li>
<p><code>:dataset-name</code> - set the name of the dataset.</p>
</li>
<li>
<p><code>:file-type</code> - Override filetype discovery mechanism for strings or force a particular
parser for an input stream. Note that parquet must have paths on disk
and cannot currently load from input stream. Acceptible file types are:
#{:csv :tsv :xlsx :xls :parquet}.</p>
</li>
<li>
<p><code>:gzipped?</code> - for file formats that support it, override autodetection and force
creation of a gzipped input stream as opposed to a normal input stream.</p>
</li>
<li>
<p><code>:column-allowlist</code> - either sequence of string column names or sequence of column
indices of columns to allowlist. This is preferred to <code>:column-whitelist</code></p>
</li>
<li>
<p><code>:column-blocklist</code> - either sequence of string column names or sequence of column
indices of columns to blocklist. This is preferred to <code>:column-blacklist</code></p>
</li>
<li>
<p><code>:num-rows</code> - Number of rows to read</p>
</li>
<li>
<p><code>:header-row?</code> - Defaults to true, indicates the first row is a header.</p>
</li>
<li>
<p><code>:key-fn</code> - function to be applied to column names. Typical use is:
<code>:key-fn keyword</code>.</p>
</li>
<li>
<p><code>:separator</code> - Add a character separator to the list of separators to auto-detect.</p>
</li>
<li>
<p><code>:csv-parser</code> - Implementation of univocity's AbstractParser to use. If not
provided a default permissive parser is used. This way you parse anything that
univocity supports (so flat files and such).</p>
</li>
<li>
<p><code>:bad-row-policy</code> - One of three options: :skip, :error, :carry-on. Defaults to
:carry-on. Some csv data has ragged rows and in this case we have several
options. If the option is :carry-on then we either create a new column or add
missing values for columns that had no data for that row.</p>
</li>
<li>
<p><code>:skip-bad-rows?</code> - Legacy option. Use :bad-row-policy.</p>
</li>
<li>
<p><code>:disable-comment-skipping?</code> - As default, the <code>#</code> character is recognised as a
line comment when found in the beginning of a line of text in a CSV file,
and the row will be ignored. Set <code>true</code> to disable this behavior.</p>
</li>
<li>
<p><code>:max-chars-per-column</code> - Defaults to 4096. Columns with more characters that this
will result in an exception.</p>
</li>
<li>
<p><code>:max-num-columns</code> - Defaults to 8192. CSV,TSV files with more columns than this
will fail to parse. For more information on this option, please visit:
<a href="https://github.com/uniVocity/univocity-parsers/issues/301">https://github.com/uniVocity/univocity-parsers/issues/301</a></p>
</li>
<li>
<p><code>:text-temp-dir</code> - The temporary directory to use for file-backed text. Setting
this value to boolean 'false' turns off file backed text which is the default. If a
tech.v3.resource stack context is opened the file will be deleted when the context
closes else it will be deleted when the gc cleans up the dataset. A shutdown hook is
added as a last resort to ensure the file is cleaned up.</p>
</li>
<li>
<p><code>:n-initial-skip-rows</code> - Skip N rows initially. This currently may include the
header row. Works across both csv and spreadsheet datasets.</p>
</li>
<li>
<p><code>:parser-type</code> - Default parser to use if no parser-fn is specified for that column.
For csv files, the default parser type is <code>:string</code> which indicates a promotional
string parser. For sequences of maps, the default parser type is :object. It can
be useful in some contexts to use the <code>:string</code> parser with sequences of maps or
maps of columns.</p>
</li>
<li>
<p><code>:parser-fn</code> -
v - <code>keyword?</code> - all columns parsed to this datatype. For example:
<code>{:parser-fn :string}</code></p>
<ul>
<li><code>map?</code> - <code>{column-name parse-method}</code> parse each column with specified
<code>parse-method</code>.
The <code>parse-method</code> can be:
<ul>
<li><code>keyword?</code> - parse the specified column to this datatype. For example:
<code>{:parser-fn {:answer :boolean :id :int32}}</code></li>
<li>tuple - pair of <code>[datatype parse-data]</code> in which case container of type
<code>[datatype]</code> will be created. <code>parse-data</code> can be one of:
<ul>
<li><code>:relaxed?</code> - data will be parsed such that parse failures of the standard
parse functions do not stop the parsing process. :unparsed-values and
:unparsed-indexes are available in the metadata of the column that tell
you the values that failed to parse and their respective indexes.</li>
<li><code>fn?</code> - function from str-&gt; one of <code>:tech.v3.dataset/missing</code>,
<code>:tech.v3.dataset/parse-failure</code>, or the parsed value.
Exceptions here always kill the parse process. :missing will get marked
in the missing indexes, and :parse-failure will result in the index being
added to missing, the unparsed the column's :unparsed-values and
:unparsed-indexes will be updated.</li>
<li><code>string?</code> - for datetime types, this will turned into a DateTimeFormatter via
DateTimeFormatter/ofPattern. For <code>:text</code> you can specify the backing file
to use.</li>
<li><code>DateTimeFormatter</code> - use with the appropriate temporal parse static function
to parse the value.</li>
</ul>
</li>
</ul>
</li>
</ul>
</li>
<li>
<p><code>map?</code> - the header-name-or-idx is used to lookup value. If not nil, then
value can be any of the above options. Else the default column parser
is used.</p>
</li>
</ul>
<p>Returns a new dataset</p>
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L22">view source</a></div></div><div class="public anchor" id="var-add-column"><h3>add-column</h3><div class="usage"><code>(add-column dataset column)</code></div><div class="doc"><div class="markdown"><p>Add a new column. Error if name collision</p>
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L130">view source</a></div></div><div class="public anchor" id="var-add-or-update-column"><h3>add-or-update-column</h3><div class="usage"><code>(add-or-update-column dataset colname column)</code><code>(add-or-update-column dataset column)</code></div><div class="doc"><div class="markdown"><p>If column exists, replace. Else append new column.</p>
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L136">view source</a></div></div><div class="public anchor" id="var-all-descriptive-stats-names"><h3>all-descriptive-stats-names</h3><div class="usage"><code>(all-descriptive-stats-names)</code></div><div class="doc"><div class="markdown"><p>Returns the names of all descriptive stats in the order they will be returned
in the resulting dataset of descriptive stats. This allows easy filtering
in the form for
(descriptive-stats ds {:stat-names (-&gt;&gt; (all-descriptive-stats-names)
(remove #{:values :num-distinct-values}))})</p>
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L144">view source</a></div></div><div class="public anchor" id="var-append-columns"><h3>append-columns</h3><div class="usage"><code>(append-columns dataset column-seq)</code></div><div class="doc"><div class="markdown"></div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L154">view source</a></div></div><div class="public anchor" id="var-assoc-ds"><h3>assoc-ds</h3><div class="usage"><code>(assoc-ds dataset cname cdata &amp; args)</code></div><div class="doc"><div class="markdown"><p>If dataset is not nil, calls <code>clojure.core/assoc</code>. Else creates a new empty dataset and
then calls <code>clojure.core/assoc</code>. Guaranteed to return a dataset (unlike assoc).</p>
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L159">view source</a></div></div><div class="public anchor" id="var-assoc-metadata"><h3>assoc-metadata</h3><div class="usage"><code>(assoc-metadata dataset filter-fn-or-ds k v &amp; args)</code></div><div class="doc"><div class="markdown"><p>Set metadata across a set of columns.</p>
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L166">view source</a></div></div><div class="public anchor" id="var-bind-.3E"><h3>bind-&gt;</h3><h4 class="type">macro</h4><div class="usage"><code>(bind-&gt; expr name &amp; args)</code></div><div class="doc"><div class="markdown"><p>Threads like <code>-&gt;</code> but binds name to expr like <code>as-&gt;</code>:</p>
<pre><code class="language-clojure">(ds/bind-&gt; (ds/-&gt;dataset "test/data/stocks.csv") ds
(assoc :logprice2 (dfn/log1p (ds "price")))
(assoc :logp3 (dfn/* 2 (ds :logprice2)))
(ds/select-columns ["price" :logprice2 :logp3])
(ds-tens/dataset-&gt;tensor)
(first))
</code></pre>
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L172">view source</a></div></div><div class="public anchor" id="var-brief"><h3>brief</h3><div class="usage"><code>(brief ds options)</code><code>(brief ds)</code></div><div class="doc"><div class="markdown"><p>Get a brief description, in mapseq form of a dataset. A brief description is
the mapseq form of descriptive stats.</p>
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L188">view source</a></div></div><div class="public anchor" id="var-categorical-.3Enumber"><h3>categorical-&gt;number</h3><div class="usage"><code>(categorical-&gt;number dataset filter-fn-or-ds)</code><code>(categorical-&gt;number dataset filter-fn-or-ds table-args)</code><code>(categorical-&gt;number dataset filter-fn-or-ds table-args result-datatype)</code></div><div class="doc"><div class="markdown"><p>Convert columns into a discrete , numeric representation
See tech.v3.dataset.categorical/fit-categorical-map.</p>
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L197">view source</a></div></div><div class="public anchor" id="var-categorical-.3Eone-hot"><h3>categorical-&gt;one-hot</h3><div class="usage"><code>(categorical-&gt;one-hot dataset filter-fn-or-ds)</code><code>(categorical-&gt;one-hot dataset filter-fn-or-ds table-args)</code><code>(categorical-&gt;one-hot dataset filter-fn-or-ds table-args result-datatype)</code></div><div class="doc"><div class="markdown"><p>Convert string columns to numeric columns.
See tech.v3.dataset.categorical/fit-one-hot</p>
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L208">view source</a></div></div><div class="public anchor" id="var-column"><h3>column</h3><div class="usage"><code>(column dataset colname)</code></div><div class="doc"><div class="markdown"></div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L219">view source</a></div></div><div class="public anchor" id="var-column-.3Edataset"><h3>column-&gt;dataset</h3><div class="usage"><code>(column-&gt;dataset dataset colname transform-fn options)</code><code>(column-&gt;dataset dataset colname transform-fn)</code></div><div class="doc"><div class="markdown"><p>Transform a column into a sequence of maps using transform-fn.
Return dataset created out of the sequence of maps.</p>
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L224">view source</a></div></div><div class="public anchor" id="var-column-cast"><h3>column-cast</h3><div class="usage"><code>(column-cast dataset colname datatype)</code><code>(column-cast dataset colname datatype options)</code></div><div class="doc"><div class="markdown"><p>Cast a column to a new datatype. This is never a lazy operation. If the old
and new datatypes match and no cast-fn is provided then dtype/clone is called
on the column.</p>
<p>colname may be a scalar or a tuple of <a href="src-col dst-col">src-col dst-col</a>.</p>
<p>datatype may be a datatype enumeration or a tuple of
<a href="datatype cast-fn">datatype cast-fn</a> where cast-fn may return either a new value,
:tech.v3.dataset/missing, or :tech.v3.dataset/parse-failure.
Exceptions are propagated to the caller. The new column has at least the
existing missing set (if no attempt returns :missing or :cast-failure).
:cast-failure means the value gets added to metadata key :unparsed-data
and the index gets added to :unparsed-indexes.</p>
<p>If the existing datatype is string, then tech.v3.datatype.column/parse-column
is called.</p>
<p>Casts between numeric datatypes need no cast-fn but one may be provided.
Casts to string need no cast-fn but one may be provided.
Casts from string to anything will call tech.v3.dataset.column/parse-column.</p>
<p>Options:</p>
<ul>
<li><code>:track-parse-errors</code> - defaults to false. When true extra metadata keys
<code>:unparsed-indexes :unparsed-data</code> will be appended to the metadata. Be aware
these values may not serialize as unparsed indexes is a roaring bitmap.</li>
</ul>
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L233">view source</a></div></div><div class="public anchor" id="var-column-count"><h3>column-count</h3><div class="usage"><code>(column-count dataset)</code></div><div class="doc"><div class="markdown"></div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L267">view source</a></div></div><div class="public anchor" id="var-column-labeled-mapseq"><h3>column-labeled-mapseq</h3><div class="usage"><code>(column-labeled-mapseq dataset value-colname-seq)</code></div><div class="doc"><div class="markdown"><p>Given a dataset, return a sequence of maps where several columns are all stored
in a :value key and a :label key contains a column name. Used for quickly creating
timeseries or scatterplot labeled graphs. Returns a lazy sequence, not a reader!</p>
<p>See also <code>columnwise-concat</code></p>
<p>Return a sequence of maps with</p>
<pre><code class="language-clojure"> {... - columns not in colname-seq
:value - value from one of the value columns
:label - name of the column the value came from
}
</code></pre>
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L272">view source</a></div></div><div class="public anchor" id="var-column-map"><h3>column-map</h3><div class="usage"><code>(column-map dataset result-colname map-fn res-dtype-or-opts filter-fn-or-ds)</code><code>(column-map dataset result-colname map-fn filter-fn-or-ds)</code><code>(column-map dataset result-colname map-fn)</code></div><div class="doc"><div class="markdown"><p>Produce a new (or updated) column as the result of mapping a fn over columns. This
function is never lazy - all results are immediately calculated.</p>
<ul>
<li><code>dataset</code> - dataset.</li>
<li><code>result-colname</code> - Name of new (or existing) column.</li>
<li><code>map-fn</code> - function to map over columns. Same rules as <code>tech.v3.datatype/emap</code>.</li>
<li><code>res-dtype-or-opts</code> - If not given result is scanned to infer missing and datatype.
If using an option map, options are described below.</li>
<li><code>filter-fn-or-ds</code> - A dataset, a sequence of columns, or a <code>tech.v3.datasets/column-filters</code>
column filter function. Defaults to all the columns of the existing dataset.</li>
</ul>
<p>Returns a new dataset with a new or updated column.</p>
<p>Options:</p>
<ul>
<li><code>:datatype</code> - Set the dataype of the result column. If not given result is scanned
to infer result datatype and missing set.</li>
<li><code>:missing-fn</code> - if given, columns are first passed to missing-fn as a sequence and
this dictates the missing set. Else the missing set is by scanning the results
during the inference process. See <code>tech.v3.dataset.column/union-missing-sets</code> and
<code>tech.v3.dataset.column/intersect-missing-sets</code> for example functions to pass in
here.</li>
</ul>
<p>Examples:</p>
<pre><code class="language-clojure">
;;From the tests --
(let [testds (ds/-&gt;dataset [{:a 1.0 :b 2.0} {:a 3.0 :b 5.0} {:a 4.0 :b nil}])]
;;result scanned for both datatype and missing set
(is (= (vec [3.0 6.0 nil])
(:b2 (ds/column-map testds :b2 #(when % (inc %)) [:b]))))
;;result scanned for missing set only. Result used in-place.
(is (= (vec [3.0 6.0 nil])
(:b2 (ds/column-map testds :b2 #(when % (inc %))
{:datatype :float64} [:b]))))
;;Nothing scanned at all.
(is (= (vec [3.0 6.0 nil])
(:b2 (ds/column-map testds :b2 #(inc %)
{:datatype :float64
:missing-fn ds-col/union-missing-sets} [:b]))))
;;Missing set scanning causes NPE at inc.
(is (thrown? Throwable
(ds/column-map testds :b2 #(inc %)
{:datatype :float64}
[:b]))))
;;Ad-hoc repl --
user&gt; (require '[tech.v3.dataset :as ds]))
nil
user&gt; (def ds (ds/-&gt;dataset "test/data/stocks.csv"))
#'user/ds
user&gt; (ds/head ds)
test/data/stocks.csv [5 3]:
| symbol | date | price |
|--------|------------|-------|
| MSFT | 2000-01-01 | 39.81 |
| MSFT | 2000-02-01 | 36.35 |
| MSFT | 2000-03-01 | 43.22 |
| MSFT | 2000-04-01 | 28.37 |
| MSFT | 2000-05-01 | 25.45 |
user&gt; (-&gt; (ds/column-map ds "price^2" #(* % %) ["price"])
(ds/head))
test/data/stocks.csv [5 4]:
| symbol | date | price | price^2 |
|--------|------------|-------|-----------|
| MSFT | 2000-01-01 | 39.81 | 1584.8361 |
| MSFT | 2000-02-01 | 36.35 | 1321.3225 |
| MSFT | 2000-03-01 | 43.22 | 1867.9684 |
| MSFT | 2000-04-01 | 28.37 | 804.8569 |
| MSFT | 2000-05-01 | 25.45 | 647.7025 |
user&gt; (def ds1 (ds/-&gt;dataset [{:a 1} {:b 2.0} {:a 2 :b 3.0}]))
#'user/ds1
user&gt; ds1
_unnamed [3 2]:
| :b | :a |
|----:|---:|
| | 1 |
| 2.0 | |
| 3.0 | 2 |
user&gt; (ds/column-map ds1 :c (fn [a b]
(when (and a b)
(+ (double a) (double b))))
[:a :b])
_unnamed [3 3]:
| :b | :a | :c |
|----:|---:|----:|
| | 1 | |
| 2.0 | | |
| 3.0 | 2 | 5.0 |
user&gt; (ds/missing (*1 :c))
{0,1}
</code></pre>
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L290">view source</a></div></div><div class="public anchor" id="var-column-map-m"><h3>column-map-m</h3><h4 class="type">macro</h4><div class="usage"><code>(column-map-m ds result-colname src-colnames body)</code></div><div class="doc"><div class="markdown"><p>Map a function across one or more columns via a macro.
The function will have arguments in the order of the src-colnames. column names of
the form <code>right.id</code> will be bound to variables named <code>right-id</code>.</p>
<p>Example:</p>
<pre><code class="language-clojure">user&gt; (-&gt; (ds/-&gt;dataset [{:a.a 1} {:b 2.0} {:a.a 2 :b 3.0}])
(ds/column-map-m :a [:a.a :b]
(when (and a-a b)
(+ (double a-a) (double b)))))
_unnamed [3 3]:
| :b | :a.a | :a |
|----:|-----:|----:|
| | 1 | |
| 2.0 | | |
| 3.0 | 2 | 5.0 |
</code></pre>
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L402">view source</a></div></div><div class="public anchor" id="var-column-names"><h3>column-names</h3><div class="usage"><code>(column-names dataset)</code></div><div class="doc"><div class="markdown"><p>In-order sequence of column names</p>
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L427">view source</a></div></div><div class="public anchor" id="var-columns"><h3>columns</h3><div class="usage"><code>(columns dataset)</code></div><div class="doc"><div class="markdown"><p>Return sequence of all columns in dataset.</p>
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L433">view source</a></div></div><div class="public anchor" id="var-columns-with-missing-seq"><h3>columns-with-missing-seq</h3><div class="usage"><code>(columns-with-missing-seq dataset)</code></div><div class="doc"><div class="markdown"><p>Return a sequence of:</p>
<pre><code class="language-clojure"> {:column-name column-name
:missing-count missing-count
}
</code></pre>
<p>or nil of no columns are missing data.</p>
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L439">view source</a></div></div><div class="public anchor" id="var-columnwise-concat"><h3>columnwise-concat</h3><div class="usage"><code>(columnwise-concat dataset colnames options)</code><code>(columnwise-concat dataset colnames)</code></div><div class="doc"><div class="markdown"><p>Given a dataset and a list of columns, produce a new dataset with
the columns concatenated to a new column with a :column column indicating
which column the original value came from. Any columns not mentioned in the
list of columns are duplicated.</p>
<p>Example:</p>
<pre><code class="language-clojure">user&gt; (-&gt; [{:a 1 :b 2 :c 3 :d 1} {:a 4 :b 5 :c 6 :d 2}]
(ds/-&gt;dataset)
(ds/columnwise-concat [:c :a :b]))
null [6 3]:
| :column | :value | :d |
|---------+--------+----|
| :c | 3 | 1 |
| :c | 6 | 2 |
| :a | 1 | 1 |
| :a | 4 | 2 |
| :b | 2 | 1 |
| :b | 5 | 2 |
</code></pre>
<p>Options:</p>
<p>value-column-name - defaults to :value
colname-column-name - defaults to :column</p>
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L451">view source</a></div></div><div class="public anchor" id="var-concat"><h3>concat</h3><div class="usage"><code>(concat dataset &amp; args)</code><code>(concat)</code></div><div class="doc"><div class="markdown"><p>Concatenate datasets using a copying-concatenation.
See also <a href="tech.v3.dataset.html#var-concat-inplace">concat-inplace</a> as it may be more efficient for your use case if you have
a small number (like less than 3) of datasets.</p>
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L485">view source</a></div></div><div class="public anchor" id="var-concat-copying"><h3>concat-copying</h3><div class="usage"><code>(concat-copying dataset &amp; args)</code><code>(concat-copying)</code></div><div class="doc"><div class="markdown"><p>Concatenate datasets into a new dataset copying data. Respects missing values.
Datasets must all have the same columns. Result column datatypes will be a widening
cast of the datatypes.</p>
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L495">view source</a></div></div><div class="public anchor" id="var-concat-inplace"><h3>concat-inplace</h3><div class="usage"><code>(concat-inplace dataset &amp; args)</code><code>(concat-inplace)</code></div><div class="doc"><div class="markdown"><p>Concatenate datasets in place. Respects missing values. Datasets must all have the
same columns. Result column datatypes will be a widening cast of the datatypes.</p>
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L505">view source</a></div></div><div class="public anchor" id="var-data-.3Edataset"><h3>data-&gt;dataset</h3><div class="usage"><code>(data-&gt;dataset input)</code></div><div class="doc"><div class="markdown"><p>Convert a data-ized dataset created via dataset-&gt;data back into a
full dataset</p>
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L514">view source</a></div></div><div class="public anchor" id="var-dataset-.3Edata"><h3>dataset-&gt;data</h3><div class="usage"><code>(dataset-&gt;data ds)</code></div><div class="doc"><div class="markdown"><p>Convert a dataset to a pure clojure datastructure. Returns a map with two keys:
{:metadata :columns}.
:columns is a vector of column definitions appropriate for passing directly back
into new-dataset.
A column definition in this case is a map of {:name :missing :data :metadata}.</p>
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L521">view source</a></div></div><div class="public anchor" id="var-dataset-name"><h3>dataset-name</h3><div class="usage"><code>(dataset-name dataset)</code></div><div class="doc"><div class="markdown"></div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L531">view source</a></div></div><div class="public anchor" id="var-dataset-parser"><h3>dataset-parser</h3><div class="usage"><code>(dataset-parser options)</code><code>(dataset-parser)</code></div><div class="doc"><div class="markdown"><p>Implements protocols/PDatasetParser, Counted, Indexed, IReduceInit, and IDeref (returns the new dataset).
See documentation for <a href="tech.v3.dataset.html#var-mapseq-parser">mapseq-parser</a>.</p>
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L536">view source</a></div></div><div class="public anchor" id="var-dataset.3F"><h3>dataset?</h3><div class="usage"><code>(dataset? ds)</code></div><div class="doc"><div class="markdown"></div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L545">view source</a></div></div><div class="public anchor" id="var-descriptive-stats"><h3>descriptive-stats</h3><div class="usage"><code>(descriptive-stats dataset)</code><code>(descriptive-stats dataset options)</code></div><div class="doc"><div class="markdown"><p>Get descriptive statistics across the columns of the dataset.
In addition to the standard stats.
Options:
:stat-names - defaults to (remove #{:values :num-distinct-values}
(all-descriptive-stats-names))
:n-categorical-values - Number of categorical values to report in the 'values'
field. Defaults to 21.</p>
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L550">view source</a></div></div><div class="public anchor" id="var-drop-columns"><h3>drop-columns</h3><div class="usage"><code>(drop-columns dataset colname-seq-or-fn)</code></div><div class="doc"><div class="markdown"><p>Same as remove-columns. Remove columns indexed by column name seq or
column filter function.
For example:</p>
<pre><code class="language-clojure">(drop-columns DS [:A :B])
(drop-columns DS cf/categorical)
</code></pre>
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L564">view source</a></div></div><div class="public anchor" id="var-drop-missing"><h3>drop-missing</h3><div class="usage"><code>(drop-missing dataset-or-col)</code><code>(drop-missing ds colname)</code></div><div class="doc"><div class="markdown"><p>Remove missing entries by simply selecting out the missing indexes.</p>
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L577">view source</a></div></div><div class="public anchor" id="var-drop-rows"><h3>drop-rows</h3><div class="usage"><code>(drop-rows dataset-or-col row-indexes)</code></div><div class="doc"><div class="markdown"><p>Drop rows from dataset or column</p>
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L585">view source</a></div></div><div class="public anchor" id="var-empty-column-names"><h3>empty-column-names</h3><div class="usage"><code>(empty-column-names ds)</code></div><div class="doc"><div class="markdown"><p>Return a sequence of column names whose empty set length matches the row count of the dataset.</p>
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L591">view source</a></div></div><div class="public anchor" id="var-empty-dataset"><h3>empty-dataset</h3><div class="usage"><code>(empty-dataset)</code></div><div class="doc"><div class="markdown"></div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L597">view source</a></div></div><div class="public anchor" id="var-ensure-array-backed"><h3>ensure-array-backed</h3><div class="usage"><code>(ensure-array-backed ds options)</code><code>(ensure-array-backed ds)</code></div><div class="doc"><div class="markdown"><p>Ensure the column data in the dataset is stored in pure java arrays. This is
sometimes necessary for interop with other libraries and this operation will
force any lazy computations to complete. This also clears the missing set
for each column and writes the missing values to the new arrays.</p>
<p>Columns that are already array backed and that have no missing values are not
changed and retuned.</p>
<p>The postcondition is that dtype/-&gt;array will return a java array in the appropriate
datatype for each column.</p>
<p>Options:</p>
<ul>
<li><code>:unpack?</code> - unpack packed datetime types. Defaults to true</li>
</ul>
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L602">view source</a></div></div><div class="public anchor" id="var-filter"><h3>filter</h3><div class="usage"><code>(filter dataset predicate)</code></div><div class="doc"><div class="markdown"><p>dataset-&gt;dataset transformation. Predicate is passed a map of
colname-&gt;column-value.</p>
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L623">view source</a></div></div><div class="public anchor" id="var-filter-column"><h3>filter-column</h3><div class="usage"><code>(filter-column dataset colname predicate)</code><code>(filter-column dataset colname)</code></div><div class="doc"><div class="markdown"><p>Filter a given column by a predicate. Predicate is passed column values.
If predicate is <em>not</em> an instance of Ifn it is treated as a value and will
be used as if the predicate is #(= value %).</p>
<p>The 2-arity form of this function reads the column as a boolean reader so for
instance numeric 0 values are false in that case as are Double/NaN, Float/NaN. Objects are
only false if nil?.</p>
<p>Returns a dataset.</p>
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L630">view source</a></div></div><div class="public anchor" id="var-filter-dataset"><h3>filter-dataset</h3><div class="usage"><code>(filter-dataset dataset filter-fn-or-ds)</code></div><div class="doc"><div class="markdown"><p>Filter the columns of the dataset returning a new dataset. This pathway is
designed to work with the tech.v3.dataset.column-filters namespace.</p>
<ul>
<li>If filter-fn-or-ds is a dataset, it is returned.</li>
<li>If filter-fn-or-ds is sequential, then select-columns is called.</li>
<li>If filter-fn-or-ds is :all, all columns are returned</li>
<li>If filter-fn-or-ds is an instance of IFn, the dataset is passed into it.</li>
</ul>
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L646">view source</a></div></div><div class="public anchor" id="var-group-by"><h3>group-by</h3><div class="usage"><code>(group-by dataset key-fn options)</code><code>(group-by dataset key-fn)</code></div><div class="doc"><div class="markdown"><p>Produce a map of key-fn-value-&gt;dataset. The argument to key-fn
is a map of colname-&gt;column-value representing a row in dataset.
Each dataset in the resulting map contains all and only rows
that produce the same key-fn-value.</p>
<p>Options - options are passed into dtype arggroup:</p>
<ul>
<li><code>:group-by-finalizer</code> - when provided this is run on each dataset immediately after the
rows are selected. This can be used to immediately perform a reduction on each new
dataset which is faster than doing it in a separate run.</li>
</ul>
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L658">view source</a></div></div><div class="public anchor" id="var-group-by-.3Eindexes"><h3>group-by-&gt;indexes</h3><div class="usage"><code>(group-by-&gt;indexes dataset key-fn options)</code><code>(group-by-&gt;indexes dataset key-fn)</code></div><div class="doc"><div class="markdown"><p>(Non-lazy) - Group a dataset and return a map of key-fn-value-&gt;indexes where indexes
is an in-order contiguous group of indexes.</p>
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L675">view source</a></div></div><div class="public anchor" id="var-group-by-column"><h3>group-by-column</h3><div class="usage"><code>(group-by-column dataset colname options)</code><code>(group-by-column dataset colname)</code></div><div class="doc"><div class="markdown"><p>Return a map of column-value-&gt;dataset. Each dataset in the
resulting map contains all and only rows with the same value in
column.</p>
<ul>
<li><code>:group-by-finalizer</code> - when provided this is run on each dataset immediately after the
rows are selected. This can be used to immediately perform a reduction on each new
dataset which is faster than doing it in a separate run.</li>
</ul>
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L684">view source</a></div></div><div class="public anchor" id="var-group-by-column-.3Eindexes"><h3>group-by-column-&gt;indexes</h3><div class="usage"><code>(group-by-column-&gt;indexes dataset colname options)</code><code>(group-by-column-&gt;indexes dataset colname)</code></div><div class="doc"><div class="markdown"><p>(Non-lazy) - Group a dataset by a column return a map of column-val-&gt;indexes
where indexes is an in-order contiguous group of indexes.</p>
<p>Options are passed into dtype's arggroup method.</p>
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L698">view source</a></div></div><div class="public anchor" id="var-group-by-column-consumer"><h3>group-by-column-consumer</h3><div class="usage"><code>(group-by-column-consumer ds cname)</code></div><div class="doc"><div class="markdown"></div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L709">view source</a></div></div><div class="public anchor" id="var-has-column.3F"><h3>has-column?</h3><div class="usage"><code>(has-column? dataset column-name)</code></div><div class="doc"><div class="markdown"></div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L714">view source</a></div></div><div class="public anchor" id="var-head"><h3>head</h3><div class="usage"><code>(head dataset n)</code><code>(head dataset)</code></div><div class="doc"><div class="markdown"><p>Get the first n row of a dataset. Equivalent to
`(select-rows ds (range n)). Arguments are reversed, however, so this can
be used in -&gt;&gt; operators.</p>
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L719">view source</a></div></div><div class="public anchor" id="var-induction"><h3>induction</h3><div class="usage"><code>(induction ds induct-fn &amp; args)</code></div><div class="doc"><div class="markdown"><p>Given a dataset and a function from dataset-&gt;row produce a new dataset.
The produced row will be merged with the current row and then added to the
dataset.</p>
<p>Options are same as the options used for <a href="tech.v3.dataset.html#var--.3Edataset">-&gt;dataset</a> in order for the
user to control the parsing of the return values of <code>induct-fn</code>.
A new dataset is returned.</p>
<p>Example:</p>
<pre><code class="language-clojure">user&gt; (def ds (ds/-&gt;dataset {:a [0 1 2 3] :b [1 2 3 4]}))
#'user/ds
user&gt; ds
_unnamed [4 2]:
| :a | :b |
|---:|---:|
| 0 | 1 |
| 1 | 2 |
| 2 | 3 |
| 3 | 4 |
user&gt; (ds/induction ds (fn [ds]
{:sum-of-previous-row (dfn/sum (ds/rowvec-at ds -1))
:sum-a (dfn/sum (ds :a))
:sum-b (dfn/sum (ds :b))}))
_unnamed [4 5]:
| :a | :b | :sum-b | :sum-a | :sum-of-previous-row |
|---:|---:|-------:|-------:|---------------------:|
| 0 | 1 | 0.0 | 0.0 | 0.0 |
| 1 | 2 | 1.0 | 0.0 | 1.0 |
| 2 | 3 | 3.0 | 1.0 | 5.0 |
| 3 | 4 | 6.0 | 3.0 | 14.0 |
</code></pre>
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L729">view source</a></div></div><div class="public anchor" id="var-major-version"><h3>major-version</h3><div class="usage"></div><div class="doc"><div class="markdown"></div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L769">view source</a></div></div><div class="public anchor" id="var-mapseq-parser"><h3>mapseq-parser</h3><div class="usage"><code>(mapseq-parser options)</code><code>(mapseq-parser)</code></div><div class="doc"><div class="markdown"><p>Return a clojure function that when called with one arg that arg must be the next map
to add to the dataset. When called with no args returns the current dataset. This can be
used to efficiently transform a stream of maps into a dataset while getting intermediate
datasets during the parse operation.</p>
<p>Options are the same for <a href="tech.v3.dataset.html#var--.3Edataset">-&gt;dataset</a>.</p>
<pre><code class="language-clojure">user&gt; (require '[tech.v3.dataset :as ds])
nil
user&gt; (def pfn (ds/mapseq-parser))
#'user/pfn
user&gt; (pfn {:a 1 :b 2})
nil
user&gt; (pfn {:a 1 :b 2})
nil
user&gt; (pfn {:a 2 :c 3})
nil
user&gt; (pfn)
_unnamed [3 3]:
| :a | :b | :c |
|---:|---:|---:|
| 1 | 2 | |
| 1 | 2 | |
| 2 | | 3 |
user&gt; (pfn {:a 3 :d 4})
nil
user&gt; (pfn {:a 5 :c 6})
nil
user&gt; (pfn)
_unnamed [5 4]:
| :a | :b | :c | :d |
|---:|---:|---:|---:|
| 1 | 2 | | |
| 1 | 2 | | |
| 2 | | 3 | |
| 3 | | | 4 |
| 5 | | 6 | |
</code></pre>
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L770">view source</a></div></div><div class="public anchor" id="var-mapseq-reader"><h3>mapseq-reader</h3><div class="usage"><code>(mapseq-reader dataset options)</code><code>(mapseq-reader dataset)</code></div><div class="doc"><div class="markdown"><p>Return a reader that produces a map of column-name-&gt;column-value
upon read.</p>
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L818">view source</a></div></div><div class="public anchor" id="var-mapseq-rf"><h3>mapseq-rf</h3><div class="usage"><code>(mapseq-rf)</code><code>(mapseq-rf options)</code></div><div class="doc"><div class="markdown"><p>Create a transduce-compatible rf that reduces a sequence of maps into a dataset.
Same options as <a href="tech.v3.dataset.html#var--.3Edataset">-&gt;dataset</a>.</p>
<pre><code class="language-clojure">user&gt; (transduce (map identity) (ds/mapseq-rf {:dataset-name :transduced}) [{:a 1 :b 2}])
:transduced [1 2]:
| :a | :b |
|---:|---:|
| 1 | 2 |
</code></pre>
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L827">view source</a></div></div><div class="public anchor" id="var-min-n-by-column"><h3>min-n-by-column</h3><div class="usage"><code>(min-n-by-column dataset cname N comparator options)</code><code>(min-n-by-column dataset cname N comparator)</code><code>(min-n-by-column dataset cname N)</code></div><div class="doc"><div class="markdown"><p>Find the minimum N entries (unsorted) by column. Resulting data will be indexed in
original order. If you want a sorted order then sort the result.</p>
<p>See options to <a href="tech.v3.dataset.html#var-sort-by-column">sort-by-column</a>.</p>
<p>Example:</p>
<pre><code class="language-clojure">user&gt; (ds/min-n-by-column ds "price" 10 nil nil)
test/data/stocks.csv [10 3]:
| symbol | date | price |
|--------|------------|------:|
| AMZN | 2001-09-01 | 5.97 |
| AMZN | 2001-10-01 | 6.98 |
| AAPL | 2000-12-01 | 7.44 |
| AAPL | 2002-08-01 | 7.38 |
| AAPL | 2002-09-01 | 7.25 |
| AAPL | 2002-12-01 | 7.16 |
| AAPL | 2003-01-01 | 7.18 |
| AAPL | 2003-02-01 | 7.51 |
| AAPL | 2003-03-01 | 7.07 |
| AAPL | 2003-04-01 | 7.11 |
user&gt; (ds/min-n-by-column ds "price" 10 &gt; nil)
test/data/stocks.csv [10 3]:
| symbol | date | price |
|--------|------------|-------:|
| GOOG | 2007-09-01 | 567.27 |
| GOOG | 2007-10-01 | 707.00 |
| GOOG | 2007-11-01 | 693.00 |
| GOOG | 2007-12-01 | 691.48 |
| GOOG | 2008-01-01 | 564.30 |
| GOOG | 2008-04-01 | 574.29 |
| GOOG | 2008-05-01 | 585.80 |
| GOOG | 2009-11-01 | 583.00 |
| GOOG | 2009-12-01 | 619.98 |
| GOOG | 2010-03-01 | 560.19 |
</code></pre>
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L845">view source</a></div></div><div class="public anchor" id="var-missing"><h3>missing</h3><div class="usage"><code>(missing dataset-or-col)</code></div><div class="doc"><div class="markdown"><p>Given a dataset or a column, return the missing set as a roaring bitmap</p>
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L893">view source</a></div></div><div class="public anchor" id="var-new-column"><h3>new-column</h3><div class="usage"><code>(new-column name data)</code><code>(new-column name data metadata)</code><code>(new-column name data metadata missing)</code><code>(new-column data-or-data-map)</code></div><div class="doc"><div class="markdown"><p>Create a new column. Data will scanned for missing values
unless the full 4-argument pathway is used.</p>
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L899">view source</a></div></div><div class="public anchor" id="var-new-dataset"><h3>new-dataset</h3><div class="usage"><code>(new-dataset options ds-metadata column-seq)</code><code>(new-dataset options column-seq)</code><code>(new-dataset column-seq)</code></div><div class="doc"><div class="markdown"><p>Create a new dataset from a sequence of columns. Data will be converted
into columns using ds-col-proto/ensure-column-seq. If the column seq is simply a
collection of vectors, for instance, columns will be named ordinally.
options map -
:dataset-name - Name of the dataset. Defaults to "_unnamed".
:key-fn - Key function used on all column names before insertion into dataset.</p>
<p>The return value fulfills the dataset protocols.</p>
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L912">view source</a></div></div><div class="public anchor" id="var-order-column-names"><h3>order-column-names</h3><div class="usage"><code>(order-column-names dataset colname-seq)</code></div><div class="doc"><div class="markdown"><p>Order a sequence of columns names so they match the order in the
original dataset. Missing columns are placed last.</p>
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L929">view source</a></div></div><div class="public anchor" id="var-pmap-ds"><h3>pmap-ds</h3><div class="usage"><code>(pmap-ds ds ds-map-fn options)</code><code>(pmap-ds ds ds-map-fn)</code></div><div class="doc"><div class="markdown"><p>Parallelize mapping a function from dataset-&gt;dataset across a single dataset. Results are
coalesced back into a single dataset. The original dataset is simple sliced into n-core
results and map-fn is called n-core times. ds-map-fn must be a function from
dataset-&gt;dataset although it may return nil.</p>
<p>Options:</p>
<ul>
<li><code>:max-batch-size</code> - this is a default for tech.v3.parallel.for/indexed-map-reduce. You
can control how many rows are processed in a given batch - the default is 64000. If your
mapping pathway produces a large expansion in the size of the dataset then it may be
good to reduce the max batch size and use :as-seq to produce a sequence of datasets.</li>
<li><code>:result-type</code>
<ul>
<li><code>:as-seq</code> - Return a sequence of datasets, one for each batch.</li>
<li><code>:as-ds</code> - Return a single datasets with all results in memory (default option).</li>
</ul>
</li>
</ul>
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L936">view source</a></div></div><div class="public anchor" id="var-print-all"><h3>print-all</h3><div class="usage"><code>(print-all dataset)</code></div><div class="doc"><div class="markdown"><p>Helper function equivalent to <code>(tech.v3.dataset.print/print-range ... :all)</code></p>
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L957">view source</a></div></div><div class="public anchor" id="var-rand-nth"><h3>rand-nth</h3><div class="usage"><code>(rand-nth dataset)</code></div><div class="doc"><div class="markdown"><p>Return a random row from the dataset in map format</p>
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L963">view source</a></div></div><div class="public anchor" id="var-remove-column"><h3>remove-column</h3><div class="usage"><code>(remove-column dataset col-name)</code></div><div class="doc"><div class="markdown"><p>Same as:</p>
<pre><code class="language-clojure">(dissoc dataset col-name)
</code></pre>
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L969">view source</a></div></div><div class="public anchor" id="var-remove-columns"><h3>remove-columns</h3><div class="usage"><code>(remove-columns dataset colname-seq-or-fn)</code></div><div class="doc"><div class="markdown"><p>Remove columns indexed by column name seq or column filter function.
For example:</p>
<pre><code class="language-clojure"> (remove-columns DS [:A :B])
(remove-columns DS cf/categorical)
</code></pre>
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L979">view source</a></div></div><div class="public anchor" id="var-remove-empty-columns"><h3>remove-empty-columns</h3><div class="usage"><code>(remove-empty-columns ds)</code></div><div class="doc"><div class="markdown"><p>Remove all columns that have no data - missing set length equals row count.</p>
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L991">view source</a></div></div><div class="public anchor" id="var-remove-rows"><h3>remove-rows</h3><div class="usage"><code>(remove-rows dataset-or-col row-indexes)</code></div><div class="doc"><div class="markdown"><p>Same as drop-rows.</p>
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L997">view source</a></div></div><div class="public anchor" id="var-rename-columns"><h3>rename-columns</h3><div class="usage"><code>(rename-columns dataset colnames)</code></div><div class="doc"><div class="markdown"><p>Rename columns using a map or vector of column names.</p>
<p>Does not reorder columns; rename is in-place for maps and
positional for vectors.</p>
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L1003">view source</a></div></div><div class="public anchor" id="var-replace-missing"><h3>replace-missing</h3><div class="usage"><code>(replace-missing ds)</code><code>(replace-missing ds strategy)</code><code>(replace-missing ds columns-selector strategy)</code><code>(replace-missing ds columns-selector strategy value)</code></div><div class="doc"><div class="markdown"><p>Replace missing values in some columns with a given strategy.
The columns selector may be:</p>
<ul>
<li>seq of any legal column names</li>
<li>or a column filter function, such as <code>numeric</code> and <code>categorical</code></li>
</ul>
<p>Strategies may be:</p>
<ul>
<li>
<p><code>:down</code> - take value from previous non-missing row if possible else use provided value.</p>
</li>
<li>
<p><code>:up</code> - take value from next non-missing row if possible else use provided value.</p>
</li>
<li>
<p><code>:downup</code> - take value from previous if possible else use next.</p>
</li>
<li>
<p><code>:updown</code> - take value from next if possible else use previous.</p>
</li>
<li>
<p><code>:nearest</code> - Use nearest of next or previous values. <code>:mid</code> is an alias for <code>:nearest</code>.</p>
</li>
<li>
<p><code>:midpoint</code> - Use midpoint of averaged values between previous and next nonmissing
rows.</p>
</li>
<li>
<p><code>:abb</code> - Impute missing with approximate bayesian bootstrap. See <a href="https://search.r-project.org/CRAN/refmans/LaplacesDemon/html/ABB.html">r's ABB</a>.</p>
</li>
<li>
<p><code>:lerp</code> - Linearly interpolate values between previous and next nonmissing rows.</p>
</li>
<li>
<p><code>:value</code> - Value will be provided - see below.</p>
<p>value may be provided which will then be used. Value may be a function in which
case it will be called on the column with missing values elided and the return will
be used to as the filler.</p>
</li>
</ul>
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L1012">view source</a></div></div><div class="public anchor" id="var-replace-missing-value"><h3>replace-missing-value</h3><div class="usage"><code>(replace-missing-value dataset filter-fn-or-ds scalar-value)</code><code>(replace-missing-value dataset scalar-value)</code></div><div class="doc"><div class="markdown"></div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L1045">view source</a></div></div><div class="public anchor" id="var-reverse-rows"><h3>reverse-rows</h3><div class="usage"><code>(reverse-rows dataset-or-col)</code></div><div class="doc"><div class="markdown"><p>Reverse the rows in the dataset or column.</p>
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L1052">view source</a></div></div><div class="public anchor" id="var-row-at"><h3>row-at</h3><div class="usage"><code>(row-at ds idx)</code></div><div class="doc"><div class="markdown"><p>Get the row at an individual index. If indexes are negative then the dataset
is indexed from the end.</p>
<pre><code class="language-clojure">user&gt; (ds/row-at stocks 1)
{"date" #object[java.time.LocalDate 0x534cb03b "2000-02-01"],
"symbol" "MSFT",
"price" 36.35}
user&gt; (ds/row-at stocks -1)
{"date" #object[java.time.LocalDate 0x6bf60ed5 "2010-03-01"],
"symbol" "AAPL",
"price" 223.02}
</code></pre>
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L1058">view source</a></div></div><div class="public anchor" id="var-row-count"><h3>row-count</h3><div class="usage"><code>(row-count dataset-or-col)</code></div><div class="doc"><div class="markdown"></div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L1076">view source</a></div></div><div class="public anchor" id="var-row-map"><h3>row-map</h3><div class="usage"><code>(row-map ds map-fn options)</code><code>(row-map ds map-fn)</code></div><div class="doc"><div class="markdown"><p>Map a function across the rows of the dataset producing a new dataset
that is merged back into the original potentially replacing existing columns.
Options are passed into the <a href="tech.v3.dataset.html#var--.3Edataset">-&gt;dataset</a> function so you can control the resulting
column types by the usual dataset parsing options described there.</p>
<p>Options:</p>
<p>See options for <a href="tech.v3.dataset.html#var-pmap-ds">pmap-ds</a>. In particular, note that you can
produce a sequence of datasets as opposed to a single large dataset.</p>
<p>Speed demons should attempt both <code>{:copying? false}</code> and <code>{:copying? true}</code> in the options
map as that changes rather drastically how data is read from the datasets. If you are
going to read all the data in the dataset, <code>{:copying? true}</code> will most likely be
the faster of the two.</p>
<p>Examples:</p>
<pre><code class="language-clojure">user&gt; (def stocks (ds/-&gt;dataset "test/data/stocks.csv"))
#'user/stocks
user&gt; (ds/head stocks)
test/data/stocks.csv [5 3]:
| symbol | date | price |
|--------|------------|------:|
| MSFT | 2000-01-01 | 39.81 |
| MSFT | 2000-02-01 | 36.35 |
| MSFT | 2000-03-01 | 43.22 |
| MSFT | 2000-04-01 | 28.37 |
| MSFT | 2000-05-01 | 25.45 |
user&gt; (ds/head (ds/row-map stocks (fn [row]
{"symbol" (keyword (row "symbol"))
:price2 (* (row "price")(row "price"))})))
test/data/stocks.csv [5 4]:
| symbol | date | price | :price2 |
|--------|------------|------:|----------:|
| :MSFT | 2000-01-01 | 39.81 | 1584.8361 |
| :MSFT | 2000-02-01 | 36.35 | 1321.3225 |
| :MSFT | 2000-03-01 | 43.22 | 1867.9684 |
| :MSFT | 2000-04-01 | 28.37 | 804.8569 |
| :MSFT | 2000-05-01 | 25.45 | 647.7025 |
</code></pre>
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L1081">view source</a></div></div><div class="public anchor" id="var-row-mapcat"><h3>row-mapcat</h3><div class="usage"><code>(row-mapcat ds mapcat-fn options)</code><code>(row-mapcat ds mapcat-fn)</code></div><div class="doc"><div class="markdown"><p>Map a function across the rows of the dataset. The function must produce a sequence of
maps and the original dataset rows will be duplicated and then merged into the result
of calling (-&gt;&gt; (apply concat) (-&gt;&gt;dataset options) on the result of <code>mapcat-fn</code>. Options
are the same as <a href="tech.v3.dataset.html#var--.3Edataset">-&gt;dataset</a>.</p>
<p>The smaller the maps returned from mapcat-fn the better, perhaps consider using records.
In the case that a mapcat-fn result map has a key that overlaps a column name the
column will be replaced with the output of mapcat-fn. The returned map will have the
key <code>:_row-id</code> assoc'd onto it so for absolutely minimal gc usage include this
as a member variable in your map.</p>
<p>Options:</p>
<ul>
<li>See options for <a href="tech.v3.dataset.html#var-pmap-ds">pmap-ds</a>. Especially note <code>:max-batch-size</code> and <code>:result-type</code>.
In order to conserve memory it may be much more efficient to return a sequence of datasets
rather than one large dataset. If returning sequences of datasets perhaps consider
a transducing pathway across them or the <a href="tech.v3.dataset.reductions.html">tech.v3.dataset.reductions</a> namespace.</li>
</ul>
<p>Example:</p>
<pre><code class="language-clojure">user&gt; (def ds (ds/-&gt;dataset {:rid (range 10)
:data (repeatedly 10 #(rand-int 3))}))
#'user/ds
user&gt; (ds/head ds)
_unnamed [5 2]:
| :rid | :data |
|-----:|------:|
| 0 | 0 |
| 1 | 2 |
| 2 | 0 |
| 3 | 1 |
| 4 | 2 |
user&gt; (def mapcat-fn (fn [row]
(for [idx (range (row :data))]
{:idx idx})))
#'user/mapcat-fn
user&gt; (mapcat mapcat-fn (ds/rows ds))
({:idx 0} {:idx 1} {:idx 0} {:idx 0} {:idx 1} {:idx 0} {:idx 1} {:idx 0} {:idx 1})
user&gt; (ds/row-mapcat ds mapcat-fn)
_unnamed [9 3]:
| :rid | :data | :idx |
|-----:|------:|-----:|
| 1 | 2 | 0 |
| 1 | 2 | 1 |
| 3 | 1 | 0 |
| 4 | 2 | 0 |
| 4 | 2 | 1 |
| 6 | 2 | 0 |
| 6 | 2 | 1 |
| 8 | 2 | 0 |
| 8 | 2 | 1 |
user&gt;
</code></pre>
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L1132">view source</a></div></div><div class="public anchor" id="var-rows"><h3>rows</h3><div class="usage"><code>(rows ds options)</code><code>(rows ds)</code></div><div class="doc"><div class="markdown"><p>Get the rows of the dataset as a list of potentially flyweight maps.</p>
<p>Options:</p>
<ul>
<li>copying? - When true the data is copied out of the dataset row by row upon read of that
row. When false the data is only referenced upon each read of a particular key. Copying
is appropriate if you want to use the row values as keys a map and it is inappropriate if
you are only going to read a very small portion of the row map.</li>
<li>nil-missing? - When true, maps returned have nil values for missing entries as opposed
to eliding the missing keys entirely. It is legacy behavior and slightly faster to
use <code>:nil-missing? true</code>.</li>
</ul>
<pre><code class="language-clojure">user&gt; (take 5 (ds/rows stocks))
({"date" #object[java.time.LocalDate 0x6c433971 "2000-01-01"],
"symbol" "MSFT",
"price" 39.81}
{"date" #object[java.time.LocalDate 0x28f96b14 "2000-02-01"],
"symbol" "MSFT",
"price" 36.35}
{"date" #object[java.time.LocalDate 0x7bdbf0a "2000-03-01"],
"symbol" "MSFT",
"price" 43.22}
{"date" #object[java.time.LocalDate 0x16d3871e "2000-04-01"],
"symbol" "MSFT",
"price" 28.37}
{"date" #object[java.time.LocalDate 0x47094da0 "2000-05-01"],
"symbol" "MSFT",
"price" 25.45})
user&gt; (ds/rows (ds/-&gt;dataset [{:a 1 :b 2} {:a 2} {:b 3}]))
[{:a 1, :b 2} {:a 2} {:b 3}]
user&gt; (ds/rows (ds/-&gt;dataset [{:a 1 :b 2} {:a 2} {:b 3}]) {:nil-missing? true})
[{:a 1, :b 2} {:a 2, :b nil} {:a nil, :b 3}]
</code></pre>
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L1195">view source</a></div></div><div class="public anchor" id="var-rowvec-at"><h3>rowvec-at</h3><div class="usage"><code>(rowvec-at ds idx)</code></div><div class="doc"><div class="markdown"><p>Return a persisent-vector-like row at a given index. Negative indexes index
from the end.</p>
<pre><code class="language-clojure">user&gt; (ds/rowvec-at stocks 1)
["MSFT" #object[java.time.LocalDate 0x5848b8b3 "2000-02-01"] 36.35]
user&gt; (ds/rowvec-at stocks -1)
["AAPL" #object[java.time.LocalDate 0x4b70b0d5 "2010-03-01"] 223.02]
</code></pre>
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L1239">view source</a></div></div><div class="public anchor" id="var-rowvecs"><h3>rowvecs</h3><div class="usage"><code>(rowvecs ds options)</code><code>(rowvecs ds)</code></div><div class="doc"><div class="markdown"><p>Return a randomly addressable list of rows in persistent vector-like form.</p>
<p>Options:</p>
<ul>
<li>copying? - When true the data is copied out of the dataset row by row upon read of that
row. When false the data is only referenced upon each read of a particular key. Copying
is appropriate if you want to use the row values as keys a map and it is inappropriate if
you are only going to read a given key for a given row once.</li>
</ul>
<pre><code class="language-clojure">user&gt; (take 5 (ds/rowvecs stocks))
(["MSFT" #object[java.time.LocalDate 0x5be9e4c8 "2000-01-01"] 39.81]
["MSFT" #object[java.time.LocalDate 0xf758e5 "2000-02-01"] 36.35]
["MSFT" #object[java.time.LocalDate 0x752cc84d "2000-03-01"] 43.22]
["MSFT" #object[java.time.LocalDate 0x7bad4827 "2000-04-01"] 28.37]
["MSFT" #object[java.time.LocalDate 0x3a62c34a "2000-05-01"] 25.45])
</code></pre>
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L1253">view source</a></div></div><div class="public anchor" id="var-sample"><h3>sample</h3><div class="usage"><code>(sample dataset n options)</code><code>(sample dataset n)</code><code>(sample dataset)</code></div><div class="doc"><div class="markdown"><p>Sample n-rows from a dataset. Defaults to sampling <em>without</em> replacement.</p>
<p>For the definition of seed, see the argshuffle documentation](<a href="https://cnuernber.github.io/dtype-next/tech.v3.datatype.argops.html#var-argshuffle">https://cnuernber.github.io/dtype-next/tech.v3.datatype.argops.html#var-argshuffle</a>)</p>
<p>The returned dataset's metadata is altered merging <code>{:print-index-range (range n)}</code> in so you
will always see the entire returned dataset. If this isn't desired, <code>vary-meta</code> a good pathway.</p>
<p>Options:</p>
<ul>
<li><code>:replacement?</code> - Do sampling with replacement. Defaults to false.</li>
<li><code>:seed</code> - Provide a seed as a number or provide a Random implementation.</li>
</ul>
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L1277">view source</a></div></div><div class="public anchor" id="var-select"><h3>select</h3><div class="usage"><code>(select dataset colname-seq selection)</code></div><div class="doc"><div class="markdown"><p>Reorder/trim dataset according to this sequence of indexes. Returns a new dataset.
colname-seq - one of:</p>
<ul>
<li>:all - all the columns</li>
<li>sequence of column names - those columns in that order.</li>
<li>implementation of java.util.Map - column order is dictate by map iteration order
selected columns are subsequently named after the corresponding value in the map.
similar to <code>rename-columns</code> except this trims the result to be only the columns
in the map.
selection - either keyword :all, a list of indexes to select, or a list of booleans where
the index position of each true value indicates an index to select. When providing indices,
duplicates will select the specified index position more than once.</li>
</ul>
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L1297">view source</a></div></div><div class="public anchor" id="var-select-by-index"><h3>select-by-index</h3><div class="usage"><code>(select-by-index dataset col-index row-index)</code></div><div class="doc"><div class="markdown"><p>Trim dataset according to this sequence of indexes. Returns a new dataset.</p>
<p>col-index and row-index - one of:</p>
<ul>
<li>:all - all the columns</li>
<li>list of indexes. May contain duplicates. Negative values will be counted from
the end of the sequence.</li>
</ul>
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L1314">view source</a></div></div><div class="public anchor" id="var-select-columns"><h3>select-columns</h3><div class="usage"><code>(select-columns dataset colname-seq-or-fn)</code></div><div class="doc"><div class="markdown"><p>Select columns from the dataset by:</p>
<ul>
<li>seq of column names</li>
<li>column selector function</li>
<li><code>:all</code> keyword</li>
</ul>
<p>For example:</p>
<pre><code class="language-clojure">(select-columns DS [:A :B])
(select-columns DS cf/numeric)
(select-columns DS :all)
</code></pre>
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L1326">view source</a></div></div><div class="public anchor" id="var-select-columns-by-index"><h3>select-columns-by-index</h3><div class="usage"><code>(select-columns-by-index dataset col-index)</code></div><div class="doc"><div class="markdown"><p>Select columns from the dataset by seq of index(includes negative) or :all.</p>
<p>See documentation for <code>select-by-index</code>.</p>
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L1344">view source</a></div></div><div class="public anchor" id="var-select-missing"><h3>select-missing</h3><div class="usage"><code>(select-missing dataset-or-col)</code></div><div class="doc"><div class="markdown"><p>Remove missing entries by simply selecting out the missing indexes</p>
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L1352">view source</a></div></div><div class="public anchor" id="var-select-rows"><h3>select-rows</h3><div class="usage"><code>(select-rows dataset-or-col row-indexes options)</code><code>(select-rows dataset-or-col row-indexes)</code></div><div class="doc"><div class="markdown"><p>Select rows from the dataset or column.</p>
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L1358">view source</a></div></div><div class="public anchor" id="var-set-dataset-name"><h3>set-dataset-name</h3><div class="usage"><code>(set-dataset-name dataset ds-name)</code></div><div class="doc"><div class="markdown"></div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L1366">view source</a></div></div><div class="public anchor" id="var-shape"><h3>shape</h3><div class="usage"><code>(shape dataset)</code></div><div class="doc"><div class="markdown"><p>Returns shape in column-major format of <a href="n-columns n-rows">n-columns n-rows</a>.</p>
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L1371">view source</a></div></div><div class="public anchor" id="var-shuffle"><h3>shuffle</h3><div class="usage"><code>(shuffle dataset options)</code><code>(shuffle dataset)</code></div><div class="doc"><div class="markdown"><p>Shuffle the rows of the dataset optionally providing a seed.
See <a href="https://cnuernber.github.io/dtype-next/tech.v3.datatype.argops.html#var-argshuffle">https://cnuernber.github.io/dtype-next/tech.v3.datatype.argops.html#var-argshuffle</a>.</p>
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L1377">view source</a></div></div><div class="public anchor" id="var-sort-by"><h3>sort-by</h3><div class="usage"><code>(sort-by dataset key-fn compare-fn &amp; args)</code><code>(sort-by dataset key-fn)</code></div><div class="doc"><div class="markdown"><p>Sort a dataset by a key-fn and compare-fn.</p>
<ul>
<li><code>key-fn</code> - function from map to sort value.</li>
<li><code>compare-fn</code> may be one of:
<ul>
<li>a clojure operator like clojure.core/&lt;</li>
<li><code>:tech.numerics/&lt;</code>, <code>:tech.numerics/&gt;</code> for unboxing comparisons of primitive
values.</li>
<li>clojure.core/compare</li>
<li>A custom java.util.Comparator instantiation.</li>
</ul>
</li>
</ul>
<p>Options:</p>
<ul>
<li><code>:nan-strategy</code> - General missing strategy. Options are <code>:first</code>, <code>:last</code>, and
<code>:exception</code>.</li>
<li><code>:parallel?</code> - Uses parallel quicksort when true and regular quicksort when false.</li>
</ul>
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L1386">view source</a></div></div><div class="public anchor" id="var-sort-by-column"><h3>sort-by-column</h3><div class="usage"><code>(sort-by-column dataset colname compare-fn &amp; args)</code><code>(sort-by-column dataset colname)</code></div><div class="doc"><div class="markdown"><p>Sort a dataset by a given column using the given compare fn.</p>
<ul>
<li><code>compare-fn</code> may be one of:
<ul>
<li>a clojure operator like clojure.core/&lt;</li>
<li><code>:tech.numerics/&lt;</code>, <code>:tech.numerics/&gt;</code> for unboxing comparisons of primitive
values.</li>
<li>clojure.core/compare</li>
<li>A custom java.util.Comparator instantiation.</li>
</ul>
</li>
</ul>
<p>Options:</p>
<ul>
<li><code>:nan-strategy</code> - General missing strategy. Options are <code>:first</code>, <code>:last</code>, and
<code>:exception</code>.</li>
<li><code>:parallel?</code> - Uses parallel quicksort when true and regular quicksort when false.</li>
</ul>
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L1408">view source</a></div></div><div class="public anchor" id="var-tail"><h3>tail</h3><div class="usage"><code>(tail dataset n)</code><code>(tail dataset)</code></div><div class="doc"><div class="markdown"><p>Get the last n rows of a dataset. Equivalent to
`(select-rows ds (range ...)). Argument order is dataset-last, however, so this can
be used in -&gt;&gt; operators.</p>
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L1429">view source</a></div></div><div class="public anchor" id="var-take-nth"><h3>take-nth</h3><div class="usage"><code>(take-nth dataset n-val)</code></div><div class="doc"><div class="markdown"></div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L1439">view source</a></div></div><div class="public anchor" id="var-unique-by"><h3>unique-by</h3><div class="usage"><code>(unique-by dataset options map-fn)</code><code>(unique-by dataset map-fn)</code></div><div class="doc"><div class="markdown"><p>Map-fn function gets passed map for each row, rows are grouped by the
return value. Keep-fn is used to decide the index to keep.</p>
<p>:keep-fn - Function from key,idx-seq-&gt;idx. Defaults to #(first %2).</p>
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L1444">view source</a></div></div><div class="public anchor" id="var-unique-by-column"><h3>unique-by-column</h3><div class="usage"><code>(unique-by-column dataset options colname)</code><code>(unique-by-column dataset colname)</code></div><div class="doc"><div class="markdown"><p>Map-fn function gets passed map for each row, rows are grouped by the
return value. Keep-fn is used to decide the index to keep.</p>
<p>:keep-fn - Function from key, idx-seq-&gt;idx. Defaults to #(first %2).</p>
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L1455">view source</a></div></div><div class="public anchor" id="var-unordered-select"><h3>unordered-select</h3><div class="usage"><code>(unordered-select dataset colname-seq index-seq)</code></div><div class="doc"><div class="markdown"><p>Perform a selection but use the order of the columns in the existing table; do
<em>not</em> reorder the columns based on colname-seq. Useful when doing selection based
on sets or persistent hash maps.</p>
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L1466">view source</a></div></div><div class="public anchor" id="var-unroll-column"><h3>unroll-column</h3><div class="usage"><code>(unroll-column dataset column-name)</code><code>(unroll-column dataset column-name options)</code></div><div class="doc"><div class="markdown"><p>Unroll a column that has some (or all) sequential data as entries.
Returns a new dataset with same columns but with other columns duplicated
where the unroll happened. Column now contains only scalar data.</p>
<p>Any missing indexes are dropped.</p>
<pre><code class="language-clojure">user&gt; (-&gt; (ds/-&gt;dataset [{:a 1 :b [2 3]}
{:a 2 :b [4 5]}
{:a 3 :b :a}])
(ds/unroll-column :b {:indexes? true}))
_unnamed [5 3]:
| :a | :b | :indexes |
|----+----+----------|
| 1 | 2 | 0 |
| 1 | 3 | 1 |
| 2 | 4 | 0 |
| 2 | 5 | 1 |
| 3 | :a | 0 |
</code></pre>
<p>Options -
:datatype - datatype of the resulting column if one aside from :object is desired.
:indexes? - If true, create a new column that records the indexes of the values from
the original column. Can also be a truthy value (like a keyword) and the column
will be named this.</p>
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L1474">view source</a></div></div><div class="public anchor" id="var-update"><h3>update</h3><div class="usage"><code>(update lhs-ds filter-fn-or-ds update-fn &amp; args)</code></div><div class="doc"><div class="markdown"><p>Update this dataset. Filters this dataset into a new dataset,
applies update-fn, then merges the result into original dataset.</p>
<p>This pathways is designed to work with the tech.v3.dataset.column-filters namespace.</p>
<ul>
<li><code>filter-fn-or-ds</code> is a generalized parameter. May be a function,
a dataset or a sequence of column names.</li>
<li>update-fn must take the dataset as the first argument and must return
a dataset.</li>
</ul>
<pre><code class="language-clojure">(ds/bind-&gt; (ds/-&gt;dataset dataset) ds
(ds/remove-column "Id")
(ds/update cf/string ds/replace-missing-value "NA")
(ds/update-elemwise cf/string #(get {"" "NA"} % %))
(ds/update cf/numeric ds/replace-missing-value 0)
(ds/update cf/boolean ds/replace-missing-value false)
(ds/update-columnwise (cf/union (cf/numeric ds) (cf/boolean ds))
#(dtype/elemwise-cast % :float64)))
</code></pre>
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L1508">view source</a></div></div><div class="public anchor" id="var-update-column"><h3>update-column</h3><div class="usage"><code>(update-column dataset col-name update-fn)</code></div><div class="doc"><div class="markdown"><p>Update a column returning a new dataset. update-fn is a column-&gt;column
transformation. Error if column does not exist.</p>
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L1534">view source</a></div></div><div class="public anchor" id="var-update-columns"><h3>update-columns</h3><div class="usage"><code>(update-columns dataset column-name-seq-or-fn update-fn)</code></div><div class="doc"><div class="markdown"><p>Update a sequence of columns selected by column name seq or column selector
function.</p>
<p>For example:</p>
<pre><code class="language-clojure">(update-columns DS [:A :B] #(dfn/+ % 2))
(update-columns DS cf/numeric #(dfn// % 2))
</code></pre>
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L1541">view source</a></div></div><div class="public anchor" id="var-update-columnwise"><h3>update-columnwise</h3><div class="usage"><code>(update-columnwise dataset filter-fn-or-ds cwise-update-fn &amp; args)</code></div><div class="doc"><div class="markdown"><p>Call update-fn on each column of the dataset. Returns the dataset.
See arguments to update</p>
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L1555">view source</a></div></div><div class="public anchor" id="var-update-elemwise"><h3>update-elemwise</h3><div class="usage"><code>(update-elemwise dataset filter-fn-or-ds map-fn)</code><code>(update-elemwise dataset map-fn)</code></div><div class="doc"><div class="markdown"><p>Replace all elements in selected columns by calling selected function on each
element. column-name-seq must be a sequence of column names if provided.
filter-fn-or-ds has same rules as update. Implicitly clears the missing set so
function must deal with type-specific missing values correctly.
Returns new dataset</p>
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L1562">view source</a></div></div><div class="public anchor" id="var-value-reader"><h3>value-reader</h3><div class="usage"><code>(value-reader dataset options)</code><code>(value-reader dataset)</code></div><div class="doc"><div class="markdown"><p>Return a reader that produces a reader of column values per index.
Options:
:copying? - Default to false - When true row values are copied on read.</p>
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L1574">view source</a></div></div><div class="public anchor" id="var-write.21"><h3>write!</h3><div class="usage"><code>(write! dataset output-path options)</code><code>(write! dataset output-path)</code></div><div class="doc"><div class="markdown"><p>Write a dataset out to a file. Supported forms are:</p>
<pre><code class="language-clojure">(ds/write! test-ds "test.csv")
(ds/write! test-ds "test.tsv")
(ds/write! test-ds "test.tsv.gz")
(ds/write! test-ds "test.nippy")
(ds/write! test-ds out-stream)
</code></pre>
<p>Options:</p>
<ul>
<li><code>:max-chars-per-column</code> - csv,tsv specific, defaults to 65536 - values longer than this will
cause an exception during serialization.</li>
<li><code>:max-num-columns</code> - csv,tsv specific, defaults to 8192 - If the dataset has more than this number of
columns an exception will be thrown during serialization.</li>
<li><code>:quoted-columns</code> - csv specific - sequence of columns names that you would like to always have quoted.</li>
<li><code>:file-type</code> - Manually specify the file type. This is usually inferred from the filename but if you
pass in an output stream then you will need to specify the file type.</li>
<li><code>:headers?</code> - if csv headers are written, defaults to true.</li>
</ul>
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L1584">view source</a></div></div></div></body></html>