980 lines
101 KiB
HTML
Vendored
980 lines
101 KiB
HTML
Vendored
<!DOCTYPE html PUBLIC ""
|
|
"">
|
|
<html><head><meta charset="UTF-8" /><title>tech.v3.dataset documentation</title><script async="true" src="https://www.googletagmanager.com/gtag/js?id=G-RGTB4J7LGP"></script><script>window.dataLayer = window.dataLayer || [];
|
|
function gtag(){dataLayer.push(arguments);}
|
|
gtag('js', new Date());
|
|
|
|
gtag('config', 'G-95TVFC1FEB');</script><link rel="stylesheet" type="text/css" href="css/default.css" /><link rel="stylesheet" type="text/css" href="highlight/solarized-light.css" /><script type="text/javascript" src="highlight/highlight.min.js"></script><script type="text/javascript" src="js/jquery.min.js"></script><script type="text/javascript" src="js/page_effects.js"></script><script>hljs.initHighlightingOnLoad();</script></head><body><div id="header"><h2>Generated by <a href="https://github.com/weavejester/codox">Codox</a> with <a href="https://github.com/xsc/codox-theme-rdash">RDash UI</a> theme</h2><h1><a href="index.html"><span class="project-title"><span class="project-name">TMD</span> <span class="project-version">8.003</span></span></a></h1></div><div class="sidebar primary"><h3 class="no-link"><span class="inner">Project</span></h3><ul class="index-link"><li class="depth-1 "><a href="index.html"><div class="inner">Index</div></a></li></ul><h3 class="no-link"><span class="inner">Topics</span></h3><ul><li class="depth-1 "><a href="000-getting-started.html"><div class="inner"><span>tech.ml.dataset Getting Started</span></div></a></li><li class="depth-1 "><a href="100-walkthrough.html"><div class="inner"><span>tech.ml.dataset Walkthrough</span></div></a></li><li class="depth-1 "><a href="200-quick-reference.html"><div class="inner"><span>tech.ml.dataset Quick Reference</span></div></a></li><li class="depth-1 "><a href="columns-readers-and-datatypes.html"><div class="inner"><span>tech.ml.dataset Columns, Readers, and Datatypes</span></div></a></li><li class="depth-1 "><a href="nippy-serialization-rocks.html"><div class="inner"><span>tech.ml.dataset And nippy</span></div></a></li><li class="depth-1 "><a href="supported-datatypes.html"><div class="inner"><span>tech.ml.dataset Supported Datatypes</span></div></a></li></ul><h3 class="no-link"><span class="inner">Namespaces</span></h3><ul><li class="depth-1"><div class="no-link"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>tech</span></div></div></li><li class="depth-2"><div class="no-link"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>v3</span></div></div></li><li class="depth-3 current"><a href="tech.v3.dataset.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>dataset</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.categorical.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>categorical</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.clipboard.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>clipboard</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.column.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>column</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.column-filters.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>column-filters</span></div></a></li><li class="depth-4"><div class="no-link"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>io</span></div></div></li><li class="depth-5 branch"><a href="tech.v3.dataset.io.csv.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>csv</span></div></a></li><li class="depth-5 branch"><a href="tech.v3.dataset.io.datetime.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>datetime</span></div></a></li><li class="depth-5 branch"><a href="tech.v3.dataset.io.string-row-parser.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>string-row-parser</span></div></a></li><li class="depth-5"><a href="tech.v3.dataset.io.univocity.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>univocity</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.join.html"><div class="inner"><span class="tree" style="top: -145px;"><span class="top" style="height: 154px;"></span><span class="bottom"></span></span><span>join</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.math.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>math</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.metamorph.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>metamorph</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.modelling.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>modelling</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.print.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>print</span></div></a></li><li class="depth-4"><a href="tech.v3.dataset.reductions.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>reductions</span></div></a></li><li class="depth-5"><a href="tech.v3.dataset.reductions.apache-data-sketch.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>apache-data-sketch</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.rolling.html"><div class="inner"><span class="tree" style="top: -52px;"><span class="top" style="height: 61px;"></span><span class="bottom"></span></span><span>rolling</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.set.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>set</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.tensor.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>tensor</span></div></a></li><li class="depth-4"><a href="tech.v3.dataset.zip.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>zip</span></div></a></li><li class="depth-3"><div class="no-link"><div class="inner"><span class="tree" style="top: -641px;"><span class="top" style="height: 650px;"></span><span class="bottom"></span></span><span>libs</span></div></div></li><li class="depth-4 branch"><a href="tech.v3.libs.arrow.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>arrow</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.libs.clj-transit.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>clj-transit</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.libs.fastexcel.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>fastexcel</span></div></a></li><li class="depth-4"><div class="no-link"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>guava</span></div></div></li><li class="depth-5"><a href="tech.v3.libs.guava.cache.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>cache</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.libs.parquet.html"><div class="inner"><span class="tree" style="top: -52px;"><span class="top" style="height: 61px;"></span><span class="bottom"></span></span><span>parquet</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.libs.poi.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>poi</span></div></a></li><li class="depth-4"><a href="tech.v3.libs.tribuo.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>tribuo</span></div></a></li></ul></div><div class="sidebar secondary"><h3><a href="#top"><span class="inner">Public Vars</span></a></h3><ul><li class="depth-1"><a href="tech.v3.dataset.html#var--.3E.3Edataset"><div class="inner"><span>->>dataset</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var--.3Edataset"><div class="inner"><span>->dataset</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-add-column"><div class="inner"><span>add-column</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-add-or-update-column"><div class="inner"><span>add-or-update-column</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-all-descriptive-stats-names"><div class="inner"><span>all-descriptive-stats-names</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-append-columns"><div class="inner"><span>append-columns</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-assoc-ds"><div class="inner"><span>assoc-ds</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-assoc-metadata"><div class="inner"><span>assoc-metadata</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-bind-.3E"><div class="inner"><span>bind-></span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-brief"><div class="inner"><span>brief</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-categorical-.3Enumber"><div class="inner"><span>categorical->number</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-categorical-.3Eone-hot"><div class="inner"><span>categorical->one-hot</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-column"><div class="inner"><span>column</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-column-.3Edataset"><div class="inner"><span>column->dataset</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-column-cast"><div class="inner"><span>column-cast</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-column-count"><div class="inner"><span>column-count</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-column-labeled-mapseq"><div class="inner"><span>column-labeled-mapseq</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-column-map"><div class="inner"><span>column-map</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-column-map-m"><div class="inner"><span>column-map-m</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-column-names"><div class="inner"><span>column-names</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-columns"><div class="inner"><span>columns</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-columns-with-missing-seq"><div class="inner"><span>columns-with-missing-seq</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-columnwise-concat"><div class="inner"><span>columnwise-concat</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-concat"><div class="inner"><span>concat</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-concat-copying"><div class="inner"><span>concat-copying</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-concat-inplace"><div class="inner"><span>concat-inplace</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-data-.3Edataset"><div class="inner"><span>data->dataset</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-dataset-.3Edata"><div class="inner"><span>dataset->data</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-dataset-name"><div class="inner"><span>dataset-name</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-dataset-parser"><div class="inner"><span>dataset-parser</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-dataset.3F"><div class="inner"><span>dataset?</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-descriptive-stats"><div class="inner"><span>descriptive-stats</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-drop-columns"><div class="inner"><span>drop-columns</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-drop-missing"><div class="inner"><span>drop-missing</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-drop-rows"><div class="inner"><span>drop-rows</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-empty-column-names"><div class="inner"><span>empty-column-names</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-empty-dataset"><div class="inner"><span>empty-dataset</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-ensure-array-backed"><div class="inner"><span>ensure-array-backed</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-filter"><div class="inner"><span>filter</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-filter-column"><div class="inner"><span>filter-column</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-filter-dataset"><div class="inner"><span>filter-dataset</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-group-by"><div class="inner"><span>group-by</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-group-by-.3Eindexes"><div class="inner"><span>group-by->indexes</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-group-by-column"><div class="inner"><span>group-by-column</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-group-by-column-.3Eindexes"><div class="inner"><span>group-by-column->indexes</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-group-by-column-consumer"><div class="inner"><span>group-by-column-consumer</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-has-column.3F"><div class="inner"><span>has-column?</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-head"><div class="inner"><span>head</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-induction"><div class="inner"><span>induction</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-major-version"><div class="inner"><span>major-version</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-mapseq-parser"><div class="inner"><span>mapseq-parser</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-mapseq-reader"><div class="inner"><span>mapseq-reader</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-mapseq-rf"><div class="inner"><span>mapseq-rf</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-min-n-by-column"><div class="inner"><span>min-n-by-column</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-missing"><div class="inner"><span>missing</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-new-column"><div class="inner"><span>new-column</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-new-dataset"><div class="inner"><span>new-dataset</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-order-column-names"><div class="inner"><span>order-column-names</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-pmap-ds"><div class="inner"><span>pmap-ds</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-print-all"><div class="inner"><span>print-all</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-rand-nth"><div class="inner"><span>rand-nth</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-remove-column"><div class="inner"><span>remove-column</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-remove-columns"><div class="inner"><span>remove-columns</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-remove-empty-columns"><div class="inner"><span>remove-empty-columns</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-remove-rows"><div class="inner"><span>remove-rows</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-rename-columns"><div class="inner"><span>rename-columns</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-replace-missing"><div class="inner"><span>replace-missing</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-replace-missing-value"><div class="inner"><span>replace-missing-value</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-reverse-rows"><div class="inner"><span>reverse-rows</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-row-at"><div class="inner"><span>row-at</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-row-count"><div class="inner"><span>row-count</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-row-map"><div class="inner"><span>row-map</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-row-mapcat"><div class="inner"><span>row-mapcat</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-rows"><div class="inner"><span>rows</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-rowvec-at"><div class="inner"><span>rowvec-at</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-rowvecs"><div class="inner"><span>rowvecs</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-sample"><div class="inner"><span>sample</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-select"><div class="inner"><span>select</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-select-by-index"><div class="inner"><span>select-by-index</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-select-columns"><div class="inner"><span>select-columns</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-select-columns-by-index"><div class="inner"><span>select-columns-by-index</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-select-missing"><div class="inner"><span>select-missing</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-select-rows"><div class="inner"><span>select-rows</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-set-dataset-name"><div class="inner"><span>set-dataset-name</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-shape"><div class="inner"><span>shape</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-shuffle"><div class="inner"><span>shuffle</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-sort-by"><div class="inner"><span>sort-by</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-sort-by-column"><div class="inner"><span>sort-by-column</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-tail"><div class="inner"><span>tail</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-take-nth"><div class="inner"><span>take-nth</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-unique-by"><div class="inner"><span>unique-by</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-unique-by-column"><div class="inner"><span>unique-by-column</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-unordered-select"><div class="inner"><span>unordered-select</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-unroll-column"><div class="inner"><span>unroll-column</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-update"><div class="inner"><span>update</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-update-column"><div class="inner"><span>update-column</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-update-columns"><div class="inner"><span>update-columns</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-update-columnwise"><div class="inner"><span>update-columnwise</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-update-elemwise"><div class="inner"><span>update-elemwise</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-value-reader"><div class="inner"><span>value-reader</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.html#var-write.21"><div class="inner"><span>write!</span></div></a></li></ul></div><div class="namespace-docs" id="content"><h1 class="anchor" id="top">tech.v3.dataset</h1><div class="doc"><div class="markdown"><p>Column major dataset abstraction for efficiently manipulating
|
|
in memory datasets.</p>
|
|
</div></div><div class="public anchor" id="var--.3E.3Edataset"><h3>->>dataset</h3><div class="usage"><code>(->>dataset options dataset)</code><code>(->>dataset dataset)</code></div><div class="doc"><div class="markdown"><p>Please see documentation of ->dataset. Options are the same.</p>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L14">view source</a></div></div><div class="public anchor" id="var--.3Edataset"><h3>->dataset</h3><div class="usage"><code>(->dataset dataset options)</code><code>(->dataset dataset)</code></div><div class="doc"><div class="markdown"><p>Create a dataset from either csv/tsv or a sequence of maps.</p>
|
|
<ul>
|
|
<li>
|
|
<p>A <code>String</code> be interpreted as a file (or gzipped file if it
|
|
ends with .gz) of tsv or csv data. The system will attempt to autodetect if this
|
|
is csv or tsv and then engineering around detecting datatypes all of which can
|
|
be overridden.</p>
|
|
</li>
|
|
<li>
|
|
<p>InputStreams have no file type and thus a <code>file-type</code> must be provided in the
|
|
options.</p>
|
|
</li>
|
|
<li>
|
|
<p>A sequence of maps may be passed in in which case the first N maps are scanned in
|
|
order to derive the column datatypes before the actual columns are created.</p>
|
|
</li>
|
|
</ul>
|
|
<p>Parquet, xlsx, and xls formats require that you require the appropriate libraries
|
|
which are <code>tech.v3.libs.parquet</code> for parquet, <code>tech.v3.libs.fastexcel</code> for xlsx,
|
|
and <code>tech.v3.libs.poi</code> for xls.</p>
|
|
<p>Arrow support is provided via the tech.v3.libs.Arrow namespace not via a file-type
|
|
overload as the Arrow project current has 3 different file types and it is not clear
|
|
what their final suffix will be or which of the three file types it will indicate.
|
|
Please see documentation in the <code>tech.v3.libs.arrow</code> namespace for further information
|
|
on Arrow file types.</p>
|
|
<p>Options:</p>
|
|
<ul>
|
|
<li>
|
|
<p><code>:dataset-name</code> - set the name of the dataset.</p>
|
|
</li>
|
|
<li>
|
|
<p><code>:file-type</code> - Override filetype discovery mechanism for strings or force a particular
|
|
parser for an input stream. Note that parquet must have paths on disk
|
|
and cannot currently load from input stream. Acceptible file types are:
|
|
#{:csv :tsv :xlsx :xls :parquet}.</p>
|
|
</li>
|
|
<li>
|
|
<p><code>:gzipped?</code> - for file formats that support it, override autodetection and force
|
|
creation of a gzipped input stream as opposed to a normal input stream.</p>
|
|
</li>
|
|
<li>
|
|
<p><code>:column-allowlist</code> - either sequence of string column names or sequence of column
|
|
indices of columns to allowlist. This is preferred to <code>:column-whitelist</code></p>
|
|
</li>
|
|
<li>
|
|
<p><code>:column-blocklist</code> - either sequence of string column names or sequence of column
|
|
indices of columns to blocklist. This is preferred to <code>:column-blacklist</code></p>
|
|
</li>
|
|
<li>
|
|
<p><code>:num-rows</code> - Number of rows to read</p>
|
|
</li>
|
|
<li>
|
|
<p><code>:header-row?</code> - Defaults to true, indicates the first row is a header.</p>
|
|
</li>
|
|
<li>
|
|
<p><code>:key-fn</code> - function to be applied to column names. Typical use is:
|
|
<code>:key-fn keyword</code>.</p>
|
|
</li>
|
|
<li>
|
|
<p><code>:separator</code> - Add a character separator to the list of separators to auto-detect.</p>
|
|
</li>
|
|
<li>
|
|
<p><code>:csv-parser</code> - Implementation of univocity's AbstractParser to use. If not
|
|
provided a default permissive parser is used. This way you parse anything that
|
|
univocity supports (so flat files and such).</p>
|
|
</li>
|
|
<li>
|
|
<p><code>:bad-row-policy</code> - One of three options: :skip, :error, :carry-on. Defaults to
|
|
:carry-on. Some csv data has ragged rows and in this case we have several
|
|
options. If the option is :carry-on then we either create a new column or add
|
|
missing values for columns that had no data for that row.</p>
|
|
</li>
|
|
<li>
|
|
<p><code>:skip-bad-rows?</code> - Legacy option. Use :bad-row-policy.</p>
|
|
</li>
|
|
<li>
|
|
<p><code>:disable-comment-skipping?</code> - As default, the <code>#</code> character is recognised as a
|
|
line comment when found in the beginning of a line of text in a CSV file,
|
|
and the row will be ignored. Set <code>true</code> to disable this behavior.</p>
|
|
</li>
|
|
<li>
|
|
<p><code>:max-chars-per-column</code> - Defaults to 4096. Columns with more characters that this
|
|
will result in an exception.</p>
|
|
</li>
|
|
<li>
|
|
<p><code>:max-num-columns</code> - Defaults to 8192. CSV,TSV files with more columns than this
|
|
will fail to parse. For more information on this option, please visit:
|
|
<a href="https://github.com/uniVocity/univocity-parsers/issues/301">https://github.com/uniVocity/univocity-parsers/issues/301</a></p>
|
|
</li>
|
|
<li>
|
|
<p><code>:text-temp-dir</code> - The temporary directory to use for file-backed text. Setting
|
|
this value to boolean 'false' turns off file backed text which is the default. If a
|
|
tech.v3.resource stack context is opened the file will be deleted when the context
|
|
closes else it will be deleted when the gc cleans up the dataset. A shutdown hook is
|
|
added as a last resort to ensure the file is cleaned up.</p>
|
|
</li>
|
|
<li>
|
|
<p><code>:n-initial-skip-rows</code> - Skip N rows initially. This currently may include the
|
|
header row. Works across both csv and spreadsheet datasets.</p>
|
|
</li>
|
|
<li>
|
|
<p><code>:parser-type</code> - Default parser to use if no parser-fn is specified for that column.
|
|
For csv files, the default parser type is <code>:string</code> which indicates a promotional
|
|
string parser. For sequences of maps, the default parser type is :object. It can
|
|
be useful in some contexts to use the <code>:string</code> parser with sequences of maps or
|
|
maps of columns.</p>
|
|
</li>
|
|
<li>
|
|
<p><code>:parser-fn</code> -
|
|
v - <code>keyword?</code> - all columns parsed to this datatype. For example:
|
|
<code>{:parser-fn :string}</code></p>
|
|
<ul>
|
|
<li><code>map?</code> - <code>{column-name parse-method}</code> parse each column with specified
|
|
<code>parse-method</code>.
|
|
The <code>parse-method</code> can be:
|
|
<ul>
|
|
<li><code>keyword?</code> - parse the specified column to this datatype. For example:
|
|
<code>{:parser-fn {:answer :boolean :id :int32}}</code></li>
|
|
<li>tuple - pair of <code>[datatype parse-data]</code> in which case container of type
|
|
<code>[datatype]</code> will be created. <code>parse-data</code> can be one of:
|
|
<ul>
|
|
<li><code>:relaxed?</code> - data will be parsed such that parse failures of the standard
|
|
parse functions do not stop the parsing process. :unparsed-values and
|
|
:unparsed-indexes are available in the metadata of the column that tell
|
|
you the values that failed to parse and their respective indexes.</li>
|
|
<li><code>fn?</code> - function from str-> one of <code>:tech.v3.dataset/missing</code>,
|
|
<code>:tech.v3.dataset/parse-failure</code>, or the parsed value.
|
|
Exceptions here always kill the parse process. :missing will get marked
|
|
in the missing indexes, and :parse-failure will result in the index being
|
|
added to missing, the unparsed the column's :unparsed-values and
|
|
:unparsed-indexes will be updated.</li>
|
|
<li><code>string?</code> - for datetime types, this will turned into a DateTimeFormatter via
|
|
DateTimeFormatter/ofPattern. For <code>:text</code> you can specify the backing file
|
|
to use.</li>
|
|
<li><code>DateTimeFormatter</code> - use with the appropriate temporal parse static function
|
|
to parse the value.</li>
|
|
</ul>
|
|
</li>
|
|
</ul>
|
|
</li>
|
|
</ul>
|
|
</li>
|
|
<li>
|
|
<p><code>map?</code> - the header-name-or-idx is used to lookup value. If not nil, then
|
|
value can be any of the above options. Else the default column parser
|
|
is used.</p>
|
|
</li>
|
|
</ul>
|
|
<p>Returns a new dataset</p>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L22">view source</a></div></div><div class="public anchor" id="var-add-column"><h3>add-column</h3><div class="usage"><code>(add-column dataset column)</code></div><div class="doc"><div class="markdown"><p>Add a new column. Error if name collision</p>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L130">view source</a></div></div><div class="public anchor" id="var-add-or-update-column"><h3>add-or-update-column</h3><div class="usage"><code>(add-or-update-column dataset colname column)</code><code>(add-or-update-column dataset column)</code></div><div class="doc"><div class="markdown"><p>If column exists, replace. Else append new column.</p>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L136">view source</a></div></div><div class="public anchor" id="var-all-descriptive-stats-names"><h3>all-descriptive-stats-names</h3><div class="usage"><code>(all-descriptive-stats-names)</code></div><div class="doc"><div class="markdown"><p>Returns the names of all descriptive stats in the order they will be returned
|
|
in the resulting dataset of descriptive stats. This allows easy filtering
|
|
in the form for
|
|
(descriptive-stats ds {:stat-names (->> (all-descriptive-stats-names)
|
|
(remove #{:values :num-distinct-values}))})</p>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L144">view source</a></div></div><div class="public anchor" id="var-append-columns"><h3>append-columns</h3><div class="usage"><code>(append-columns dataset column-seq)</code></div><div class="doc"><div class="markdown"></div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L154">view source</a></div></div><div class="public anchor" id="var-assoc-ds"><h3>assoc-ds</h3><div class="usage"><code>(assoc-ds dataset cname cdata & args)</code></div><div class="doc"><div class="markdown"><p>If dataset is not nil, calls <code>clojure.core/assoc</code>. Else creates a new empty dataset and
|
|
then calls <code>clojure.core/assoc</code>. Guaranteed to return a dataset (unlike assoc).</p>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L159">view source</a></div></div><div class="public anchor" id="var-assoc-metadata"><h3>assoc-metadata</h3><div class="usage"><code>(assoc-metadata dataset filter-fn-or-ds k v & args)</code></div><div class="doc"><div class="markdown"><p>Set metadata across a set of columns.</p>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L166">view source</a></div></div><div class="public anchor" id="var-bind-.3E"><h3>bind-></h3><h4 class="type">macro</h4><div class="usage"><code>(bind-> expr name & args)</code></div><div class="doc"><div class="markdown"><p>Threads like <code>-></code> but binds name to expr like <code>as-></code>:</p>
|
|
<pre><code class="language-clojure">(ds/bind-> (ds/->dataset "test/data/stocks.csv") ds
|
|
(assoc :logprice2 (dfn/log1p (ds "price")))
|
|
(assoc :logp3 (dfn/* 2 (ds :logprice2)))
|
|
(ds/select-columns ["price" :logprice2 :logp3])
|
|
(ds-tens/dataset->tensor)
|
|
(first))
|
|
</code></pre>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L172">view source</a></div></div><div class="public anchor" id="var-brief"><h3>brief</h3><div class="usage"><code>(brief ds options)</code><code>(brief ds)</code></div><div class="doc"><div class="markdown"><p>Get a brief description, in mapseq form of a dataset. A brief description is
|
|
the mapseq form of descriptive stats.</p>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L188">view source</a></div></div><div class="public anchor" id="var-categorical-.3Enumber"><h3>categorical->number</h3><div class="usage"><code>(categorical->number dataset filter-fn-or-ds)</code><code>(categorical->number dataset filter-fn-or-ds table-args)</code><code>(categorical->number dataset filter-fn-or-ds table-args result-datatype)</code></div><div class="doc"><div class="markdown"><p>Convert columns into a discrete , numeric representation
|
|
See tech.v3.dataset.categorical/fit-categorical-map.</p>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L197">view source</a></div></div><div class="public anchor" id="var-categorical-.3Eone-hot"><h3>categorical->one-hot</h3><div class="usage"><code>(categorical->one-hot dataset filter-fn-or-ds)</code><code>(categorical->one-hot dataset filter-fn-or-ds table-args)</code><code>(categorical->one-hot dataset filter-fn-or-ds table-args result-datatype)</code></div><div class="doc"><div class="markdown"><p>Convert string columns to numeric columns.
|
|
See tech.v3.dataset.categorical/fit-one-hot</p>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L208">view source</a></div></div><div class="public anchor" id="var-column"><h3>column</h3><div class="usage"><code>(column dataset colname)</code></div><div class="doc"><div class="markdown"></div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L219">view source</a></div></div><div class="public anchor" id="var-column-.3Edataset"><h3>column->dataset</h3><div class="usage"><code>(column->dataset dataset colname transform-fn options)</code><code>(column->dataset dataset colname transform-fn)</code></div><div class="doc"><div class="markdown"><p>Transform a column into a sequence of maps using transform-fn.
|
|
Return dataset created out of the sequence of maps.</p>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L224">view source</a></div></div><div class="public anchor" id="var-column-cast"><h3>column-cast</h3><div class="usage"><code>(column-cast dataset colname datatype)</code><code>(column-cast dataset colname datatype options)</code></div><div class="doc"><div class="markdown"><p>Cast a column to a new datatype. This is never a lazy operation. If the old
|
|
and new datatypes match and no cast-fn is provided then dtype/clone is called
|
|
on the column.</p>
|
|
<p>colname may be a scalar or a tuple of <a href="src-col dst-col">src-col dst-col</a>.</p>
|
|
<p>datatype may be a datatype enumeration or a tuple of
|
|
<a href="datatype cast-fn">datatype cast-fn</a> where cast-fn may return either a new value,
|
|
:tech.v3.dataset/missing, or :tech.v3.dataset/parse-failure.
|
|
Exceptions are propagated to the caller. The new column has at least the
|
|
existing missing set (if no attempt returns :missing or :cast-failure).
|
|
:cast-failure means the value gets added to metadata key :unparsed-data
|
|
and the index gets added to :unparsed-indexes.</p>
|
|
<p>If the existing datatype is string, then tech.v3.datatype.column/parse-column
|
|
is called.</p>
|
|
<p>Casts between numeric datatypes need no cast-fn but one may be provided.
|
|
Casts to string need no cast-fn but one may be provided.
|
|
Casts from string to anything will call tech.v3.dataset.column/parse-column.</p>
|
|
<p>Options:</p>
|
|
<ul>
|
|
<li><code>:track-parse-errors</code> - defaults to false. When true extra metadata keys
|
|
<code>:unparsed-indexes :unparsed-data</code> will be appended to the metadata. Be aware
|
|
these values may not serialize as unparsed indexes is a roaring bitmap.</li>
|
|
</ul>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L233">view source</a></div></div><div class="public anchor" id="var-column-count"><h3>column-count</h3><div class="usage"><code>(column-count dataset)</code></div><div class="doc"><div class="markdown"></div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L267">view source</a></div></div><div class="public anchor" id="var-column-labeled-mapseq"><h3>column-labeled-mapseq</h3><div class="usage"><code>(column-labeled-mapseq dataset value-colname-seq)</code></div><div class="doc"><div class="markdown"><p>Given a dataset, return a sequence of maps where several columns are all stored
|
|
in a :value key and a :label key contains a column name. Used for quickly creating
|
|
timeseries or scatterplot labeled graphs. Returns a lazy sequence, not a reader!</p>
|
|
<p>See also <code>columnwise-concat</code></p>
|
|
<p>Return a sequence of maps with</p>
|
|
<pre><code class="language-clojure"> {... - columns not in colname-seq
|
|
:value - value from one of the value columns
|
|
:label - name of the column the value came from
|
|
}
|
|
</code></pre>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L272">view source</a></div></div><div class="public anchor" id="var-column-map"><h3>column-map</h3><div class="usage"><code>(column-map dataset result-colname map-fn res-dtype-or-opts filter-fn-or-ds)</code><code>(column-map dataset result-colname map-fn filter-fn-or-ds)</code><code>(column-map dataset result-colname map-fn)</code></div><div class="doc"><div class="markdown"><p>Produce a new (or updated) column as the result of mapping a fn over columns. This
|
|
function is never lazy - all results are immediately calculated.</p>
|
|
<ul>
|
|
<li><code>dataset</code> - dataset.</li>
|
|
<li><code>result-colname</code> - Name of new (or existing) column.</li>
|
|
<li><code>map-fn</code> - function to map over columns. Same rules as <code>tech.v3.datatype/emap</code>.</li>
|
|
<li><code>res-dtype-or-opts</code> - If not given result is scanned to infer missing and datatype.
|
|
If using an option map, options are described below.</li>
|
|
<li><code>filter-fn-or-ds</code> - A dataset, a sequence of columns, or a <code>tech.v3.datasets/column-filters</code>
|
|
column filter function. Defaults to all the columns of the existing dataset.</li>
|
|
</ul>
|
|
<p>Returns a new dataset with a new or updated column.</p>
|
|
<p>Options:</p>
|
|
<ul>
|
|
<li><code>:datatype</code> - Set the dataype of the result column. If not given result is scanned
|
|
to infer result datatype and missing set.</li>
|
|
<li><code>:missing-fn</code> - if given, columns are first passed to missing-fn as a sequence and
|
|
this dictates the missing set. Else the missing set is by scanning the results
|
|
during the inference process. See <code>tech.v3.dataset.column/union-missing-sets</code> and
|
|
<code>tech.v3.dataset.column/intersect-missing-sets</code> for example functions to pass in
|
|
here.</li>
|
|
</ul>
|
|
<p>Examples:</p>
|
|
<pre><code class="language-clojure">
|
|
;;From the tests --
|
|
|
|
(let [testds (ds/->dataset [{:a 1.0 :b 2.0} {:a 3.0 :b 5.0} {:a 4.0 :b nil}])]
|
|
;;result scanned for both datatype and missing set
|
|
(is (= (vec [3.0 6.0 nil])
|
|
(:b2 (ds/column-map testds :b2 #(when % (inc %)) [:b]))))
|
|
;;result scanned for missing set only. Result used in-place.
|
|
(is (= (vec [3.0 6.0 nil])
|
|
(:b2 (ds/column-map testds :b2 #(when % (inc %))
|
|
{:datatype :float64} [:b]))))
|
|
;;Nothing scanned at all.
|
|
(is (= (vec [3.0 6.0 nil])
|
|
(:b2 (ds/column-map testds :b2 #(inc %)
|
|
{:datatype :float64
|
|
:missing-fn ds-col/union-missing-sets} [:b]))))
|
|
;;Missing set scanning causes NPE at inc.
|
|
(is (thrown? Throwable
|
|
(ds/column-map testds :b2 #(inc %)
|
|
{:datatype :float64}
|
|
[:b]))))
|
|
|
|
;;Ad-hoc repl --
|
|
|
|
user> (require '[tech.v3.dataset :as ds]))
|
|
nil
|
|
user> (def ds (ds/->dataset "test/data/stocks.csv"))
|
|
#'user/ds
|
|
user> (ds/head ds)
|
|
test/data/stocks.csv [5 3]:
|
|
|
|
| symbol | date | price |
|
|
|--------|------------|-------|
|
|
| MSFT | 2000-01-01 | 39.81 |
|
|
| MSFT | 2000-02-01 | 36.35 |
|
|
| MSFT | 2000-03-01 | 43.22 |
|
|
| MSFT | 2000-04-01 | 28.37 |
|
|
| MSFT | 2000-05-01 | 25.45 |
|
|
user> (-> (ds/column-map ds "price^2" #(* % %) ["price"])
|
|
(ds/head))
|
|
test/data/stocks.csv [5 4]:
|
|
|
|
| symbol | date | price | price^2 |
|
|
|--------|------------|-------|-----------|
|
|
| MSFT | 2000-01-01 | 39.81 | 1584.8361 |
|
|
| MSFT | 2000-02-01 | 36.35 | 1321.3225 |
|
|
| MSFT | 2000-03-01 | 43.22 | 1867.9684 |
|
|
| MSFT | 2000-04-01 | 28.37 | 804.8569 |
|
|
| MSFT | 2000-05-01 | 25.45 | 647.7025 |
|
|
|
|
|
|
|
|
user> (def ds1 (ds/->dataset [{:a 1} {:b 2.0} {:a 2 :b 3.0}]))
|
|
#'user/ds1
|
|
user> ds1
|
|
_unnamed [3 2]:
|
|
|
|
| :b | :a |
|
|
|----:|---:|
|
|
| | 1 |
|
|
| 2.0 | |
|
|
| 3.0 | 2 |
|
|
user> (ds/column-map ds1 :c (fn [a b]
|
|
(when (and a b)
|
|
(+ (double a) (double b))))
|
|
[:a :b])
|
|
_unnamed [3 3]:
|
|
|
|
| :b | :a | :c |
|
|
|----:|---:|----:|
|
|
| | 1 | |
|
|
| 2.0 | | |
|
|
| 3.0 | 2 | 5.0 |
|
|
user> (ds/missing (*1 :c))
|
|
{0,1}
|
|
</code></pre>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L290">view source</a></div></div><div class="public anchor" id="var-column-map-m"><h3>column-map-m</h3><h4 class="type">macro</h4><div class="usage"><code>(column-map-m ds result-colname src-colnames body)</code></div><div class="doc"><div class="markdown"><p>Map a function across one or more columns via a macro.
|
|
The function will have arguments in the order of the src-colnames. column names of
|
|
the form <code>right.id</code> will be bound to variables named <code>right-id</code>.</p>
|
|
<p>Example:</p>
|
|
<pre><code class="language-clojure">user> (-> (ds/->dataset [{:a.a 1} {:b 2.0} {:a.a 2 :b 3.0}])
|
|
(ds/column-map-m :a [:a.a :b]
|
|
(when (and a-a b)
|
|
(+ (double a-a) (double b)))))
|
|
_unnamed [3 3]:
|
|
|
|
| :b | :a.a | :a |
|
|
|----:|-----:|----:|
|
|
| | 1 | |
|
|
| 2.0 | | |
|
|
| 3.0 | 2 | 5.0 |
|
|
</code></pre>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L402">view source</a></div></div><div class="public anchor" id="var-column-names"><h3>column-names</h3><div class="usage"><code>(column-names dataset)</code></div><div class="doc"><div class="markdown"><p>In-order sequence of column names</p>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L427">view source</a></div></div><div class="public anchor" id="var-columns"><h3>columns</h3><div class="usage"><code>(columns dataset)</code></div><div class="doc"><div class="markdown"><p>Return sequence of all columns in dataset.</p>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L433">view source</a></div></div><div class="public anchor" id="var-columns-with-missing-seq"><h3>columns-with-missing-seq</h3><div class="usage"><code>(columns-with-missing-seq dataset)</code></div><div class="doc"><div class="markdown"><p>Return a sequence of:</p>
|
|
<pre><code class="language-clojure"> {:column-name column-name
|
|
:missing-count missing-count
|
|
}
|
|
</code></pre>
|
|
<p>or nil of no columns are missing data.</p>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L439">view source</a></div></div><div class="public anchor" id="var-columnwise-concat"><h3>columnwise-concat</h3><div class="usage"><code>(columnwise-concat dataset colnames options)</code><code>(columnwise-concat dataset colnames)</code></div><div class="doc"><div class="markdown"><p>Given a dataset and a list of columns, produce a new dataset with
|
|
the columns concatenated to a new column with a :column column indicating
|
|
which column the original value came from. Any columns not mentioned in the
|
|
list of columns are duplicated.</p>
|
|
<p>Example:</p>
|
|
<pre><code class="language-clojure">user> (-> [{:a 1 :b 2 :c 3 :d 1} {:a 4 :b 5 :c 6 :d 2}]
|
|
(ds/->dataset)
|
|
(ds/columnwise-concat [:c :a :b]))
|
|
null [6 3]:
|
|
|
|
| :column | :value | :d |
|
|
|---------+--------+----|
|
|
| :c | 3 | 1 |
|
|
| :c | 6 | 2 |
|
|
| :a | 1 | 1 |
|
|
| :a | 4 | 2 |
|
|
| :b | 2 | 1 |
|
|
| :b | 5 | 2 |
|
|
</code></pre>
|
|
<p>Options:</p>
|
|
<p>value-column-name - defaults to :value
|
|
colname-column-name - defaults to :column</p>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L451">view source</a></div></div><div class="public anchor" id="var-concat"><h3>concat</h3><div class="usage"><code>(concat dataset & args)</code><code>(concat)</code></div><div class="doc"><div class="markdown"><p>Concatenate datasets using a copying-concatenation.
|
|
See also <a href="tech.v3.dataset.html#var-concat-inplace">concat-inplace</a> as it may be more efficient for your use case if you have
|
|
a small number (like less than 3) of datasets.</p>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L485">view source</a></div></div><div class="public anchor" id="var-concat-copying"><h3>concat-copying</h3><div class="usage"><code>(concat-copying dataset & args)</code><code>(concat-copying)</code></div><div class="doc"><div class="markdown"><p>Concatenate datasets into a new dataset copying data. Respects missing values.
|
|
Datasets must all have the same columns. Result column datatypes will be a widening
|
|
cast of the datatypes.</p>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L495">view source</a></div></div><div class="public anchor" id="var-concat-inplace"><h3>concat-inplace</h3><div class="usage"><code>(concat-inplace dataset & args)</code><code>(concat-inplace)</code></div><div class="doc"><div class="markdown"><p>Concatenate datasets in place. Respects missing values. Datasets must all have the
|
|
same columns. Result column datatypes will be a widening cast of the datatypes.</p>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L505">view source</a></div></div><div class="public anchor" id="var-data-.3Edataset"><h3>data->dataset</h3><div class="usage"><code>(data->dataset input)</code></div><div class="doc"><div class="markdown"><p>Convert a data-ized dataset created via dataset->data back into a
|
|
full dataset</p>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L514">view source</a></div></div><div class="public anchor" id="var-dataset-.3Edata"><h3>dataset->data</h3><div class="usage"><code>(dataset->data ds)</code></div><div class="doc"><div class="markdown"><p>Convert a dataset to a pure clojure datastructure. Returns a map with two keys:
|
|
{:metadata :columns}.
|
|
:columns is a vector of column definitions appropriate for passing directly back
|
|
into new-dataset.
|
|
A column definition in this case is a map of {:name :missing :data :metadata}.</p>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L521">view source</a></div></div><div class="public anchor" id="var-dataset-name"><h3>dataset-name</h3><div class="usage"><code>(dataset-name dataset)</code></div><div class="doc"><div class="markdown"></div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L531">view source</a></div></div><div class="public anchor" id="var-dataset-parser"><h3>dataset-parser</h3><div class="usage"><code>(dataset-parser options)</code><code>(dataset-parser)</code></div><div class="doc"><div class="markdown"><p>Implements protocols/PDatasetParser, Counted, Indexed, IReduceInit, and IDeref (returns the new dataset).
|
|
See documentation for <a href="tech.v3.dataset.html#var-mapseq-parser">mapseq-parser</a>.</p>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L536">view source</a></div></div><div class="public anchor" id="var-dataset.3F"><h3>dataset?</h3><div class="usage"><code>(dataset? ds)</code></div><div class="doc"><div class="markdown"></div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L545">view source</a></div></div><div class="public anchor" id="var-descriptive-stats"><h3>descriptive-stats</h3><div class="usage"><code>(descriptive-stats dataset)</code><code>(descriptive-stats dataset options)</code></div><div class="doc"><div class="markdown"><p>Get descriptive statistics across the columns of the dataset.
|
|
In addition to the standard stats.
|
|
Options:
|
|
:stat-names - defaults to (remove #{:values :num-distinct-values}
|
|
(all-descriptive-stats-names))
|
|
:n-categorical-values - Number of categorical values to report in the 'values'
|
|
field. Defaults to 21.</p>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L550">view source</a></div></div><div class="public anchor" id="var-drop-columns"><h3>drop-columns</h3><div class="usage"><code>(drop-columns dataset colname-seq-or-fn)</code></div><div class="doc"><div class="markdown"><p>Same as remove-columns. Remove columns indexed by column name seq or
|
|
column filter function.
|
|
For example:</p>
|
|
<pre><code class="language-clojure">(drop-columns DS [:A :B])
|
|
(drop-columns DS cf/categorical)
|
|
</code></pre>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L564">view source</a></div></div><div class="public anchor" id="var-drop-missing"><h3>drop-missing</h3><div class="usage"><code>(drop-missing dataset-or-col)</code><code>(drop-missing ds colname)</code></div><div class="doc"><div class="markdown"><p>Remove missing entries by simply selecting out the missing indexes.</p>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L577">view source</a></div></div><div class="public anchor" id="var-drop-rows"><h3>drop-rows</h3><div class="usage"><code>(drop-rows dataset-or-col row-indexes)</code></div><div class="doc"><div class="markdown"><p>Drop rows from dataset or column</p>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L585">view source</a></div></div><div class="public anchor" id="var-empty-column-names"><h3>empty-column-names</h3><div class="usage"><code>(empty-column-names ds)</code></div><div class="doc"><div class="markdown"><p>Return a sequence of column names whose empty set length matches the row count of the dataset.</p>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L591">view source</a></div></div><div class="public anchor" id="var-empty-dataset"><h3>empty-dataset</h3><div class="usage"><code>(empty-dataset)</code></div><div class="doc"><div class="markdown"></div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L597">view source</a></div></div><div class="public anchor" id="var-ensure-array-backed"><h3>ensure-array-backed</h3><div class="usage"><code>(ensure-array-backed ds options)</code><code>(ensure-array-backed ds)</code></div><div class="doc"><div class="markdown"><p>Ensure the column data in the dataset is stored in pure java arrays. This is
|
|
sometimes necessary for interop with other libraries and this operation will
|
|
force any lazy computations to complete. This also clears the missing set
|
|
for each column and writes the missing values to the new arrays.</p>
|
|
<p>Columns that are already array backed and that have no missing values are not
|
|
changed and retuned.</p>
|
|
<p>The postcondition is that dtype/->array will return a java array in the appropriate
|
|
datatype for each column.</p>
|
|
<p>Options:</p>
|
|
<ul>
|
|
<li><code>:unpack?</code> - unpack packed datetime types. Defaults to true</li>
|
|
</ul>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L602">view source</a></div></div><div class="public anchor" id="var-filter"><h3>filter</h3><div class="usage"><code>(filter dataset predicate)</code></div><div class="doc"><div class="markdown"><p>dataset->dataset transformation. Predicate is passed a map of
|
|
colname->column-value.</p>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L623">view source</a></div></div><div class="public anchor" id="var-filter-column"><h3>filter-column</h3><div class="usage"><code>(filter-column dataset colname predicate)</code><code>(filter-column dataset colname)</code></div><div class="doc"><div class="markdown"><p>Filter a given column by a predicate. Predicate is passed column values.
|
|
If predicate is <em>not</em> an instance of Ifn it is treated as a value and will
|
|
be used as if the predicate is #(= value %).</p>
|
|
<p>The 2-arity form of this function reads the column as a boolean reader so for
|
|
instance numeric 0 values are false in that case as are Double/NaN, Float/NaN. Objects are
|
|
only false if nil?.</p>
|
|
<p>Returns a dataset.</p>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L630">view source</a></div></div><div class="public anchor" id="var-filter-dataset"><h3>filter-dataset</h3><div class="usage"><code>(filter-dataset dataset filter-fn-or-ds)</code></div><div class="doc"><div class="markdown"><p>Filter the columns of the dataset returning a new dataset. This pathway is
|
|
designed to work with the tech.v3.dataset.column-filters namespace.</p>
|
|
<ul>
|
|
<li>If filter-fn-or-ds is a dataset, it is returned.</li>
|
|
<li>If filter-fn-or-ds is sequential, then select-columns is called.</li>
|
|
<li>If filter-fn-or-ds is :all, all columns are returned</li>
|
|
<li>If filter-fn-or-ds is an instance of IFn, the dataset is passed into it.</li>
|
|
</ul>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L646">view source</a></div></div><div class="public anchor" id="var-group-by"><h3>group-by</h3><div class="usage"><code>(group-by dataset key-fn options)</code><code>(group-by dataset key-fn)</code></div><div class="doc"><div class="markdown"><p>Produce a map of key-fn-value->dataset. The argument to key-fn
|
|
is a map of colname->column-value representing a row in dataset.
|
|
Each dataset in the resulting map contains all and only rows
|
|
that produce the same key-fn-value.</p>
|
|
<p>Options - options are passed into dtype arggroup:</p>
|
|
<ul>
|
|
<li><code>:group-by-finalizer</code> - when provided this is run on each dataset immediately after the
|
|
rows are selected. This can be used to immediately perform a reduction on each new
|
|
dataset which is faster than doing it in a separate run.</li>
|
|
</ul>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L658">view source</a></div></div><div class="public anchor" id="var-group-by-.3Eindexes"><h3>group-by->indexes</h3><div class="usage"><code>(group-by->indexes dataset key-fn options)</code><code>(group-by->indexes dataset key-fn)</code></div><div class="doc"><div class="markdown"><p>(Non-lazy) - Group a dataset and return a map of key-fn-value->indexes where indexes
|
|
is an in-order contiguous group of indexes.</p>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L675">view source</a></div></div><div class="public anchor" id="var-group-by-column"><h3>group-by-column</h3><div class="usage"><code>(group-by-column dataset colname options)</code><code>(group-by-column dataset colname)</code></div><div class="doc"><div class="markdown"><p>Return a map of column-value->dataset. Each dataset in the
|
|
resulting map contains all and only rows with the same value in
|
|
column.</p>
|
|
<ul>
|
|
<li><code>:group-by-finalizer</code> - when provided this is run on each dataset immediately after the
|
|
rows are selected. This can be used to immediately perform a reduction on each new
|
|
dataset which is faster than doing it in a separate run.</li>
|
|
</ul>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L684">view source</a></div></div><div class="public anchor" id="var-group-by-column-.3Eindexes"><h3>group-by-column->indexes</h3><div class="usage"><code>(group-by-column->indexes dataset colname options)</code><code>(group-by-column->indexes dataset colname)</code></div><div class="doc"><div class="markdown"><p>(Non-lazy) - Group a dataset by a column return a map of column-val->indexes
|
|
where indexes is an in-order contiguous group of indexes.</p>
|
|
<p>Options are passed into dtype's arggroup method.</p>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L698">view source</a></div></div><div class="public anchor" id="var-group-by-column-consumer"><h3>group-by-column-consumer</h3><div class="usage"><code>(group-by-column-consumer ds cname)</code></div><div class="doc"><div class="markdown"></div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L709">view source</a></div></div><div class="public anchor" id="var-has-column.3F"><h3>has-column?</h3><div class="usage"><code>(has-column? dataset column-name)</code></div><div class="doc"><div class="markdown"></div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L714">view source</a></div></div><div class="public anchor" id="var-head"><h3>head</h3><div class="usage"><code>(head dataset n)</code><code>(head dataset)</code></div><div class="doc"><div class="markdown"><p>Get the first n row of a dataset. Equivalent to
|
|
`(select-rows ds (range n)). Arguments are reversed, however, so this can
|
|
be used in ->> operators.</p>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L719">view source</a></div></div><div class="public anchor" id="var-induction"><h3>induction</h3><div class="usage"><code>(induction ds induct-fn & args)</code></div><div class="doc"><div class="markdown"><p>Given a dataset and a function from dataset->row produce a new dataset.
|
|
The produced row will be merged with the current row and then added to the
|
|
dataset.</p>
|
|
<p>Options are same as the options used for <a href="tech.v3.dataset.html#var--.3Edataset">->dataset</a> in order for the
|
|
user to control the parsing of the return values of <code>induct-fn</code>.
|
|
A new dataset is returned.</p>
|
|
<p>Example:</p>
|
|
<pre><code class="language-clojure">user> (def ds (ds/->dataset {:a [0 1 2 3] :b [1 2 3 4]}))
|
|
#'user/ds
|
|
user> ds
|
|
_unnamed [4 2]:
|
|
|
|
| :a | :b |
|
|
|---:|---:|
|
|
| 0 | 1 |
|
|
| 1 | 2 |
|
|
| 2 | 3 |
|
|
| 3 | 4 |
|
|
user> (ds/induction ds (fn [ds]
|
|
{:sum-of-previous-row (dfn/sum (ds/rowvec-at ds -1))
|
|
:sum-a (dfn/sum (ds :a))
|
|
:sum-b (dfn/sum (ds :b))}))
|
|
_unnamed [4 5]:
|
|
|
|
| :a | :b | :sum-b | :sum-a | :sum-of-previous-row |
|
|
|---:|---:|-------:|-------:|---------------------:|
|
|
| 0 | 1 | 0.0 | 0.0 | 0.0 |
|
|
| 1 | 2 | 1.0 | 0.0 | 1.0 |
|
|
| 2 | 3 | 3.0 | 1.0 | 5.0 |
|
|
| 3 | 4 | 6.0 | 3.0 | 14.0 |
|
|
</code></pre>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L729">view source</a></div></div><div class="public anchor" id="var-major-version"><h3>major-version</h3><div class="usage"></div><div class="doc"><div class="markdown"></div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L769">view source</a></div></div><div class="public anchor" id="var-mapseq-parser"><h3>mapseq-parser</h3><div class="usage"><code>(mapseq-parser options)</code><code>(mapseq-parser)</code></div><div class="doc"><div class="markdown"><p>Return a clojure function that when called with one arg that arg must be the next map
|
|
to add to the dataset. When called with no args returns the current dataset. This can be
|
|
used to efficiently transform a stream of maps into a dataset while getting intermediate
|
|
datasets during the parse operation.</p>
|
|
<p>Options are the same for <a href="tech.v3.dataset.html#var--.3Edataset">->dataset</a>.</p>
|
|
<pre><code class="language-clojure">user> (require '[tech.v3.dataset :as ds])
|
|
nil
|
|
user> (def pfn (ds/mapseq-parser))
|
|
#'user/pfn
|
|
user> (pfn {:a 1 :b 2})
|
|
nil
|
|
user> (pfn {:a 1 :b 2})
|
|
nil
|
|
user> (pfn {:a 2 :c 3})
|
|
nil
|
|
user> (pfn)
|
|
_unnamed [3 3]:
|
|
|
|
| :a | :b | :c |
|
|
|---:|---:|---:|
|
|
| 1 | 2 | |
|
|
| 1 | 2 | |
|
|
| 2 | | 3 |
|
|
user> (pfn {:a 3 :d 4})
|
|
nil
|
|
user> (pfn {:a 5 :c 6})
|
|
nil
|
|
user> (pfn)
|
|
_unnamed [5 4]:
|
|
|
|
| :a | :b | :c | :d |
|
|
|---:|---:|---:|---:|
|
|
| 1 | 2 | | |
|
|
| 1 | 2 | | |
|
|
| 2 | | 3 | |
|
|
| 3 | | | 4 |
|
|
| 5 | | 6 | |
|
|
</code></pre>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L770">view source</a></div></div><div class="public anchor" id="var-mapseq-reader"><h3>mapseq-reader</h3><div class="usage"><code>(mapseq-reader dataset options)</code><code>(mapseq-reader dataset)</code></div><div class="doc"><div class="markdown"><p>Return a reader that produces a map of column-name->column-value
|
|
upon read.</p>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L818">view source</a></div></div><div class="public anchor" id="var-mapseq-rf"><h3>mapseq-rf</h3><div class="usage"><code>(mapseq-rf)</code><code>(mapseq-rf options)</code></div><div class="doc"><div class="markdown"><p>Create a transduce-compatible rf that reduces a sequence of maps into a dataset.
|
|
Same options as <a href="tech.v3.dataset.html#var--.3Edataset">->dataset</a>.</p>
|
|
<pre><code class="language-clojure">user> (transduce (map identity) (ds/mapseq-rf {:dataset-name :transduced}) [{:a 1 :b 2}])
|
|
:transduced [1 2]:
|
|
|
|
| :a | :b |
|
|
|---:|---:|
|
|
| 1 | 2 |
|
|
</code></pre>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L827">view source</a></div></div><div class="public anchor" id="var-min-n-by-column"><h3>min-n-by-column</h3><div class="usage"><code>(min-n-by-column dataset cname N comparator options)</code><code>(min-n-by-column dataset cname N comparator)</code><code>(min-n-by-column dataset cname N)</code></div><div class="doc"><div class="markdown"><p>Find the minimum N entries (unsorted) by column. Resulting data will be indexed in
|
|
original order. If you want a sorted order then sort the result.</p>
|
|
<p>See options to <a href="tech.v3.dataset.html#var-sort-by-column">sort-by-column</a>.</p>
|
|
<p>Example:</p>
|
|
<pre><code class="language-clojure">user> (ds/min-n-by-column ds "price" 10 nil nil)
|
|
test/data/stocks.csv [10 3]:
|
|
|
|
| symbol | date | price |
|
|
|--------|------------|------:|
|
|
| AMZN | 2001-09-01 | 5.97 |
|
|
| AMZN | 2001-10-01 | 6.98 |
|
|
| AAPL | 2000-12-01 | 7.44 |
|
|
| AAPL | 2002-08-01 | 7.38 |
|
|
| AAPL | 2002-09-01 | 7.25 |
|
|
| AAPL | 2002-12-01 | 7.16 |
|
|
| AAPL | 2003-01-01 | 7.18 |
|
|
| AAPL | 2003-02-01 | 7.51 |
|
|
| AAPL | 2003-03-01 | 7.07 |
|
|
| AAPL | 2003-04-01 | 7.11 |
|
|
user> (ds/min-n-by-column ds "price" 10 > nil)
|
|
test/data/stocks.csv [10 3]:
|
|
|
|
| symbol | date | price |
|
|
|--------|------------|-------:|
|
|
| GOOG | 2007-09-01 | 567.27 |
|
|
| GOOG | 2007-10-01 | 707.00 |
|
|
| GOOG | 2007-11-01 | 693.00 |
|
|
| GOOG | 2007-12-01 | 691.48 |
|
|
| GOOG | 2008-01-01 | 564.30 |
|
|
| GOOG | 2008-04-01 | 574.29 |
|
|
| GOOG | 2008-05-01 | 585.80 |
|
|
| GOOG | 2009-11-01 | 583.00 |
|
|
| GOOG | 2009-12-01 | 619.98 |
|
|
| GOOG | 2010-03-01 | 560.19 |
|
|
</code></pre>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L845">view source</a></div></div><div class="public anchor" id="var-missing"><h3>missing</h3><div class="usage"><code>(missing dataset-or-col)</code></div><div class="doc"><div class="markdown"><p>Given a dataset or a column, return the missing set as a roaring bitmap</p>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L893">view source</a></div></div><div class="public anchor" id="var-new-column"><h3>new-column</h3><div class="usage"><code>(new-column name data)</code><code>(new-column name data metadata)</code><code>(new-column name data metadata missing)</code><code>(new-column data-or-data-map)</code></div><div class="doc"><div class="markdown"><p>Create a new column. Data will scanned for missing values
|
|
unless the full 4-argument pathway is used.</p>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L899">view source</a></div></div><div class="public anchor" id="var-new-dataset"><h3>new-dataset</h3><div class="usage"><code>(new-dataset options ds-metadata column-seq)</code><code>(new-dataset options column-seq)</code><code>(new-dataset column-seq)</code></div><div class="doc"><div class="markdown"><p>Create a new dataset from a sequence of columns. Data will be converted
|
|
into columns using ds-col-proto/ensure-column-seq. If the column seq is simply a
|
|
collection of vectors, for instance, columns will be named ordinally.
|
|
options map -
|
|
:dataset-name - Name of the dataset. Defaults to "_unnamed".
|
|
:key-fn - Key function used on all column names before insertion into dataset.</p>
|
|
<p>The return value fulfills the dataset protocols.</p>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L912">view source</a></div></div><div class="public anchor" id="var-order-column-names"><h3>order-column-names</h3><div class="usage"><code>(order-column-names dataset colname-seq)</code></div><div class="doc"><div class="markdown"><p>Order a sequence of columns names so they match the order in the
|
|
original dataset. Missing columns are placed last.</p>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L929">view source</a></div></div><div class="public anchor" id="var-pmap-ds"><h3>pmap-ds</h3><div class="usage"><code>(pmap-ds ds ds-map-fn options)</code><code>(pmap-ds ds ds-map-fn)</code></div><div class="doc"><div class="markdown"><p>Parallelize mapping a function from dataset->dataset across a single dataset. Results are
|
|
coalesced back into a single dataset. The original dataset is simple sliced into n-core
|
|
results and map-fn is called n-core times. ds-map-fn must be a function from
|
|
dataset->dataset although it may return nil.</p>
|
|
<p>Options:</p>
|
|
<ul>
|
|
<li><code>:max-batch-size</code> - this is a default for tech.v3.parallel.for/indexed-map-reduce. You
|
|
can control how many rows are processed in a given batch - the default is 64000. If your
|
|
mapping pathway produces a large expansion in the size of the dataset then it may be
|
|
good to reduce the max batch size and use :as-seq to produce a sequence of datasets.</li>
|
|
<li><code>:result-type</code>
|
|
<ul>
|
|
<li><code>:as-seq</code> - Return a sequence of datasets, one for each batch.</li>
|
|
<li><code>:as-ds</code> - Return a single datasets with all results in memory (default option).</li>
|
|
</ul>
|
|
</li>
|
|
</ul>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L936">view source</a></div></div><div class="public anchor" id="var-print-all"><h3>print-all</h3><div class="usage"><code>(print-all dataset)</code></div><div class="doc"><div class="markdown"><p>Helper function equivalent to <code>(tech.v3.dataset.print/print-range ... :all)</code></p>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L957">view source</a></div></div><div class="public anchor" id="var-rand-nth"><h3>rand-nth</h3><div class="usage"><code>(rand-nth dataset)</code></div><div class="doc"><div class="markdown"><p>Return a random row from the dataset in map format</p>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L963">view source</a></div></div><div class="public anchor" id="var-remove-column"><h3>remove-column</h3><div class="usage"><code>(remove-column dataset col-name)</code></div><div class="doc"><div class="markdown"><p>Same as:</p>
|
|
<pre><code class="language-clojure">(dissoc dataset col-name)
|
|
</code></pre>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L969">view source</a></div></div><div class="public anchor" id="var-remove-columns"><h3>remove-columns</h3><div class="usage"><code>(remove-columns dataset colname-seq-or-fn)</code></div><div class="doc"><div class="markdown"><p>Remove columns indexed by column name seq or column filter function.
|
|
For example:</p>
|
|
<pre><code class="language-clojure"> (remove-columns DS [:A :B])
|
|
(remove-columns DS cf/categorical)
|
|
</code></pre>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L979">view source</a></div></div><div class="public anchor" id="var-remove-empty-columns"><h3>remove-empty-columns</h3><div class="usage"><code>(remove-empty-columns ds)</code></div><div class="doc"><div class="markdown"><p>Remove all columns that have no data - missing set length equals row count.</p>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L991">view source</a></div></div><div class="public anchor" id="var-remove-rows"><h3>remove-rows</h3><div class="usage"><code>(remove-rows dataset-or-col row-indexes)</code></div><div class="doc"><div class="markdown"><p>Same as drop-rows.</p>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L997">view source</a></div></div><div class="public anchor" id="var-rename-columns"><h3>rename-columns</h3><div class="usage"><code>(rename-columns dataset colnames)</code></div><div class="doc"><div class="markdown"><p>Rename columns using a map or vector of column names.</p>
|
|
<p>Does not reorder columns; rename is in-place for maps and
|
|
positional for vectors.</p>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L1003">view source</a></div></div><div class="public anchor" id="var-replace-missing"><h3>replace-missing</h3><div class="usage"><code>(replace-missing ds)</code><code>(replace-missing ds strategy)</code><code>(replace-missing ds columns-selector strategy)</code><code>(replace-missing ds columns-selector strategy value)</code></div><div class="doc"><div class="markdown"><p>Replace missing values in some columns with a given strategy.
|
|
The columns selector may be:</p>
|
|
<ul>
|
|
<li>seq of any legal column names</li>
|
|
<li>or a column filter function, such as <code>numeric</code> and <code>categorical</code></li>
|
|
</ul>
|
|
<p>Strategies may be:</p>
|
|
<ul>
|
|
<li>
|
|
<p><code>:down</code> - take value from previous non-missing row if possible else use provided value.</p>
|
|
</li>
|
|
<li>
|
|
<p><code>:up</code> - take value from next non-missing row if possible else use provided value.</p>
|
|
</li>
|
|
<li>
|
|
<p><code>:downup</code> - take value from previous if possible else use next.</p>
|
|
</li>
|
|
<li>
|
|
<p><code>:updown</code> - take value from next if possible else use previous.</p>
|
|
</li>
|
|
<li>
|
|
<p><code>:nearest</code> - Use nearest of next or previous values. <code>:mid</code> is an alias for <code>:nearest</code>.</p>
|
|
</li>
|
|
<li>
|
|
<p><code>:midpoint</code> - Use midpoint of averaged values between previous and next nonmissing
|
|
rows.</p>
|
|
</li>
|
|
<li>
|
|
<p><code>:abb</code> - Impute missing with approximate bayesian bootstrap. See <a href="https://search.r-project.org/CRAN/refmans/LaplacesDemon/html/ABB.html">r's ABB</a>.</p>
|
|
</li>
|
|
<li>
|
|
<p><code>:lerp</code> - Linearly interpolate values between previous and next nonmissing rows.</p>
|
|
</li>
|
|
<li>
|
|
<p><code>:value</code> - Value will be provided - see below.</p>
|
|
<p>value may be provided which will then be used. Value may be a function in which
|
|
case it will be called on the column with missing values elided and the return will
|
|
be used to as the filler.</p>
|
|
</li>
|
|
</ul>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L1012">view source</a></div></div><div class="public anchor" id="var-replace-missing-value"><h3>replace-missing-value</h3><div class="usage"><code>(replace-missing-value dataset filter-fn-or-ds scalar-value)</code><code>(replace-missing-value dataset scalar-value)</code></div><div class="doc"><div class="markdown"></div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L1045">view source</a></div></div><div class="public anchor" id="var-reverse-rows"><h3>reverse-rows</h3><div class="usage"><code>(reverse-rows dataset-or-col)</code></div><div class="doc"><div class="markdown"><p>Reverse the rows in the dataset or column.</p>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L1052">view source</a></div></div><div class="public anchor" id="var-row-at"><h3>row-at</h3><div class="usage"><code>(row-at ds idx)</code></div><div class="doc"><div class="markdown"><p>Get the row at an individual index. If indexes are negative then the dataset
|
|
is indexed from the end.</p>
|
|
<pre><code class="language-clojure">user> (ds/row-at stocks 1)
|
|
{"date" #object[java.time.LocalDate 0x534cb03b "2000-02-01"],
|
|
"symbol" "MSFT",
|
|
"price" 36.35}
|
|
user> (ds/row-at stocks -1)
|
|
{"date" #object[java.time.LocalDate 0x6bf60ed5 "2010-03-01"],
|
|
"symbol" "AAPL",
|
|
"price" 223.02}
|
|
</code></pre>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L1058">view source</a></div></div><div class="public anchor" id="var-row-count"><h3>row-count</h3><div class="usage"><code>(row-count dataset-or-col)</code></div><div class="doc"><div class="markdown"></div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L1076">view source</a></div></div><div class="public anchor" id="var-row-map"><h3>row-map</h3><div class="usage"><code>(row-map ds map-fn options)</code><code>(row-map ds map-fn)</code></div><div class="doc"><div class="markdown"><p>Map a function across the rows of the dataset producing a new dataset
|
|
that is merged back into the original potentially replacing existing columns.
|
|
Options are passed into the <a href="tech.v3.dataset.html#var--.3Edataset">->dataset</a> function so you can control the resulting
|
|
column types by the usual dataset parsing options described there.</p>
|
|
<p>Options:</p>
|
|
<p>See options for <a href="tech.v3.dataset.html#var-pmap-ds">pmap-ds</a>. In particular, note that you can
|
|
produce a sequence of datasets as opposed to a single large dataset.</p>
|
|
<p>Speed demons should attempt both <code>{:copying? false}</code> and <code>{:copying? true}</code> in the options
|
|
map as that changes rather drastically how data is read from the datasets. If you are
|
|
going to read all the data in the dataset, <code>{:copying? true}</code> will most likely be
|
|
the faster of the two.</p>
|
|
<p>Examples:</p>
|
|
<pre><code class="language-clojure">user> (def stocks (ds/->dataset "test/data/stocks.csv"))
|
|
#'user/stocks
|
|
user> (ds/head stocks)
|
|
test/data/stocks.csv [5 3]:
|
|
|
|
| symbol | date | price |
|
|
|--------|------------|------:|
|
|
| MSFT | 2000-01-01 | 39.81 |
|
|
| MSFT | 2000-02-01 | 36.35 |
|
|
| MSFT | 2000-03-01 | 43.22 |
|
|
| MSFT | 2000-04-01 | 28.37 |
|
|
| MSFT | 2000-05-01 | 25.45 |
|
|
user> (ds/head (ds/row-map stocks (fn [row]
|
|
{"symbol" (keyword (row "symbol"))
|
|
:price2 (* (row "price")(row "price"))})))
|
|
test/data/stocks.csv [5 4]:
|
|
|
|
| symbol | date | price | :price2 |
|
|
|--------|------------|------:|----------:|
|
|
| :MSFT | 2000-01-01 | 39.81 | 1584.8361 |
|
|
| :MSFT | 2000-02-01 | 36.35 | 1321.3225 |
|
|
| :MSFT | 2000-03-01 | 43.22 | 1867.9684 |
|
|
| :MSFT | 2000-04-01 | 28.37 | 804.8569 |
|
|
| :MSFT | 2000-05-01 | 25.45 | 647.7025 |
|
|
</code></pre>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L1081">view source</a></div></div><div class="public anchor" id="var-row-mapcat"><h3>row-mapcat</h3><div class="usage"><code>(row-mapcat ds mapcat-fn options)</code><code>(row-mapcat ds mapcat-fn)</code></div><div class="doc"><div class="markdown"><p>Map a function across the rows of the dataset. The function must produce a sequence of
|
|
maps and the original dataset rows will be duplicated and then merged into the result
|
|
of calling (->> (apply concat) (->>dataset options) on the result of <code>mapcat-fn</code>. Options
|
|
are the same as <a href="tech.v3.dataset.html#var--.3Edataset">->dataset</a>.</p>
|
|
<p>The smaller the maps returned from mapcat-fn the better, perhaps consider using records.
|
|
In the case that a mapcat-fn result map has a key that overlaps a column name the
|
|
column will be replaced with the output of mapcat-fn. The returned map will have the
|
|
key <code>:_row-id</code> assoc'd onto it so for absolutely minimal gc usage include this
|
|
as a member variable in your map.</p>
|
|
<p>Options:</p>
|
|
<ul>
|
|
<li>See options for <a href="tech.v3.dataset.html#var-pmap-ds">pmap-ds</a>. Especially note <code>:max-batch-size</code> and <code>:result-type</code>.
|
|
In order to conserve memory it may be much more efficient to return a sequence of datasets
|
|
rather than one large dataset. If returning sequences of datasets perhaps consider
|
|
a transducing pathway across them or the <a href="tech.v3.dataset.reductions.html">tech.v3.dataset.reductions</a> namespace.</li>
|
|
</ul>
|
|
<p>Example:</p>
|
|
<pre><code class="language-clojure">user> (def ds (ds/->dataset {:rid (range 10)
|
|
:data (repeatedly 10 #(rand-int 3))}))
|
|
#'user/ds
|
|
user> (ds/head ds)
|
|
_unnamed [5 2]:
|
|
|
|
| :rid | :data |
|
|
|-----:|------:|
|
|
| 0 | 0 |
|
|
| 1 | 2 |
|
|
| 2 | 0 |
|
|
| 3 | 1 |
|
|
| 4 | 2 |
|
|
user> (def mapcat-fn (fn [row]
|
|
(for [idx (range (row :data))]
|
|
{:idx idx})))
|
|
#'user/mapcat-fn
|
|
user> (mapcat mapcat-fn (ds/rows ds))
|
|
({:idx 0} {:idx 1} {:idx 0} {:idx 0} {:idx 1} {:idx 0} {:idx 1} {:idx 0} {:idx 1})
|
|
user> (ds/row-mapcat ds mapcat-fn)
|
|
_unnamed [9 3]:
|
|
|
|
| :rid | :data | :idx |
|
|
|-----:|------:|-----:|
|
|
| 1 | 2 | 0 |
|
|
| 1 | 2 | 1 |
|
|
| 3 | 1 | 0 |
|
|
| 4 | 2 | 0 |
|
|
| 4 | 2 | 1 |
|
|
| 6 | 2 | 0 |
|
|
| 6 | 2 | 1 |
|
|
| 8 | 2 | 0 |
|
|
| 8 | 2 | 1 |
|
|
user>
|
|
</code></pre>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L1132">view source</a></div></div><div class="public anchor" id="var-rows"><h3>rows</h3><div class="usage"><code>(rows ds options)</code><code>(rows ds)</code></div><div class="doc"><div class="markdown"><p>Get the rows of the dataset as a list of potentially flyweight maps.</p>
|
|
<p>Options:</p>
|
|
<ul>
|
|
<li>copying? - When true the data is copied out of the dataset row by row upon read of that
|
|
row. When false the data is only referenced upon each read of a particular key. Copying
|
|
is appropriate if you want to use the row values as keys a map and it is inappropriate if
|
|
you are only going to read a very small portion of the row map.</li>
|
|
<li>nil-missing? - When true, maps returned have nil values for missing entries as opposed
|
|
to eliding the missing keys entirely. It is legacy behavior and slightly faster to
|
|
use <code>:nil-missing? true</code>.</li>
|
|
</ul>
|
|
<pre><code class="language-clojure">user> (take 5 (ds/rows stocks))
|
|
({"date" #object[java.time.LocalDate 0x6c433971 "2000-01-01"],
|
|
"symbol" "MSFT",
|
|
"price" 39.81}
|
|
{"date" #object[java.time.LocalDate 0x28f96b14 "2000-02-01"],
|
|
"symbol" "MSFT",
|
|
"price" 36.35}
|
|
{"date" #object[java.time.LocalDate 0x7bdbf0a "2000-03-01"],
|
|
"symbol" "MSFT",
|
|
"price" 43.22}
|
|
{"date" #object[java.time.LocalDate 0x16d3871e "2000-04-01"],
|
|
"symbol" "MSFT",
|
|
"price" 28.37}
|
|
{"date" #object[java.time.LocalDate 0x47094da0 "2000-05-01"],
|
|
"symbol" "MSFT",
|
|
"price" 25.45})
|
|
|
|
|
|
user> (ds/rows (ds/->dataset [{:a 1 :b 2} {:a 2} {:b 3}]))
|
|
[{:a 1, :b 2} {:a 2} {:b 3}]
|
|
|
|
user> (ds/rows (ds/->dataset [{:a 1 :b 2} {:a 2} {:b 3}]) {:nil-missing? true})
|
|
[{:a 1, :b 2} {:a 2, :b nil} {:a nil, :b 3}]
|
|
</code></pre>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L1195">view source</a></div></div><div class="public anchor" id="var-rowvec-at"><h3>rowvec-at</h3><div class="usage"><code>(rowvec-at ds idx)</code></div><div class="doc"><div class="markdown"><p>Return a persisent-vector-like row at a given index. Negative indexes index
|
|
from the end.</p>
|
|
<pre><code class="language-clojure">user> (ds/rowvec-at stocks 1)
|
|
["MSFT" #object[java.time.LocalDate 0x5848b8b3 "2000-02-01"] 36.35]
|
|
user> (ds/rowvec-at stocks -1)
|
|
["AAPL" #object[java.time.LocalDate 0x4b70b0d5 "2010-03-01"] 223.02]
|
|
</code></pre>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L1239">view source</a></div></div><div class="public anchor" id="var-rowvecs"><h3>rowvecs</h3><div class="usage"><code>(rowvecs ds options)</code><code>(rowvecs ds)</code></div><div class="doc"><div class="markdown"><p>Return a randomly addressable list of rows in persistent vector-like form.</p>
|
|
<p>Options:</p>
|
|
<ul>
|
|
<li>copying? - When true the data is copied out of the dataset row by row upon read of that
|
|
row. When false the data is only referenced upon each read of a particular key. Copying
|
|
is appropriate if you want to use the row values as keys a map and it is inappropriate if
|
|
you are only going to read a given key for a given row once.</li>
|
|
</ul>
|
|
<pre><code class="language-clojure">user> (take 5 (ds/rowvecs stocks))
|
|
(["MSFT" #object[java.time.LocalDate 0x5be9e4c8 "2000-01-01"] 39.81]
|
|
["MSFT" #object[java.time.LocalDate 0xf758e5 "2000-02-01"] 36.35]
|
|
["MSFT" #object[java.time.LocalDate 0x752cc84d "2000-03-01"] 43.22]
|
|
["MSFT" #object[java.time.LocalDate 0x7bad4827 "2000-04-01"] 28.37]
|
|
["MSFT" #object[java.time.LocalDate 0x3a62c34a "2000-05-01"] 25.45])
|
|
</code></pre>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L1253">view source</a></div></div><div class="public anchor" id="var-sample"><h3>sample</h3><div class="usage"><code>(sample dataset n options)</code><code>(sample dataset n)</code><code>(sample dataset)</code></div><div class="doc"><div class="markdown"><p>Sample n-rows from a dataset. Defaults to sampling <em>without</em> replacement.</p>
|
|
<p>For the definition of seed, see the argshuffle documentation](<a href="https://cnuernber.github.io/dtype-next/tech.v3.datatype.argops.html#var-argshuffle">https://cnuernber.github.io/dtype-next/tech.v3.datatype.argops.html#var-argshuffle</a>)</p>
|
|
<p>The returned dataset's metadata is altered merging <code>{:print-index-range (range n)}</code> in so you
|
|
will always see the entire returned dataset. If this isn't desired, <code>vary-meta</code> a good pathway.</p>
|
|
<p>Options:</p>
|
|
<ul>
|
|
<li><code>:replacement?</code> - Do sampling with replacement. Defaults to false.</li>
|
|
<li><code>:seed</code> - Provide a seed as a number or provide a Random implementation.</li>
|
|
</ul>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L1277">view source</a></div></div><div class="public anchor" id="var-select"><h3>select</h3><div class="usage"><code>(select dataset colname-seq selection)</code></div><div class="doc"><div class="markdown"><p>Reorder/trim dataset according to this sequence of indexes. Returns a new dataset.
|
|
colname-seq - one of:</p>
|
|
<ul>
|
|
<li>:all - all the columns</li>
|
|
<li>sequence of column names - those columns in that order.</li>
|
|
<li>implementation of java.util.Map - column order is dictate by map iteration order
|
|
selected columns are subsequently named after the corresponding value in the map.
|
|
similar to <code>rename-columns</code> except this trims the result to be only the columns
|
|
in the map.
|
|
selection - either keyword :all, a list of indexes to select, or a list of booleans where
|
|
the index position of each true value indicates an index to select. When providing indices,
|
|
duplicates will select the specified index position more than once.</li>
|
|
</ul>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L1297">view source</a></div></div><div class="public anchor" id="var-select-by-index"><h3>select-by-index</h3><div class="usage"><code>(select-by-index dataset col-index row-index)</code></div><div class="doc"><div class="markdown"><p>Trim dataset according to this sequence of indexes. Returns a new dataset.</p>
|
|
<p>col-index and row-index - one of:</p>
|
|
<ul>
|
|
<li>:all - all the columns</li>
|
|
<li>list of indexes. May contain duplicates. Negative values will be counted from
|
|
the end of the sequence.</li>
|
|
</ul>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L1314">view source</a></div></div><div class="public anchor" id="var-select-columns"><h3>select-columns</h3><div class="usage"><code>(select-columns dataset colname-seq-or-fn)</code></div><div class="doc"><div class="markdown"><p>Select columns from the dataset by:</p>
|
|
<ul>
|
|
<li>seq of column names</li>
|
|
<li>column selector function</li>
|
|
<li><code>:all</code> keyword</li>
|
|
</ul>
|
|
<p>For example:</p>
|
|
<pre><code class="language-clojure">(select-columns DS [:A :B])
|
|
(select-columns DS cf/numeric)
|
|
(select-columns DS :all)
|
|
</code></pre>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L1326">view source</a></div></div><div class="public anchor" id="var-select-columns-by-index"><h3>select-columns-by-index</h3><div class="usage"><code>(select-columns-by-index dataset col-index)</code></div><div class="doc"><div class="markdown"><p>Select columns from the dataset by seq of index(includes negative) or :all.</p>
|
|
<p>See documentation for <code>select-by-index</code>.</p>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L1344">view source</a></div></div><div class="public anchor" id="var-select-missing"><h3>select-missing</h3><div class="usage"><code>(select-missing dataset-or-col)</code></div><div class="doc"><div class="markdown"><p>Remove missing entries by simply selecting out the missing indexes</p>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L1352">view source</a></div></div><div class="public anchor" id="var-select-rows"><h3>select-rows</h3><div class="usage"><code>(select-rows dataset-or-col row-indexes options)</code><code>(select-rows dataset-or-col row-indexes)</code></div><div class="doc"><div class="markdown"><p>Select rows from the dataset or column.</p>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L1358">view source</a></div></div><div class="public anchor" id="var-set-dataset-name"><h3>set-dataset-name</h3><div class="usage"><code>(set-dataset-name dataset ds-name)</code></div><div class="doc"><div class="markdown"></div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L1366">view source</a></div></div><div class="public anchor" id="var-shape"><h3>shape</h3><div class="usage"><code>(shape dataset)</code></div><div class="doc"><div class="markdown"><p>Returns shape in column-major format of <a href="n-columns n-rows">n-columns n-rows</a>.</p>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L1371">view source</a></div></div><div class="public anchor" id="var-shuffle"><h3>shuffle</h3><div class="usage"><code>(shuffle dataset options)</code><code>(shuffle dataset)</code></div><div class="doc"><div class="markdown"><p>Shuffle the rows of the dataset optionally providing a seed.
|
|
See <a href="https://cnuernber.github.io/dtype-next/tech.v3.datatype.argops.html#var-argshuffle">https://cnuernber.github.io/dtype-next/tech.v3.datatype.argops.html#var-argshuffle</a>.</p>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L1377">view source</a></div></div><div class="public anchor" id="var-sort-by"><h3>sort-by</h3><div class="usage"><code>(sort-by dataset key-fn compare-fn & args)</code><code>(sort-by dataset key-fn)</code></div><div class="doc"><div class="markdown"><p>Sort a dataset by a key-fn and compare-fn.</p>
|
|
<ul>
|
|
<li><code>key-fn</code> - function from map to sort value.</li>
|
|
<li><code>compare-fn</code> may be one of:
|
|
<ul>
|
|
<li>a clojure operator like clojure.core/<</li>
|
|
<li><code>:tech.numerics/<</code>, <code>:tech.numerics/></code> for unboxing comparisons of primitive
|
|
values.</li>
|
|
<li>clojure.core/compare</li>
|
|
<li>A custom java.util.Comparator instantiation.</li>
|
|
</ul>
|
|
</li>
|
|
</ul>
|
|
<p>Options:</p>
|
|
<ul>
|
|
<li><code>:nan-strategy</code> - General missing strategy. Options are <code>:first</code>, <code>:last</code>, and
|
|
<code>:exception</code>.</li>
|
|
<li><code>:parallel?</code> - Uses parallel quicksort when true and regular quicksort when false.</li>
|
|
</ul>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L1386">view source</a></div></div><div class="public anchor" id="var-sort-by-column"><h3>sort-by-column</h3><div class="usage"><code>(sort-by-column dataset colname compare-fn & args)</code><code>(sort-by-column dataset colname)</code></div><div class="doc"><div class="markdown"><p>Sort a dataset by a given column using the given compare fn.</p>
|
|
<ul>
|
|
<li><code>compare-fn</code> may be one of:
|
|
<ul>
|
|
<li>a clojure operator like clojure.core/<</li>
|
|
<li><code>:tech.numerics/<</code>, <code>:tech.numerics/></code> for unboxing comparisons of primitive
|
|
values.</li>
|
|
<li>clojure.core/compare</li>
|
|
<li>A custom java.util.Comparator instantiation.</li>
|
|
</ul>
|
|
</li>
|
|
</ul>
|
|
<p>Options:</p>
|
|
<ul>
|
|
<li><code>:nan-strategy</code> - General missing strategy. Options are <code>:first</code>, <code>:last</code>, and
|
|
<code>:exception</code>.</li>
|
|
<li><code>:parallel?</code> - Uses parallel quicksort when true and regular quicksort when false.</li>
|
|
</ul>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L1408">view source</a></div></div><div class="public anchor" id="var-tail"><h3>tail</h3><div class="usage"><code>(tail dataset n)</code><code>(tail dataset)</code></div><div class="doc"><div class="markdown"><p>Get the last n rows of a dataset. Equivalent to
|
|
`(select-rows ds (range ...)). Argument order is dataset-last, however, so this can
|
|
be used in ->> operators.</p>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L1429">view source</a></div></div><div class="public anchor" id="var-take-nth"><h3>take-nth</h3><div class="usage"><code>(take-nth dataset n-val)</code></div><div class="doc"><div class="markdown"></div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L1439">view source</a></div></div><div class="public anchor" id="var-unique-by"><h3>unique-by</h3><div class="usage"><code>(unique-by dataset options map-fn)</code><code>(unique-by dataset map-fn)</code></div><div class="doc"><div class="markdown"><p>Map-fn function gets passed map for each row, rows are grouped by the
|
|
return value. Keep-fn is used to decide the index to keep.</p>
|
|
<p>:keep-fn - Function from key,idx-seq->idx. Defaults to #(first %2).</p>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L1444">view source</a></div></div><div class="public anchor" id="var-unique-by-column"><h3>unique-by-column</h3><div class="usage"><code>(unique-by-column dataset options colname)</code><code>(unique-by-column dataset colname)</code></div><div class="doc"><div class="markdown"><p>Map-fn function gets passed map for each row, rows are grouped by the
|
|
return value. Keep-fn is used to decide the index to keep.</p>
|
|
<p>:keep-fn - Function from key, idx-seq->idx. Defaults to #(first %2).</p>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L1455">view source</a></div></div><div class="public anchor" id="var-unordered-select"><h3>unordered-select</h3><div class="usage"><code>(unordered-select dataset colname-seq index-seq)</code></div><div class="doc"><div class="markdown"><p>Perform a selection but use the order of the columns in the existing table; do
|
|
<em>not</em> reorder the columns based on colname-seq. Useful when doing selection based
|
|
on sets or persistent hash maps.</p>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L1466">view source</a></div></div><div class="public anchor" id="var-unroll-column"><h3>unroll-column</h3><div class="usage"><code>(unroll-column dataset column-name)</code><code>(unroll-column dataset column-name options)</code></div><div class="doc"><div class="markdown"><p>Unroll a column that has some (or all) sequential data as entries.
|
|
Returns a new dataset with same columns but with other columns duplicated
|
|
where the unroll happened. Column now contains only scalar data.</p>
|
|
<p>Any missing indexes are dropped.</p>
|
|
<pre><code class="language-clojure">user> (-> (ds/->dataset [{:a 1 :b [2 3]}
|
|
{:a 2 :b [4 5]}
|
|
{:a 3 :b :a}])
|
|
(ds/unroll-column :b {:indexes? true}))
|
|
_unnamed [5 3]:
|
|
|
|
| :a | :b | :indexes |
|
|
|----+----+----------|
|
|
| 1 | 2 | 0 |
|
|
| 1 | 3 | 1 |
|
|
| 2 | 4 | 0 |
|
|
| 2 | 5 | 1 |
|
|
| 3 | :a | 0 |
|
|
</code></pre>
|
|
<p>Options -
|
|
:datatype - datatype of the resulting column if one aside from :object is desired.
|
|
:indexes? - If true, create a new column that records the indexes of the values from
|
|
the original column. Can also be a truthy value (like a keyword) and the column
|
|
will be named this.</p>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L1474">view source</a></div></div><div class="public anchor" id="var-update"><h3>update</h3><div class="usage"><code>(update lhs-ds filter-fn-or-ds update-fn & args)</code></div><div class="doc"><div class="markdown"><p>Update this dataset. Filters this dataset into a new dataset,
|
|
applies update-fn, then merges the result into original dataset.</p>
|
|
<p>This pathways is designed to work with the tech.v3.dataset.column-filters namespace.</p>
|
|
<ul>
|
|
<li><code>filter-fn-or-ds</code> is a generalized parameter. May be a function,
|
|
a dataset or a sequence of column names.</li>
|
|
<li>update-fn must take the dataset as the first argument and must return
|
|
a dataset.</li>
|
|
</ul>
|
|
<pre><code class="language-clojure">(ds/bind-> (ds/->dataset dataset) ds
|
|
(ds/remove-column "Id")
|
|
(ds/update cf/string ds/replace-missing-value "NA")
|
|
(ds/update-elemwise cf/string #(get {"" "NA"} % %))
|
|
(ds/update cf/numeric ds/replace-missing-value 0)
|
|
(ds/update cf/boolean ds/replace-missing-value false)
|
|
(ds/update-columnwise (cf/union (cf/numeric ds) (cf/boolean ds))
|
|
#(dtype/elemwise-cast % :float64)))
|
|
</code></pre>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L1508">view source</a></div></div><div class="public anchor" id="var-update-column"><h3>update-column</h3><div class="usage"><code>(update-column dataset col-name update-fn)</code></div><div class="doc"><div class="markdown"><p>Update a column returning a new dataset. update-fn is a column->column
|
|
transformation. Error if column does not exist.</p>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L1534">view source</a></div></div><div class="public anchor" id="var-update-columns"><h3>update-columns</h3><div class="usage"><code>(update-columns dataset column-name-seq-or-fn update-fn)</code></div><div class="doc"><div class="markdown"><p>Update a sequence of columns selected by column name seq or column selector
|
|
function.</p>
|
|
<p>For example:</p>
|
|
<pre><code class="language-clojure">(update-columns DS [:A :B] #(dfn/+ % 2))
|
|
(update-columns DS cf/numeric #(dfn// % 2))
|
|
</code></pre>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L1541">view source</a></div></div><div class="public anchor" id="var-update-columnwise"><h3>update-columnwise</h3><div class="usage"><code>(update-columnwise dataset filter-fn-or-ds cwise-update-fn & args)</code></div><div class="doc"><div class="markdown"><p>Call update-fn on each column of the dataset. Returns the dataset.
|
|
See arguments to update</p>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L1555">view source</a></div></div><div class="public anchor" id="var-update-elemwise"><h3>update-elemwise</h3><div class="usage"><code>(update-elemwise dataset filter-fn-or-ds map-fn)</code><code>(update-elemwise dataset map-fn)</code></div><div class="doc"><div class="markdown"><p>Replace all elements in selected columns by calling selected function on each
|
|
element. column-name-seq must be a sequence of column names if provided.
|
|
filter-fn-or-ds has same rules as update. Implicitly clears the missing set so
|
|
function must deal with type-specific missing values correctly.
|
|
Returns new dataset</p>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L1562">view source</a></div></div><div class="public anchor" id="var-value-reader"><h3>value-reader</h3><div class="usage"><code>(value-reader dataset options)</code><code>(value-reader dataset)</code></div><div class="doc"><div class="markdown"><p>Return a reader that produces a reader of column values per index.
|
|
Options:
|
|
:copying? - Default to false - When true row values are copied on read.</p>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L1574">view source</a></div></div><div class="public anchor" id="var-write.21"><h3>write!</h3><div class="usage"><code>(write! dataset output-path options)</code><code>(write! dataset output-path)</code></div><div class="doc"><div class="markdown"><p>Write a dataset out to a file. Supported forms are:</p>
|
|
<pre><code class="language-clojure">(ds/write! test-ds "test.csv")
|
|
(ds/write! test-ds "test.tsv")
|
|
(ds/write! test-ds "test.tsv.gz")
|
|
(ds/write! test-ds "test.nippy")
|
|
(ds/write! test-ds out-stream)
|
|
</code></pre>
|
|
<p>Options:</p>
|
|
<ul>
|
|
<li><code>:max-chars-per-column</code> - csv,tsv specific, defaults to 65536 - values longer than this will
|
|
cause an exception during serialization.</li>
|
|
<li><code>:max-num-columns</code> - csv,tsv specific, defaults to 8192 - If the dataset has more than this number of
|
|
columns an exception will be thrown during serialization.</li>
|
|
<li><code>:quoted-columns</code> - csv specific - sequence of columns names that you would like to always have quoted.</li>
|
|
<li><code>:file-type</code> - Manually specify the file type. This is usually inferred from the filename but if you
|
|
pass in an output stream then you will need to specify the file type.</li>
|
|
<li><code>:headers?</code> - if csv headers are written, defaults to true.</li>
|
|
</ul>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L1584">view source</a></div></div></div></body></html> |