805 lines
99 KiB
HTML
Vendored
805 lines
99 KiB
HTML
Vendored
<!DOCTYPE html PUBLIC ""
|
|
"">
|
|
<html><head><meta charset="UTF-8" /><title>tech.v3.dataset.metamorph documentation</title><script async="true" src="https://www.googletagmanager.com/gtag/js?id=G-RGTB4J7LGP"></script><script>window.dataLayer = window.dataLayer || [];
|
|
function gtag(){dataLayer.push(arguments);}
|
|
gtag('js', new Date());
|
|
|
|
gtag('config', 'G-95TVFC1FEB');</script><link rel="stylesheet" type="text/css" href="css/default.css" /><link rel="stylesheet" type="text/css" href="highlight/solarized-light.css" /><script type="text/javascript" src="highlight/highlight.min.js"></script><script type="text/javascript" src="js/jquery.min.js"></script><script type="text/javascript" src="js/page_effects.js"></script><script>hljs.initHighlightingOnLoad();</script></head><body><div id="header"><h2>Generated by <a href="https://github.com/weavejester/codox">Codox</a> with <a href="https://github.com/xsc/codox-theme-rdash">RDash UI</a> theme</h2><h1><a href="index.html"><span class="project-title"><span class="project-name">TMD</span> <span class="project-version">8.003</span></span></a></h1></div><div class="sidebar primary"><h3 class="no-link"><span class="inner">Project</span></h3><ul class="index-link"><li class="depth-1 "><a href="index.html"><div class="inner">Index</div></a></li></ul><h3 class="no-link"><span class="inner">Topics</span></h3><ul><li class="depth-1 "><a href="000-getting-started.html"><div class="inner"><span>tech.ml.dataset Getting Started</span></div></a></li><li class="depth-1 "><a href="100-walkthrough.html"><div class="inner"><span>tech.ml.dataset Walkthrough</span></div></a></li><li class="depth-1 "><a href="200-quick-reference.html"><div class="inner"><span>tech.ml.dataset Quick Reference</span></div></a></li><li class="depth-1 "><a href="columns-readers-and-datatypes.html"><div class="inner"><span>tech.ml.dataset Columns, Readers, and Datatypes</span></div></a></li><li class="depth-1 "><a href="nippy-serialization-rocks.html"><div class="inner"><span>tech.ml.dataset And nippy</span></div></a></li><li class="depth-1 "><a href="supported-datatypes.html"><div class="inner"><span>tech.ml.dataset Supported Datatypes</span></div></a></li></ul><h3 class="no-link"><span class="inner">Namespaces</span></h3><ul><li class="depth-1"><div class="no-link"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>tech</span></div></div></li><li class="depth-2"><div class="no-link"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>v3</span></div></div></li><li class="depth-3"><a href="tech.v3.dataset.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>dataset</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.categorical.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>categorical</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.clipboard.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>clipboard</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.column.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>column</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.column-filters.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>column-filters</span></div></a></li><li class="depth-4"><div class="no-link"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>io</span></div></div></li><li class="depth-5 branch"><a href="tech.v3.dataset.io.csv.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>csv</span></div></a></li><li class="depth-5 branch"><a href="tech.v3.dataset.io.datetime.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>datetime</span></div></a></li><li class="depth-5 branch"><a href="tech.v3.dataset.io.string-row-parser.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>string-row-parser</span></div></a></li><li class="depth-5"><a href="tech.v3.dataset.io.univocity.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>univocity</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.join.html"><div class="inner"><span class="tree" style="top: -145px;"><span class="top" style="height: 154px;"></span><span class="bottom"></span></span><span>join</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.math.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>math</span></div></a></li><li class="depth-4 branch current"><a href="tech.v3.dataset.metamorph.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>metamorph</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.modelling.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>modelling</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.print.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>print</span></div></a></li><li class="depth-4"><a href="tech.v3.dataset.reductions.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>reductions</span></div></a></li><li class="depth-5"><a href="tech.v3.dataset.reductions.apache-data-sketch.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>apache-data-sketch</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.rolling.html"><div class="inner"><span class="tree" style="top: -52px;"><span class="top" style="height: 61px;"></span><span class="bottom"></span></span><span>rolling</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.set.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>set</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.tensor.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>tensor</span></div></a></li><li class="depth-4"><a href="tech.v3.dataset.zip.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>zip</span></div></a></li><li class="depth-3"><div class="no-link"><div class="inner"><span class="tree" style="top: -641px;"><span class="top" style="height: 650px;"></span><span class="bottom"></span></span><span>libs</span></div></div></li><li class="depth-4 branch"><a href="tech.v3.libs.arrow.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>arrow</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.libs.clj-transit.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>clj-transit</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.libs.fastexcel.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>fastexcel</span></div></a></li><li class="depth-4"><div class="no-link"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>guava</span></div></div></li><li class="depth-5"><a href="tech.v3.libs.guava.cache.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>cache</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.libs.parquet.html"><div class="inner"><span class="tree" style="top: -52px;"><span class="top" style="height: 61px;"></span><span class="bottom"></span></span><span>parquet</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.libs.poi.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>poi</span></div></a></li><li class="depth-4"><a href="tech.v3.libs.tribuo.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>tribuo</span></div></a></li></ul></div><div class="sidebar secondary"><h3><a href="#top"><span class="inner">Public Vars</span></a></h3><ul><li class="depth-1"><a href="tech.v3.dataset.metamorph.html#var-add-column"><div class="inner"><span>add-column</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.metamorph.html#var-add-or-update-column"><div class="inner"><span>add-or-update-column</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.metamorph.html#var-append-columns"><div class="inner"><span>append-columns</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.metamorph.html#var-assoc-ds"><div class="inner"><span>assoc-ds</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.metamorph.html#var-assoc-metadata"><div class="inner"><span>assoc-metadata</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.metamorph.html#var-brief"><div class="inner"><span>brief</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.metamorph.html#var-build-pipelined-function"><div class="inner"><span>build-pipelined-function</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.metamorph.html#var-categorical-.3Enumber"><div class="inner"><span>categorical->number</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.metamorph.html#var-categorical-.3Eone-hot"><div class="inner"><span>categorical->one-hot</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.metamorph.html#var-column"><div class="inner"><span>column</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.metamorph.html#var-column-.3Edataset"><div class="inner"><span>column->dataset</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.metamorph.html#var-column-cast"><div class="inner"><span>column-cast</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.metamorph.html#var-column-count"><div class="inner"><span>column-count</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.metamorph.html#var-column-labeled-mapseq"><div class="inner"><span>column-labeled-mapseq</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.metamorph.html#var-column-map"><div class="inner"><span>column-map</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.metamorph.html#var-column-names"><div class="inner"><span>column-names</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.metamorph.html#var-column-values-.3Ecategorical"><div class="inner"><span>column-values->categorical</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.metamorph.html#var-columns"><div class="inner"><span>columns</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.metamorph.html#var-columns-with-missing-seq"><div class="inner"><span>columns-with-missing-seq</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.metamorph.html#var-columnwise-concat"><div class="inner"><span>columnwise-concat</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.metamorph.html#var-concat"><div class="inner"><span>concat</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.metamorph.html#var-concat-copying"><div class="inner"><span>concat-copying</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.metamorph.html#var-concat-inplace"><div class="inner"><span>concat-inplace</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.metamorph.html#var-data-.3Edataset"><div class="inner"><span>data->dataset</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.metamorph.html#var-dataset-.3Ecategorical-xforms"><div class="inner"><span>dataset->categorical-xforms</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.metamorph.html#var-dataset-.3Edata"><div class="inner"><span>dataset->data</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.metamorph.html#var-dataset-name"><div class="inner"><span>dataset-name</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.metamorph.html#var-dataset.3F"><div class="inner"><span>dataset?</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.metamorph.html#var-descriptive-stats"><div class="inner"><span>descriptive-stats</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.metamorph.html#var-drop-columns"><div class="inner"><span>drop-columns</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.metamorph.html#var-drop-missing"><div class="inner"><span>drop-missing</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.metamorph.html#var-drop-rows"><div class="inner"><span>drop-rows</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.metamorph.html#var-empty-column-names"><div class="inner"><span>empty-column-names</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.metamorph.html#var-empty-dataset"><div class="inner"><span>empty-dataset</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.metamorph.html#var-ensure-array-backed"><div class="inner"><span>ensure-array-backed</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.metamorph.html#var-feature-ecount"><div class="inner"><span>feature-ecount</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.metamorph.html#var-filter"><div class="inner"><span>filter</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.metamorph.html#var-filter-column"><div class="inner"><span>filter-column</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.metamorph.html#var-filter-dataset"><div class="inner"><span>filter-dataset</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.metamorph.html#var-group-by"><div class="inner"><span>group-by</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.metamorph.html#var-group-by-.3Eindexes"><div class="inner"><span>group-by->indexes</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.metamorph.html#var-group-by-column"><div class="inner"><span>group-by-column</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.metamorph.html#var-group-by-column-.3Eindexes"><div class="inner"><span>group-by-column->indexes</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.metamorph.html#var-group-by-column-consumer"><div class="inner"><span>group-by-column-consumer</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.metamorph.html#var-has-column.3F"><div class="inner"><span>has-column?</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.metamorph.html#var-head"><div class="inner"><span>head</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.metamorph.html#var-induction"><div class="inner"><span>induction</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.metamorph.html#var-inference-column.3F"><div class="inner"><span>inference-column?</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.metamorph.html#var-inference-target-column-names"><div class="inner"><span>inference-target-column-names</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.metamorph.html#var-inference-target-ds"><div class="inner"><span>inference-target-ds</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.metamorph.html#var-inference-target-label-inverse-map"><div class="inner"><span>inference-target-label-inverse-map</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.metamorph.html#var-inference-target-label-map"><div class="inner"><span>inference-target-label-map</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.metamorph.html#var-k-fold-datasets"><div class="inner"><span>k-fold-datasets</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.metamorph.html#var-labels"><div class="inner"><span>labels</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.metamorph.html#var-mapseq-reader"><div class="inner"><span>mapseq-reader</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.metamorph.html#var-min-n-by-column"><div class="inner"><span>min-n-by-column</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.metamorph.html#var-missing"><div class="inner"><span>missing</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.metamorph.html#var-model-type"><div class="inner"><span>model-type</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.metamorph.html#var-new-column"><div class="inner"><span>new-column</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.metamorph.html#var-new-dataset"><div class="inner"><span>new-dataset</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.metamorph.html#var-num-inference-classes"><div class="inner"><span>num-inference-classes</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.metamorph.html#var-order-column-names"><div class="inner"><span>order-column-names</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.metamorph.html#var-pmap-ds"><div class="inner"><span>pmap-ds</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.metamorph.html#var-print-all"><div class="inner"><span>print-all</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.metamorph.html#var-probability-distributions-.3Elabel-column"><div class="inner"><span>probability-distributions->label-column</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.metamorph.html#var-rand-nth"><div class="inner"><span>rand-nth</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.metamorph.html#var-remove-column"><div class="inner"><span>remove-column</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.metamorph.html#var-remove-columns"><div class="inner"><span>remove-columns</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.metamorph.html#var-remove-empty-columns"><div class="inner"><span>remove-empty-columns</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.metamorph.html#var-remove-rows"><div class="inner"><span>remove-rows</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.metamorph.html#var-rename-columns"><div class="inner"><span>rename-columns</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.metamorph.html#var-replace-missing"><div class="inner"><span>replace-missing</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.metamorph.html#var-replace-missing-value"><div class="inner"><span>replace-missing-value</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.metamorph.html#var-reverse-rows"><div class="inner"><span>reverse-rows</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.metamorph.html#var-row-at"><div class="inner"><span>row-at</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.metamorph.html#var-row-count"><div class="inner"><span>row-count</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.metamorph.html#var-row-map"><div class="inner"><span>row-map</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.metamorph.html#var-row-mapcat"><div class="inner"><span>row-mapcat</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.metamorph.html#var-rows"><div class="inner"><span>rows</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.metamorph.html#var-rowvec-at"><div class="inner"><span>rowvec-at</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.metamorph.html#var-rowvecs"><div class="inner"><span>rowvecs</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.metamorph.html#var-sample"><div class="inner"><span>sample</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.metamorph.html#var-select"><div class="inner"><span>select</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.metamorph.html#var-select-by-index"><div class="inner"><span>select-by-index</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.metamorph.html#var-select-columns"><div class="inner"><span>select-columns</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.metamorph.html#var-select-columns-by-index"><div class="inner"><span>select-columns-by-index</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.metamorph.html#var-select-missing"><div class="inner"><span>select-missing</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.metamorph.html#var-select-rows"><div class="inner"><span>select-rows</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.metamorph.html#var-set-dataset-name"><div class="inner"><span>set-dataset-name</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.metamorph.html#var-set-inference-target"><div class="inner"><span>set-inference-target</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.metamorph.html#var-shape"><div class="inner"><span>shape</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.metamorph.html#var-shuffle"><div class="inner"><span>shuffle</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.metamorph.html#var-sort-by"><div class="inner"><span>sort-by</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.metamorph.html#var-sort-by-column"><div class="inner"><span>sort-by-column</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.metamorph.html#var-tail"><div class="inner"><span>tail</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.metamorph.html#var-take-nth"><div class="inner"><span>take-nth</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.metamorph.html#var-train-test-split"><div class="inner"><span>train-test-split</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.metamorph.html#var-unique-by"><div class="inner"><span>unique-by</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.metamorph.html#var-unique-by-column"><div class="inner"><span>unique-by-column</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.metamorph.html#var-unordered-select"><div class="inner"><span>unordered-select</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.metamorph.html#var-unroll-column"><div class="inner"><span>unroll-column</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.metamorph.html#var-update"><div class="inner"><span>update</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.metamorph.html#var-update-column"><div class="inner"><span>update-column</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.metamorph.html#var-update-columns"><div class="inner"><span>update-columns</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.metamorph.html#var-update-columnwise"><div class="inner"><span>update-columnwise</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.metamorph.html#var-update-elemwise"><div class="inner"><span>update-elemwise</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.metamorph.html#var-value-reader"><div class="inner"><span>value-reader</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.metamorph.html#var-write.21"><div class="inner"><span>write!</span></div></a></li></ul></div><div class="namespace-docs" id="content"><h1 class="anchor" id="top">tech.v3.dataset.metamorph</h1><div class="doc"><div class="markdown"><p>This is an auto-generated api system - it scans the namespaces and changes the first
|
|
to be metamorph-compliant which means transforming an argument that is just a dataset into
|
|
an argument that is a metamorph context - a map of <code>{:metamorph/data ds}</code>. They also return
|
|
their result as a metamorph context.</p>
|
|
</div></div><div class="public anchor" id="var-add-column"><h3>add-column</h3><div class="usage"><code>(add-column column)</code></div><div class="doc"><div class="markdown"><p>Add a new column. Error if name collision</p>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset/metamorph.clj#L10">view source</a></div></div><div class="public anchor" id="var-add-or-update-column"><h3>add-or-update-column</h3><div class="usage"><code>(add-or-update-column colname column)</code><code>(add-or-update-column column)</code></div><div class="doc"><div class="markdown"><p>If column exists, replace. Else append new column.</p>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset/metamorph.clj#L16">view source</a></div></div><div class="public anchor" id="var-append-columns"><h3>append-columns</h3><div class="usage"><code>(append-columns column-seq)</code></div><div class="doc"><div class="markdown"></div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset/metamorph.clj#L24">view source</a></div></div><div class="public anchor" id="var-assoc-ds"><h3>assoc-ds</h3><div class="usage"><code>(assoc-ds cname cdata & args)</code></div><div class="doc"><div class="markdown"><p>If dataset is not nil, calls <code>clojure.core/assoc</code>. Else creates a new empty dataset and
|
|
then calls <code>clojure.core/assoc</code>. Guaranteed to return a dataset (unlike assoc).</p>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset/metamorph.clj#L29">view source</a></div></div><div class="public anchor" id="var-assoc-metadata"><h3>assoc-metadata</h3><div class="usage"><code>(assoc-metadata filter-fn-or-ds k v & args)</code></div><div class="doc"><div class="markdown"><p>Set metadata across a set of columns.</p>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset/metamorph.clj#L36">view source</a></div></div><div class="public anchor" id="var-brief"><h3>brief</h3><div class="usage"><code>(brief options)</code><code>(brief)</code></div><div class="doc"><div class="markdown"><p>Get a brief description, in mapseq form of a dataset. A brief description is
|
|
the mapseq form of descriptive stats.</p>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset/metamorph.clj#L42">view source</a></div></div><div class="public anchor" id="var-build-pipelined-function"><h3>build-pipelined-function</h3><h4 class="type">macro</h4><div class="usage"><code>(build-pipelined-function f m)</code></div><div class="doc"><div class="markdown"></div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset/metamorph.clj#L51">view source</a></div></div><div class="public anchor" id="var-categorical-.3Enumber"><h3>categorical->number</h3><div class="usage"><code>(categorical->number filter-fn-or-ds)</code><code>(categorical->number filter-fn-or-ds table-args)</code><code>(categorical->number filter-fn-or-ds table-args result-datatype)</code></div><div class="doc"><div class="markdown"><p>Convert columns into a discrete , numeric representation
|
|
See tech.v3.dataset.categorical/fit-categorical-map.</p>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset/metamorph.clj#L56">view source</a></div></div><div class="public anchor" id="var-categorical-.3Eone-hot"><h3>categorical->one-hot</h3><div class="usage"><code>(categorical->one-hot filter-fn-or-ds)</code><code>(categorical->one-hot filter-fn-or-ds table-args)</code><code>(categorical->one-hot filter-fn-or-ds table-args result-datatype)</code></div><div class="doc"><div class="markdown"><p>Convert string columns to numeric columns.
|
|
See tech.v3.dataset.categorical/fit-one-hot</p>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset/metamorph.clj#L67">view source</a></div></div><div class="public anchor" id="var-column"><h3>column</h3><div class="usage"><code>(column colname)</code></div><div class="doc"><div class="markdown"></div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset/metamorph.clj#L78">view source</a></div></div><div class="public anchor" id="var-column-.3Edataset"><h3>column->dataset</h3><div class="usage"><code>(column->dataset colname transform-fn options)</code><code>(column->dataset colname transform-fn)</code></div><div class="doc"><div class="markdown"><p>Transform a column into a sequence of maps using transform-fn.
|
|
Return dataset created out of the sequence of maps.</p>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset/metamorph.clj#L83">view source</a></div></div><div class="public anchor" id="var-column-cast"><h3>column-cast</h3><div class="usage"><code>(column-cast colname datatype)</code><code>(column-cast colname datatype options)</code></div><div class="doc"><div class="markdown"><p>Cast a column to a new datatype. This is never a lazy operation. If the old
|
|
and new datatypes match and no cast-fn is provided then dtype/clone is called
|
|
on the column.</p>
|
|
<p>colname may be a scalar or a tuple of <a href="src-col dst-col">src-col dst-col</a>.</p>
|
|
<p>datatype may be a datatype enumeration or a tuple of
|
|
<a href="datatype cast-fn">datatype cast-fn</a> where cast-fn may return either a new value,
|
|
:tech.v3.dataset/missing, or :tech.v3.dataset/parse-failure.
|
|
Exceptions are propagated to the caller. The new column has at least the
|
|
existing missing set (if no attempt returns :missing or :cast-failure).
|
|
:cast-failure means the value gets added to metadata key :unparsed-data
|
|
and the index gets added to :unparsed-indexes.</p>
|
|
<p>If the existing datatype is string, then tech.v3.datatype.column/parse-column
|
|
is called.</p>
|
|
<p>Casts between numeric datatypes need no cast-fn but one may be provided.
|
|
Casts to string need no cast-fn but one may be provided.
|
|
Casts from string to anything will call tech.v3.dataset.column/parse-column.</p>
|
|
<p>Options:</p>
|
|
<ul>
|
|
<li><code>:track-parse-errors</code> - defaults to false. When true extra metadata keys
|
|
<code>:unparsed-indexes :unparsed-data</code> will be appended to the metadata. Be aware
|
|
these values may not serialize as unparsed indexes is a roaring bitmap.</li>
|
|
</ul>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset/metamorph.clj#L92">view source</a></div></div><div class="public anchor" id="var-column-count"><h3>column-count</h3><div class="usage"><code>(column-count)</code></div><div class="doc"><div class="markdown"></div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset/metamorph.clj#L126">view source</a></div></div><div class="public anchor" id="var-column-labeled-mapseq"><h3>column-labeled-mapseq</h3><div class="usage"><code>(column-labeled-mapseq value-colname-seq)</code></div><div class="doc"><div class="markdown"><p>Given a dataset, return a sequence of maps where several columns are all stored
|
|
in a :value key and a :label key contains a column name. Used for quickly creating
|
|
timeseries or scatterplot labeled graphs. Returns a lazy sequence, not a reader!</p>
|
|
<p>See also <code>columnwise-concat</code></p>
|
|
<p>Return a sequence of maps with</p>
|
|
<pre><code class="language-clojure"> {... - columns not in colname-seq
|
|
:value - value from one of the value columns
|
|
:label - name of the column the value came from
|
|
}
|
|
</code></pre>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset/metamorph.clj#L131">view source</a></div></div><div class="public anchor" id="var-column-map"><h3>column-map</h3><div class="usage"><code>(column-map result-colname map-fn res-dtype-or-opts filter-fn-or-ds)</code><code>(column-map result-colname map-fn filter-fn-or-ds)</code><code>(column-map result-colname map-fn)</code></div><div class="doc"><div class="markdown"><p>Produce a new (or updated) column as the result of mapping a fn over columns. This
|
|
function is never lazy - all results are immediately calculated.</p>
|
|
<ul>
|
|
<li><code>dataset</code> - dataset.</li>
|
|
<li><code>result-colname</code> - Name of new (or existing) column.</li>
|
|
<li><code>map-fn</code> - function to map over columns. Same rules as <code>tech.v3.datatype/emap</code>.</li>
|
|
<li><code>res-dtype-or-opts</code> - If not given result is scanned to infer missing and datatype.
|
|
If using an option map, options are described below.</li>
|
|
<li><code>filter-fn-or-ds</code> - A dataset, a sequence of columns, or a <code>tech.v3.datasets/column-filters</code>
|
|
column filter function. Defaults to all the columns of the existing dataset.</li>
|
|
</ul>
|
|
<p>Returns a new dataset with a new or updated column.</p>
|
|
<p>Options:</p>
|
|
<ul>
|
|
<li><code>:datatype</code> - Set the dataype of the result column. If not given result is scanned
|
|
to infer result datatype and missing set.</li>
|
|
<li><code>:missing-fn</code> - if given, columns are first passed to missing-fn as a sequence and
|
|
this dictates the missing set. Else the missing set is by scanning the results
|
|
during the inference process. See <code>tech.v3.dataset.column/union-missing-sets</code> and
|
|
<code>tech.v3.dataset.column/intersect-missing-sets</code> for example functions to pass in
|
|
here.</li>
|
|
</ul>
|
|
<p>Examples:</p>
|
|
<pre><code class="language-clojure">
|
|
;;From the tests --
|
|
|
|
(let [testds (ds/->dataset [{:a 1.0 :b 2.0} {:a 3.0 :b 5.0} {:a 4.0 :b nil}])]
|
|
;;result scanned for both datatype and missing set
|
|
(is (= (vec [3.0 6.0 nil])
|
|
(:b2 (ds/column-map testds :b2 #(when % (inc %)) [:b]))))
|
|
;;result scanned for missing set only. Result used in-place.
|
|
(is (= (vec [3.0 6.0 nil])
|
|
(:b2 (ds/column-map testds :b2 #(when % (inc %))
|
|
{:datatype :float64} [:b]))))
|
|
;;Nothing scanned at all.
|
|
(is (= (vec [3.0 6.0 nil])
|
|
(:b2 (ds/column-map testds :b2 #(inc %)
|
|
{:datatype :float64
|
|
:missing-fn ds-col/union-missing-sets} [:b]))))
|
|
;;Missing set scanning causes NPE at inc.
|
|
(is (thrown? Throwable
|
|
(ds/column-map testds :b2 #(inc %)
|
|
{:datatype :float64}
|
|
[:b]))))
|
|
|
|
;;Ad-hoc repl --
|
|
|
|
user> (require '[tech.v3.dataset :as ds]))
|
|
nil
|
|
user> (def ds (ds/->dataset "test/data/stocks.csv"))
|
|
#'user/ds
|
|
user> (ds/head ds)
|
|
test/data/stocks.csv [5 3]:
|
|
|
|
| symbol | date | price |
|
|
|--------|------------|-------|
|
|
| MSFT | 2000-01-01 | 39.81 |
|
|
| MSFT | 2000-02-01 | 36.35 |
|
|
| MSFT | 2000-03-01 | 43.22 |
|
|
| MSFT | 2000-04-01 | 28.37 |
|
|
| MSFT | 2000-05-01 | 25.45 |
|
|
user> (-> (ds/column-map ds "price^2" #(* % %) ["price"])
|
|
(ds/head))
|
|
test/data/stocks.csv [5 4]:
|
|
|
|
| symbol | date | price | price^2 |
|
|
|--------|------------|-------|-----------|
|
|
| MSFT | 2000-01-01 | 39.81 | 1584.8361 |
|
|
| MSFT | 2000-02-01 | 36.35 | 1321.3225 |
|
|
| MSFT | 2000-03-01 | 43.22 | 1867.9684 |
|
|
| MSFT | 2000-04-01 | 28.37 | 804.8569 |
|
|
| MSFT | 2000-05-01 | 25.45 | 647.7025 |
|
|
|
|
|
|
|
|
user> (def ds1 (ds/->dataset [{:a 1} {:b 2.0} {:a 2 :b 3.0}]))
|
|
#'user/ds1
|
|
user> ds1
|
|
_unnamed [3 2]:
|
|
|
|
| :b | :a |
|
|
|----:|---:|
|
|
| | 1 |
|
|
| 2.0 | |
|
|
| 3.0 | 2 |
|
|
user> (ds/column-map ds1 :c (fn [a b]
|
|
(when (and a b)
|
|
(+ (double a) (double b))))
|
|
[:a :b])
|
|
_unnamed [3 3]:
|
|
|
|
| :b | :a | :c |
|
|
|----:|---:|----:|
|
|
| | 1 | |
|
|
| 2.0 | | |
|
|
| 3.0 | 2 | 5.0 |
|
|
user> (ds/missing (*1 :c))
|
|
{0,1}
|
|
</code></pre>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset/metamorph.clj#L149">view source</a></div></div><div class="public anchor" id="var-column-names"><h3>column-names</h3><div class="usage"><code>(column-names)</code></div><div class="doc"><div class="markdown"><p>In-order sequence of column names</p>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset/metamorph.clj#L261">view source</a></div></div><div class="public anchor" id="var-column-values-.3Ecategorical"><h3>column-values->categorical</h3><div class="usage"><code>(column-values->categorical src-column)</code></div><div class="doc"><div class="markdown"><p>Given a column encoded via either string->number or one-hot, reverse
|
|
map to the a sequence of the original string column values.
|
|
In the case of one-hot mappings, src-column must be the original
|
|
column name before the one-hot map</p>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset/metamorph.clj#L267">view source</a></div></div><div class="public anchor" id="var-columns"><h3>columns</h3><div class="usage"><code>(columns)</code></div><div class="doc"><div class="markdown"><p>Return sequence of all columns in dataset.</p>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset/metamorph.clj#L276">view source</a></div></div><div class="public anchor" id="var-columns-with-missing-seq"><h3>columns-with-missing-seq</h3><div class="usage"><code>(columns-with-missing-seq)</code></div><div class="doc"><div class="markdown"><p>Return a sequence of:</p>
|
|
<pre><code class="language-clojure"> {:column-name column-name
|
|
:missing-count missing-count
|
|
}
|
|
</code></pre>
|
|
<p>or nil of no columns are missing data.</p>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset/metamorph.clj#L282">view source</a></div></div><div class="public anchor" id="var-columnwise-concat"><h3>columnwise-concat</h3><div class="usage"><code>(columnwise-concat colnames options)</code><code>(columnwise-concat colnames)</code></div><div class="doc"><div class="markdown"><p>Given a dataset and a list of columns, produce a new dataset with
|
|
the columns concatenated to a new column with a :column column indicating
|
|
which column the original value came from. Any columns not mentioned in the
|
|
list of columns are duplicated.</p>
|
|
<p>Example:</p>
|
|
<pre><code class="language-clojure">user> (-> [{:a 1 :b 2 :c 3 :d 1} {:a 4 :b 5 :c 6 :d 2}]
|
|
(ds/->dataset)
|
|
(ds/columnwise-concat [:c :a :b]))
|
|
null [6 3]:
|
|
|
|
| :column | :value | :d |
|
|
|---------+--------+----|
|
|
| :c | 3 | 1 |
|
|
| :c | 6 | 2 |
|
|
| :a | 1 | 1 |
|
|
| :a | 4 | 2 |
|
|
| :b | 2 | 1 |
|
|
| :b | 5 | 2 |
|
|
</code></pre>
|
|
<p>Options:</p>
|
|
<p>value-column-name - defaults to :value
|
|
colname-column-name - defaults to :column</p>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset/metamorph.clj#L294">view source</a></div></div><div class="public anchor" id="var-concat"><h3>concat</h3><div class="usage"><code>(concat & args)</code><code>(concat)</code></div><div class="doc"><div class="markdown"><p>Concatenate datasets using a copying-concatenation.
|
|
See also <a href="tech.v3.dataset.html#var-concat-inplace">concat-inplace</a> as it may be more efficient for your use case if you have
|
|
a small number (like less than 3) of datasets.</p>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset/metamorph.clj#L328">view source</a></div></div><div class="public anchor" id="var-concat-copying"><h3>concat-copying</h3><div class="usage"><code>(concat-copying & args)</code><code>(concat-copying)</code></div><div class="doc"><div class="markdown"><p>Concatenate datasets into a new dataset copying data. Respects missing values.
|
|
Datasets must all have the same columns. Result column datatypes will be a widening
|
|
cast of the datatypes.</p>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset/metamorph.clj#L338">view source</a></div></div><div class="public anchor" id="var-concat-inplace"><h3>concat-inplace</h3><div class="usage"><code>(concat-inplace & args)</code><code>(concat-inplace)</code></div><div class="doc"><div class="markdown"><p>Concatenate datasets in place. Respects missing values. Datasets must all have the
|
|
same columns. Result column datatypes will be a widening cast of the datatypes.</p>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset/metamorph.clj#L348">view source</a></div></div><div class="public anchor" id="var-data-.3Edataset"><h3>data->dataset</h3><div class="usage"><code>(data->dataset)</code></div><div class="doc"><div class="markdown"><p>Convert a data-ized dataset created via dataset->data back into a
|
|
full dataset</p>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset/metamorph.clj#L357">view source</a></div></div><div class="public anchor" id="var-dataset-.3Ecategorical-xforms"><h3>dataset->categorical-xforms</h3><div class="usage"><code>(dataset->categorical-xforms)</code></div><div class="doc"><div class="markdown"><p>Given a dataset, return a map of column-name->xform information.</p>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset/metamorph.clj#L364">view source</a></div></div><div class="public anchor" id="var-dataset-.3Edata"><h3>dataset->data</h3><div class="usage"><code>(dataset->data)</code></div><div class="doc"><div class="markdown"><p>Convert a dataset to a pure clojure datastructure. Returns a map with two keys:
|
|
{:metadata :columns}.
|
|
:columns is a vector of column definitions appropriate for passing directly back
|
|
into new-dataset.
|
|
A column definition in this case is a map of {:name :missing :data :metadata}.</p>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset/metamorph.clj#L370">view source</a></div></div><div class="public anchor" id="var-dataset-name"><h3>dataset-name</h3><div class="usage"><code>(dataset-name)</code></div><div class="doc"><div class="markdown"></div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset/metamorph.clj#L380">view source</a></div></div><div class="public anchor" id="var-dataset.3F"><h3>dataset?</h3><div class="usage"><code>(dataset?)</code></div><div class="doc"><div class="markdown"></div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset/metamorph.clj#L385">view source</a></div></div><div class="public anchor" id="var-descriptive-stats"><h3>descriptive-stats</h3><div class="usage"><code>(descriptive-stats)</code><code>(descriptive-stats options)</code></div><div class="doc"><div class="markdown"><p>Get descriptive statistics across the columns of the dataset.
|
|
In addition to the standard stats.
|
|
Options:
|
|
:stat-names - defaults to (remove #{:values :num-distinct-values}
|
|
(all-descriptive-stats-names))
|
|
:n-categorical-values - Number of categorical values to report in the 'values'
|
|
field. Defaults to 21.</p>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset/metamorph.clj#L390">view source</a></div></div><div class="public anchor" id="var-drop-columns"><h3>drop-columns</h3><div class="usage"><code>(drop-columns colname-seq-or-fn)</code></div><div class="doc"><div class="markdown"><p>Same as remove-columns. Remove columns indexed by column name seq or
|
|
column filter function.
|
|
For example:</p>
|
|
<pre><code class="language-clojure">(drop-columns DS [:A :B])
|
|
(drop-columns DS cf/categorical)
|
|
</code></pre>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset/metamorph.clj#L404">view source</a></div></div><div class="public anchor" id="var-drop-missing"><h3>drop-missing</h3><div class="usage"><code>(drop-missing)</code><code>(drop-missing colname)</code></div><div class="doc"><div class="markdown"><p>Remove missing entries by simply selecting out the missing indexes.</p>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset/metamorph.clj#L417">view source</a></div></div><div class="public anchor" id="var-drop-rows"><h3>drop-rows</h3><div class="usage"><code>(drop-rows row-indexes)</code></div><div class="doc"><div class="markdown"><p>Drop rows from dataset or column</p>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset/metamorph.clj#L425">view source</a></div></div><div class="public anchor" id="var-empty-column-names"><h3>empty-column-names</h3><div class="usage"><code>(empty-column-names)</code></div><div class="doc"><div class="markdown"><p>Return a sequence of column names whose empty set length matches the row count of the dataset.</p>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset/metamorph.clj#L431">view source</a></div></div><div class="public anchor" id="var-empty-dataset"><h3>empty-dataset</h3><div class="usage"><code>(empty-dataset)</code></div><div class="doc"><div class="markdown"></div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset/metamorph.clj#L437">view source</a></div></div><div class="public anchor" id="var-ensure-array-backed"><h3>ensure-array-backed</h3><div class="usage"><code>(ensure-array-backed options)</code><code>(ensure-array-backed)</code></div><div class="doc"><div class="markdown"><p>Ensure the column data in the dataset is stored in pure java arrays. This is
|
|
sometimes necessary for interop with other libraries and this operation will
|
|
force any lazy computations to complete. This also clears the missing set
|
|
for each column and writes the missing values to the new arrays.</p>
|
|
<p>Columns that are already array backed and that have no missing values are not
|
|
changed and retuned.</p>
|
|
<p>The postcondition is that dtype/->array will return a java array in the appropriate
|
|
datatype for each column.</p>
|
|
<p>Options:</p>
|
|
<ul>
|
|
<li><code>:unpack?</code> - unpack packed datetime types. Defaults to true</li>
|
|
</ul>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset/metamorph.clj#L442">view source</a></div></div><div class="public anchor" id="var-feature-ecount"><h3>feature-ecount</h3><div class="usage"><code>(feature-ecount)</code></div><div class="doc"><div class="markdown"><p>Number of feature columns. Feature columns are columns that are not
|
|
inference targets.</p>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset/metamorph.clj#L463">view source</a></div></div><div class="public anchor" id="var-filter"><h3>filter</h3><div class="usage"><code>(filter predicate)</code></div><div class="doc"><div class="markdown"><p>dataset->dataset transformation. Predicate is passed a map of
|
|
colname->column-value.</p>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset/metamorph.clj#L470">view source</a></div></div><div class="public anchor" id="var-filter-column"><h3>filter-column</h3><div class="usage"><code>(filter-column colname predicate)</code><code>(filter-column colname)</code></div><div class="doc"><div class="markdown"><p>Filter a given column by a predicate. Predicate is passed column values.
|
|
If predicate is <em>not</em> an instance of Ifn it is treated as a value and will
|
|
be used as if the predicate is #(= value %).</p>
|
|
<p>The 2-arity form of this function reads the column as a boolean reader so for
|
|
instance numeric 0 values are false in that case as are Double/NaN, Float/NaN. Objects are
|
|
only false if nil?.</p>
|
|
<p>Returns a dataset.</p>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset/metamorph.clj#L477">view source</a></div></div><div class="public anchor" id="var-filter-dataset"><h3>filter-dataset</h3><div class="usage"><code>(filter-dataset filter-fn-or-ds)</code></div><div class="doc"><div class="markdown"><p>Filter the columns of the dataset returning a new dataset. This pathway is
|
|
designed to work with the tech.v3.dataset.column-filters namespace.</p>
|
|
<ul>
|
|
<li>If filter-fn-or-ds is a dataset, it is returned.</li>
|
|
<li>If filter-fn-or-ds is sequential, then select-columns is called.</li>
|
|
<li>If filter-fn-or-ds is :all, all columns are returned</li>
|
|
<li>If filter-fn-or-ds is an instance of IFn, the dataset is passed into it.</li>
|
|
</ul>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset/metamorph.clj#L493">view source</a></div></div><div class="public anchor" id="var-group-by"><h3>group-by</h3><div class="usage"><code>(group-by key-fn options)</code><code>(group-by key-fn)</code></div><div class="doc"><div class="markdown"><p>Produce a map of key-fn-value->dataset. The argument to key-fn
|
|
is a map of colname->column-value representing a row in dataset.
|
|
Each dataset in the resulting map contains all and only rows
|
|
that produce the same key-fn-value.</p>
|
|
<p>Options - options are passed into dtype arggroup:</p>
|
|
<ul>
|
|
<li><code>:group-by-finalizer</code> - when provided this is run on each dataset immediately after the
|
|
rows are selected. This can be used to immediately perform a reduction on each new
|
|
dataset which is faster than doing it in a separate run.</li>
|
|
</ul>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset/metamorph.clj#L505">view source</a></div></div><div class="public anchor" id="var-group-by-.3Eindexes"><h3>group-by->indexes</h3><div class="usage"><code>(group-by->indexes key-fn options)</code><code>(group-by->indexes key-fn)</code></div><div class="doc"><div class="markdown"><p>(Non-lazy) - Group a dataset and return a map of key-fn-value->indexes where indexes
|
|
is an in-order contiguous group of indexes.</p>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset/metamorph.clj#L522">view source</a></div></div><div class="public anchor" id="var-group-by-column"><h3>group-by-column</h3><div class="usage"><code>(group-by-column colname options)</code><code>(group-by-column colname)</code></div><div class="doc"><div class="markdown"><p>Return a map of column-value->dataset. Each dataset in the
|
|
resulting map contains all and only rows with the same value in
|
|
column.</p>
|
|
<ul>
|
|
<li><code>:group-by-finalizer</code> - when provided this is run on each dataset immediately after the
|
|
rows are selected. This can be used to immediately perform a reduction on each new
|
|
dataset which is faster than doing it in a separate run.</li>
|
|
</ul>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset/metamorph.clj#L531">view source</a></div></div><div class="public anchor" id="var-group-by-column-.3Eindexes"><h3>group-by-column->indexes</h3><div class="usage"><code>(group-by-column->indexes colname options)</code><code>(group-by-column->indexes colname)</code></div><div class="doc"><div class="markdown"><p>(Non-lazy) - Group a dataset by a column return a map of column-val->indexes
|
|
where indexes is an in-order contiguous group of indexes.</p>
|
|
<p>Options are passed into dtype's arggroup method.</p>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset/metamorph.clj#L545">view source</a></div></div><div class="public anchor" id="var-group-by-column-consumer"><h3>group-by-column-consumer</h3><div class="usage"><code>(group-by-column-consumer cname)</code></div><div class="doc"><div class="markdown"></div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset/metamorph.clj#L556">view source</a></div></div><div class="public anchor" id="var-has-column.3F"><h3>has-column?</h3><div class="usage"><code>(has-column? column-name)</code></div><div class="doc"><div class="markdown"></div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset/metamorph.clj#L561">view source</a></div></div><div class="public anchor" id="var-head"><h3>head</h3><div class="usage"><code>(head n)</code><code>(head)</code></div><div class="doc"><div class="markdown"><p>Get the first n row of a dataset. Equivalent to
|
|
`(select-rows ds (range n)). Arguments are reversed, however, so this can
|
|
be used in ->> operators.</p>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset/metamorph.clj#L566">view source</a></div></div><div class="public anchor" id="var-induction"><h3>induction</h3><div class="usage"><code>(induction induct-fn & args)</code></div><div class="doc"><div class="markdown"><p>Given a dataset and a function from dataset->row produce a new dataset.
|
|
The produced row will be merged with the current row and then added to the
|
|
dataset.</p>
|
|
<p>Options are same as the options used for <a href="tech.v3.dataset.html#var--.3Edataset">->dataset</a> in order for the
|
|
user to control the parsing of the return values of <code>induct-fn</code>.
|
|
A new dataset is returned.</p>
|
|
<p>Example:</p>
|
|
<pre><code class="language-clojure">user> (def ds (ds/->dataset {:a [0 1 2 3] :b [1 2 3 4]}))
|
|
#'user/ds
|
|
user> ds
|
|
_unnamed [4 2]:
|
|
|
|
| :a | :b |
|
|
|---:|---:|
|
|
| 0 | 1 |
|
|
| 1 | 2 |
|
|
| 2 | 3 |
|
|
| 3 | 4 |
|
|
user> (ds/induction ds (fn [ds]
|
|
{:sum-of-previous-row (dfn/sum (ds/rowvec-at ds -1))
|
|
:sum-a (dfn/sum (ds :a))
|
|
:sum-b (dfn/sum (ds :b))}))
|
|
_unnamed [4 5]:
|
|
|
|
| :a | :b | :sum-b | :sum-a | :sum-of-previous-row |
|
|
|---:|---:|-------:|-------:|---------------------:|
|
|
| 0 | 1 | 0.0 | 0.0 | 0.0 |
|
|
| 1 | 2 | 1.0 | 0.0 | 1.0 |
|
|
| 2 | 3 | 3.0 | 1.0 | 5.0 |
|
|
| 3 | 4 | 6.0 | 3.0 | 14.0 |
|
|
</code></pre>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset/metamorph.clj#L576">view source</a></div></div><div class="public anchor" id="var-inference-column.3F"><h3>inference-column?</h3><div class="usage"><code>(inference-column?)</code></div><div class="doc"><div class="markdown"></div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset/metamorph.clj#L616">view source</a></div></div><div class="public anchor" id="var-inference-target-column-names"><h3>inference-target-column-names</h3><div class="usage"><code>(inference-target-column-names)</code></div><div class="doc"><div class="markdown"><p>Return the names of the columns that are inference targets.</p>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset/metamorph.clj#L621">view source</a></div></div><div class="public anchor" id="var-inference-target-ds"><h3>inference-target-ds</h3><div class="usage"><code>(inference-target-ds)</code></div><div class="doc"><div class="markdown"><p>Given a dataset return reverse-mapped inference target columns or nil
|
|
in the case where there are no inference targets.</p>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset/metamorph.clj#L627">view source</a></div></div><div class="public anchor" id="var-inference-target-label-inverse-map"><h3>inference-target-label-inverse-map</h3><div class="usage"><code>(inference-target-label-inverse-map & args)</code></div><div class="doc"><div class="markdown"><p>Given options generated during ETL operations and annotated with :label-columns
|
|
sequence container 1 label column, generate a reverse map that maps from a dataset
|
|
value back to the label that generated that value.</p>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset/metamorph.clj#L634">view source</a></div></div><div class="public anchor" id="var-inference-target-label-map"><h3>inference-target-label-map</h3><div class="usage"><code>(inference-target-label-map & args)</code></div><div class="doc"><div class="markdown"></div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset/metamorph.clj#L642">view source</a></div></div><div class="public anchor" id="var-k-fold-datasets"><h3>k-fold-datasets</h3><div class="usage"><code>(k-fold-datasets k options)</code><code>(k-fold-datasets k)</code></div><div class="doc"><div class="markdown"><p>Given 1 dataset, prepary K datasets using the k-fold algorithm.
|
|
Randomize dataset defaults to true which will realize the entire dataset
|
|
so use with care if you have large datasets.</p>
|
|
<p>Returns a sequence of {:test-ds :train-ds}</p>
|
|
<p>Options:</p>
|
|
<ul>
|
|
<li><code>:randomize-dataset?</code> - When true, shuffle the dataset. In that case 'seed' may be
|
|
provided. Defaults to true.</li>
|
|
<li><code>:seed</code> - when <code>:randomize-dataset?</code> is true then this can either be an
|
|
implementation of java.util.Random or an integer seed which will be used to
|
|
construct java.util.Random.</li>
|
|
</ul>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset/metamorph.clj#L647">view source</a></div></div><div class="public anchor" id="var-labels"><h3>labels</h3><div class="usage"><code>(labels)</code></div><div class="doc"><div class="markdown"><p>Return the labels. The labels sequence is the reverse mapped inference
|
|
column. This returns a single column of data or errors out.</p>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset/metamorph.clj#L667">view source</a></div></div><div class="public anchor" id="var-mapseq-reader"><h3>mapseq-reader</h3><div class="usage"><code>(mapseq-reader options)</code><code>(mapseq-reader)</code></div><div class="doc"><div class="markdown"><p>Return a reader that produces a map of column-name->column-value
|
|
upon read.</p>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset/metamorph.clj#L674">view source</a></div></div><div class="public anchor" id="var-min-n-by-column"><h3>min-n-by-column</h3><div class="usage"><code>(min-n-by-column cname N comparator options)</code><code>(min-n-by-column cname N comparator)</code><code>(min-n-by-column cname N)</code></div><div class="doc"><div class="markdown"><p>Find the minimum N entries (unsorted) by column. Resulting data will be indexed in
|
|
original order. If you want a sorted order then sort the result.</p>
|
|
<p>See options to <a href="tech.v3.dataset.html#var-sort-by-column">sort-by-column</a>.</p>
|
|
<p>Example:</p>
|
|
<pre><code class="language-clojure">user> (ds/min-n-by-column ds "price" 10 nil nil)
|
|
test/data/stocks.csv [10 3]:
|
|
|
|
| symbol | date | price |
|
|
|--------|------------|------:|
|
|
| AMZN | 2001-09-01 | 5.97 |
|
|
| AMZN | 2001-10-01 | 6.98 |
|
|
| AAPL | 2000-12-01 | 7.44 |
|
|
| AAPL | 2002-08-01 | 7.38 |
|
|
| AAPL | 2002-09-01 | 7.25 |
|
|
| AAPL | 2002-12-01 | 7.16 |
|
|
| AAPL | 2003-01-01 | 7.18 |
|
|
| AAPL | 2003-02-01 | 7.51 |
|
|
| AAPL | 2003-03-01 | 7.07 |
|
|
| AAPL | 2003-04-01 | 7.11 |
|
|
user> (ds/min-n-by-column ds "price" 10 > nil)
|
|
test/data/stocks.csv [10 3]:
|
|
|
|
| symbol | date | price |
|
|
|--------|------------|-------:|
|
|
| GOOG | 2007-09-01 | 567.27 |
|
|
| GOOG | 2007-10-01 | 707.00 |
|
|
| GOOG | 2007-11-01 | 693.00 |
|
|
| GOOG | 2007-12-01 | 691.48 |
|
|
| GOOG | 2008-01-01 | 564.30 |
|
|
| GOOG | 2008-04-01 | 574.29 |
|
|
| GOOG | 2008-05-01 | 585.80 |
|
|
| GOOG | 2009-11-01 | 583.00 |
|
|
| GOOG | 2009-12-01 | 619.98 |
|
|
| GOOG | 2010-03-01 | 560.19 |
|
|
</code></pre>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset/metamorph.clj#L683">view source</a></div></div><div class="public anchor" id="var-missing"><h3>missing</h3><div class="usage"><code>(missing)</code></div><div class="doc"><div class="markdown"><p>Given a dataset or a column, return the missing set as a roaring bitmap</p>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset/metamorph.clj#L731">view source</a></div></div><div class="public anchor" id="var-model-type"><h3>model-type</h3><div class="usage"><code>(model-type & args)</code></div><div class="doc"><div class="markdown"><p>Check the label column after dataset processing.
|
|
Return either
|
|
:regression
|
|
:classification</p>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset/metamorph.clj#L737">view source</a></div></div><div class="public anchor" id="var-new-column"><h3>new-column</h3><div class="usage"><code>(new-column data)</code><code>(new-column data metadata)</code><code>(new-column data metadata missing)</code><code>(new-column)</code></div><div class="doc"><div class="markdown"><p>Create a new column. Data will scanned for missing values
|
|
unless the full 4-argument pathway is used.</p>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset/metamorph.clj#L746">view source</a></div></div><div class="public anchor" id="var-new-dataset"><h3>new-dataset</h3><div class="usage"><code>(new-dataset ds-metadata column-seq)</code><code>(new-dataset column-seq)</code><code>(new-dataset)</code></div><div class="doc"><div class="markdown"><p>Create a new dataset from a sequence of columns. Data will be converted
|
|
into columns using ds-col-proto/ensure-column-seq. If the column seq is simply a
|
|
collection of vectors, for instance, columns will be named ordinally.
|
|
options map -
|
|
:dataset-name - Name of the dataset. Defaults to "_unnamed".
|
|
:key-fn - Key function used on all column names before insertion into dataset.</p>
|
|
<p>The return value fulfills the dataset protocols.</p>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset/metamorph.clj#L759">view source</a></div></div><div class="public anchor" id="var-num-inference-classes"><h3>num-inference-classes</h3><div class="usage"><code>(num-inference-classes)</code></div><div class="doc"><div class="markdown"><p>Given a dataset and correctly built options from pipeline operations,
|
|
return the number of classes used for the label. Error if not classification
|
|
dataset.</p>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset/metamorph.clj#L776">view source</a></div></div><div class="public anchor" id="var-order-column-names"><h3>order-column-names</h3><div class="usage"><code>(order-column-names colname-seq)</code></div><div class="doc"><div class="markdown"><p>Order a sequence of columns names so they match the order in the
|
|
original dataset. Missing columns are placed last.</p>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset/metamorph.clj#L784">view source</a></div></div><div class="public anchor" id="var-pmap-ds"><h3>pmap-ds</h3><div class="usage"><code>(pmap-ds ds-map-fn options)</code><code>(pmap-ds ds-map-fn)</code></div><div class="doc"><div class="markdown"><p>Parallelize mapping a function from dataset->dataset across a single dataset. Results are
|
|
coalesced back into a single dataset. The original dataset is simple sliced into n-core
|
|
results and map-fn is called n-core times. ds-map-fn must be a function from
|
|
dataset->dataset although it may return nil.</p>
|
|
<p>Options:</p>
|
|
<ul>
|
|
<li><code>:max-batch-size</code> - this is a default for tech.v3.parallel.for/indexed-map-reduce. You
|
|
can control how many rows are processed in a given batch - the default is 64000. If your
|
|
mapping pathway produces a large expansion in the size of the dataset then it may be
|
|
good to reduce the max batch size and use :as-seq to produce a sequence of datasets.</li>
|
|
<li><code>:result-type</code>
|
|
<ul>
|
|
<li><code>:as-seq</code> - Return a sequence of datasets, one for each batch.</li>
|
|
<li><code>:as-ds</code> - Return a single datasets with all results in memory (default option).</li>
|
|
</ul>
|
|
</li>
|
|
</ul>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset/metamorph.clj#L791">view source</a></div></div><div class="public anchor" id="var-print-all"><h3>print-all</h3><div class="usage"><code>(print-all)</code></div><div class="doc"><div class="markdown"><p>Helper function equivalent to <code>(tech.v3.dataset.print/print-range ... :all)</code></p>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset/metamorph.clj#L812">view source</a></div></div><div class="public anchor" id="var-probability-distributions-.3Elabel-column"><h3>probability-distributions->label-column</h3><div class="usage"><code>(probability-distributions->label-column dst-colname label-column-datatype)</code><code>(probability-distributions->label-column dst-colname)</code></div><div class="doc"><div class="markdown"><p>Given a dataset that has columns in which the column names describe labels and the
|
|
rows describe a probability distribution, create a label column by taking the max
|
|
value in each row and assign column that row value.
|
|
Creates a categorical label column which has a catgeorical map in its meta.</p>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset/metamorph.clj#L818">view source</a></div></div><div class="public anchor" id="var-rand-nth"><h3>rand-nth</h3><div class="usage"><code>(rand-nth)</code></div><div class="doc"><div class="markdown"><p>Return a random row from the dataset in map format</p>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset/metamorph.clj#L830">view source</a></div></div><div class="public anchor" id="var-remove-column"><h3>remove-column</h3><div class="usage"><code>(remove-column col-name)</code></div><div class="doc"><div class="markdown"><p>Same as:</p>
|
|
<pre><code class="language-clojure">(dissoc dataset col-name)
|
|
</code></pre>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset/metamorph.clj#L836">view source</a></div></div><div class="public anchor" id="var-remove-columns"><h3>remove-columns</h3><div class="usage"><code>(remove-columns colname-seq-or-fn)</code></div><div class="doc"><div class="markdown"><p>Remove columns indexed by column name seq or column filter function.
|
|
For example:</p>
|
|
<pre><code class="language-clojure"> (remove-columns DS [:A :B])
|
|
(remove-columns DS cf/categorical)
|
|
</code></pre>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset/metamorph.clj#L846">view source</a></div></div><div class="public anchor" id="var-remove-empty-columns"><h3>remove-empty-columns</h3><div class="usage"><code>(remove-empty-columns)</code></div><div class="doc"><div class="markdown"><p>Remove all columns that have no data - missing set length equals row count.</p>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset/metamorph.clj#L858">view source</a></div></div><div class="public anchor" id="var-remove-rows"><h3>remove-rows</h3><div class="usage"><code>(remove-rows row-indexes)</code></div><div class="doc"><div class="markdown"><p>Same as drop-rows.</p>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset/metamorph.clj#L864">view source</a></div></div><div class="public anchor" id="var-rename-columns"><h3>rename-columns</h3><div class="usage"><code>(rename-columns colnames)</code></div><div class="doc"><div class="markdown"><p>Rename columns using a map or vector of column names.</p>
|
|
<p>Does not reorder columns; rename is in-place for maps and
|
|
positional for vectors.</p>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset/metamorph.clj#L870">view source</a></div></div><div class="public anchor" id="var-replace-missing"><h3>replace-missing</h3><div class="usage"><code>(replace-missing)</code><code>(replace-missing strategy)</code><code>(replace-missing columns-selector strategy)</code><code>(replace-missing columns-selector strategy value)</code></div><div class="doc"><div class="markdown"><p>Replace missing values in some columns with a given strategy.
|
|
The columns selector may be:</p>
|
|
<ul>
|
|
<li>seq of any legal column names</li>
|
|
<li>or a column filter function, such as <code>numeric</code> and <code>categorical</code></li>
|
|
</ul>
|
|
<p>Strategies may be:</p>
|
|
<ul>
|
|
<li>
|
|
<p><code>:down</code> - take value from previous non-missing row if possible else use provided value.</p>
|
|
</li>
|
|
<li>
|
|
<p><code>:up</code> - take value from next non-missing row if possible else use provided value.</p>
|
|
</li>
|
|
<li>
|
|
<p><code>:downup</code> - take value from previous if possible else use next.</p>
|
|
</li>
|
|
<li>
|
|
<p><code>:updown</code> - take value from next if possible else use previous.</p>
|
|
</li>
|
|
<li>
|
|
<p><code>:nearest</code> - Use nearest of next or previous values. <code>:mid</code> is an alias for <code>:nearest</code>.</p>
|
|
</li>
|
|
<li>
|
|
<p><code>:midpoint</code> - Use midpoint of averaged values between previous and next nonmissing
|
|
rows.</p>
|
|
</li>
|
|
<li>
|
|
<p><code>:abb</code> - Impute missing with approximate bayesian bootstrap. See <a href="https://search.r-project.org/CRAN/refmans/LaplacesDemon/html/ABB.html">r's ABB</a>.</p>
|
|
</li>
|
|
<li>
|
|
<p><code>:lerp</code> - Linearly interpolate values between previous and next nonmissing rows.</p>
|
|
</li>
|
|
<li>
|
|
<p><code>:value</code> - Value will be provided - see below.</p>
|
|
<p>value may be provided which will then be used. Value may be a function in which
|
|
case it will be called on the column with missing values elided and the return will
|
|
be used to as the filler.</p>
|
|
</li>
|
|
</ul>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset/metamorph.clj#L879">view source</a></div></div><div class="public anchor" id="var-replace-missing-value"><h3>replace-missing-value</h3><div class="usage"><code>(replace-missing-value filter-fn-or-ds scalar-value)</code><code>(replace-missing-value scalar-value)</code></div><div class="doc"><div class="markdown"></div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset/metamorph.clj#L912">view source</a></div></div><div class="public anchor" id="var-reverse-rows"><h3>reverse-rows</h3><div class="usage"><code>(reverse-rows)</code></div><div class="doc"><div class="markdown"><p>Reverse the rows in the dataset or column.</p>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset/metamorph.clj#L919">view source</a></div></div><div class="public anchor" id="var-row-at"><h3>row-at</h3><div class="usage"><code>(row-at idx)</code></div><div class="doc"><div class="markdown"><p>Get the row at an individual index. If indexes are negative then the dataset
|
|
is indexed from the end.</p>
|
|
<pre><code class="language-clojure">user> (ds/row-at stocks 1)
|
|
{"date" #object[java.time.LocalDate 0x534cb03b "2000-02-01"],
|
|
"symbol" "MSFT",
|
|
"price" 36.35}
|
|
user> (ds/row-at stocks -1)
|
|
{"date" #object[java.time.LocalDate 0x6bf60ed5 "2010-03-01"],
|
|
"symbol" "AAPL",
|
|
"price" 223.02}
|
|
</code></pre>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset/metamorph.clj#L925">view source</a></div></div><div class="public anchor" id="var-row-count"><h3>row-count</h3><div class="usage"><code>(row-count)</code></div><div class="doc"><div class="markdown"></div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset/metamorph.clj#L943">view source</a></div></div><div class="public anchor" id="var-row-map"><h3>row-map</h3><div class="usage"><code>(row-map map-fn options)</code><code>(row-map map-fn)</code></div><div class="doc"><div class="markdown"><p>Map a function across the rows of the dataset producing a new dataset
|
|
that is merged back into the original potentially replacing existing columns.
|
|
Options are passed into the <a href="tech.v3.dataset.html#var--.3Edataset">->dataset</a> function so you can control the resulting
|
|
column types by the usual dataset parsing options described there.</p>
|
|
<p>Options:</p>
|
|
<p>See options for <a href="tech.v3.dataset.html#var-pmap-ds">pmap-ds</a>. In particular, note that you can
|
|
produce a sequence of datasets as opposed to a single large dataset.</p>
|
|
<p>Speed demons should attempt both <code>{:copying? false}</code> and <code>{:copying? true}</code> in the options
|
|
map as that changes rather drastically how data is read from the datasets. If you are
|
|
going to read all the data in the dataset, <code>{:copying? true}</code> will most likely be
|
|
the faster of the two.</p>
|
|
<p>Examples:</p>
|
|
<pre><code class="language-clojure">user> (def stocks (ds/->dataset "test/data/stocks.csv"))
|
|
#'user/stocks
|
|
user> (ds/head stocks)
|
|
test/data/stocks.csv [5 3]:
|
|
|
|
| symbol | date | price |
|
|
|--------|------------|------:|
|
|
| MSFT | 2000-01-01 | 39.81 |
|
|
| MSFT | 2000-02-01 | 36.35 |
|
|
| MSFT | 2000-03-01 | 43.22 |
|
|
| MSFT | 2000-04-01 | 28.37 |
|
|
| MSFT | 2000-05-01 | 25.45 |
|
|
user> (ds/head (ds/row-map stocks (fn [row]
|
|
{"symbol" (keyword (row "symbol"))
|
|
:price2 (* (row "price")(row "price"))})))
|
|
test/data/stocks.csv [5 4]:
|
|
|
|
| symbol | date | price | :price2 |
|
|
|--------|------------|------:|----------:|
|
|
| :MSFT | 2000-01-01 | 39.81 | 1584.8361 |
|
|
| :MSFT | 2000-02-01 | 36.35 | 1321.3225 |
|
|
| :MSFT | 2000-03-01 | 43.22 | 1867.9684 |
|
|
| :MSFT | 2000-04-01 | 28.37 | 804.8569 |
|
|
| :MSFT | 2000-05-01 | 25.45 | 647.7025 |
|
|
</code></pre>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset/metamorph.clj#L948">view source</a></div></div><div class="public anchor" id="var-row-mapcat"><h3>row-mapcat</h3><div class="usage"><code>(row-mapcat mapcat-fn options)</code><code>(row-mapcat mapcat-fn)</code></div><div class="doc"><div class="markdown"><p>Map a function across the rows of the dataset. The function must produce a sequence of
|
|
maps and the original dataset rows will be duplicated and then merged into the result
|
|
of calling (->> (apply concat) (->>dataset options) on the result of <code>mapcat-fn</code>. Options
|
|
are the same as <a href="tech.v3.dataset.html#var--.3Edataset">->dataset</a>.</p>
|
|
<p>The smaller the maps returned from mapcat-fn the better, perhaps consider using records.
|
|
In the case that a mapcat-fn result map has a key that overlaps a column name the
|
|
column will be replaced with the output of mapcat-fn. The returned map will have the
|
|
key <code>:_row-id</code> assoc'd onto it so for absolutely minimal gc usage include this
|
|
as a member variable in your map.</p>
|
|
<p>Options:</p>
|
|
<ul>
|
|
<li>See options for <a href="tech.v3.dataset.html#var-pmap-ds">pmap-ds</a>. Especially note <code>:max-batch-size</code> and <code>:result-type</code>.
|
|
In order to conserve memory it may be much more efficient to return a sequence of datasets
|
|
rather than one large dataset. If returning sequences of datasets perhaps consider
|
|
a transducing pathway across them or the <a href="tech.v3.dataset.reductions.html">tech.v3.dataset.reductions</a> namespace.</li>
|
|
</ul>
|
|
<p>Example:</p>
|
|
<pre><code class="language-clojure">user> (def ds (ds/->dataset {:rid (range 10)
|
|
:data (repeatedly 10 #(rand-int 3))}))
|
|
#'user/ds
|
|
user> (ds/head ds)
|
|
_unnamed [5 2]:
|
|
|
|
| :rid | :data |
|
|
|-----:|------:|
|
|
| 0 | 0 |
|
|
| 1 | 2 |
|
|
| 2 | 0 |
|
|
| 3 | 1 |
|
|
| 4 | 2 |
|
|
user> (def mapcat-fn (fn [row]
|
|
(for [idx (range (row :data))]
|
|
{:idx idx})))
|
|
#'user/mapcat-fn
|
|
user> (mapcat mapcat-fn (ds/rows ds))
|
|
({:idx 0} {:idx 1} {:idx 0} {:idx 0} {:idx 1} {:idx 0} {:idx 1} {:idx 0} {:idx 1})
|
|
user> (ds/row-mapcat ds mapcat-fn)
|
|
_unnamed [9 3]:
|
|
|
|
| :rid | :data | :idx |
|
|
|-----:|------:|-----:|
|
|
| 1 | 2 | 0 |
|
|
| 1 | 2 | 1 |
|
|
| 3 | 1 | 0 |
|
|
| 4 | 2 | 0 |
|
|
| 4 | 2 | 1 |
|
|
| 6 | 2 | 0 |
|
|
| 6 | 2 | 1 |
|
|
| 8 | 2 | 0 |
|
|
| 8 | 2 | 1 |
|
|
user>
|
|
</code></pre>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset/metamorph.clj#L999">view source</a></div></div><div class="public anchor" id="var-rows"><h3>rows</h3><div class="usage"><code>(rows options)</code><code>(rows)</code></div><div class="doc"><div class="markdown"><p>Get the rows of the dataset as a list of potentially flyweight maps.</p>
|
|
<p>Options:</p>
|
|
<ul>
|
|
<li>copying? - When true the data is copied out of the dataset row by row upon read of that
|
|
row. When false the data is only referenced upon each read of a particular key. Copying
|
|
is appropriate if you want to use the row values as keys a map and it is inappropriate if
|
|
you are only going to read a very small portion of the row map.</li>
|
|
<li>nil-missing? - When true, maps returned have nil values for missing entries as opposed
|
|
to eliding the missing keys entirely. It is legacy behavior and slightly faster to
|
|
use <code>:nil-missing? true</code>.</li>
|
|
</ul>
|
|
<pre><code class="language-clojure">user> (take 5 (ds/rows stocks))
|
|
({"date" #object[java.time.LocalDate 0x6c433971 "2000-01-01"],
|
|
"symbol" "MSFT",
|
|
"price" 39.81}
|
|
{"date" #object[java.time.LocalDate 0x28f96b14 "2000-02-01"],
|
|
"symbol" "MSFT",
|
|
"price" 36.35}
|
|
{"date" #object[java.time.LocalDate 0x7bdbf0a "2000-03-01"],
|
|
"symbol" "MSFT",
|
|
"price" 43.22}
|
|
{"date" #object[java.time.LocalDate 0x16d3871e "2000-04-01"],
|
|
"symbol" "MSFT",
|
|
"price" 28.37}
|
|
{"date" #object[java.time.LocalDate 0x47094da0 "2000-05-01"],
|
|
"symbol" "MSFT",
|
|
"price" 25.45})
|
|
|
|
|
|
user> (ds/rows (ds/->dataset [{:a 1 :b 2} {:a 2} {:b 3}]))
|
|
[{:a 1, :b 2} {:a 2} {:b 3}]
|
|
|
|
user> (ds/rows (ds/->dataset [{:a 1 :b 2} {:a 2} {:b 3}]) {:nil-missing? true})
|
|
[{:a 1, :b 2} {:a 2, :b nil} {:a nil, :b 3}]
|
|
</code></pre>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset/metamorph.clj#L1062">view source</a></div></div><div class="public anchor" id="var-rowvec-at"><h3>rowvec-at</h3><div class="usage"><code>(rowvec-at idx)</code></div><div class="doc"><div class="markdown"><p>Return a persisent-vector-like row at a given index. Negative indexes index
|
|
from the end.</p>
|
|
<pre><code class="language-clojure">user> (ds/rowvec-at stocks 1)
|
|
["MSFT" #object[java.time.LocalDate 0x5848b8b3 "2000-02-01"] 36.35]
|
|
user> (ds/rowvec-at stocks -1)
|
|
["AAPL" #object[java.time.LocalDate 0x4b70b0d5 "2010-03-01"] 223.02]
|
|
</code></pre>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset/metamorph.clj#L1106">view source</a></div></div><div class="public anchor" id="var-rowvecs"><h3>rowvecs</h3><div class="usage"><code>(rowvecs options)</code><code>(rowvecs)</code></div><div class="doc"><div class="markdown"><p>Return a randomly addressable list of rows in persistent vector-like form.</p>
|
|
<p>Options:</p>
|
|
<ul>
|
|
<li>copying? - When true the data is copied out of the dataset row by row upon read of that
|
|
row. When false the data is only referenced upon each read of a particular key. Copying
|
|
is appropriate if you want to use the row values as keys a map and it is inappropriate if
|
|
you are only going to read a given key for a given row once.</li>
|
|
</ul>
|
|
<pre><code class="language-clojure">user> (take 5 (ds/rowvecs stocks))
|
|
(["MSFT" #object[java.time.LocalDate 0x5be9e4c8 "2000-01-01"] 39.81]
|
|
["MSFT" #object[java.time.LocalDate 0xf758e5 "2000-02-01"] 36.35]
|
|
["MSFT" #object[java.time.LocalDate 0x752cc84d "2000-03-01"] 43.22]
|
|
["MSFT" #object[java.time.LocalDate 0x7bad4827 "2000-04-01"] 28.37]
|
|
["MSFT" #object[java.time.LocalDate 0x3a62c34a "2000-05-01"] 25.45])
|
|
</code></pre>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset/metamorph.clj#L1120">view source</a></div></div><div class="public anchor" id="var-sample"><h3>sample</h3><div class="usage"><code>(sample n options)</code><code>(sample n)</code><code>(sample)</code></div><div class="doc"><div class="markdown"><p>Sample n-rows from a dataset. Defaults to sampling <em>without</em> replacement.</p>
|
|
<p>For the definition of seed, see the argshuffle documentation](<a href="https://cnuernber.github.io/dtype-next/tech.v3.datatype.argops.html#var-argshuffle">https://cnuernber.github.io/dtype-next/tech.v3.datatype.argops.html#var-argshuffle</a>)</p>
|
|
<p>The returned dataset's metadata is altered merging <code>{:print-index-range (range n)}</code> in so you
|
|
will always see the entire returned dataset. If this isn't desired, <code>vary-meta</code> a good pathway.</p>
|
|
<p>Options:</p>
|
|
<ul>
|
|
<li><code>:replacement?</code> - Do sampling with replacement. Defaults to false.</li>
|
|
<li><code>:seed</code> - Provide a seed as a number or provide a Random implementation.</li>
|
|
</ul>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset/metamorph.clj#L1144">view source</a></div></div><div class="public anchor" id="var-select"><h3>select</h3><div class="usage"><code>(select colname-seq selection)</code></div><div class="doc"><div class="markdown"><p>Reorder/trim dataset according to this sequence of indexes. Returns a new dataset.
|
|
colname-seq - one of:</p>
|
|
<ul>
|
|
<li>:all - all the columns</li>
|
|
<li>sequence of column names - those columns in that order.</li>
|
|
<li>implementation of java.util.Map - column order is dictate by map iteration order
|
|
selected columns are subsequently named after the corresponding value in the map.
|
|
similar to <code>rename-columns</code> except this trims the result to be only the columns
|
|
in the map.
|
|
selection - either keyword :all, a list of indexes to select, or a list of booleans where
|
|
the index position of each true value indicates an index to select. When providing indices,
|
|
duplicates will select the specified index position more than once.</li>
|
|
</ul>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset/metamorph.clj#L1164">view source</a></div></div><div class="public anchor" id="var-select-by-index"><h3>select-by-index</h3><div class="usage"><code>(select-by-index col-index row-index)</code></div><div class="doc"><div class="markdown"><p>Trim dataset according to this sequence of indexes. Returns a new dataset.</p>
|
|
<p>col-index and row-index - one of:</p>
|
|
<ul>
|
|
<li>:all - all the columns</li>
|
|
<li>list of indexes. May contain duplicates. Negative values will be counted from
|
|
the end of the sequence.</li>
|
|
</ul>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset/metamorph.clj#L1181">view source</a></div></div><div class="public anchor" id="var-select-columns"><h3>select-columns</h3><div class="usage"><code>(select-columns colname-seq-or-fn)</code></div><div class="doc"><div class="markdown"><p>Select columns from the dataset by:</p>
|
|
<ul>
|
|
<li>seq of column names</li>
|
|
<li>column selector function</li>
|
|
<li><code>:all</code> keyword</li>
|
|
</ul>
|
|
<p>For example:</p>
|
|
<pre><code class="language-clojure">(select-columns DS [:A :B])
|
|
(select-columns DS cf/numeric)
|
|
(select-columns DS :all)
|
|
</code></pre>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset/metamorph.clj#L1193">view source</a></div></div><div class="public anchor" id="var-select-columns-by-index"><h3>select-columns-by-index</h3><div class="usage"><code>(select-columns-by-index col-index)</code></div><div class="doc"><div class="markdown"><p>Select columns from the dataset by seq of index(includes negative) or :all.</p>
|
|
<p>See documentation for <code>select-by-index</code>.</p>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset/metamorph.clj#L1211">view source</a></div></div><div class="public anchor" id="var-select-missing"><h3>select-missing</h3><div class="usage"><code>(select-missing)</code></div><div class="doc"><div class="markdown"><p>Remove missing entries by simply selecting out the missing indexes</p>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset/metamorph.clj#L1219">view source</a></div></div><div class="public anchor" id="var-select-rows"><h3>select-rows</h3><div class="usage"><code>(select-rows row-indexes options)</code><code>(select-rows row-indexes)</code></div><div class="doc"><div class="markdown"><p>Select rows from the dataset or column.</p>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset/metamorph.clj#L1225">view source</a></div></div><div class="public anchor" id="var-set-dataset-name"><h3>set-dataset-name</h3><div class="usage"><code>(set-dataset-name ds-name)</code></div><div class="doc"><div class="markdown"></div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset/metamorph.clj#L1233">view source</a></div></div><div class="public anchor" id="var-set-inference-target"><h3>set-inference-target</h3><div class="usage"><code>(set-inference-target target-name-or-target-name-seq)</code></div><div class="doc"><div class="markdown"><p>Set the inference target on the column. This sets the :column-type member
|
|
of the column metadata to :inference-target?.</p>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset/metamorph.clj#L1238">view source</a></div></div><div class="public anchor" id="var-shape"><h3>shape</h3><div class="usage"><code>(shape)</code></div><div class="doc"><div class="markdown"><p>Returns shape in column-major format of <a href="n-columns n-rows">n-columns n-rows</a>.</p>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset/metamorph.clj#L1245">view source</a></div></div><div class="public anchor" id="var-shuffle"><h3>shuffle</h3><div class="usage"><code>(shuffle options)</code><code>(shuffle)</code></div><div class="doc"><div class="markdown"><p>Shuffle the rows of the dataset optionally providing a seed.
|
|
See <a href="https://cnuernber.github.io/dtype-next/tech.v3.datatype.argops.html#var-argshuffle">https://cnuernber.github.io/dtype-next/tech.v3.datatype.argops.html#var-argshuffle</a>.</p>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset/metamorph.clj#L1251">view source</a></div></div><div class="public anchor" id="var-sort-by"><h3>sort-by</h3><div class="usage"><code>(sort-by key-fn compare-fn & args)</code><code>(sort-by key-fn)</code></div><div class="doc"><div class="markdown"><p>Sort a dataset by a key-fn and compare-fn.</p>
|
|
<ul>
|
|
<li><code>key-fn</code> - function from map to sort value.</li>
|
|
<li><code>compare-fn</code> may be one of:
|
|
<ul>
|
|
<li>a clojure operator like clojure.core/<</li>
|
|
<li><code>:tech.numerics/<</code>, <code>:tech.numerics/></code> for unboxing comparisons of primitive
|
|
values.</li>
|
|
<li>clojure.core/compare</li>
|
|
<li>A custom java.util.Comparator instantiation.</li>
|
|
</ul>
|
|
</li>
|
|
</ul>
|
|
<p>Options:</p>
|
|
<ul>
|
|
<li><code>:nan-strategy</code> - General missing strategy. Options are <code>:first</code>, <code>:last</code>, and
|
|
<code>:exception</code>.</li>
|
|
<li><code>:parallel?</code> - Uses parallel quicksort when true and regular quicksort when false.</li>
|
|
</ul>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset/metamorph.clj#L1260">view source</a></div></div><div class="public anchor" id="var-sort-by-column"><h3>sort-by-column</h3><div class="usage"><code>(sort-by-column colname compare-fn & args)</code><code>(sort-by-column colname)</code></div><div class="doc"><div class="markdown"><p>Sort a dataset by a given column using the given compare fn.</p>
|
|
<ul>
|
|
<li><code>compare-fn</code> may be one of:
|
|
<ul>
|
|
<li>a clojure operator like clojure.core/<</li>
|
|
<li><code>:tech.numerics/<</code>, <code>:tech.numerics/></code> for unboxing comparisons of primitive
|
|
values.</li>
|
|
<li>clojure.core/compare</li>
|
|
<li>A custom java.util.Comparator instantiation.</li>
|
|
</ul>
|
|
</li>
|
|
</ul>
|
|
<p>Options:</p>
|
|
<ul>
|
|
<li><code>:nan-strategy</code> - General missing strategy. Options are <code>:first</code>, <code>:last</code>, and
|
|
<code>:exception</code>.</li>
|
|
<li><code>:parallel?</code> - Uses parallel quicksort when true and regular quicksort when false.</li>
|
|
</ul>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset/metamorph.clj#L1282">view source</a></div></div><div class="public anchor" id="var-tail"><h3>tail</h3><div class="usage"><code>(tail n)</code><code>(tail)</code></div><div class="doc"><div class="markdown"><p>Get the last n rows of a dataset. Equivalent to
|
|
`(select-rows ds (range ...)). Argument order is dataset-last, however, so this can
|
|
be used in ->> operators.</p>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset/metamorph.clj#L1303">view source</a></div></div><div class="public anchor" id="var-take-nth"><h3>take-nth</h3><div class="usage"><code>(take-nth n-val)</code></div><div class="doc"><div class="markdown"></div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset/metamorph.clj#L1313">view source</a></div></div><div class="public anchor" id="var-train-test-split"><h3>train-test-split</h3><div class="usage"><code>(train-test-split options)</code><code>(train-test-split)</code></div><div class="doc"><div class="markdown"><p>Probabilistically split the dataset returning a map of <code>{:train-ds :test-ds}</code>.</p>
|
|
<p>Options:</p>
|
|
<ul>
|
|
<li><code>:randomize-dataset?</code> - When true, shuffle the dataset. In that case 'seed' may be
|
|
provided. Defaults to true.</li>
|
|
<li><code>:seed</code> - when <code>:randomize-dataset?</code> is true then this can either be an
|
|
implementation of java.util.Random or an integer seed which will be used to
|
|
construct java.util.Random.</li>
|
|
<li><code>:train-fraction</code> - Fraction of the dataset to use as training set. Defaults to
|
|
0.7.</li>
|
|
</ul>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset/metamorph.clj#L1318">view source</a></div></div><div class="public anchor" id="var-unique-by"><h3>unique-by</h3><div class="usage"><code>(unique-by options map-fn)</code><code>(unique-by map-fn)</code></div><div class="doc"><div class="markdown"><p>Map-fn function gets passed map for each row, rows are grouped by the
|
|
return value. Keep-fn is used to decide the index to keep.</p>
|
|
<p>:keep-fn - Function from key,idx-seq->idx. Defaults to #(first %2).</p>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset/metamorph.clj#L1336">view source</a></div></div><div class="public anchor" id="var-unique-by-column"><h3>unique-by-column</h3><div class="usage"><code>(unique-by-column options colname)</code><code>(unique-by-column colname)</code></div><div class="doc"><div class="markdown"><p>Map-fn function gets passed map for each row, rows are grouped by the
|
|
return value. Keep-fn is used to decide the index to keep.</p>
|
|
<p>:keep-fn - Function from key, idx-seq->idx. Defaults to #(first %2).</p>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset/metamorph.clj#L1347">view source</a></div></div><div class="public anchor" id="var-unordered-select"><h3>unordered-select</h3><div class="usage"><code>(unordered-select colname-seq index-seq)</code></div><div class="doc"><div class="markdown"><p>Perform a selection but use the order of the columns in the existing table; do
|
|
<em>not</em> reorder the columns based on colname-seq. Useful when doing selection based
|
|
on sets or persistent hash maps.</p>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset/metamorph.clj#L1358">view source</a></div></div><div class="public anchor" id="var-unroll-column"><h3>unroll-column</h3><div class="usage"><code>(unroll-column column-name)</code><code>(unroll-column column-name options)</code></div><div class="doc"><div class="markdown"><p>Unroll a column that has some (or all) sequential data as entries.
|
|
Returns a new dataset with same columns but with other columns duplicated
|
|
where the unroll happened. Column now contains only scalar data.</p>
|
|
<p>Any missing indexes are dropped.</p>
|
|
<pre><code class="language-clojure">user> (-> (ds/->dataset [{:a 1 :b [2 3]}
|
|
{:a 2 :b [4 5]}
|
|
{:a 3 :b :a}])
|
|
(ds/unroll-column :b {:indexes? true}))
|
|
_unnamed [5 3]:
|
|
|
|
| :a | :b | :indexes |
|
|
|----+----+----------|
|
|
| 1 | 2 | 0 |
|
|
| 1 | 3 | 1 |
|
|
| 2 | 4 | 0 |
|
|
| 2 | 5 | 1 |
|
|
| 3 | :a | 0 |
|
|
</code></pre>
|
|
<p>Options -
|
|
:datatype - datatype of the resulting column if one aside from :object is desired.
|
|
:indexes? - If true, create a new column that records the indexes of the values from
|
|
the original column. Can also be a truthy value (like a keyword) and the column
|
|
will be named this.</p>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset/metamorph.clj#L1366">view source</a></div></div><div class="public anchor" id="var-update"><h3>update</h3><div class="usage"><code>(update filter-fn-or-ds update-fn & args)</code></div><div class="doc"><div class="markdown"><p>Update this dataset. Filters this dataset into a new dataset,
|
|
applies update-fn, then merges the result into original dataset.</p>
|
|
<p>This pathways is designed to work with the tech.v3.dataset.column-filters namespace.</p>
|
|
<ul>
|
|
<li><code>filter-fn-or-ds</code> is a generalized parameter. May be a function,
|
|
a dataset or a sequence of column names.</li>
|
|
<li>update-fn must take the dataset as the first argument and must return
|
|
a dataset.</li>
|
|
</ul>
|
|
<pre><code class="language-clojure">(ds/bind-> (ds/->dataset dataset) ds
|
|
(ds/remove-column "Id")
|
|
(ds/update cf/string ds/replace-missing-value "NA")
|
|
(ds/update-elemwise cf/string #(get {"" "NA"} % %))
|
|
(ds/update cf/numeric ds/replace-missing-value 0)
|
|
(ds/update cf/boolean ds/replace-missing-value false)
|
|
(ds/update-columnwise (cf/union (cf/numeric ds) (cf/boolean ds))
|
|
#(dtype/elemwise-cast % :float64)))
|
|
</code></pre>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset/metamorph.clj#L1400">view source</a></div></div><div class="public anchor" id="var-update-column"><h3>update-column</h3><div class="usage"><code>(update-column col-name update-fn)</code></div><div class="doc"><div class="markdown"><p>Update a column returning a new dataset. update-fn is a column->column
|
|
transformation. Error if column does not exist.</p>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset/metamorph.clj#L1426">view source</a></div></div><div class="public anchor" id="var-update-columns"><h3>update-columns</h3><div class="usage"><code>(update-columns column-name-seq-or-fn update-fn)</code></div><div class="doc"><div class="markdown"><p>Update a sequence of columns selected by column name seq or column selector
|
|
function.</p>
|
|
<p>For example:</p>
|
|
<pre><code class="language-clojure">(update-columns DS [:A :B] #(dfn/+ % 2))
|
|
(update-columns DS cf/numeric #(dfn// % 2))
|
|
</code></pre>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset/metamorph.clj#L1433">view source</a></div></div><div class="public anchor" id="var-update-columnwise"><h3>update-columnwise</h3><div class="usage"><code>(update-columnwise filter-fn-or-ds cwise-update-fn & args)</code></div><div class="doc"><div class="markdown"><p>Call update-fn on each column of the dataset. Returns the dataset.
|
|
See arguments to update</p>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset/metamorph.clj#L1447">view source</a></div></div><div class="public anchor" id="var-update-elemwise"><h3>update-elemwise</h3><div class="usage"><code>(update-elemwise filter-fn-or-ds map-fn)</code><code>(update-elemwise map-fn)</code></div><div class="doc"><div class="markdown"><p>Replace all elements in selected columns by calling selected function on each
|
|
element. column-name-seq must be a sequence of column names if provided.
|
|
filter-fn-or-ds has same rules as update. Implicitly clears the missing set so
|
|
function must deal with type-specific missing values correctly.
|
|
Returns new dataset</p>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset/metamorph.clj#L1454">view source</a></div></div><div class="public anchor" id="var-value-reader"><h3>value-reader</h3><div class="usage"><code>(value-reader options)</code><code>(value-reader)</code></div><div class="doc"><div class="markdown"><p>Return a reader that produces a reader of column values per index.
|
|
Options:
|
|
:copying? - Default to false - When true row values are copied on read.</p>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset/metamorph.clj#L1466">view source</a></div></div><div class="public anchor" id="var-write.21"><h3>write!</h3><div class="usage"><code>(write! output-path options)</code><code>(write! output-path)</code></div><div class="doc"><div class="markdown"><p>Write a dataset out to a file. Supported forms are:</p>
|
|
<pre><code class="language-clojure">(ds/write! test-ds "test.csv")
|
|
(ds/write! test-ds "test.tsv")
|
|
(ds/write! test-ds "test.tsv.gz")
|
|
(ds/write! test-ds "test.nippy")
|
|
(ds/write! test-ds out-stream)
|
|
</code></pre>
|
|
<p>Options:</p>
|
|
<ul>
|
|
<li><code>:max-chars-per-column</code> - csv,tsv specific, defaults to 65536 - values longer than this will
|
|
cause an exception during serialization.</li>
|
|
<li><code>:max-num-columns</code> - csv,tsv specific, defaults to 8192 - If the dataset has more than this number of
|
|
columns an exception will be thrown during serialization.</li>
|
|
<li><code>:quoted-columns</code> - csv specific - sequence of columns names that you would like to always have quoted.</li>
|
|
<li><code>:file-type</code> - Manually specify the file type. This is usually inferred from the filename but if you
|
|
pass in an output stream then you will need to specify the file type.</li>
|
|
<li><code>:headers?</code> - if csv headers are written, defaults to true.</li>
|
|
</ul>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset/metamorph.clj#L1476">view source</a></div></div></div></body></html> |