62 lines
20 KiB
HTML
Vendored
62 lines
20 KiB
HTML
Vendored
<!DOCTYPE html PUBLIC ""
|
|
"">
|
|
<html><head><meta charset="UTF-8" /><title>tech.v3.dataset.modelling documentation</title><script async="true" src="https://www.googletagmanager.com/gtag/js?id=G-RGTB4J7LGP"></script><script>window.dataLayer = window.dataLayer || [];
|
|
function gtag(){dataLayer.push(arguments);}
|
|
gtag('js', new Date());
|
|
|
|
gtag('config', 'G-95TVFC1FEB');</script><link rel="stylesheet" type="text/css" href="css/default.css" /><link rel="stylesheet" type="text/css" href="highlight/solarized-light.css" /><script type="text/javascript" src="highlight/highlight.min.js"></script><script type="text/javascript" src="js/jquery.min.js"></script><script type="text/javascript" src="js/page_effects.js"></script><script>hljs.initHighlightingOnLoad();</script></head><body><div id="header"><h2>Generated by <a href="https://github.com/weavejester/codox">Codox</a> with <a href="https://github.com/xsc/codox-theme-rdash">RDash UI</a> theme</h2><h1><a href="index.html"><span class="project-title"><span class="project-name">TMD</span> <span class="project-version">8.003</span></span></a></h1></div><div class="sidebar primary"><h3 class="no-link"><span class="inner">Project</span></h3><ul class="index-link"><li class="depth-1 "><a href="index.html"><div class="inner">Index</div></a></li></ul><h3 class="no-link"><span class="inner">Topics</span></h3><ul><li class="depth-1 "><a href="000-getting-started.html"><div class="inner"><span>tech.ml.dataset Getting Started</span></div></a></li><li class="depth-1 "><a href="100-walkthrough.html"><div class="inner"><span>tech.ml.dataset Walkthrough</span></div></a></li><li class="depth-1 "><a href="200-quick-reference.html"><div class="inner"><span>tech.ml.dataset Quick Reference</span></div></a></li><li class="depth-1 "><a href="columns-readers-and-datatypes.html"><div class="inner"><span>tech.ml.dataset Columns, Readers, and Datatypes</span></div></a></li><li class="depth-1 "><a href="nippy-serialization-rocks.html"><div class="inner"><span>tech.ml.dataset And nippy</span></div></a></li><li class="depth-1 "><a href="supported-datatypes.html"><div class="inner"><span>tech.ml.dataset Supported Datatypes</span></div></a></li></ul><h3 class="no-link"><span class="inner">Namespaces</span></h3><ul><li class="depth-1"><div class="no-link"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>tech</span></div></div></li><li class="depth-2"><div class="no-link"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>v3</span></div></div></li><li class="depth-3"><a href="tech.v3.dataset.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>dataset</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.categorical.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>categorical</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.clipboard.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>clipboard</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.column.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>column</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.column-filters.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>column-filters</span></div></a></li><li class="depth-4"><div class="no-link"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>io</span></div></div></li><li class="depth-5 branch"><a href="tech.v3.dataset.io.csv.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>csv</span></div></a></li><li class="depth-5 branch"><a href="tech.v3.dataset.io.datetime.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>datetime</span></div></a></li><li class="depth-5 branch"><a href="tech.v3.dataset.io.string-row-parser.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>string-row-parser</span></div></a></li><li class="depth-5"><a href="tech.v3.dataset.io.univocity.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>univocity</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.join.html"><div class="inner"><span class="tree" style="top: -145px;"><span class="top" style="height: 154px;"></span><span class="bottom"></span></span><span>join</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.math.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>math</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.metamorph.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>metamorph</span></div></a></li><li class="depth-4 branch current"><a href="tech.v3.dataset.modelling.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>modelling</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.print.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>print</span></div></a></li><li class="depth-4"><a href="tech.v3.dataset.reductions.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>reductions</span></div></a></li><li class="depth-5"><a href="tech.v3.dataset.reductions.apache-data-sketch.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>apache-data-sketch</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.rolling.html"><div class="inner"><span class="tree" style="top: -52px;"><span class="top" style="height: 61px;"></span><span class="bottom"></span></span><span>rolling</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.set.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>set</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.tensor.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>tensor</span></div></a></li><li class="depth-4"><a href="tech.v3.dataset.zip.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>zip</span></div></a></li><li class="depth-3"><div class="no-link"><div class="inner"><span class="tree" style="top: -641px;"><span class="top" style="height: 650px;"></span><span class="bottom"></span></span><span>libs</span></div></div></li><li class="depth-4 branch"><a href="tech.v3.libs.arrow.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>arrow</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.libs.clj-transit.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>clj-transit</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.libs.fastexcel.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>fastexcel</span></div></a></li><li class="depth-4"><div class="no-link"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>guava</span></div></div></li><li class="depth-5"><a href="tech.v3.libs.guava.cache.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>cache</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.libs.parquet.html"><div class="inner"><span class="tree" style="top: -52px;"><span class="top" style="height: 61px;"></span><span class="bottom"></span></span><span>parquet</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.libs.poi.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>poi</span></div></a></li><li class="depth-4"><a href="tech.v3.libs.tribuo.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>tribuo</span></div></a></li></ul></div><div class="sidebar secondary"><h3><a href="#top"><span class="inner">Public Vars</span></a></h3><ul><li class="depth-1"><a href="tech.v3.dataset.modelling.html#var-column-values-.3Ecategorical"><div class="inner"><span>column-values->categorical</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.modelling.html#var-dataset-.3Ecategorical-xforms"><div class="inner"><span>dataset->categorical-xforms</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.modelling.html#var-feature-ecount"><div class="inner"><span>feature-ecount</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.modelling.html#var-inference-column.3F"><div class="inner"><span>inference-column?</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.modelling.html#var-inference-target-column-names"><div class="inner"><span>inference-target-column-names</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.modelling.html#var-inference-target-ds"><div class="inner"><span>inference-target-ds</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.modelling.html#var-inference-target-label-inverse-map"><div class="inner"><span>inference-target-label-inverse-map</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.modelling.html#var-inference-target-label-map"><div class="inner"><span>inference-target-label-map</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.modelling.html#var-k-fold-datasets"><div class="inner"><span>k-fold-datasets</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.modelling.html#var-labels"><div class="inner"><span>labels</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.modelling.html#var-model-type"><div class="inner"><span>model-type</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.modelling.html#var-num-inference-classes"><div class="inner"><span>num-inference-classes</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.modelling.html#var-probability-distributions-.3Elabel-column"><div class="inner"><span>probability-distributions->label-column</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.modelling.html#var-set-inference-target"><div class="inner"><span>set-inference-target</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.modelling.html#var-train-test-split"><div class="inner"><span>train-test-split</span></div></a></li></ul></div><div class="namespace-docs" id="content"><h1 class="anchor" id="top">tech.v3.dataset.modelling</h1><div class="doc"><div class="markdown"><p>Methods related specifically to machine learning such as setting the inference
|
|
target. This file integrates tightly with tech.v3.dataset.categorical which provides
|
|
categorical -> number and one-hot transformation pathways.</p>
|
|
<p>The functions in this namespace manipulate the metadata on the columns of the dataset, wich can be inspected via <code>clojure.core/meta</code></p>
|
|
</div></div><div class="public anchor" id="var-column-values-.3Ecategorical"><h3>column-values->categorical</h3><div class="usage"><code>(column-values->categorical dataset src-column)</code></div><div class="doc"><div class="markdown"><p>Given a column encoded via either string->number or one-hot, reverse
|
|
map to the a sequence of the original string column values.
|
|
In the case of one-hot mappings, src-column must be the original
|
|
column name before the one-hot map</p>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset/modelling.clj#L109">view source</a></div></div><div class="public anchor" id="var-dataset-.3Ecategorical-xforms"><h3>dataset->categorical-xforms</h3><div class="usage"><code>(dataset->categorical-xforms ds)</code></div><div class="doc"><div class="markdown"><p>Given a dataset, return a map of column-name->xform information.</p>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset/modelling.clj#L66">view source</a></div></div><div class="public anchor" id="var-feature-ecount"><h3>feature-ecount</h3><div class="usage"><code>(feature-ecount dataset)</code></div><div class="doc"><div class="markdown"><p>Number of feature columns. Feature columns are columns that are not
|
|
inference targets.</p>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset/modelling.clj#L83">view source</a></div></div><div class="public anchor" id="var-inference-column.3F"><h3>inference-column?</h3><div class="usage"><code>(inference-column? col)</code></div><div class="doc"><div class="markdown"></div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset/modelling.clj#L21">view source</a></div></div><div class="public anchor" id="var-inference-target-column-names"><h3>inference-target-column-names</h3><div class="usage"><code>(inference-target-column-names ds)</code></div><div class="doc"><div class="markdown"><p>Return the names of the columns that are inference targets.</p>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset/modelling.clj#L37">view source</a></div></div><div class="public anchor" id="var-inference-target-ds"><h3>inference-target-ds</h3><div class="usage"><code>(inference-target-ds dataset)</code></div><div class="doc"><div class="markdown"><p>Given a dataset return reverse-mapped inference target columns or nil
|
|
in the case where there are no inference targets.</p>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset/modelling.clj#L203">view source</a></div></div><div class="public anchor" id="var-inference-target-label-inverse-map"><h3>inference-target-label-inverse-map</h3><div class="usage"><code>(inference-target-label-inverse-map dataset & [label-columns])</code></div><div class="doc"><div class="markdown"><p>Given options generated during ETL operations and annotated with :label-columns
|
|
sequence container 1 label column, generate a reverse map that maps from a dataset
|
|
value back to the label that generated that value.</p>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset/modelling.clj#L57">view source</a></div></div><div class="public anchor" id="var-inference-target-label-map"><h3>inference-target-label-map</h3><div class="usage"><code>(inference-target-label-map dataset & [label-columns])</code></div><div class="doc"><div class="markdown"></div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset/modelling.clj#L46">view source</a></div></div><div class="public anchor" id="var-k-fold-datasets"><h3>k-fold-datasets</h3><div class="usage"><code>(k-fold-datasets dataset k options)</code><code>(k-fold-datasets dataset k)</code></div><div class="doc"><div class="markdown"><p>Given 1 dataset, prepary K datasets using the k-fold algorithm.
|
|
Randomize dataset defaults to true which will realize the entire dataset
|
|
so use with care if you have large datasets.</p>
|
|
<p>Returns a sequence of {:test-ds :train-ds}</p>
|
|
<p>Options:</p>
|
|
<ul>
|
|
<li><code>:randomize-dataset?</code> - When true, shuffle the dataset. In that case 'seed' may be
|
|
provided. Defaults to true.</li>
|
|
<li><code>:seed</code> - when <code>:randomize-dataset?</code> is true then this can either be an
|
|
implementation of java.util.Random or an integer seed which will be used to
|
|
construct java.util.Random.</li>
|
|
</ul>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset/modelling.clj#L139">view source</a></div></div><div class="public anchor" id="var-labels"><h3>labels</h3><div class="usage"><code>(labels dataset)</code></div><div class="doc"><div class="markdown"><p>Return the labels. The labels sequence is the reverse mapped inference
|
|
column. This returns a single column of data or errors out.</p>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset/modelling.clj#L212">view source</a></div></div><div class="public anchor" id="var-model-type"><h3>model-type</h3><div class="usage"><code>(model-type dataset & [column-name-seq])</code></div><div class="doc"><div class="markdown"><p>Check the label column after dataset processing.
|
|
Return either
|
|
:regression
|
|
:classification</p>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset/modelling.clj#L91">view source</a></div></div><div class="public anchor" id="var-num-inference-classes"><h3>num-inference-classes</h3><div class="usage"><code>(num-inference-classes dataset)</code></div><div class="doc"><div class="markdown"><p>Given a dataset and correctly built options from pipeline operations,
|
|
return the number of classes used for the label. Error if not classification
|
|
dataset.</p>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset/modelling.clj#L75">view source</a></div></div><div class="public anchor" id="var-probability-distributions-.3Elabel-column"><h3>probability-distributions->label-column</h3><div class="usage"><code>(probability-distributions->label-column prob-ds dst-colname label-column-datatype)</code><code>(probability-distributions->label-column prob-ds dst-colname)</code></div><div class="doc"><div class="markdown"><p>Given a dataset that has columns in which the column names describe labels and the
|
|
rows describe a probability distribution, create a label column by taking the max
|
|
value in each row and assign column that row value.
|
|
Creates a categorical label column which has a catgeorical map in its meta.</p>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset/modelling.clj#L225">view source</a></div></div><div class="public anchor" id="var-set-inference-target"><h3>set-inference-target</h3><div class="usage"><code>(set-inference-target dataset target-name-or-target-name-seq)</code></div><div class="doc"><div class="markdown"><p>Set the inference target on the column. This sets the :column-type member
|
|
of the column metadata to :inference-target?.</p>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset/modelling.clj#L26">view source</a></div></div><div class="public anchor" id="var-train-test-split"><h3>train-test-split</h3><div class="usage"><code>(train-test-split dataset {:keys [train-fraction], :or {train-fraction 0.7}, :as options})</code><code>(train-test-split dataset)</code></div><div class="doc"><div class="markdown"><p>Probabilistically split the dataset returning a map of <code>{:train-ds :test-ds}</code>.</p>
|
|
<p>Options:</p>
|
|
<ul>
|
|
<li><code>:randomize-dataset?</code> - When true, shuffle the dataset. In that case 'seed' may be
|
|
provided. Defaults to true.</li>
|
|
<li><code>:seed</code> - when <code>:randomize-dataset?</code> is true then this can either be an
|
|
implementation of java.util.Random or an integer seed which will be used to
|
|
construct java.util.Random.</li>
|
|
<li><code>:train-fraction</code> - Fraction of the dataset to use as training set. Defaults to
|
|
0.7.</li>
|
|
</ul>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset/modelling.clj#L178">view source</a></div></div></div></body></html> |