165 lines
26 KiB
HTML
Vendored
165 lines
26 KiB
HTML
Vendored
<!DOCTYPE html PUBLIC ""
|
|
"">
|
|
<html><head><meta charset="UTF-8" /><title>tech.ml.dataset Quick Reference</title><script async="true" src="https://www.googletagmanager.com/gtag/js?id=G-RGTB4J7LGP"></script><script>window.dataLayer = window.dataLayer || [];
|
|
function gtag(){dataLayer.push(arguments);}
|
|
gtag('js', new Date());
|
|
|
|
gtag('config', 'G-95TVFC1FEB');</script><link rel="stylesheet" type="text/css" href="css/default.css" /><link rel="stylesheet" type="text/css" href="highlight/solarized-light.css" /><script type="text/javascript" src="highlight/highlight.min.js"></script><script type="text/javascript" src="js/jquery.min.js"></script><script type="text/javascript" src="js/page_effects.js"></script><script>hljs.initHighlightingOnLoad();</script></head><body><div id="header"><h2>Generated by <a href="https://github.com/weavejester/codox">Codox</a> with <a href="https://github.com/xsc/codox-theme-rdash">RDash UI</a> theme</h2><h1><a href="index.html"><span class="project-title"><span class="project-name">TMD</span> <span class="project-version">7.000-beta-23</span></span></a></h1></div><div class="sidebar primary"><h3 class="no-link"><span class="inner">Project</span></h3><ul class="index-link"><li class="depth-1 "><a href="index.html"><div class="inner">Index</div></a></li></ul><h3 class="no-link"><span class="inner">Topics</span></h3><ul><li class="depth-1 "><a href="columns-readers-and-datatypes.html"><div class="inner"><span>tech.ml.dataset Columns, Readers, and Datatypes</span></div></a></li><li class="depth-1 "><a href="nippy-serialization-rocks.html"><div class="inner"><span>tech.ml.dataset And nippy</span></div></a></li><li class="depth-1 current"><a href="quick-reference.html"><div class="inner"><span>tech.ml.dataset Quick Reference</span></div></a></li><li class="depth-1 "><a href="supported-datatypes.html"><div class="inner"><span>tech.ml.dataset Supported Datatypes</span></div></a></li><li class="depth-1 "><a href="walkthrough.html"><div class="inner"><span>tech.ml.dataset Walkthrough</span></div></a></li></ul><h3 class="no-link"><span class="inner">Namespaces</span></h3><ul><li class="depth-1"><div class="no-link"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>tech</span></div></div></li><li class="depth-2"><div class="no-link"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>v3</span></div></div></li><li class="depth-3"><a href="tech.v3.dataset.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>dataset</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.categorical.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>categorical</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.clipboard.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>clipboard</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.column.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>column</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.column-filters.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>column-filters</span></div></a></li><li class="depth-4"><div class="no-link"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>io</span></div></div></li><li class="depth-5 branch"><a href="tech.v3.dataset.io.csv.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>csv</span></div></a></li><li class="depth-5 branch"><a href="tech.v3.dataset.io.datetime.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>datetime</span></div></a></li><li class="depth-5 branch"><a href="tech.v3.dataset.io.string-row-parser.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>string-row-parser</span></div></a></li><li class="depth-5"><a href="tech.v3.dataset.io.univocity.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>univocity</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.join.html"><div class="inner"><span class="tree" style="top: -145px;"><span class="top" style="height: 154px;"></span><span class="bottom"></span></span><span>join</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.math.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>math</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.metamorph.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>metamorph</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.modelling.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>modelling</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.neanderthal.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>neanderthal</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.print.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>print</span></div></a></li><li class="depth-4"><a href="tech.v3.dataset.reductions.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>reductions</span></div></a></li><li class="depth-5"><a href="tech.v3.dataset.reductions.apache-data-sketch.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>apache-data-sketch</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.rolling.html"><div class="inner"><span class="tree" style="top: -52px;"><span class="top" style="height: 61px;"></span><span class="bottom"></span></span><span>rolling</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.set.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>set</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.tensor.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>tensor</span></div></a></li><li class="depth-4"><a href="tech.v3.dataset.zip.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>zip</span></div></a></li><li class="depth-3"><div class="no-link"><div class="inner"><span class="tree" style="top: -672px;"><span class="top" style="height: 681px;"></span><span class="bottom"></span></span><span>libs</span></div></div></li><li class="depth-4 branch"><a href="tech.v3.libs.arrow.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>arrow</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.libs.fastexcel.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>fastexcel</span></div></a></li><li class="depth-4"><div class="no-link"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>guava</span></div></div></li><li class="depth-5"><a href="tech.v3.libs.guava.cache.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>cache</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.libs.parquet.html"><div class="inner"><span class="tree" style="top: -52px;"><span class="top" style="height: 61px;"></span><span class="bottom"></span></span><span>parquet</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.libs.poi.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>poi</span></div></a></li><li class="depth-4"><div class="no-link"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>smile</span></div></div></li><li class="depth-5"><a href="tech.v3.libs.smile.data.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>data</span></div></a></li><li class="depth-4"><a href="tech.v3.libs.tribuo.html"><div class="inner"><span class="tree" style="top: -52px;"><span class="top" style="height: 61px;"></span><span class="bottom"></span></span><span>tribuo</span></div></a></li></ul></div><div class="document" id="content"><div class="doc"><div class="markdown"><h1>tech.ml.dataset Quick Reference</h1>
|
|
<p>Functions are linked to their source but if no namespace is specified they are
|
|
also accessible via the <code>tech.ml.dataset</code> namespace.</p>
|
|
<p>This is not an exhaustive listing of all functionality; just a quick brief way to find
|
|
functions that are we find most useful.</p>
|
|
<h2>Loading/Saving</h2>
|
|
<ul>
|
|
<li><a href="https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var--.3Edataset">->dataset, ->>dataset</a> - loads csv, tsv,
|
|
sequence-of-maps, map-of-arrays, xlsx, xls, and if their respective namespaces and dependencies are loaded, parquet and arrow.</li>
|
|
<li><a href="https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-write.21">write!</a> - Writes csv, tsv or
|
|
nippy with gzipping. Depends on scanning file path string to determine options.</li>
|
|
<li><a href="https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-dataset-.3Edata">dataset->data</a> - Useful if you want the entire
|
|
dataset represented as (mostly) pure Clojure/JVM datastructures. Missing sets are
|
|
roaring bitmaps, data is probably in primitive arrays. String tables receive special
|
|
treatment.</li>
|
|
<li><a href="https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-data-.3Edataset">data->dataset</a> - Inverse of data->dataset.</li>
|
|
<li><a href="https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.io.univocity.html#var-csv-.3Erows">tech.ml.dataset.io.univocity/csv->rows</a> - Lazily parse a
|
|
csv or tsv returning a sequence of string[] rows. This uses a subset of the ->dataset options.</li>
|
|
<li><a href="https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.io.string-row-parser.html#var-rows-.3Edataset">tech.ml.dataset.parse/rows->dataset</a> - Given
|
|
a sequence of string[] rows, parse data into a dataset. Uses subset of the ->dataset
|
|
options.</li>
|
|
<li><a href="https://techascent.github.io/tech.ml.dataset/tech.v3.libs.parquet.html">parquet support</a></li>
|
|
<li><a href="https://techascent.github.io/tech.ml.dataset/tech.v3.libs.arrow.html">arrow support</a></li>
|
|
<li><a href="https://techascent.github.io/tech.ml.dataset/tech.v3.libs.poi.html">xlsx, xls support</a></li>
|
|
<li><a href="https://techascent.github.io/tech.ml.dataset/tech.v3.libs.fastexcel.html">fast xlsx support</a></li>
|
|
</ul>
|
|
<h2>Accessing Values</h2>
|
|
<ul>
|
|
<li>Datasets overload Ifn so are functions of their column names. <code>(ds :colname)</code> will
|
|
return the column named <code>:colname</code>. Datasets implement <code>IPersistentMap</code> so
|
|
<code>(map (comp second meta) ds)</code> or <code>(map meta (vals ds))</code> will return a sequence of column
|
|
metadata. <code>keys</code>, <code>vals</code>, <code>contains?</code>, <code>assoc</code>, <code>dissoc</code>, <code>merge</code> and map-style destructuring
|
|
all work on datasets. Note that <code>update</code> does not work as update will always return a
|
|
persistent map.</li>
|
|
<li>Columns are iterable and implement indexed so you can use them with <code>map</code>, <code>count</code>
|
|
and <code>nth</code>. They furthermore overload IFn such that they are functions of their indexes similar
|
|
to persistent vectors. Using negative values as indexes will index from the end similar to numpy and pandas.</li>
|
|
<li>Typed random access is supported the <code>(tech.v3.datatype/->reader col)</code>
|
|
transformation. This is guaranteed to return an implementation of <code>java.util.List</code>
|
|
and also overloads <code>IFn</code> such that like a persistent vector passing in the index
|
|
will return the value - e.g. <code>(col 0)</code> returns the value at index 0. Direct access
|
|
to packed datetime columns may be surprising; call <code>tech.v3.datatype.datetime/unpack</code>
|
|
on the column prior to calling <code>tech.v3.datatype/->reader</code> to get to the unpacked
|
|
datatype. Using negative values as indexes will index from the end similar to numpy and pandas.</li>
|
|
<li><a href="https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-row-count">row-count</a> - works on datasets and columns.</li>
|
|
<li><a href="https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-column-count">column-count</a> - number of columns.</li>
|
|
<li><a href="https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-rows">rows</a> - get the rows of the
|
|
dataset as a <code>java.util.List</code> of persistent-map-like maps. Implemented as a flyweight
|
|
implementation of <code>clojure.lang.APersistentMap</code> where data is read out of the underlying dataset on demand. This keeps the
|
|
data in the backing store and lazily reads it so you will have relatively more expensive reading of the
|
|
data but will not increase your memory working-set size. Using negative values as indexes will index from the end similar to numpy and pandas.</li>
|
|
<li><a href="https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-rowvecs">rowvecs</a> - Get the rows of the
|
|
dataset as a 'java.util.List' of rows. These rows behave like persistent vectors and are safe to use
|
|
in maps. If you are going to use row values as keys in maps, passing in <code>{:copying? true}</code> will be more
|
|
efficient as then each hash and equals comparison is using data in the vector and not re-reading the data
|
|
out of the source dataset. Using negative values as indexes will index from the end similar to numpy and pandas.</li>
|
|
<li><a href="https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-missing">missing</a> - return a RoaringBitmap of missing indexes. If this is a column, it is the column's specific missing indexes. If this is a dataset, return a union of all the columns' missing indexes.</li>
|
|
<li><a href="https://github.com/clojure/clojure/blob/master/src/clj/clojure/core.clj#L202">meta, with-meta, vary-meta</a> - Datasets and columns implement
|
|
<code>clojure.lang.IObj</code> so you can get/set metadata on them freely. <code>:name</code> has meaning in the system and setting it
|
|
directly on a column is not recommended. Metadata is generally carried forward through most of the operations below.</li>
|
|
</ul>
|
|
<p><code>rows</code> and <code>rowvecs</code> are lazy and thus <code>(rand-nth (ds/rows ds))</code> is
|
|
a relatively efficient pathway (and fun). <code>(ds/rows (ds/sample ds))</code> is also
|
|
pretty good for quick scans.</p>
|
|
<h2>Print Options</h2>
|
|
<p>We use these options frequently during exploration to get more/less printing
|
|
output. These are used like <code>(vary-meta ds assoc :print-column-max-width 10)</code>.
|
|
Often it is useful to print the entire table: <code>(vary-meta ds assoc :print-index-range :all)</code></p>
|
|
<ul>
|
|
<li><a href="https://github.com/techascent/tech.ml.dataset/blob/7c8c7514e0e35995050c1e326122a1826cc18273/src/tech/v3/dataset/print.clj#L93">print metadata options</a></li>
|
|
</ul>
|
|
<h2>Dataset Exploration</h2>
|
|
<ul>
|
|
<li><a href="https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-head">head</a> - Return dataset consisting of first N rows.</li>
|
|
<li><a href="https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-tail">tail</a> - Return dataset consisting of last N rows.</li>
|
|
<li><a href="https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-sample">sample</a> - Randomly sample N rows of the dataset.</li>
|
|
<li><a href="https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-rand-nth">rand-nth</a> - Randomly sample a row of the dataset.</li>
|
|
<li><a href="https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-descriptive-stats">descriptive-stats</a> - return a dataset of
|
|
columnwise descriptive statistics.</li>
|
|
<li><a href="https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-brief">brief</a> - Return a sequence of maps of descriptive statistics.</li>
|
|
</ul>
|
|
<h2>Subrect Selection</h2>
|
|
<p>Keeping in mind that assoc, dissoc, and merge all work at the dataset level - here are some other
|
|
pathways that are useful for subrect selection.</p>
|
|
<ul>
|
|
<li><a href="https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-select">select</a> - can be used for renaming.
|
|
Anything iterable can be used for the rows.</li>
|
|
<li><a href="https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-select-columns">select-columns</a> - Select a subset of columns.</li>
|
|
<li><a href="https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-select-rows">select-rows</a> - works on datasets and columns.</li>
|
|
<li><a href="https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-drop-rows">drop-rows</a> - works on datasets and columns.</li>
|
|
<li><a href="https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-drop-missing">drop-missing</a> - Drop rows with missing values from the dataset.</li>
|
|
</ul>
|
|
<h2>Dataset Manipulation</h2>
|
|
<p>Several of the functions below come in <code>->column</code> variants and some come additional
|
|
in <code>->indexes</code> variants. <code>->column</code> variants are going to be faster than the base
|
|
versions and <code>->indexes</code> simply return indexes and thus skip creating sub-datasets
|
|
so these are faster yet.</p>
|
|
<ul>
|
|
<li><a href="https://github.com/techascent/tech.ml.dataset/blob/7c8c7514e0e35995050c1e326122a1826cc18273/src/tech/v3/dataset/impl/dataset.clj#L380">new-dataset</a> - Create a new dataset from a sequence of columns. Columns may be actual columns created via <code>tech.ml.dataset.column/new-column</code> or they could be maps containing at least keys <code>#:tech.v3.dataset{:name :data}</code> but also potentially <code>#:tech.v3.dataset{:metadata :missing}</code> in order to create a column with a specific set of missing values and metadata. <code>:force-datatype true</code> will disable the system
|
|
from attempting to scan the data for missing values and e.g. create a float column
|
|
from a vector of Float objects. The above also applies to using <code>clojure.core/assoc</code> with a dataset.</li>
|
|
<li><a href="https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-replace-missing">replace-missing</a>, <a href="https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-replace-missing-value">replace-missing-value</a> - replace missing values in one or more columns.</li>
|
|
<li><a href="https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-group-by-column">group-by-column</a>, <a href="https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-group-by">group-by</a> - Create a persistent map of value->dataset. Sub-datasets are created via indexing into the original dataset so data is not copied.</li>
|
|
<li><a href="https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-sort-by-column">sort-by-column</a>, <a href="https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-sort-by">sort-by</a> - Return a sorted dataset.</li>
|
|
<li><a href="https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-filter-column">filter-column</a>, <a href="https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-filter">filter</a> - Return a new dataset with only rows that pass the predicate.</li>
|
|
<li><a href="https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-concat-copying">concat-copying</a>, <a href="https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-concat-inplace">concat-inplace</a> - Given Y datasets produce a new dataset. Copying is generally much more efficient than in-place for a dataset count > 2. <code>(apply ds/concat-copying x-seq)</code> is
|
|
<strong>far</strong> more efficient than <code>(reduce ds/concat-copying x-seq)</code>; this also is true for <code>concat-inplace</code>.</li>
|
|
<li><a href="https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-unique-by-column">unique-by-column</a>, <a href="https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-unique-by">unique-by</a> - Remove duplicate rows. Passing in <code>keep-fn</code> allows
|
|
you to choose either first, last, or some other criteria for rows that have the same
|
|
values. For <code>unique-by</code>, <code>identity</code> will work just fine.</li>
|
|
<li><a href="https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-row-map">row-map</a> - In parallel, map a function from map->map over the dataset. The returned maps will be
|
|
used to create new columns in a new dataset and the result merged with the original. Note there are options to return a sequence of datasets as opposed to a single large
|
|
final dataset.</li>
|
|
<li><a href="https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-row-mapcat">row-mapcat</a> - In parallel, map a function from map->sequence-of-maps over the dataset potentially
|
|
expanding or shrinking the result. When multiple maps are returned, row information not included in the original map is efficiently duplicated. Note there are options to
|
|
return a sequence of datasets as opposed to a single potentially very large final dataset.</li>
|
|
<li><a href="https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-pmap-ds">pmap-ds</a> - Split dataset into batches and in parallel map a function from ds->ds across the dataset.
|
|
Can return either a new dataset via concat-copying or a sequence of datasets.</li>
|
|
<li><a href="https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.join.html#var-pd-merge">pd-merge</a> - Generalized left,right,inner,outer, and cross joins.</li>
|
|
<li><a href="https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.join.html#var-left-join-asof">left-join-asof</a> - Join-nearest type functionality useful for doing things like finding
|
|
the 3, 6, and 12 month prices from a dataset daily prices where you what the nearest price as things don't trade on the weekends.</li>
|
|
<li><a href="https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.rolling.html#var-rolling">rolling</a> - fixed and variable rolling window operations.</li>
|
|
<li><a href="https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.reductions.html#var-group-by-column-agg">group-column-by-agg</a> - Very high performance primitive taking a sequence of
|
|
datasets and producing a new dataset that is first grouped by one or more columns and then the per-group data is reduced using a map of reducers. Each key in the map becomes
|
|
a column in the result dataset.</li>
|
|
<li><a href="https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.neanderthal.html">neanderthal support</a> - transformations of datasets to/from neanderthal dense native matrixes.</li>
|
|
<li><a href="https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.tensor.html">tensor support</a> - transformations of datasets to/from <a href="https://cnuernber.github.io/dtype-next/tech.v3.tensor.html">tech.v3.tensor</a> objects.</li>
|
|
</ul>
|
|
<h2>Elementwise Arithmetic</h2>
|
|
<p>Functions in 'tech.v3.datatype.functional' all will apply various elementwise
|
|
arithmetic operations to a column lazily returning a new column. It is highly recommended to
|
|
remove all missing values before using elemwise arithmetic as the <code>functional</code> namespace
|
|
has no knowledge of missing values. Integer columns with missing values will be upcast
|
|
to float or double columns in order to support a missing value indicator.</p>
|
|
<pre><code class="language-clojure"> (assoc ds :value (dtype/elemwise-cast (ds :value) :int64)
|
|
:shrs-or-prn-amt (dtype/elemwise-cast (ds :shrs-or-prn-amt) :int64)
|
|
:cik (dtype/const-reader (:cik filing) (ds/row-count ds))
|
|
:investor (dtype/const-reader investor (ds/row-count ds))
|
|
:form-type (dtype/const-reader form-type (ds/row-count ds))
|
|
:edgar-id (dtype/const-reader (:edgar-id filing) (ds/row-count ds))
|
|
:weight (dfn// (ds :value)
|
|
(double (dfn/reduce-+ (ds :value)))))
|
|
</code></pre>
|
|
<h2>Forcing Lazy Evaluation</h2>
|
|
<p>The dataset system relies on index indirection and laziness quite often. This allows
|
|
you to aggregate up operations and pay relatively little for them however sometimes
|
|
it increases the accessing costs of the data by an undesirable amount. Because
|
|
of this we use <code>clone</code> quite often to force calculations to complete before
|
|
beginning a new stage of data processing. Clone is multithreaded and very efficient
|
|
often boiling down into either parallelized iteration over the data or
|
|
<code>System/arraycopy</code> calls.</p>
|
|
<p>Additionally calling 'clone' after loading will reduce the in-memory size of the
|
|
dataset by a bit - sometimes 20%. This is because lists that have allocated extra
|
|
capacity are copied into arrays that have no extra capacity.</p>
|
|
<ul>
|
|
<li><a href="https://github.com/cnuernber/dtype-next/blob/152f09f925041d41782e05009bbf84d7d6cfdbc6/src/tech/v3/datatype.clj#L95">tech.v3.datatype/clone</a> - Clones the dataset realizing lazy operation and copying the data into
|
|
java arrays. Will clone datasets or columns.</li>
|
|
</ul>
|
|
</div></div></div></body></html> |