Files
2026-02-08 11:20:43 -10:00

149 lines
31 KiB
HTML
Vendored

<!DOCTYPE html PUBLIC ""
"">
<html><head><meta charset="UTF-8" /><title>tech.ml.dataset Quick Reference</title><script async="true" src="https://www.googletagmanager.com/gtag/js?id=G-RGTB4J7LGP"></script><script>window.dataLayer = window.dataLayer || [];
function gtag(){dataLayer.push(arguments);}
gtag('js', new Date());
gtag('config', 'G-95TVFC1FEB');</script><link rel="stylesheet" type="text/css" href="css/default.css" /><link rel="stylesheet" type="text/css" href="highlight/solarized-light.css" /><script type="text/javascript" src="highlight/highlight.min.js"></script><script type="text/javascript" src="js/jquery.min.js"></script><script type="text/javascript" src="js/page_effects.js"></script><script>hljs.initHighlightingOnLoad();</script></head><body><div id="header"><h2>Generated by <a href="https://github.com/weavejester/codox">Codox</a> with <a href="https://github.com/xsc/codox-theme-rdash">RDash UI</a> theme</h2><h1><a href="index.html"><span class="project-title"><span class="project-name">TMD</span> <span class="project-version">8.003</span></span></a></h1></div><div class="sidebar primary"><h3 class="no-link"><span class="inner">Project</span></h3><ul class="index-link"><li class="depth-1 "><a href="index.html"><div class="inner">Index</div></a></li></ul><h3 class="no-link"><span class="inner">Topics</span></h3><ul><li class="depth-1 "><a href="000-getting-started.html"><div class="inner"><span>tech.ml.dataset Getting Started</span></div></a></li><li class="depth-1 "><a href="100-walkthrough.html"><div class="inner"><span>tech.ml.dataset Walkthrough</span></div></a></li><li class="depth-1 current"><a href="200-quick-reference.html"><div class="inner"><span>tech.ml.dataset Quick Reference</span></div></a></li><li class="depth-1 "><a href="columns-readers-and-datatypes.html"><div class="inner"><span>tech.ml.dataset Columns, Readers, and Datatypes</span></div></a></li><li class="depth-1 "><a href="nippy-serialization-rocks.html"><div class="inner"><span>tech.ml.dataset And nippy</span></div></a></li><li class="depth-1 "><a href="supported-datatypes.html"><div class="inner"><span>tech.ml.dataset Supported Datatypes</span></div></a></li></ul><h3 class="no-link"><span class="inner">Namespaces</span></h3><ul><li class="depth-1"><div class="no-link"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>tech</span></div></div></li><li class="depth-2"><div class="no-link"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>v3</span></div></div></li><li class="depth-3"><a href="tech.v3.dataset.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>dataset</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.categorical.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>categorical</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.clipboard.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>clipboard</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.column.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>column</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.column-filters.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>column-filters</span></div></a></li><li class="depth-4"><div class="no-link"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>io</span></div></div></li><li class="depth-5 branch"><a href="tech.v3.dataset.io.csv.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>csv</span></div></a></li><li class="depth-5 branch"><a href="tech.v3.dataset.io.datetime.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>datetime</span></div></a></li><li class="depth-5 branch"><a href="tech.v3.dataset.io.string-row-parser.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>string-row-parser</span></div></a></li><li class="depth-5"><a href="tech.v3.dataset.io.univocity.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>univocity</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.join.html"><div class="inner"><span class="tree" style="top: -145px;"><span class="top" style="height: 154px;"></span><span class="bottom"></span></span><span>join</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.math.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>math</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.metamorph.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>metamorph</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.modelling.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>modelling</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.print.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>print</span></div></a></li><li class="depth-4"><a href="tech.v3.dataset.reductions.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>reductions</span></div></a></li><li class="depth-5"><a href="tech.v3.dataset.reductions.apache-data-sketch.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>apache-data-sketch</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.rolling.html"><div class="inner"><span class="tree" style="top: -52px;"><span class="top" style="height: 61px;"></span><span class="bottom"></span></span><span>rolling</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.set.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>set</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.tensor.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>tensor</span></div></a></li><li class="depth-4"><a href="tech.v3.dataset.zip.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>zip</span></div></a></li><li class="depth-3"><div class="no-link"><div class="inner"><span class="tree" style="top: -641px;"><span class="top" style="height: 650px;"></span><span class="bottom"></span></span><span>libs</span></div></div></li><li class="depth-4 branch"><a href="tech.v3.libs.arrow.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>arrow</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.libs.clj-transit.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>clj-transit</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.libs.fastexcel.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>fastexcel</span></div></a></li><li class="depth-4"><div class="no-link"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>guava</span></div></div></li><li class="depth-5"><a href="tech.v3.libs.guava.cache.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>cache</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.libs.parquet.html"><div class="inner"><span class="tree" style="top: -52px;"><span class="top" style="height: 61px;"></span><span class="bottom"></span></span><span>parquet</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.libs.poi.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>poi</span></div></a></li><li class="depth-4"><a href="tech.v3.libs.tribuo.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>tribuo</span></div></a></li></ul></div><div class="document" id="content"><div class="doc"><div class="markdown"><h1>tech.ml.dataset Quick Reference</h1>
<p>This topic summarizes many of the most frequently used TMD functions, together with some quick notes about their use. Functions here are linked to further documentation, or their source. Note, unless a namespace is specified, each function is accessible via the <code>tech.ml.dataset</code> namespace.</p>
<p>For a more thorough treatment, the <a href="https://techascent.github.io/tech.ml.dataset/index.html">API docs</a> list every available function.</p>
<h3>Table of Contents</h3>
<ol>
<li><a href="#LoadingSaving">Loading/Saving</a></li>
<li><a href="#AccessingValues">Accessing Values</a></li>
<li><a href="#PrintOptions">REPL Friendly Printing</a></li>
<li><a href="#ExploringDatasets">Exploring Datasets</a></li>
<li><a href="#SelectingSubrects">Selecting Subrects</a></li>
<li><a href="#ManipulatingDatasets">Manipulating Datasets</a></li>
<li><a href="#ElementwiseArithmetic">Elementwise Arithmetic</a></li>
<li><a href="#ForcingLazyEvaluation">Forcing Lazy Evaluation</a></li>
</ol>
<hr />
<div id="LoadingSaving"></div>
<h2>Loading/Saving</h2>
<ul>
<li><a href="https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var--.3Edataset">-&gt;dataset, -&gt;&gt;dataset</a> - obtains datasets from files or streams of csv/tsv, sequence-of-maps, map-of-arrays, xlsx, xls, and other typical formats. If their respective namespaces and dependencies are loaded, this function can also load parquet and arrow. <a href="https://github.com/techascent/tech.ml.dataset.sql">SQL</a> and <a href="https://github.com/cnuernber/tmdjs">ClojureScript</a> support is provided by separate libraries.</li>
<li><a href="https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-write.21">write!</a> - Writes csv, tsv, nippy (or a variety of other formats) with optional gzipping. Depends on scanning file path string to determine options.</li>
<li><a href="https://techascent.github.io/tech.ml.dataset/tech.v3.libs.parquet.html">parquet support</a></li>
<li><a href="https://techascent.github.io/tech.ml.dataset/tech.v3.libs.poi.html">xlsx, xls support</a></li>
<li><a href="https://techascent.github.io/tech.ml.dataset/tech.v3.libs.fastexcel.html">fast xlsx support</a></li>
<li><a href="https://techascent.github.io/tech.ml.dataset/tech.v3.libs.arrow.html">arrow support</a></li>
<li><a href="https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-dataset-.3Edata">dataset-&gt;data</a> - Useful if you want an entire dataset represented as Clojure/JVM datastructures. Primitive arrays save space, roaring bitmaps represent missing sets, and string tables receive special treatment.</li>
<li><a href="https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-data-.3Edataset">data-&gt;dataset</a> - Inverse of data-&gt;dataset.</li>
<li><a href="https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.io.univocity.html#var-csv-.3Erows">tech.ml.dataset.io.univocity/csv-&gt;rows</a> - lower-level support for lazily parsing a csv or tsv as a sequence of <code>string[]</code> rows. Offers a subset of the <code>-&gt;dataset</code> options.</li>
<li><a href="https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.io.string-row-parser.html#var-rows-.3Edataset">tech.ml.dataset.parse/rows-&gt;dataset</a> - lower-level support for obtaining a dataset from a sequence of <code>string[]</code> rows. Offers subset of the <code>-&gt;dataset</code> options.</li>
</ul>
<hr />
<div id="AccessingValues"></div>
<h2>Accessing Values</h2>
<ul>
<li>Datasets are logically maps of column name to column, and to this end implement <code>IPersistentMap</code>. So, <code>(map meta (vals ds))</code> will return a sequence of column metadata. Moreover, datasets implement <code>Ifn</code>, and so are functions of their column names. Thus, <code>(ds :colname)</code> will return the column named <code>:colname</code>. Functions like <code>keys</code>, <code>vals</code>, <code>contains?</code>, <code>assoc</code>, <code>dissoc</code>, <code>merge</code>, and map-style destructuring all work on datasets. Notably, <code>update</code> does not work as update always returns a persistent map.</li>
<li>Columns are iterable and implement indexed (random access) so they work with <code>map</code>, <code>count</code> and <code>nth</code>. Columns also implement <code>IFn</code> analgous to to persistent vectors. Helpfully, using negative values as indexes reads from the end similar to numpy and pandas.</li>
<li><a href="https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-row-count">row-count</a> - count dataset and column rows.</li>
<li><a href="https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-rows">rows</a> - get the rows of the dataset as a <code>java.util.List</code> of persistent-map-like maps. Accomplished by a flyweight implementation of <code>clojure.lang.APersistentMap</code> where data is read out of the underlying dataset on demand. This keeps the data in the backing store for lazily access; this makes reading marginally more expensive, but allows this call not to increase memory working-set size. Indexing rows returned like this with negative values indexes from the end similar to numpy and pandas.</li>
<li><a href="https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-rowvecs">rowvecs</a> - get the rows of the dataset as a <code>java.util.List</code> of persistent-vector-like entries. These rows are safe to use in maps. When using row values as keys in maps, the <code>{:copying? true}</code> option can help with performance, because each hash and equals comparison is using data located in the vector, not re-reading the data out of the source dataset. Negative values index from the end similar to numpy and pandas.</li>
<li><code>rows</code> and <code>rowvecs</code> are lazy and thus <code>(rand-nth (ds/rows ds))</code> is a relatively efficient pathway (and fun). <code>(ds/rows (ds/sample ds))</code> is also good for a quick scan.</li>
<li><a href="https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-column-count">column-count</a> - count columns.</li>
<li>Typed random access is supported the <code>(tech.v3.datatype/-&gt;reader col)</code> transformation. This is guaranteed to return an implementation of <code>java.util.List</code> storing typed values. These implement <code>IFn</code> like a column or persistent vector. Direct access to packed datetime columns may produce surprising results; call <code>tech.v3.datatype.datetime/unpack</code> on the column prior to calling <code>tech.v3.datatype/-&gt;reader</code> to get to the unpacked datatype. Negative indexes on readers index from the end.</li>
<li><a href="https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-missing">missing</a> - return a RoaringBitmap of missing indexes. For columns, returns the column's specific missing indexes. For datasets, returns a union of all the columns' missing indexes.</li>
<li><a href="https://github.com/clojure/clojure/blob/master/src/clj/clojure/core.clj#L202">meta, with-meta, vary-meta</a> - both datasets and columns implement <code>clojure.lang.IObj</code> so metadata works. The key <code>:name</code> has meaning in the system and setting it directly on a column is not recommended. In general, operations preserve metadata.</li>
</ul>
<hr />
<div id="PrintOptions"></div>
<h2>REPL Friendly Printing</h2>
<p>REPL workflows are an important part of TMD, and so controlling what is printed (especially for larger datasets) is critical. Many options are provided, by metadata, to get the right information on the screen for perusal.</p>
<p>Be default, printing is abbreviated, the helpful <a href="https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-print-all">print-all</a> function overrides this behavior to enable printing all rows.</p>
<p>In general, any option can be set like <code>(vary-meta ds assoc :print-column-max-width 10)</code>.</p>
<ul>
<li>Summary of <a href="https://github.com/techascent/tech.ml.dataset/blob/7c8c7514e0e35995050c1e326122a1826cc18273/src/tech/v3/dataset/print.clj#L93">print metadata options</a></li>
</ul>
<hr />
<div id="ExploringDatasets"></div>
<h2>Exploring Datasets</h2>
<ul>
<li><a href="https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-head">head</a> - obtains a dataset consisting of the first N rows of the input dataset.</li>
<li><a href="https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-tail">tail</a> - obtains a dataset consisting of the last N rows of the input dataset.</li>
<li><a href="https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-sample">sample</a> - samples N rows, randomly, as a dataset.</li>
<li><a href="https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-rand-nth">rand-nth</a> - samples a single row of the dataset.</li>
<li><a href="https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-descriptive-stats">descriptive-stats</a> - produces a dataset of columnwise descriptive statistics.</li>
<li><a href="https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-brief">brief</a> - get descriptive statistics as Clojure data (an edn sequence of maps, one for each column).</li>
</ul>
<hr />
<div id="SelectingSubrects"></div>
<h2>Selecting Subrects</h2>
<p>Recall that since datasets are maps, <code>assoc</code>, <code>dissoc</code>, and <code>merge</code> all work at the dataset level - beyond that, consider these helpful subrect selection functions.</p>
<ul>
<li><a href="https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-select-columns">select-columns</a> - select a subset of columns. Notably, this also controls column <em>order</em> for downstream printing and serialization.</li>
<li><a href="https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-select-rows">select-rows</a> - get a specific subset of rows from a datasets or column.</li>
<li><a href="https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-select">select</a> - get a specific set of rows and columns in a single call, can be used for renaming.</li>
<li><a href="https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-drop-rows">drop-rows</a> - drop rows by index from a datasets or column.</li>
<li><a href="https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-drop-missing">drop-missing</a> - drop any rows with missing values from the dataset.</li>
</ul>
<hr />
<div id="ManipulatingDatasets"></div>
<h2>Manipulating Datasets</h2>
<ul>
<li><a href="https://github.com/techascent/tech.ml.dataset/blob/7c8c7514e0e35995050c1e326122a1826cc18273/src/tech/v3/dataset/impl/dataset.clj#L380">new-dataset</a> - Create a new dataset from a sequence of columns. Columns may be actual columns created via <code>tech.ml.dataset.column/new-column</code> or they could be maps containing at least keys <code>#:tech.v3.dataset{:name :data}</code> but also potentially <code>#:tech.v3.dataset{:metadata :missing}</code> in order to create a column with a specific set of missing values and metadata. <code>:force-datatype true</code> will disable the system from attempting to scan the data for missing values and e.g. create a float column from a vector of Float objects. The above also applies to using <code>clojure.core/assoc</code> with a dataset.</li>
<li><a href="https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-row-map">row-map</a> ⭐ - maps a function from map-&gt;map in parallel over the dataset. The returned maps will be used to create or update columns in the output dataset, merging with the original. Note there are options to return a sequence of datasets as opposed to a single large final dataset.</li>
<li><a href="https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-row-mapcat">row-mapcat</a> - maps a function from map-&gt;sequence-of-maps over the dataset in parallel, potentially expanding or shrinking the result (in terms of row count). When expanding, row information not included in the original map is efficiently duplicated. Note there are options to return a sequence of datasets as opposed to a single potentially very large final dataset.</li>
<li><a href="https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.join.html#var-pd-merge">pd-merge</a> - implements generalized left, right, inner, outer, and cross joins. Allows combining datasets in a way familiar to users of traditional databases.</li>
<li><a href="https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-replace-missing">replace-missing</a>, <a href="https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-replace-missing-value">replace-missing-value</a> - replace missing values in one or more columns.</li>
<li><a href="https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-group-by-column">group-by-column</a>, <a href="https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-group-by">group-by</a> - creates a map of value to dataset with that value. These datasets are created via indexing into the original dataset for efficiency, so no data is copied.</li>
<li><a href="https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-sort-by-column">sort-by-column</a>, <a href="https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-sort-by">sort-by</a> - sorts the dataset by column values.</li>
<li><a href="https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-filter-column">filter-column</a>, <a href="https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-filter">filter</a> - produces a new dataset with only rows that pass a predicate.</li>
<li><a href="https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-concat-copying">concat-copying</a>, <a href="https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-concat-inplace">concat-inplace</a> - produces a new dataset as a concatenation of supplied datasets. Copying can be more efficient than in-place, but uses more memory - <code>(apply ds/concat-copying x-seq)</code> is <strong>far</strong> more efficient than <code>(reduce ds/concat-copying x-seq)</code>; this also is true for <code>concat-inplace</code>.</li>
<li><a href="https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-unique-by-column">unique-by-column</a>, <a href="https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-unique-by">unique-by</a> - removes duplicate rows. Passing in <code>keep-fn</code> allows you to choose either first, last, or some other criteria for rows that have the same values. For <code>unique-by</code>, <code>identity</code> will work just fine (rows have sane equality semantics).</li>
<li><a href="https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-pmap-ds">pmap-ds</a> - maps a function of ds-&gt;ds in parallel over batches of data in the dataset. Can return either a new dataset via concat-copying or a sequence of datasets.</li>
<li><a href="https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.join.html#var-left-join-asof">left-join-asof</a> - specialized join-nearest functionality useful for doing things like finding the nearest values in time in irregularly sampled data.</li>
<li><a href="https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.rolling.html#var-rolling">rolling</a> - fixed and variable rolling window operations.</li>
<li><a href="https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.reductions.html#var-group-by-column-agg">group-column-by-agg</a> - unusually high performance primitive that logically combines <code>group-by</code> and <code>reduce</code> operations. Each key in the supplied map of reducers becomes a column in the output dataset.</li>
<li><a href="https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.neanderthal.html">neanderthal support</a> - transformations of datasets to/from neanderthal dense native matrixes.</li>
<li><a href="https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.tensor.html">tensor support</a> - transformations of datasets to/from <a href="https://cnuernber.github.io/dtype-next/tech.v3.tensor.html">tech.v3.tensor</a> objects.</li>
</ul>
<p>Many of the functions above come in <code>-&gt;column</code> variants, which can be faster by avoiding the creation of fully-realized output datasets with superfluous data. Moreover, some of these functions come in <code>-&gt;indexes</code> variants, which simply return indexes and thus skip creating sub-datasets. Operating in index space as such can be <em>very</em> efficient.</p>
<hr />
<div id="ElementwiseArithmetic"></div>
<h2>Elementwise Arithmetic</h2>
<p>Functions in the <code>tech.v3.datatype.functional</code> namespace operate elementwise on a column, lazily returning a new column. It is highly recommended to remove all missing values before using element-wise arithmetic as the <code>functional</code> namespace has no knowledge of missing values. Integer columns with missing values will be upcast to float or double columns in order to support a missing value indicator.</p>
<p>Note the use of <code>dfn</code> from <code>(require [tech.v3.datatype.functional :as dfn])</code>:</p>
<pre><code class="language-clojure">(assoc ds :value (dtype/elemwise-cast (ds :value) :int64)
:shrs-or-prn-amt (dtype/elemwise-cast (ds :shrs-or-prn-amt) :int64)
:cik (dtype/const-reader (:cik filing) (ds/row-count ds))
:investor (dtype/const-reader investor (ds/row-count ds))
:form-type (dtype/const-reader form-type (ds/row-count ds))
:edgar-id (dtype/const-reader (:edgar-id filing) (ds/row-count ds))
:weight (dfn// (ds :value)
(double (dfn/reduce-+ (ds :value)))))
</code></pre>
<hr />
<div id="ForcingLazyEvaluation"></div>
<h2>Forcing Lazy Evaluation</h2>
<p>In pandas or with R's <code>data.table</code>s one frequently needs to consider making a copy of some data before operating on it. Making too many copies uses too much memory, making too few copies leads to confusing non-local overwrites of data. There is untold lossage in these nonfunctional notions of dataset processing. Unlike these, TMD's datasets are functional.</p>
<p>TMD's functional datasets rely on index indirection, lazyness, and structural sharing to simplify the mental model necessary to reason about their operation. This allows low-cost aggregation of operations, and eliminates most wondering about whether making a copy is necessary or not (it's generally not). However, these indirections sometimes increase read costs.</p>
<p>At any time, <code>clone</code> can be used to make a clean copy of the dataset that relies on no indirect computation, and stores the data separately, so there is no chance of accidental overwrites. Clone is multithreaded and very efficient, boiling down to parallelized iteration over the data and <code>System/arraycopy</code> calls. Moreover, calling <code>clone</code> can reduce the in-memory size of the dataset by a bit - sometimes 20%, by converting <code>List</code>s that have some overhead into arrays that have no extra capacity.</p>
<ul>
<li><a href="https://github.com/cnuernber/dtype-next/blob/152f09f925041d41782e05009bbf84d7d6cfdbc6/src/tech/v3/datatype.clj#L95">tech.v3.datatype/clone</a> - clones the dataset realizing lazy operations and copying the data into java arrays. Operates on datasets and columns.</li>
</ul>
<hr />
<h2>Additional Selling Points</h2>
<p>Sophisticated support for <a href="https://techascent.github.io/tech.ml.dataset/tech.v3.libs.arrow.html">Apache Arrow</a>, including mmap support for JDK-8-&gt;JDK-17 although if you are on an M-1 Mac you will need to use JDK-17. Also, with arrow, per-column compression (LZ4, ZSTD) exists across all supported platforms. At the time of writing, the official Arrow SDK does not support mmap, or JDK-17, and has no user-accessible way to save a compressed streaming format file.</p>
<p>Support is provided for operating on <em>sequences</em> of datasets, enabling working on larger, potentially out-of-memory workloads. This is consistent with the design of the parquet and arrow data storage systems and aggregation operations for sequences of datasets are efficiently implemented in the
<a href="https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.reductions.html">tech.v3.dataset.reductions</a> namespace.</p>
<p>Preliminary support for algorithms from the <a href="https://datasketches.apache.org/">Apache Data Sketches</a> system can be found in the <a href="https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.reductions.apache-data-sketch.html">apache-data-sketch</a> namespace. Summations/means in this area are implemented using the
<a href="https://en.wikipedia.org/wiki/Kahan_summation_algorithm">Kahan compensated summation</a> algorithm.</p>
<h3>Efficient Rowwise Operations</h3>
<p>TMD uses efficient parallelized mechanisms to operate on data for rowwise map and mapcat operations. Argument functions are passed maps that lazily read only the required data from the underlying dataset (huge savings over reading all the data). TMD scans the returned maps from the argument function for datatype and missing information. Columns derived from the mapping operation overwrite columns in the original dataset - the powerful <code>row-map</code> function works this way.</p>
<p>The mapping operations are run in parallel using a primitive named <code>pmap-ds</code> and the resulting datasets can either be returned in a sequence or combined into a single larger dataset.</p>
<ul>
<li><a href="https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-row-map">row-map</a></li>
<li><a href="https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-row-mapcat">row-mapcat</a></li>
<li><a href="https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-rows">rows</a></li>
<li><a href="https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-rowvecs">rowvecs</a></li>
</ul>
<hr />
<h2>📚 Additional Documentation 📚</h2>
<p>The best place to start is the "Getting Started" topic in the documentation: <a href="https://techascent.github.io/tech.ml.dataset/000-getting-started.html">https://techascent.github.io/tech.ml.dataset/000-getting-started.html</a></p>
<p>The "Walkthrough" topic provides long-form examples of processing real data: <a href="https://techascent.github.io/tech.ml.dataset/100-walkthrough.html">https://techascent.github.io/tech.ml.dataset/100-walkthrough.html</a></p>
<p>The API docs document every available function: <a href="https://techascent.github.io/tech.ml.dataset/">https://techascent.github.io/tech.ml.dataset/</a></p>
<p>The provided Java API (<a href="https://techascent.github.io/tech.ml.dataset/javadoc/tech/v3/TMD.html">javadoc</a> / <a href="https://techascent.github.io/tech.ml.dataset/javadoc/index.html">with frames</a>) and sample program (<a href="java_test/java/jtest/TMDDemo.java">source</a>) show how to use TMD from Java.</p>
</div></div></div></body></html>