Files
df-research/tech.ml.dataset/docs/tech.v3.dataset.reductions.html
2026-02-08 11:20:43 -10:00

206 lines
30 KiB
HTML
Vendored

<!DOCTYPE html PUBLIC ""
"">
<html><head><meta charset="UTF-8" /><title>tech.v3.dataset.reductions documentation</title><script async="true" src="https://www.googletagmanager.com/gtag/js?id=G-RGTB4J7LGP"></script><script>window.dataLayer = window.dataLayer || [];
function gtag(){dataLayer.push(arguments);}
gtag('js', new Date());
gtag('config', 'G-95TVFC1FEB');</script><link rel="stylesheet" type="text/css" href="css/default.css" /><link rel="stylesheet" type="text/css" href="highlight/solarized-light.css" /><script type="text/javascript" src="highlight/highlight.min.js"></script><script type="text/javascript" src="js/jquery.min.js"></script><script type="text/javascript" src="js/page_effects.js"></script><script>hljs.initHighlightingOnLoad();</script></head><body><div id="header"><h2>Generated by <a href="https://github.com/weavejester/codox">Codox</a> with <a href="https://github.com/xsc/codox-theme-rdash">RDash UI</a> theme</h2><h1><a href="index.html"><span class="project-title"><span class="project-name">TMD</span> <span class="project-version">8.003</span></span></a></h1></div><div class="sidebar primary"><h3 class="no-link"><span class="inner">Project</span></h3><ul class="index-link"><li class="depth-1 "><a href="index.html"><div class="inner">Index</div></a></li></ul><h3 class="no-link"><span class="inner">Topics</span></h3><ul><li class="depth-1 "><a href="000-getting-started.html"><div class="inner"><span>tech.ml.dataset Getting Started</span></div></a></li><li class="depth-1 "><a href="100-walkthrough.html"><div class="inner"><span>tech.ml.dataset Walkthrough</span></div></a></li><li class="depth-1 "><a href="200-quick-reference.html"><div class="inner"><span>tech.ml.dataset Quick Reference</span></div></a></li><li class="depth-1 "><a href="columns-readers-and-datatypes.html"><div class="inner"><span>tech.ml.dataset Columns, Readers, and Datatypes</span></div></a></li><li class="depth-1 "><a href="nippy-serialization-rocks.html"><div class="inner"><span>tech.ml.dataset And nippy</span></div></a></li><li class="depth-1 "><a href="supported-datatypes.html"><div class="inner"><span>tech.ml.dataset Supported Datatypes</span></div></a></li></ul><h3 class="no-link"><span class="inner">Namespaces</span></h3><ul><li class="depth-1"><div class="no-link"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>tech</span></div></div></li><li class="depth-2"><div class="no-link"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>v3</span></div></div></li><li class="depth-3"><a href="tech.v3.dataset.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>dataset</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.categorical.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>categorical</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.clipboard.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>clipboard</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.column.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>column</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.column-filters.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>column-filters</span></div></a></li><li class="depth-4"><div class="no-link"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>io</span></div></div></li><li class="depth-5 branch"><a href="tech.v3.dataset.io.csv.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>csv</span></div></a></li><li class="depth-5 branch"><a href="tech.v3.dataset.io.datetime.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>datetime</span></div></a></li><li class="depth-5 branch"><a href="tech.v3.dataset.io.string-row-parser.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>string-row-parser</span></div></a></li><li class="depth-5"><a href="tech.v3.dataset.io.univocity.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>univocity</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.join.html"><div class="inner"><span class="tree" style="top: -145px;"><span class="top" style="height: 154px;"></span><span class="bottom"></span></span><span>join</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.math.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>math</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.metamorph.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>metamorph</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.modelling.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>modelling</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.print.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>print</span></div></a></li><li class="depth-4 current"><a href="tech.v3.dataset.reductions.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>reductions</span></div></a></li><li class="depth-5"><a href="tech.v3.dataset.reductions.apache-data-sketch.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>apache-data-sketch</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.rolling.html"><div class="inner"><span class="tree" style="top: -52px;"><span class="top" style="height: 61px;"></span><span class="bottom"></span></span><span>rolling</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.set.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>set</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.tensor.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>tensor</span></div></a></li><li class="depth-4"><a href="tech.v3.dataset.zip.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>zip</span></div></a></li><li class="depth-3"><div class="no-link"><div class="inner"><span class="tree" style="top: -641px;"><span class="top" style="height: 650px;"></span><span class="bottom"></span></span><span>libs</span></div></div></li><li class="depth-4 branch"><a href="tech.v3.libs.arrow.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>arrow</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.libs.clj-transit.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>clj-transit</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.libs.fastexcel.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>fastexcel</span></div></a></li><li class="depth-4"><div class="no-link"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>guava</span></div></div></li><li class="depth-5"><a href="tech.v3.libs.guava.cache.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>cache</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.libs.parquet.html"><div class="inner"><span class="tree" style="top: -52px;"><span class="top" style="height: 61px;"></span><span class="bottom"></span></span><span>parquet</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.libs.poi.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>poi</span></div></a></li><li class="depth-4"><a href="tech.v3.libs.tribuo.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>tribuo</span></div></a></li></ul></div><div class="sidebar secondary"><h3><a href="#top"><span class="inner">Public Vars</span></a></h3><ul><li class="depth-1"><a href="tech.v3.dataset.reductions.html#var-aggregate"><div class="inner"><span>aggregate</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.reductions.html#var-count-distinct"><div class="inner"><span>count-distinct</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.reductions.html#var-distinct"><div class="inner"><span>distinct</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.reductions.html#var-distinct-int32"><div class="inner"><span>distinct-int32</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.reductions.html#var-first-value"><div class="inner"><span>first-value</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.reductions.html#var-group-by-column-agg"><div class="inner"><span>group-by-column-agg</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.reductions.html#var-group-by-column-agg-rf"><div class="inner"><span>group-by-column-agg-rf</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.reductions.html#var-maximum"><div class="inner"><span>maximum</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.reductions.html#var-maximum-rf"><div class="inner"><span>maximum-rf</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.reductions.html#var-mean"><div class="inner"><span>mean</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.reductions.html#var-prob-cdf"><div class="inner"><span>prob-cdf</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.reductions.html#var-prob-interquartile-range"><div class="inner"><span>prob-interquartile-range</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.reductions.html#var-prob-median"><div class="inner"><span>prob-median</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.reductions.html#var-prob-quantile"><div class="inner"><span>prob-quantile</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.reductions.html#var-prob-set-cardinality"><div class="inner"><span>prob-set-cardinality</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.reductions.html#var-reducer"><div class="inner"><span>reducer</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.reductions.html#var-reducer-.3Ecolumn-reducer"><div class="inner"><span>reducer-&gt;column-reducer</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.reductions.html#var-reservoir-dataset"><div class="inner"><span>reservoir-dataset</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.reductions.html#var-reservoir-desc-stat"><div class="inner"><span>reservoir-desc-stat</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.reductions.html#var-row-count"><div class="inner"><span>row-count</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.reductions.html#var-sum"><div class="inner"><span>sum</span></div></a></li></ul></div><div class="namespace-docs" id="content"><h1 class="anchor" id="top">tech.v3.dataset.reductions</h1><div class="doc"><div class="markdown"><p>Specific high performance reductions intended to be performend over a sequence
of datasets. This allows aggregations to be done in situations where the dataset is
larger than what will fit in memory on a normal machine. Due to this fact, summation
is implemented using Kahan algorithm and various statistical methods are done in using
statistical estimation techniques and thus are prefixed with <code>prob-</code> which is short
for <code>probabilistic</code>.</p>
<ul>
<li><code>aggregate</code> - Perform a multi-dataset aggregation. Returns a dataset with row.</li>
<li><code>group-by-column-agg</code> - Perform a multi-dataset group-by followed by
an aggregation. Returns a dataset with one row per key.</li>
</ul>
<p>Examples:</p>
<pre><code class="language-clojure">user&gt; (require '[tech.v3.dataset :as ds])
nil
user&gt; (require '[tech.v3.datatype.datetime :as dtype-dt])
nil
user&gt; (def stocks (-&gt; (ds/-&gt;dataset "test/data/stocks.csv" {:key-fn keyword})
(ds/update-column :date #(dtype-dt/datetime-&gt;epoch :epoch-days %))))
#'user/stocks
user&gt; (require '[tech.v3.dataset.reductions :as ds-reduce])
nil
user&gt; (ds-reduce/group-by-column-agg
:symbol
{:symbol (ds-reduce/first-value :symbol)
:price-avg (ds-reduce/mean :price)
:price-sum (ds-reduce/sum :price)
:price-med (ds-reduce/prob-median :price)}
(repeat 3 stocks))
:symbol-aggregation [5 4]:
| :symbol | :price-avg | :price-sum | :price-med |
|---------|--------------|------------|--------------|
| IBM | 91.26121951 | 33675.39 | 88.70468750 |
| AAPL | 64.73048780 | 23885.55 | 37.05281250 |
| MSFT | 24.73674797 | 9127.86 | 24.07277778 |
| AMZN | 47.98707317 | 17707.23 | 41.35142361 |
| GOOG | 415.87044118 | 84837.57 | 422.69722222 |
</code></pre>
<ul>
<li><a href="https://github.com/zero-one-group/geni-performance-benchmark/blob/da4d02e54de25a72214f72c4864ebd3d307520f8/dataset/src/dataset/optimised_by_chris.clj">zero-one benchmark winner</a></li>
</ul>
</div></div><div class="public anchor" id="var-aggregate"><h3>aggregate</h3><div class="usage"><code>(aggregate agg-map options ds-seq)</code><code>(aggregate agg-map ds-seq)</code></div><div class="doc"><div class="markdown"><p>Create a set of aggregate statistics over a sequence of datasets. Returns a
dataset with a single row and uses the same interface group-by-column-agg.</p>
<p>Example:</p>
<pre><code class="language-clojure"> (ds-reduce/aggregate
{:n-elems (ds-reduce/row-count)
:price-avg (ds-reduce/mean :price)
:price-sum (ds-reduce/sum :price)
:price-med (ds-reduce/prob-median :price)
:price-iqr (ds-reduce/prob-interquartile-range :price)
:n-dates (ds-reduce/count-distinct :date :int32)}
[ds-seq])
</code></pre>
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset/reductions.clj#L593">view source</a></div></div><div class="public anchor" id="var-count-distinct"><h3>count-distinct</h3><div class="usage"><code>(count-distinct colname op-space)</code><code>(count-distinct colname)</code></div><div class="doc"><div class="markdown"></div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset/reductions.clj#L202">view source</a></div></div><div class="public anchor" id="var-distinct"><h3>distinct</h3><div class="usage"><code>(distinct colname finalizer)</code><code>(distinct colname)</code></div><div class="doc"><div class="markdown"><p>Create a reducer that will return a set of values.</p>
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset/reductions.clj#L190">view source</a></div></div><div class="public anchor" id="var-distinct-int32"><h3>distinct-int32</h3><div class="usage"><code>(distinct-int32 colname finalizer)</code><code>(distinct-int32 colname)</code></div><div class="doc"><div class="markdown"><p>Get the set of distinct items given you know the space is no larger than int32
space. The optional finalizer allows you to post-process the data.</p>
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset/reductions.clj#L164">view source</a></div></div><div class="public anchor" id="var-first-value"><h3>first-value</h3><div class="usage"><code>(first-value colname)</code></div><div class="doc"><div class="markdown"></div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset/reductions.clj#L91">view source</a></div></div><div class="public anchor" id="var-group-by-column-agg"><h3>group-by-column-agg</h3><div class="usage"><code>(group-by-column-agg colname agg-map options ds-seq)</code><code>(group-by-column-agg colname agg-map ds-seq)</code></div><div class="doc"><div class="markdown"><p>Group a sequence of datasets by a column and aggregate down into a new dataset.</p>
<ul>
<li>
<p>colname - Either a single scalar column name or a vector of column names to group by.</p>
</li>
<li>
<p>agg-map - map of result column name to reducer. All values in the agg map must be
functions from dataset to hamf (non-parallel) reducers. Note that transducer-compatible
rf's - such as kixi.mean, are valid hamf reducers.</p>
</li>
<li>
<p>ds-seq - Either a single dataset or sequence of datasets.</p>
</li>
</ul>
<p>See also <a href="tech.v3.dataset.reductions.html#var-group-by-column-agg-rf">group-by-column-agg-rf</a>.</p>
<p>Options:</p>
<ul>
<li><code>:map-initial-capacity</code> - initial hashmap capacity. Resizing hash-maps is expensive so we
would like to set this to something reasonable. Defaults to 10000.</li>
<li><code>:index-filter</code> - A function that given a dataset produces a function from long index
to boolean, ideally either nil or a java.util.function.LongPredicate. Only indexes for
which the index-filter returns true will be added to the aggregation. For very large
datasets, this is a bit faster than using filter before the aggregation.</li>
</ul>
<p>Example:</p>
<pre><code class="language-clojure">
user&gt; (require '[tech.v3.dataset :as ds])
nil
user&gt; (require '[tech.v3.dataset.reductions :as ds-reduce])
nil
user&gt; (def ds (ds/-&gt;dataset "https://github.com/techascent/tech.ml.dataset/raw/master/test/data/stocks.csv"
{:key-fn keyword}))
#'user/ds
user&gt; (ds-reduce/group-by-column-agg
:symbol
{:price-avg (ds-reduce/mean :price)
:price-sum (ds-reduce/sum :price)}
ds)
_unnamed [5 3]:
| :symbol | :price-avg | :price-sum |
|---------|-------------:|-----------:|
| MSFT | 24.73674797 | 3042.62 |
| AAPL | 64.73048780 | 7961.85 |
| IBM | 91.26121951 | 11225.13 |
| AMZN | 47.98707317 | 5902.41 |
| GOOG | 415.87044118 | 28279.19 |
user&gt; (def testds (ds/-&gt;dataset {:a ["a" "a" "a" "b" "b" "b" "c" "d" "e"]
:b [22 21 22 44 42 44 77 88 99]}))
#'user/testds
user&gt; (ds-reduce/group-by-column-agg
[:a :b] {:c (ds-reduce/row-count)}
testds)
_unnamed [7 3]:
| :a | :b | :c |
|----|---:|---:|
| e | 99 | 1 |
| a | 21 | 1 |
| c | 77 | 1 |
| d | 88 | 1 |
| b | 44 | 2 |
| b | 42 | 1 |
| a | 22 | 2 |
</code></pre>
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset/reductions.clj#L517">view source</a></div></div><div class="public anchor" id="var-group-by-column-agg-rf"><h3>group-by-column-agg-rf</h3><div class="usage"><code>(group-by-column-agg-rf colname agg-map)</code><code>(group-by-column-agg-rf colname agg-map options)</code></div><div class="doc"><div class="markdown"><p>Produce a transduce-compatible rf that will perform the group-by-column-agg pathway.
See documentation for <a href="tech.v3.dataset.reductions.html#var-group-by-column-agg">group-by-column-agg</a>.</p>
<pre><code class="language-clojure">tech.v3.dataset.reductions-test&gt; (def stocks (ds/-&gt;dataset "test/data/stocks.csv" {:key-fn keyword}))
#'tech.v3.dataset.reductions-test/stocks
tech.v3.dataset.reductions-test&gt; (transduce (map identity)
(ds-reduce/group-by-column-agg-rf
:symbol
{:n-elems (ds-reduce/row-count)
:price-avg (ds-reduce/mean :price)
:price-sum (ds-reduce/sum :price)
:symbol (ds-reduce/first-value :symbol)
:n-dates (ds-reduce/count-distinct :date :int32)}
{:index-filter (fn [dataset]
(let [rdr (dtype/-&gt;reader (dataset :price))]
(hamf/long-predicate
idx (&gt; (.readDouble rdr idx) 100.0))))})
[stocks stocks stocks])
_unnamed [4 5]:
| :symbol | :n-elems | :price-avg | :price-sum | :n-dates |
|---------|---------:|-------------:|-----------:|---------:|
| AAPL | 93 | 160.19096774 | 14897.76 | 31 |
| IBM | 120 | 111.03775000 | 13324.53 | 40 |
| AMZN | 18 | 126.97833333 | 2285.61 | 6 |
| GOOG | 204 | 415.87044118 | 84837.57 | 68 |
</code></pre>
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset/reductions.clj#L377">view source</a></div></div><div class="public anchor" id="var-maximum"><h3>maximum</h3><div class="usage"><code>(maximum colname)</code></div><div class="doc"><div class="markdown"></div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset/reductions.clj#L147">view source</a></div></div><div class="public anchor" id="var-maximum-rf"><h3>maximum-rf</h3><div class="usage"><code>(maximum-rf)</code><code>(maximum-rf eax v)</code><code>(maximum-rf eax)</code></div><div class="doc"><div class="markdown"></div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset/reductions.clj#L137">view source</a></div></div><div class="public anchor" id="var-mean"><h3>mean</h3><div class="usage"><code>(mean colname)</code></div><div class="doc"><div class="markdown"><p>Create a double consumer which will produce a mean of the column.</p>
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset/reductions.clj#L116">view source</a></div></div><div class="public anchor" id="var-prob-cdf"><h3>prob-cdf</h3><div class="usage"><code>(prob-cdf colname cdf)</code><code>(prob-cdf colname cdf k)</code></div><div class="doc"><div class="markdown"><p>See docs for <a href="tech.v3.dataset.reductions.apache-data-sketch.html#var-prob-cdf">tech.v3.dataset.reductions.apache-data-sketch/prob-cdf</a></p>
<ul>
<li>k - defaults to 128. This produces a normalized rank error of about 1.7%</li>
</ul>
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset/reductions.clj#L226">view source</a></div></div><div class="public anchor" id="var-prob-interquartile-range"><h3>prob-interquartile-range</h3><div class="usage"><code>(prob-interquartile-range colname k)</code><code>(prob-interquartile-range colname)</code></div><div class="doc"><div class="markdown"><p>See docs for [[tech.v3.dataset.reductions.apache-data-sketch/prob-interquartile-range</p>
<ul>
<li>k - defaults to 128. This produces a normalized rank error of about 1.7%</li>
</ul>
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset/reductions.clj#L247">view source</a></div></div><div class="public anchor" id="var-prob-median"><h3>prob-median</h3><div class="usage"><code>(prob-median colname)</code><code>(prob-median colname k)</code></div><div class="doc"><div class="markdown"><p>See docs for <a href="tech.v3.dataset.reductions.apache-data-sketch.html#var-prob-median">tech.v3.dataset.reductions.apache-data-sketch/prob-median</a></p>
<ul>
<li>k - defaults to 128. This produces a normalized rank error of about 1.7%</li>
</ul>
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset/reductions.clj#L240">view source</a></div></div><div class="public anchor" id="var-prob-quantile"><h3>prob-quantile</h3><div class="usage"><code>(prob-quantile colname quantile)</code><code>(prob-quantile colname quantile k)</code></div><div class="doc"><div class="markdown"><p>See docs for <a href="tech.v3.dataset.reductions.apache-data-sketch.html#var-prob-quantile">tech.v3.dataset.reductions.apache-data-sketch/prob-quantile</a></p>
<ul>
<li>k - defaults to 128. This produces a normalized rank error of about 1.7%</li>
</ul>
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset/reductions.clj#L233">view source</a></div></div><div class="public anchor" id="var-prob-set-cardinality"><h3>prob-set-cardinality</h3><div class="usage"><code>(prob-set-cardinality colname options)</code><code>(prob-set-cardinality colname)</code></div><div class="doc"><div class="markdown"><p>See docs for <a href="tech.v3.dataset.reductions.apache-data-sketch.html#var-prob-set-cardinality">tech.v3.dataset.reductions.apache-data-sketch/prob-set-cardinality</a>.</p>
<p>Options:</p>
<ul>
<li><code>:hll-lgk</code> - defaults to 12, this is log-base2 of k, so k = 4096. lgK can be
from 4 to 21.</li>
<li><code>:hll-type</code> - One of #{4,6,8}, defaults to 8. The HLL_4, HLL_6 and HLL_8
represent different levels of compression of the final HLL array where the
4, 6 and 8 refer to the number of bits each bucket of the HLL array is
compressed down to. The HLL_4 is the most compressed but generally slightly
slower than the other two, especially during union operations.</li>
<li><code>:datatype</code> - One of :float64, :int64, :string</li>
</ul>
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset/reductions.clj#L210">view source</a></div></div><div class="public anchor" id="var-reducer"><h3>reducer</h3><div class="usage"><code>(reducer column-name init-val-fn rfn merge-fn finalize-fn)</code><code>(reducer column-name rfn)</code></div><div class="doc"><div class="markdown"><p>Make a group-by-agg reducer.</p>
<ul>
<li><code>column-name</code> - Single column name or multiple columns.</li>
<li><code>init-val-fn</code> - Function to produce initial accumulators</li>
<li><code>rfn</code> - Function that takes the accumulator and each column's data as
as further arguments. For a single-column pathway this looks like a normal clojure
reduction function but for two columns it gets extra arguments.</li>
<li><code>merge-fn</code> - Function that takes two accumulators and merges them. Merge is not required
for <a href="tech.v3.dataset.reductions.html#var-group-by-column-agg">group-by-column-agg</a> but it <em>is</em> required for <a href="tech.v3.dataset.reductions.html#var-aggregate">aggregate</a>.</li>
<li><code>finalize-fn</code> - finalize the result after aggregation. Optional, will be replaced
with identity of not provided.</li>
</ul>
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset/reductions.clj#L306">view source</a></div></div><div class="public anchor" id="var-reducer-.3Ecolumn-reducer"><h3>reducer-&gt;column-reducer</h3><div class="usage"><code>(reducer-&gt;column-reducer reducer cname)</code><code>(reducer-&gt;column-reducer reducer op-space cname)</code></div><div class="doc"><div class="markdown"><p>Given a hamf parallel reducer and a column name, return a dataset reducer of one column.</p>
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset/reductions.clj#L83">view source</a></div></div><div class="public anchor" id="var-reservoir-dataset"><h3>reservoir-dataset</h3><div class="usage"><code>(reservoir-dataset reservoir-size)</code><code>(reservoir-dataset reservoir-size options)</code></div><div class="doc"><div class="markdown"></div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset/reductions.clj#L283">view source</a></div></div><div class="public anchor" id="var-reservoir-desc-stat"><h3>reservoir-desc-stat</h3><div class="usage"><code>(reservoir-desc-stat colname reservoir-size stat-name options)</code><code>(reservoir-desc-stat colname reservoir-size stat-name)</code></div><div class="doc"><div class="markdown"><p>Calculate a descriptive statistic using reservoir sampling. A list of statistic
names are found in <code>tech.v3.datatype.statistics/all-descriptive-stats-names</code>.
Options are options used in
<a href="https://cnuernber.github.io/dtype-next/tech.v3.datatype.sampling.html#var-double-reservoir">reservoir-sampler</a>.</p>
<p>Note that this method will <em>not</em> convert datetime objects to milliseconds for you as
in descriptive-stats.</p>
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset/reductions.clj#L254">view source</a></div></div><div class="public anchor" id="var-row-count"><h3>row-count</h3><div class="usage"><code>(row-count)</code></div><div class="doc"><div class="markdown"><p>Create a simple reducer that returns the number of times reduceIndex was called.</p>
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset/reductions.clj#L127">view source</a></div></div><div class="public anchor" id="var-sum"><h3>sum</h3><div class="usage"><code>(sum colname)</code></div><div class="doc"><div class="markdown"><p>Create a double consumer which will sum the values.</p>
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset/reductions.clj#L107">view source</a></div></div></div></body></html>