83 lines
19 KiB
HTML
Vendored
83 lines
19 KiB
HTML
Vendored
<!DOCTYPE html PUBLIC ""
|
|
"">
|
|
<html><head><meta charset="UTF-8" /><title>tech.v3.dataset.reductions.apache-data-sketch documentation</title><script async="true" src="https://www.googletagmanager.com/gtag/js?id=G-RGTB4J7LGP"></script><script>window.dataLayer = window.dataLayer || [];
|
|
function gtag(){dataLayer.push(arguments);}
|
|
gtag('js', new Date());
|
|
|
|
gtag('config', 'G-95TVFC1FEB');</script><link rel="stylesheet" type="text/css" href="css/default.css" /><link rel="stylesheet" type="text/css" href="highlight/solarized-light.css" /><script type="text/javascript" src="highlight/highlight.min.js"></script><script type="text/javascript" src="js/jquery.min.js"></script><script type="text/javascript" src="js/page_effects.js"></script><script>hljs.initHighlightingOnLoad();</script></head><body><div id="header"><h2>Generated by <a href="https://github.com/weavejester/codox">Codox</a> with <a href="https://github.com/xsc/codox-theme-rdash">RDash UI</a> theme</h2><h1><a href="index.html"><span class="project-title"><span class="project-name">TMD</span> <span class="project-version">8.003</span></span></a></h1></div><div class="sidebar primary"><h3 class="no-link"><span class="inner">Project</span></h3><ul class="index-link"><li class="depth-1 "><a href="index.html"><div class="inner">Index</div></a></li></ul><h3 class="no-link"><span class="inner">Topics</span></h3><ul><li class="depth-1 "><a href="000-getting-started.html"><div class="inner"><span>tech.ml.dataset Getting Started</span></div></a></li><li class="depth-1 "><a href="100-walkthrough.html"><div class="inner"><span>tech.ml.dataset Walkthrough</span></div></a></li><li class="depth-1 "><a href="200-quick-reference.html"><div class="inner"><span>tech.ml.dataset Quick Reference</span></div></a></li><li class="depth-1 "><a href="columns-readers-and-datatypes.html"><div class="inner"><span>tech.ml.dataset Columns, Readers, and Datatypes</span></div></a></li><li class="depth-1 "><a href="nippy-serialization-rocks.html"><div class="inner"><span>tech.ml.dataset And nippy</span></div></a></li><li class="depth-1 "><a href="supported-datatypes.html"><div class="inner"><span>tech.ml.dataset Supported Datatypes</span></div></a></li></ul><h3 class="no-link"><span class="inner">Namespaces</span></h3><ul><li class="depth-1"><div class="no-link"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>tech</span></div></div></li><li class="depth-2"><div class="no-link"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>v3</span></div></div></li><li class="depth-3"><a href="tech.v3.dataset.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>dataset</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.categorical.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>categorical</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.clipboard.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>clipboard</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.column.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>column</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.column-filters.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>column-filters</span></div></a></li><li class="depth-4"><div class="no-link"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>io</span></div></div></li><li class="depth-5 branch"><a href="tech.v3.dataset.io.csv.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>csv</span></div></a></li><li class="depth-5 branch"><a href="tech.v3.dataset.io.datetime.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>datetime</span></div></a></li><li class="depth-5 branch"><a href="tech.v3.dataset.io.string-row-parser.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>string-row-parser</span></div></a></li><li class="depth-5"><a href="tech.v3.dataset.io.univocity.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>univocity</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.join.html"><div class="inner"><span class="tree" style="top: -145px;"><span class="top" style="height: 154px;"></span><span class="bottom"></span></span><span>join</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.math.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>math</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.metamorph.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>metamorph</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.modelling.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>modelling</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.print.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>print</span></div></a></li><li class="depth-4"><a href="tech.v3.dataset.reductions.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>reductions</span></div></a></li><li class="depth-5 current"><a href="tech.v3.dataset.reductions.apache-data-sketch.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>apache-data-sketch</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.rolling.html"><div class="inner"><span class="tree" style="top: -52px;"><span class="top" style="height: 61px;"></span><span class="bottom"></span></span><span>rolling</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.set.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>set</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.tensor.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>tensor</span></div></a></li><li class="depth-4"><a href="tech.v3.dataset.zip.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>zip</span></div></a></li><li class="depth-3"><div class="no-link"><div class="inner"><span class="tree" style="top: -641px;"><span class="top" style="height: 650px;"></span><span class="bottom"></span></span><span>libs</span></div></div></li><li class="depth-4 branch"><a href="tech.v3.libs.arrow.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>arrow</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.libs.clj-transit.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>clj-transit</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.libs.fastexcel.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>fastexcel</span></div></a></li><li class="depth-4"><div class="no-link"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>guava</span></div></div></li><li class="depth-5"><a href="tech.v3.libs.guava.cache.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>cache</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.libs.parquet.html"><div class="inner"><span class="tree" style="top: -52px;"><span class="top" style="height: 61px;"></span><span class="bottom"></span></span><span>parquet</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.libs.poi.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>poi</span></div></a></li><li class="depth-4"><a href="tech.v3.libs.tribuo.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>tribuo</span></div></a></li></ul></div><div class="sidebar secondary"><h3><a href="#top"><span class="inner">Public Vars</span></a></h3><ul><li class="depth-1"><a href="tech.v3.dataset.reductions.apache-data-sketch.html#var-doubles-sketch-reducer"><div class="inner"><span>doubles-sketch-reducer</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.reductions.apache-data-sketch.html#var-hll-reducer"><div class="inner"><span>hll-reducer</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.reductions.apache-data-sketch.html#var-prob-cdf"><div class="inner"><span>prob-cdf</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.reductions.apache-data-sketch.html#var-prob-cdfs"><div class="inner"><span>prob-cdfs</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.reductions.apache-data-sketch.html#var-prob-interquartile-range"><div class="inner"><span>prob-interquartile-range</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.reductions.apache-data-sketch.html#var-prob-median"><div class="inner"><span>prob-median</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.reductions.apache-data-sketch.html#var-prob-pmfs"><div class="inner"><span>prob-pmfs</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.reductions.apache-data-sketch.html#var-prob-quantile"><div class="inner"><span>prob-quantile</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.reductions.apache-data-sketch.html#var-prob-quantiles"><div class="inner"><span>prob-quantiles</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.reductions.apache-data-sketch.html#var-prob-set-cardinality"><div class="inner"><span>prob-set-cardinality</span></div></a></li></ul></div><div class="namespace-docs" id="content"><h1 class="anchor" id="top">tech.v3.dataset.reductions.apache-data-sketch</h1><div class="doc"><div class="markdown"><p>Reduction reducers based on the apache data sketch family of algorithms.</p>
|
|
<ul>
|
|
<li><a href="https://datasketches.apache.org/">apache data sketches</a></li>
|
|
</ul>
|
|
<p>Algorithms included here are:</p>
|
|
<h3>Set Cardinality</h3>
|
|
<ul>
|
|
<li><a href="https://datasketches.apache.org/docs/HLL/HLL.html">hyper-log-log</a></li>
|
|
<li><a href="https://datasketches.apache.org/docs/Theta/ThetaSketchFramework.html">theta</a></li>
|
|
<li><a href="https://datasketches.apache.org/docs/CPC/CPC.html">cpc</a></li>
|
|
</ul>
|
|
<h3>Quantiles</h3>
|
|
<ul>
|
|
<li><a href="https://datasketches.apache.org/api/java/snapshot/apidocs/index.html">doubles</a></li>
|
|
</ul>
|
|
<p>Example:</p>
|
|
<pre><code class="language-clojure">user> (require '[tech.v3.dataset :as ds])
|
|
11:04:44.508 [nREPL-session-e40a19c2-8d41-40a8-8853-abe1293abe20] DEBUG tech.v3.tensor.dimensions.global-to-local - insn custom indexing enabled!
|
|
nil
|
|
user> (require '[tech.v3.dataset.reductions :as ds-reduce])
|
|
nil
|
|
user> (require '[tech.v3.dataset.reductions.apache-data-sketch :as ds-sketch])
|
|
#'user/stocks
|
|
user> (def stocks (ds/->dataset "test/data/stocks.csv" {:key-fn keyword}))
|
|
#'user/stocks
|
|
user> (ds-reduce/group-by-column-agg
|
|
:symbol
|
|
{:symbol (ds-reduce/first-value :symbol)
|
|
:price-quantiles (ds-sketch/prob-quantiles :price [0.25 0.5 0.75])
|
|
:price-cdfs (ds-sketch/prob-cdfs :price [25 50 75])}
|
|
[stocks stocks stocks])
|
|
:symbol-aggregation [5 3]:
|
|
|
|
| :symbol | :price-quantiles | :price-cdfs |
|
|
|---------|-----------------------|--------------------------|
|
|
| AAPL | [11.03, 36.81, 105.1] | [0.4065, 0.5528, 0.6423] |
|
|
| IBM | [77.26, 88.70, 102.4] | [0.000, 0.000, 0.1382] |
|
|
| AMZN | [30.12, 41.50, 67.00] | [0.2249, 0.6396, 0.8103] |
|
|
| MSFT | [21.75, 24.11, 27.34] | [0.5772, 1.000, 1.000] |
|
|
| GOOG | [338.5, 421.6, 510.0] | [0.000, 0.000, 0.000] |
|
|
</code></pre>
|
|
</div></div><div class="public anchor" id="var-doubles-sketch-reducer"><h3>doubles-sketch-reducer</h3><div class="usage"><code>(doubles-sketch-reducer k finalize-fn)</code></div><div class="doc"><div class="markdown"><p>Return a doubles updater. This is the reservoir and k is the reservoir size. From
|
|
a reservoir we can then get various different statistical quantities.</p>
|
|
<p>A k of 128 results in about 1.7% error in returned quantities.</p>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset/reductions/apache_data_sketch.clj#L157">view source</a></div></div><div class="public anchor" id="var-hll-reducer"><h3>hll-reducer</h3><div class="usage"><code>(hll-reducer {:keys [hll-lgk hll-type datatype], :or {hll-lgk 12, hll-type 8, datatype :float64}})</code></div><div class="doc"><div class="markdown"><p>Return a hamf parallel reducer that produces a hyper-log-log-based set cardinality.</p>
|
|
<p>At any point you can get an estimate from the reduced value - call sketch-estimate.</p>
|
|
<p>Options:</p>
|
|
<ul>
|
|
<li><code>:hll-lgk</code> - defaults to 12, this is log-base2 of k, so k = 4096. lgK can be
|
|
from 4 to 21.</li>
|
|
<li><code>:hll-type</code> - One of #{4,6,8}, defaults to 8. The HLL_4, HLL_6 and HLL_8
|
|
represent different levels of compression of the final HLL array where the
|
|
4, 6 and 8 refer to the number of bits each bucket of the HLL array is
|
|
compressed down to. The HLL_4 is the most compressed but generally slightly
|
|
slower than the other two, especially during union operations.</li>
|
|
<li><code>:datatype</code> - One of :float64, :int64, :string</li>
|
|
</ul>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset/reductions/apache_data_sketch.clj#L90">view source</a></div></div><div class="public anchor" id="var-prob-cdf"><h3>prob-cdf</h3><div class="usage"><code>(prob-cdf colname cdf k)</code><code>(prob-cdf colname cdf)</code></div><div class="doc"><div class="markdown"><p>Probabilistic cdfs, one for each value passed in. See <a href="https://datasketches.apache.org/api/java/snapshot/apidocs/index.html">DoublesSketch</a>.
|
|
See prob-quantiles for k.</p>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset/reductions/apache_data_sketch.clj#L259">view source</a></div></div><div class="public anchor" id="var-prob-cdfs"><h3>prob-cdfs</h3><div class="usage"><code>(prob-cdfs colname cdfs k)</code><code>(prob-cdfs colname cdfs)</code></div><div class="doc"><div class="markdown"><p>Probabilistic cdfs, one for each value passed in. See <a href="https://datasketches.apache.org/api/java/snapshot/apidocs/index.html">DoublesSketch</a>.
|
|
See prob-quantiles for k.</p>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset/reductions/apache_data_sketch.clj#L245">view source</a></div></div><div class="public anchor" id="var-prob-interquartile-range"><h3>prob-interquartile-range</h3><div class="usage"><code>(prob-interquartile-range colname k)</code><code>(prob-interquartile-range colname)</code></div><div class="doc"><div class="markdown"><p>Probabilistic interquartile range - <a href="https://datasketches.apache.org/api/java/snapshot/apidocs/index.html">DoublesSketch</a>.</p>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset/reductions/apache_data_sketch.clj#L234">view source</a></div></div><div class="public anchor" id="var-prob-median"><h3>prob-median</h3><div class="usage"><code>(prob-median colname k)</code><code>(prob-median colname)</code></div><div class="doc"><div class="markdown"><p>Probabilistic median - <a href="https://datasketches.apache.org/api/java/snapshot/apidocs/index.html">DoublesSketch</a>.</p>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset/reductions/apache_data_sketch.clj#L226">view source</a></div></div><div class="public anchor" id="var-prob-pmfs"><h3>prob-pmfs</h3><div class="usage"><code>(prob-pmfs colname pmfs k)</code><code>(prob-pmfs colname pmfs)</code></div><div class="doc"><div class="markdown"><p>Returns an approximation to the Probability Mass Function (PMF) of the input stream
|
|
given a set of splitPoints (values). See <a href="https://datasketches.apache.org/api/java/snapshot/apidocs/index.html">DoublesSketch</a>.</p>
|
|
<p>See prog-quantiles for k</p>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset/reductions/apache_data_sketch.clj#L271">view source</a></div></div><div class="public anchor" id="var-prob-quantile"><h3>prob-quantile</h3><div class="usage"><code>(prob-quantile colname quantile k)</code><code>(prob-quantile colname quantile)</code></div><div class="doc"><div class="markdown"><p>Probabilistic quantile estimation - see <a href="https://datasketches.apache.org/api/java/snapshot/apidocs/index.html">DoublesSketch</a>.</p>
|
|
<ul>
|
|
<li>k - defaults to 128. This produces a normalized rank error of about 1.7%</li>
|
|
</ul>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset/reductions/apache_data_sketch.clj#L200">view source</a></div></div><div class="public anchor" id="var-prob-quantiles"><h3>prob-quantiles</h3><div class="usage"><code>(prob-quantiles colname quantiles k)</code><code>(prob-quantiles colname quantiles)</code></div><div class="doc"><div class="markdown"><p>Probabilistic quantile estimation - see <a href="https://datasketches.apache.org/api/java/snapshot/apidocs/index.html">DoublesSketch</a>.</p>
|
|
<ul>
|
|
<li>quantiles - sequence of quantiles.</li>
|
|
<li>k - defaults to 128. This produces a normalized rank error of about 1.7%</li>
|
|
</ul>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset/reductions/apache_data_sketch.clj#L210">view source</a></div></div><div class="public anchor" id="var-prob-set-cardinality"><h3>prob-set-cardinality</h3><div class="usage"><code>(prob-set-cardinality colname options)</code><code>(prob-set-cardinality colname)</code></div><div class="doc"><div class="markdown"><p>Get the probabilistic set cardinality using hyper-log-log. See <a href="tech.v3.dataset.reductions.apache-data-sketch.html#var-hll-reducer">hll-reducer</a>.</p>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset/reductions/apache_data_sketch.clj#L134">view source</a></div></div></div></body></html> |