150 lines
21 KiB
HTML
Vendored
150 lines
21 KiB
HTML
Vendored
<!DOCTYPE html PUBLIC ""
|
|
"">
|
|
<html><head><meta charset="UTF-8" /><title>tech.v3.libs.parquet documentation</title><script async="true" src="https://www.googletagmanager.com/gtag/js?id=G-RGTB4J7LGP"></script><script>window.dataLayer = window.dataLayer || [];
|
|
function gtag(){dataLayer.push(arguments);}
|
|
gtag('js', new Date());
|
|
|
|
gtag('config', 'G-95TVFC1FEB');</script><link rel="stylesheet" type="text/css" href="css/default.css" /><link rel="stylesheet" type="text/css" href="highlight/solarized-light.css" /><script type="text/javascript" src="highlight/highlight.min.js"></script><script type="text/javascript" src="js/jquery.min.js"></script><script type="text/javascript" src="js/page_effects.js"></script><script>hljs.initHighlightingOnLoad();</script></head><body><div id="header"><h2>Generated by <a href="https://github.com/weavejester/codox">Codox</a> with <a href="https://github.com/xsc/codox-theme-rdash">RDash UI</a> theme</h2><h1><a href="index.html"><span class="project-title"><span class="project-name">TMD</span> <span class="project-version">8.003</span></span></a></h1></div><div class="sidebar primary"><h3 class="no-link"><span class="inner">Project</span></h3><ul class="index-link"><li class="depth-1 "><a href="index.html"><div class="inner">Index</div></a></li></ul><h3 class="no-link"><span class="inner">Topics</span></h3><ul><li class="depth-1 "><a href="000-getting-started.html"><div class="inner"><span>tech.ml.dataset Getting Started</span></div></a></li><li class="depth-1 "><a href="100-walkthrough.html"><div class="inner"><span>tech.ml.dataset Walkthrough</span></div></a></li><li class="depth-1 "><a href="200-quick-reference.html"><div class="inner"><span>tech.ml.dataset Quick Reference</span></div></a></li><li class="depth-1 "><a href="columns-readers-and-datatypes.html"><div class="inner"><span>tech.ml.dataset Columns, Readers, and Datatypes</span></div></a></li><li class="depth-1 "><a href="nippy-serialization-rocks.html"><div class="inner"><span>tech.ml.dataset And nippy</span></div></a></li><li class="depth-1 "><a href="supported-datatypes.html"><div class="inner"><span>tech.ml.dataset Supported Datatypes</span></div></a></li></ul><h3 class="no-link"><span class="inner">Namespaces</span></h3><ul><li class="depth-1"><div class="no-link"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>tech</span></div></div></li><li class="depth-2"><div class="no-link"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>v3</span></div></div></li><li class="depth-3"><a href="tech.v3.dataset.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>dataset</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.categorical.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>categorical</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.clipboard.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>clipboard</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.column.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>column</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.column-filters.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>column-filters</span></div></a></li><li class="depth-4"><div class="no-link"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>io</span></div></div></li><li class="depth-5 branch"><a href="tech.v3.dataset.io.csv.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>csv</span></div></a></li><li class="depth-5 branch"><a href="tech.v3.dataset.io.datetime.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>datetime</span></div></a></li><li class="depth-5 branch"><a href="tech.v3.dataset.io.string-row-parser.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>string-row-parser</span></div></a></li><li class="depth-5"><a href="tech.v3.dataset.io.univocity.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>univocity</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.join.html"><div class="inner"><span class="tree" style="top: -145px;"><span class="top" style="height: 154px;"></span><span class="bottom"></span></span><span>join</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.math.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>math</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.metamorph.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>metamorph</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.modelling.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>modelling</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.print.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>print</span></div></a></li><li class="depth-4"><a href="tech.v3.dataset.reductions.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>reductions</span></div></a></li><li class="depth-5"><a href="tech.v3.dataset.reductions.apache-data-sketch.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>apache-data-sketch</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.rolling.html"><div class="inner"><span class="tree" style="top: -52px;"><span class="top" style="height: 61px;"></span><span class="bottom"></span></span><span>rolling</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.set.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>set</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.tensor.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>tensor</span></div></a></li><li class="depth-4"><a href="tech.v3.dataset.zip.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>zip</span></div></a></li><li class="depth-3"><div class="no-link"><div class="inner"><span class="tree" style="top: -641px;"><span class="top" style="height: 650px;"></span><span class="bottom"></span></span><span>libs</span></div></div></li><li class="depth-4 branch"><a href="tech.v3.libs.arrow.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>arrow</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.libs.clj-transit.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>clj-transit</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.libs.fastexcel.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>fastexcel</span></div></a></li><li class="depth-4"><div class="no-link"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>guava</span></div></div></li><li class="depth-5"><a href="tech.v3.libs.guava.cache.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>cache</span></div></a></li><li class="depth-4 branch current"><a href="tech.v3.libs.parquet.html"><div class="inner"><span class="tree" style="top: -52px;"><span class="top" style="height: 61px;"></span><span class="bottom"></span></span><span>parquet</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.libs.poi.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>poi</span></div></a></li><li class="depth-4"><a href="tech.v3.libs.tribuo.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>tribuo</span></div></a></li></ul></div><div class="sidebar secondary"><h3><a href="#top"><span class="inner">Public Vars</span></a></h3><ul><li class="depth-1"><a href="tech.v3.libs.parquet.html#var--.3Erow-group-supplier"><div class="inner"><span>->row-group-supplier</span></div></a></li><li class="depth-1"><a href="tech.v3.libs.parquet.html#var-ds-.3Eparquet"><div class="inner"><span>ds->parquet</span></div></a></li><li class="depth-1"><a href="tech.v3.libs.parquet.html#var-ds-seq-.3Eparquet"><div class="inner"><span>ds-seq->parquet</span></div></a></li><li class="depth-1"><a href="tech.v3.libs.parquet.html#var-parquet-.3Eds"><div class="inner"><span>parquet->ds</span></div></a></li><li class="depth-1"><a href="tech.v3.libs.parquet.html#var-parquet-.3Eds-seq"><div class="inner"><span>parquet->ds-seq</span></div></a></li><li class="depth-1"><a href="tech.v3.libs.parquet.html#var-parquet-.3Emetadata-seq"><div class="inner"><span>parquet->metadata-seq</span></div></a></li></ul></div><div class="namespace-docs" id="content"><h1 class="anchor" id="top">tech.v3.libs.parquet</h1><div class="doc"><div class="markdown"><p>Support for reading Parquet files. You must require this namespace to
|
|
enable parquet read/write support.</p>
|
|
<p>Supported datatypes:</p>
|
|
<ul>
|
|
<li>all numeric types</li>
|
|
<li>strings</li>
|
|
<li>java.time LocalDate, Instant</li>
|
|
<li>UUIDs (get read/written as strings in accordance to R's write_parquet function)</li>
|
|
</ul>
|
|
<p>Parsing parquet file options include more general io/->dataset options:</p>
|
|
<ul>
|
|
<li><code>:key-fn</code></li>
|
|
<li><code>:column-allowlist</code> in preference to <code>:column-whitelist</code></li>
|
|
<li><code>:column-blocklist</code> in preference to <code>:column-blacklist</code></li>
|
|
<li><code>:parser-fn</code></li>
|
|
</ul>
|
|
<p>Please include these dependencies in your project and be sure to read the notes
|
|
later in this document. Note that these exclusions are carefully chosen
|
|
to avoid the myriad of <a href="https://clojurians.zulipchat.com/#narrow/stream/236259-tech.2Eml.2Edataset.2Edev/topic/Arrow.3A.20dataset-.3Estream!.20.26.20metadata/near/300962808">serious CVE issues</a>
|
|
associated with hadoop:</p>
|
|
<pre><code class="language-clojure">org.apache.parquet/parquet-hadoop {:mvn/version "1.12.0"
|
|
:exclusions [org.slf4j/slf4j-log4j12]}
|
|
org.apache.hadoop/hadoop-common {:mvn/version "3.3.0"
|
|
:exclusions [com.sun.jersey/jersey-core
|
|
com.sun.jersey/jersey-json
|
|
com.sun.jersey/jersey-server
|
|
com.sun.jersey/jersey-servlet
|
|
|
|
dnsjava/dnsjava
|
|
|
|
org.eclipse.jetty/jetty-server
|
|
org.eclipse.jetty/jetty-servlet
|
|
org.eclipse.jetty/jetty-util
|
|
org.eclipse.jetty/jetty-webapp
|
|
|
|
javax.activation/javax.activation-api
|
|
javax.servlet.jsp/jsp-api
|
|
javax.servlet/javax.servlet-api
|
|
|
|
io.netty/netty-codec
|
|
io.netty/netty-handler
|
|
io.netty/netty-transport
|
|
io.netty/netty-transport-native-epoll
|
|
|
|
org.codehaus.jettison/jettison
|
|
|
|
org.apache.zookeeper/zookeeper
|
|
|
|
org.apache.curator/curator-recipes
|
|
org.apache.curator/curator-client
|
|
org.apache.htrace/htrace-core4
|
|
|
|
org.apache.hadoop.thirdparty/hadoop-shaded-protobuf_3_7
|
|
org.apache.hadoop/hadoop-auth
|
|
|
|
|
|
org.apache.kerby/kerb-core
|
|
|
|
commons-cli/commons-cli
|
|
commons-net/commons-net
|
|
org.apache.commons/commons-lang3
|
|
org.apache.commons/commons-text
|
|
org.apache.commons/commons-configuration2
|
|
|
|
com.google.re2j/re2j
|
|
com.google.code.findbugs/jsr305
|
|
|
|
com.jcraft/jsch
|
|
|
|
log4j/log4j
|
|
org.slf4j/slf4j-log4j12]
|
|
}
|
|
;; We literally need this for 1 POJO formatting object.
|
|
org.apache.hadoop/hadoop-mapreduce-client-core {:mvn/version "3.3.0"
|
|
:exclusions [org.slf4j/slf4j-log4j12
|
|
org.apache.avro/avro
|
|
org.apache.hadoop/hadoop-yarn-client
|
|
org.apache.hadoop/hadoop-yarn-common
|
|
org.apache.hadoop/hadoop-annotations
|
|
org.apache.hadoop/hadoop-hdfs-client
|
|
io.netty/netty
|
|
com.google.inject.extensions/guice-servlet]}
|
|
;; M-1 mac support for snappy
|
|
org.xerial.snappy/snappy-java {:mvn/version "1.1.8.4"}
|
|
</code></pre>
|
|
<h4>Logging</h4>
|
|
<p>When writing parquet files you may notice a truly excessive amount of logging and/or
|
|
extremely slow write speeds. The solution to this, if you are using
|
|
the default <code>tech.ml.dataset</code> implementation with logback-classic as the concrete
|
|
logger is to disable debug logging by placing a file named <code>logback.xml</code>
|
|
in the classpath where the root node has a log-level above debug. The logback.xml
|
|
file that 'tmd' uses by default during development is located in
|
|
<a href="https://github.com/techascent/tech.ml.dataset/blob/45a032768f25b1493a83e6baaff34832a184f8ab/dev-resources/logback.xml">dev-resources</a> and is
|
|
enabled via a profile in <a href="https://github.com/techascent/tech.ml.dataset/blob/45a032768f25b1493a83e6baaff34832a184f8ab/project.clj#L69">project.clj</a>.</p>
|
|
<h4>Large-ish Datasets</h4>
|
|
<p>The parquet writer will automatically split your dataset up into multiple parquet
|
|
records so it is possible that you can attempt to write one large dataset then when
|
|
you read it back you get a parquet file with multiple datasets. This is perhaps
|
|
confusing but it is a side effect of the hadoop architecture. The simplest solution
|
|
to this is to, when loading parquet files, use parquet->ds-seq and then a final
|
|
concat-copying operation to produce one final dataset. <code>->dataset</code> will do
|
|
this operation for you but it will emit a warning when doing so as this may
|
|
lead to OOM situations with some parquet files. To disable this warning use the
|
|
option <code>:disable-parquet-warn-on-multiple-datasets</code> set to truthy.</p>
|
|
</div></div><div class="public anchor" id="var--.3Erow-group-supplier"><h3>->row-group-supplier</h3><div class="usage"><code>(->row-group-supplier path)</code></div><div class="doc"><div class="markdown"><p>Recommended way of low-level reading the file. The metadata of the supplier contains a
|
|
<code>:row-group</code> member that contains a vector of row group metadata.
|
|
The supplier implements java.util.Supplier java.util.Iterable and clojure.lang.IReduce.<br />
|
|
Each time it is called it returns a tuple of <a href="ParquetFileReader, PageReadStore, row-group-metadata">ParquetFileReader, PageReadStore, row-group-metadata</a>.</p>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/libs/parquet.clj#L833">view source</a></div></div><div class="public anchor" id="var-ds-.3Eparquet"><h3>ds->parquet</h3><div class="usage"><code>(ds->parquet ds path options)</code><code>(ds->parquet ds path)</code></div><div class="doc"><div class="markdown"><p>Write a dataset to a parquet file. Many parquet options are possible;
|
|
these can also be passed in via ds/->write!</p>
|
|
<p>Options are the same as ds-seq->parquet.</p>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/libs/parquet.clj#L1192">view source</a></div></div><div class="public anchor" id="var-ds-seq-.3Eparquet"><h3>ds-seq->parquet</h3><div class="usage"><code>(ds-seq->parquet path options ds-seq)</code><code>(ds-seq->parquet path ds-seq)</code></div><div class="doc"><div class="markdown"><p>Write a sequence of datasets to a parquet file. Parquet will break the data
|
|
stream up according to parquet file properties. Path may be a string path or
|
|
a java.io.OutputStream.</p>
|
|
<p>Options:</p>
|
|
<ul>
|
|
<li><code>:hadoop-configuration</code> - Either nil or an instance of
|
|
<code>org.apache.hadoop.conf.Configuration</code>.</li>
|
|
<li><code>:compression-codec</code> - keyword describing compression codec. Options are
|
|
<code>[:brotli :gzip :lz4 :lzo :snappy :uncompressed :zstd]</code>. Defaults to
|
|
<code>:snappy</code>.</li>
|
|
<li><code>:block-size</code> - Defaults to <code>ParquetWriter/DEFAULT_BLOCK_SIZE</code>.</li>
|
|
<li><code>:page-size</code> - Defaults to <code>ParquetWriter/DEFAULT_PAGE_SIZE</code>.</li>
|
|
<li><code>:dictionary-page-size</code> - Defaults to <code>ParquetWriter/DEFAULT_PAGE_SIZE</code>.</li>
|
|
<li><code>:dictionary-enabled?</code> - Defaults to
|
|
<code>ParquetWriter/DEFAULT_IS_DICTIONARY_ENABLED</code>.</li>
|
|
<li><code>:validating?</code> - Defaults to <code>ParquetWriter/DEFAULT_IS_VALIDATING_ENABLED</code>.
|
|
parquet file.</li>
|
|
<li><code>:writer-version</code> - Defaults to <code>ParquetWriter/DEFAULT_WRITER_VERSION</code>.</li>
|
|
</ul>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/libs/parquet.clj#L1126">view source</a></div></div><div class="public anchor" id="var-parquet-.3Eds"><h3>parquet->ds</h3><div class="usage"><code>(parquet->ds input options)</code><code>(parquet->ds input)</code></div><div class="doc"><div class="markdown"><p>Load a parquet file. Input must be a file on disk.</p>
|
|
<p>Options are a subset of the options used for loading datasets -
|
|
specifically <code>:column-allowlist</code> and <code>:column-blocklist</code> can be
|
|
useful here. The parquet metadata ends up as metadata on the
|
|
datasets. <code>:column-whitelist</code> and <code>:column-blacklist</code> are available
|
|
but not preferred.</p>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/libs/parquet.clj#L859">view source</a></div></div><div class="public anchor" id="var-parquet-.3Eds-seq"><h3>parquet->ds-seq</h3><div class="usage"><code>(parquet->ds-seq path options)</code><code>(parquet->ds-seq path)</code></div><div class="doc"><div class="markdown"><p>Given a string, hadoop path, or a parquet InputFile, return a sequence of datasets.
|
|
Column will have parquet metadata merged into their normal metadata.
|
|
Reader will be closed upon termination of the sequence.
|
|
The return value can be efficiently reduced over and iterated without leaking memory.<br />
|
|
See ham-fisted's lazy noncaching namespace for help.</p>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/libs/parquet.clj#L844">view source</a></div></div><div class="public anchor" id="var-parquet-.3Emetadata-seq"><h3>parquet->metadata-seq</h3><div class="usage"><code>(parquet->metadata-seq path)</code></div><div class="doc"><div class="markdown"><p>Given a local parquet file, return a sequence of metadata, one for each row-group.
|
|
A row-group maps directly to a dataset.</p>
|
|
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/libs/parquet.clj#L826">view source</a></div></div></div></body></html> |