Files
2026-02-08 11:20:43 -10:00

150 lines
21 KiB
HTML
Vendored

<!DOCTYPE html PUBLIC ""
"">
<html><head><meta charset="UTF-8" /><title>tech.v3.libs.parquet documentation</title><script async="true" src="https://www.googletagmanager.com/gtag/js?id=G-RGTB4J7LGP"></script><script>window.dataLayer = window.dataLayer || [];
function gtag(){dataLayer.push(arguments);}
gtag('js', new Date());
gtag('config', 'G-95TVFC1FEB');</script><link rel="stylesheet" type="text/css" href="css/default.css" /><link rel="stylesheet" type="text/css" href="highlight/solarized-light.css" /><script type="text/javascript" src="highlight/highlight.min.js"></script><script type="text/javascript" src="js/jquery.min.js"></script><script type="text/javascript" src="js/page_effects.js"></script><script>hljs.initHighlightingOnLoad();</script></head><body><div id="header"><h2>Generated by <a href="https://github.com/weavejester/codox">Codox</a> with <a href="https://github.com/xsc/codox-theme-rdash">RDash UI</a> theme</h2><h1><a href="index.html"><span class="project-title"><span class="project-name">TMD</span> <span class="project-version">8.003</span></span></a></h1></div><div class="sidebar primary"><h3 class="no-link"><span class="inner">Project</span></h3><ul class="index-link"><li class="depth-1 "><a href="index.html"><div class="inner">Index</div></a></li></ul><h3 class="no-link"><span class="inner">Topics</span></h3><ul><li class="depth-1 "><a href="000-getting-started.html"><div class="inner"><span>tech.ml.dataset Getting Started</span></div></a></li><li class="depth-1 "><a href="100-walkthrough.html"><div class="inner"><span>tech.ml.dataset Walkthrough</span></div></a></li><li class="depth-1 "><a href="200-quick-reference.html"><div class="inner"><span>tech.ml.dataset Quick Reference</span></div></a></li><li class="depth-1 "><a href="columns-readers-and-datatypes.html"><div class="inner"><span>tech.ml.dataset Columns, Readers, and Datatypes</span></div></a></li><li class="depth-1 "><a href="nippy-serialization-rocks.html"><div class="inner"><span>tech.ml.dataset And nippy</span></div></a></li><li class="depth-1 "><a href="supported-datatypes.html"><div class="inner"><span>tech.ml.dataset Supported Datatypes</span></div></a></li></ul><h3 class="no-link"><span class="inner">Namespaces</span></h3><ul><li class="depth-1"><div class="no-link"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>tech</span></div></div></li><li class="depth-2"><div class="no-link"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>v3</span></div></div></li><li class="depth-3"><a href="tech.v3.dataset.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>dataset</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.categorical.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>categorical</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.clipboard.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>clipboard</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.column.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>column</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.column-filters.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>column-filters</span></div></a></li><li class="depth-4"><div class="no-link"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>io</span></div></div></li><li class="depth-5 branch"><a href="tech.v3.dataset.io.csv.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>csv</span></div></a></li><li class="depth-5 branch"><a href="tech.v3.dataset.io.datetime.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>datetime</span></div></a></li><li class="depth-5 branch"><a href="tech.v3.dataset.io.string-row-parser.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>string-row-parser</span></div></a></li><li class="depth-5"><a href="tech.v3.dataset.io.univocity.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>univocity</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.join.html"><div class="inner"><span class="tree" style="top: -145px;"><span class="top" style="height: 154px;"></span><span class="bottom"></span></span><span>join</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.math.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>math</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.metamorph.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>metamorph</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.modelling.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>modelling</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.print.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>print</span></div></a></li><li class="depth-4"><a href="tech.v3.dataset.reductions.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>reductions</span></div></a></li><li class="depth-5"><a href="tech.v3.dataset.reductions.apache-data-sketch.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>apache-data-sketch</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.rolling.html"><div class="inner"><span class="tree" style="top: -52px;"><span class="top" style="height: 61px;"></span><span class="bottom"></span></span><span>rolling</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.set.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>set</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.tensor.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>tensor</span></div></a></li><li class="depth-4"><a href="tech.v3.dataset.zip.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>zip</span></div></a></li><li class="depth-3"><div class="no-link"><div class="inner"><span class="tree" style="top: -641px;"><span class="top" style="height: 650px;"></span><span class="bottom"></span></span><span>libs</span></div></div></li><li class="depth-4 branch"><a href="tech.v3.libs.arrow.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>arrow</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.libs.clj-transit.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>clj-transit</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.libs.fastexcel.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>fastexcel</span></div></a></li><li class="depth-4"><div class="no-link"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>guava</span></div></div></li><li class="depth-5"><a href="tech.v3.libs.guava.cache.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>cache</span></div></a></li><li class="depth-4 branch current"><a href="tech.v3.libs.parquet.html"><div class="inner"><span class="tree" style="top: -52px;"><span class="top" style="height: 61px;"></span><span class="bottom"></span></span><span>parquet</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.libs.poi.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>poi</span></div></a></li><li class="depth-4"><a href="tech.v3.libs.tribuo.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>tribuo</span></div></a></li></ul></div><div class="sidebar secondary"><h3><a href="#top"><span class="inner">Public Vars</span></a></h3><ul><li class="depth-1"><a href="tech.v3.libs.parquet.html#var--.3Erow-group-supplier"><div class="inner"><span>-&gt;row-group-supplier</span></div></a></li><li class="depth-1"><a href="tech.v3.libs.parquet.html#var-ds-.3Eparquet"><div class="inner"><span>ds-&gt;parquet</span></div></a></li><li class="depth-1"><a href="tech.v3.libs.parquet.html#var-ds-seq-.3Eparquet"><div class="inner"><span>ds-seq-&gt;parquet</span></div></a></li><li class="depth-1"><a href="tech.v3.libs.parquet.html#var-parquet-.3Eds"><div class="inner"><span>parquet-&gt;ds</span></div></a></li><li class="depth-1"><a href="tech.v3.libs.parquet.html#var-parquet-.3Eds-seq"><div class="inner"><span>parquet-&gt;ds-seq</span></div></a></li><li class="depth-1"><a href="tech.v3.libs.parquet.html#var-parquet-.3Emetadata-seq"><div class="inner"><span>parquet-&gt;metadata-seq</span></div></a></li></ul></div><div class="namespace-docs" id="content"><h1 class="anchor" id="top">tech.v3.libs.parquet</h1><div class="doc"><div class="markdown"><p>Support for reading Parquet files. You must require this namespace to
enable parquet read/write support.</p>
<p>Supported datatypes:</p>
<ul>
<li>all numeric types</li>
<li>strings</li>
<li>java.time LocalDate, Instant</li>
<li>UUIDs (get read/written as strings in accordance to R's write_parquet function)</li>
</ul>
<p>Parsing parquet file options include more general io/-&gt;dataset options:</p>
<ul>
<li><code>:key-fn</code></li>
<li><code>:column-allowlist</code> in preference to <code>:column-whitelist</code></li>
<li><code>:column-blocklist</code> in preference to <code>:column-blacklist</code></li>
<li><code>:parser-fn</code></li>
</ul>
<p>Please include these dependencies in your project and be sure to read the notes
later in this document. Note that these exclusions are carefully chosen
to avoid the myriad of <a href="https://clojurians.zulipchat.com/#narrow/stream/236259-tech.2Eml.2Edataset.2Edev/topic/Arrow.3A.20dataset-.3Estream!.20.26.20metadata/near/300962808">serious CVE issues</a>
associated with hadoop:</p>
<pre><code class="language-clojure">org.apache.parquet/parquet-hadoop {:mvn/version "1.12.0"
:exclusions [org.slf4j/slf4j-log4j12]}
org.apache.hadoop/hadoop-common {:mvn/version "3.3.0"
:exclusions [com.sun.jersey/jersey-core
com.sun.jersey/jersey-json
com.sun.jersey/jersey-server
com.sun.jersey/jersey-servlet
dnsjava/dnsjava
org.eclipse.jetty/jetty-server
org.eclipse.jetty/jetty-servlet
org.eclipse.jetty/jetty-util
org.eclipse.jetty/jetty-webapp
javax.activation/javax.activation-api
javax.servlet.jsp/jsp-api
javax.servlet/javax.servlet-api
io.netty/netty-codec
io.netty/netty-handler
io.netty/netty-transport
io.netty/netty-transport-native-epoll
org.codehaus.jettison/jettison
org.apache.zookeeper/zookeeper
org.apache.curator/curator-recipes
org.apache.curator/curator-client
org.apache.htrace/htrace-core4
org.apache.hadoop.thirdparty/hadoop-shaded-protobuf_3_7
org.apache.hadoop/hadoop-auth
org.apache.kerby/kerb-core
commons-cli/commons-cli
commons-net/commons-net
org.apache.commons/commons-lang3
org.apache.commons/commons-text
org.apache.commons/commons-configuration2
com.google.re2j/re2j
com.google.code.findbugs/jsr305
com.jcraft/jsch
log4j/log4j
org.slf4j/slf4j-log4j12]
}
;; We literally need this for 1 POJO formatting object.
org.apache.hadoop/hadoop-mapreduce-client-core {:mvn/version "3.3.0"
:exclusions [org.slf4j/slf4j-log4j12
org.apache.avro/avro
org.apache.hadoop/hadoop-yarn-client
org.apache.hadoop/hadoop-yarn-common
org.apache.hadoop/hadoop-annotations
org.apache.hadoop/hadoop-hdfs-client
io.netty/netty
com.google.inject.extensions/guice-servlet]}
;; M-1 mac support for snappy
org.xerial.snappy/snappy-java {:mvn/version "1.1.8.4"}
</code></pre>
<h4>Logging</h4>
<p>When writing parquet files you may notice a truly excessive amount of logging and/or
extremely slow write speeds. The solution to this, if you are using
the default <code>tech.ml.dataset</code> implementation with logback-classic as the concrete
logger is to disable debug logging by placing a file named <code>logback.xml</code>
in the classpath where the root node has a log-level above debug. The logback.xml
file that 'tmd' uses by default during development is located in
<a href="https://github.com/techascent/tech.ml.dataset/blob/45a032768f25b1493a83e6baaff34832a184f8ab/dev-resources/logback.xml">dev-resources</a> and is
enabled via a profile in <a href="https://github.com/techascent/tech.ml.dataset/blob/45a032768f25b1493a83e6baaff34832a184f8ab/project.clj#L69">project.clj</a>.</p>
<h4>Large-ish Datasets</h4>
<p>The parquet writer will automatically split your dataset up into multiple parquet
records so it is possible that you can attempt to write one large dataset then when
you read it back you get a parquet file with multiple datasets. This is perhaps
confusing but it is a side effect of the hadoop architecture. The simplest solution
to this is to, when loading parquet files, use parquet-&gt;ds-seq and then a final
concat-copying operation to produce one final dataset. <code>-&gt;dataset</code> will do
this operation for you but it will emit a warning when doing so as this may
lead to OOM situations with some parquet files. To disable this warning use the
option <code>:disable-parquet-warn-on-multiple-datasets</code> set to truthy.</p>
</div></div><div class="public anchor" id="var--.3Erow-group-supplier"><h3>-&gt;row-group-supplier</h3><div class="usage"><code>(-&gt;row-group-supplier path)</code></div><div class="doc"><div class="markdown"><p>Recommended way of low-level reading the file. The metadata of the supplier contains a
<code>:row-group</code> member that contains a vector of row group metadata.
The supplier implements java.util.Supplier java.util.Iterable and clojure.lang.IReduce.<br />
Each time it is called it returns a tuple of <a href="ParquetFileReader, PageReadStore, row-group-metadata">ParquetFileReader, PageReadStore, row-group-metadata</a>.</p>
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/libs/parquet.clj#L833">view source</a></div></div><div class="public anchor" id="var-ds-.3Eparquet"><h3>ds-&gt;parquet</h3><div class="usage"><code>(ds-&gt;parquet ds path options)</code><code>(ds-&gt;parquet ds path)</code></div><div class="doc"><div class="markdown"><p>Write a dataset to a parquet file. Many parquet options are possible;
these can also be passed in via ds/-&gt;write!</p>
<p>Options are the same as ds-seq-&gt;parquet.</p>
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/libs/parquet.clj#L1192">view source</a></div></div><div class="public anchor" id="var-ds-seq-.3Eparquet"><h3>ds-seq-&gt;parquet</h3><div class="usage"><code>(ds-seq-&gt;parquet path options ds-seq)</code><code>(ds-seq-&gt;parquet path ds-seq)</code></div><div class="doc"><div class="markdown"><p>Write a sequence of datasets to a parquet file. Parquet will break the data
stream up according to parquet file properties. Path may be a string path or
a java.io.OutputStream.</p>
<p>Options:</p>
<ul>
<li><code>:hadoop-configuration</code> - Either nil or an instance of
<code>org.apache.hadoop.conf.Configuration</code>.</li>
<li><code>:compression-codec</code> - keyword describing compression codec. Options are
<code>[:brotli :gzip :lz4 :lzo :snappy :uncompressed :zstd]</code>. Defaults to
<code>:snappy</code>.</li>
<li><code>:block-size</code> - Defaults to <code>ParquetWriter/DEFAULT_BLOCK_SIZE</code>.</li>
<li><code>:page-size</code> - Defaults to <code>ParquetWriter/DEFAULT_PAGE_SIZE</code>.</li>
<li><code>:dictionary-page-size</code> - Defaults to <code>ParquetWriter/DEFAULT_PAGE_SIZE</code>.</li>
<li><code>:dictionary-enabled?</code> - Defaults to
<code>ParquetWriter/DEFAULT_IS_DICTIONARY_ENABLED</code>.</li>
<li><code>:validating?</code> - Defaults to <code>ParquetWriter/DEFAULT_IS_VALIDATING_ENABLED</code>.
parquet file.</li>
<li><code>:writer-version</code> - Defaults to <code>ParquetWriter/DEFAULT_WRITER_VERSION</code>.</li>
</ul>
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/libs/parquet.clj#L1126">view source</a></div></div><div class="public anchor" id="var-parquet-.3Eds"><h3>parquet-&gt;ds</h3><div class="usage"><code>(parquet-&gt;ds input options)</code><code>(parquet-&gt;ds input)</code></div><div class="doc"><div class="markdown"><p>Load a parquet file. Input must be a file on disk.</p>
<p>Options are a subset of the options used for loading datasets -
specifically <code>:column-allowlist</code> and <code>:column-blocklist</code> can be
useful here. The parquet metadata ends up as metadata on the
datasets. <code>:column-whitelist</code> and <code>:column-blacklist</code> are available
but not preferred.</p>
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/libs/parquet.clj#L859">view source</a></div></div><div class="public anchor" id="var-parquet-.3Eds-seq"><h3>parquet-&gt;ds-seq</h3><div class="usage"><code>(parquet-&gt;ds-seq path options)</code><code>(parquet-&gt;ds-seq path)</code></div><div class="doc"><div class="markdown"><p>Given a string, hadoop path, or a parquet InputFile, return a sequence of datasets.
Column will have parquet metadata merged into their normal metadata.
Reader will be closed upon termination of the sequence.
The return value can be efficiently reduced over and iterated without leaking memory.<br />
See ham-fisted's lazy noncaching namespace for help.</p>
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/libs/parquet.clj#L844">view source</a></div></div><div class="public anchor" id="var-parquet-.3Emetadata-seq"><h3>parquet-&gt;metadata-seq</h3><div class="usage"><code>(parquet-&gt;metadata-seq path)</code></div><div class="doc"><div class="markdown"><p>Given a local parquet file, return a sequence of metadata, one for each row-group.
A row-group maps directly to a dataset.</p>
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/libs/parquet.clj#L826">view source</a></div></div></div></body></html>