df-research/tech.ml.dataset/docs/tech.v3.dataset.io.html

<!DOCTYPE html PUBLIC ""
    "">
<html><head><meta charset="UTF-8" /><title>tech.v3.dataset.io documentation</title><link rel="stylesheet" type="text/css" href="css/default.css" /><link rel="stylesheet" type="text/css" href="highlight/solarized-light.css" /><script type="text/javascript" src="highlight/highlight.min.js"></script><script type="text/javascript" src="js/jquery.min.js"></script><script type="text/javascript" src="js/page_effects.js"></script><script>hljs.initHighlightingOnLoad();</script></head><body><div id="header"><h2>Generated by <a href="https://github.com/weavejester/codox">Codox</a> with <a href="https://github.com/xsc/codox-theme-rdash">RDash UI</a> theme</h2><h1><a href="index.html"><span class="project-title"><span class="project-name">tech.ml.dataset</span> <span class="project-version">5.0.0-SNAPSHOT</span></span></a></h1></div><div class="sidebar primary"><h3 class="no-link"><span class="inner">Project</span></h3><ul class="index-link"><li class="depth-1 "><a href="index.html"><div class="inner">Index</div></a></li></ul><h3 class="no-link"><span class="inner">Topics</span></h3><ul><li class="depth-1 "><a href="columns-readers-and-datatypes.html"><div class="inner"><span>Columns, Readers, and Datatypes</span></div></a></li><li class="depth-1 "><a href="csv-space-operations.html"><div class="inner"><span>CSV Space Operations</span></div></a></li><li class="depth-1 "><a href="nippy-serialization-rocks.html"><div class="inner"><span>Nippy Rocks!</span></div></a></li><li class="depth-1 "><a href="quick-reference.html"><div class="inner"><span>Quick Reference - Core API</span></div></a></li><li class="depth-1 "><a href="walkthrough.html"><div class="inner"><span>tech.ml.dataset Walkthrough</span></div></a></li></ul><h3 class="no-link"><span class="inner">Namespaces</span></h3><ul><li class="depth-1"><div class="no-link"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>tech</span></div></div></li><li class="depth-2"><div class="no-link"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>v3</span></div></div></li><li class="depth-3"><a href="tech.v3.dataset.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>dataset</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.categorical.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>categorical</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.column-filters.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>column-filters</span></div></a></li><li class="depth-4 current"><a href="tech.v3.dataset.io.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>io</span></div></a></li><li class="depth-5"><a href="tech.v3.dataset.io.datetime.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>datetime</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.math.html"><div class="inner"><span class="tree" style="top: -52px;"><span class="top" style="height: 61px;"></span><span class="bottom"></span></span><span>math</span></div></a></li><li class="depth-4"><a href="tech.v3.dataset.modelling.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>modelling</span></div></a></li></ul></div><div class="sidebar secondary"><h3><a href="#top"><span class="inner">Public Vars</span></a></h3><ul><li class="depth-1"><a href="tech.v3.dataset.io.html#var--.3E.3Edataset"><div class="inner"><span>-&gt;&gt;dataset</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.io.html#var--.3Edataset"><div class="inner"><span>-&gt;dataset</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.io.html#var-data-.3Edataset"><div class="inner"><span>data-&gt;dataset</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.io.html#var-dataset-.3Edata.21"><div class="inner"><span>dataset-&gt;data!</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.io.html#var-str-.3Efile-info"><div class="inner"><span>str-&gt;file-info</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.io.html#var-wrap-stream-fn"><div class="inner"><span>wrap-stream-fn</span></div></a></li><li class="depth-1"><a href="tech.v3.dataset.io.html#var-write.21"><div class="inner"><span>write!</span></div></a></li></ul></div><div class="namespace-docs" id="content"><h1 class="anchor" id="top">tech.v3.dataset.io</h1><div class="doc"><div class="markdown"></div></div><div class="public anchor" id="var--.3E.3Edataset"><h3>-&gt;&gt;dataset</h3><div class="usage"><code>(-&gt;&gt;dataset options dataset)</code><code>(-&gt;&gt;dataset dataset)</code></div><div class="doc"><div class="markdown"><p>Please see documentation of -&gt;dataset. Options are the same.</p></div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/dtype-next/src/tech/v3/dataset/io.clj#L168">view source</a></div></div><div class="public anchor" id="var--.3Edataset"><h3>-&gt;dataset</h3><div class="usage"><code>(-&gt;dataset dataset {:keys [table-name dataset-name], :as options})</code><code>(-&gt;dataset dataset)</code></div><div class="doc"><div class="markdown"><p>Create a dataset from either csv/tsv or a sequence of maps.</p>
<ul>
  <li>
  <p>A <code>String</code> or <code>InputStream</code> will be interpreted as a file (or gzipped file if it  ends with .gz) of tsv or csv data. The system will attempt to autodetect if this  is csv or tsv and then engineering around detecting datatypes all of which can  be overridden.</p></li>
  <li>
  <p>A sequence of maps may be passed in in which case the first N maps are scanned in  order to derive the column datatypes before the actual columns are created.</p></li>
</ul>
<p>Options:</p>
<ul>
  <li><code>:dataset-name</code> - set the name of the dataset.</li>
  <li>
  <p><code>:file-type</code> - Override filetype discovery mechanism for strings or force a particular parser for an input stream. Note that arrow and parquet must have paths on disk and cannot currently load from input stream. Acceptible file types are:</p>#{:csv :tsv :xlsx :xls :arrow :parquet}.</li>
  <li><code>:gzipped?</code> - for file formats that support it, override autodetection and force  creation of a gzipped input stream as opposed to a normal input stream.</li>
  <li><code>:column-whitelist</code> - either sequence of string column names or sequence of column  indices of columns to whitelist.</li>
  <li><code>:column-blacklist</code> - either sequence of string column names or sequence of column  indices of columns to blacklist.</li>
  <li><code>:num-rows</code> - Number of rows to read</li>
  <li><code>:header-row?</code> - Defaults to true, indicates the first row is a header.</li>
  <li><code>:key-fn</code> - function to be applied to column names. Typical use is:  <code>:key-fn keyword</code>.</li>
  <li><code>:separator</code> - Add a character separator to the list of separators to auto-detect.</li>
  <li><code>:csv-parser</code> - Implementation of univocity’s AbstractParser to use. If not  provided a default permissive parser is used. This way you parse anything that  univocity supports (so flat files and such).</li>
  <li><code>:bad-row-policy</code> - One of three options: :skip, :error, :carry-on. Defaults to  :carry-on. Some csv data has ragged rows and in this case we have several  options. If the option is :carry-on then we either create a new column or add  missing values for columns that had no data for that row.</li>
  <li><code>:skip-bad-rows?</code> - Legacy option. Use :bad-row-policy.</li>
  <li><code>:max-chars-per-column</code> - Defaults to 4096. Columns with more characters that this  will result in an exception.</li>
  <li><code>:max-num-columns</code> - Defaults to 8192. CSV,TSV files with more columns than this  will fail to parse. For more information on this option, please visit:  <a href="https://github.com/uniVocity/univocity-parsers/issues/301">https://github.com/uniVocity/univocity-parsers/issues/301</a></li>
  <li><code>:n-initial-skip-rows</code> - Skip N rows initially. This currently may include the header  row. Works across both csv and spreadsheet datasets.</li>
  <li><code>:parser-fn</code> -</li>
  <li><code>keyword?</code> - all columns parsed to this datatype</li>
  <li>tuple - pair of [datatype <code>parse-data</code>] in which case container of type [datatype] will be created. <code>parse-data</code> can be one of:
    <ul>
      <li><code>:relaxed?</code> - data will be parsed such that parse failures of the standard  parse functions do not stop the parsing process. :unparsed-values and  :unparsed-indexes are available in the metadata of the column that tell  you the values that failed to parse and their respective indexes.</li>
      <li><code>fn?</code> - function from str-&gt; one of <code>:tech.ml.dataset.parser/missing</code>,  <code>:tech.ml.dataset.parser/parse-failure</code>, or the parsed value.  Exceptions here always kill the parse process. :missing will get marked  in the missing indexes, and :parse-failure will result in the index being  added to missing, the unparsed the column’s :unparsed-values and  :unparsed-indexes will be updated.</li>
      <li><code>string?</code> - for datetime types, this will turned into a DateTimeFormatter via  DateTimeFormatter/ofPattern. For encoded-text, this has to be a valid  argument to Charset/forName.</li>
      <li><code>DateTimeFormatter</code> - use with the appropriate temporal parse static function  to parse the value.</li>
    </ul>
  </li>
  <li><code>map?</code> - the header-name-or-idx is used to lookup value. If not nil, then  value can be any of the above options. Else the default column parser  is used.</li>
</ul>
<p>Returns a new dataset</p></div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/dtype-next/src/tech/v3/dataset/io.clj#L60">view source</a></div></div><div class="public anchor" id="var-data-.3Edataset"><h3>data-&gt;dataset</h3><h4 class="type">multimethod</h4><div class="usage"></div><div class="doc"><div class="markdown"></div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/dtype-next/src/tech/v3/dataset/io.clj#L37">view source</a></div></div><div class="public anchor" id="var-dataset-.3Edata.21"><h3>dataset-&gt;data!</h3><h4 class="type">multimethod</h4><div class="usage"></div><div class="doc"><div class="markdown"></div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/dtype-next/src/tech/v3/dataset/io.clj#L48">view source</a></div></div><div class="public anchor" id="var-str-.3Efile-info"><h3>str-&gt;file-info</h3><div class="usage"><code>(str-&gt;file-info file-str)</code></div><div class="doc"><div class="markdown"></div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/dtype-next/src/tech/v3/dataset/io.clj#L12">view source</a></div></div><div class="public anchor" id="var-wrap-stream-fn"><h3>wrap-stream-fn</h3><div class="usage"><code>(wrap-stream-fn dataset gzipped? open-fn)</code></div><div class="doc"><div class="markdown"></div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/dtype-next/src/tech/v3/dataset/io.clj#L27">view source</a></div></div><div class="public anchor" id="var-write.21"><h3>write!</h3><div class="usage"><code>(write! dataset output-path options)</code><code>(write! dataset output-path)</code></div><div class="doc"><div class="markdown"></div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/dtype-next/src/tech/v3/dataset/io.clj#L176">view source</a></div></div></div></body></html>