48 lines
6.9 KiB
HTML
Vendored
48 lines
6.9 KiB
HTML
Vendored
<!DOCTYPE html PUBLIC ""
|
||
"">
|
||
<html><head><meta charset="UTF-8" /><title>CSV Space Operations</title><link rel="stylesheet" type="text/css" href="css/default.css" /><link rel="stylesheet" type="text/css" href="highlight/solarized-light.css" /><script type="text/javascript" src="highlight/highlight.min.js"></script><script type="text/javascript" src="js/jquery.min.js"></script><script type="text/javascript" src="js/page_effects.js"></script><script>hljs.initHighlightingOnLoad();</script></head><body><div id="header"><h2>Generated by <a href="https://github.com/weavejester/codox">Codox</a> with <a href="https://github.com/xsc/codox-theme-rdash">RDash UI</a> theme</h2><h1><a href="index.html"><span class="project-title"><span class="project-name">tech.ml.dataset</span> <span class="project-version">5.0.0-SNAPSHOT</span></span></a></h1></div><div class="sidebar primary"><h3 class="no-link"><span class="inner">Project</span></h3><ul class="index-link"><li class="depth-1 "><a href="index.html"><div class="inner">Index</div></a></li></ul><h3 class="no-link"><span class="inner">Topics</span></h3><ul><li class="depth-1 "><a href="columns-readers-and-datatypes.html"><div class="inner"><span>Columns, Readers, and Datatypes</span></div></a></li><li class="depth-1 current"><a href="csv-space-operations.html"><div class="inner"><span>CSV Space Operations</span></div></a></li><li class="depth-1 "><a href="nippy-serialization-rocks.html"><div class="inner"><span>Nippy Rocks!</span></div></a></li><li class="depth-1 "><a href="quick-reference.html"><div class="inner"><span>Quick Reference - Core API</span></div></a></li><li class="depth-1 "><a href="walkthrough.html"><div class="inner"><span>tech.ml.dataset Walkthrough</span></div></a></li></ul><h3 class="no-link"><span class="inner">Namespaces</span></h3><ul><li class="depth-1"><div class="no-link"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>tech</span></div></div></li><li class="depth-2"><div class="no-link"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>v3</span></div></div></li><li class="depth-3"><a href="tech.v3.dataset.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>dataset</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.categorical.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>categorical</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.column-filters.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>column-filters</span></div></a></li><li class="depth-4"><div class="no-link"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>io</span></div></div></li><li class="depth-5"><a href="tech.v3.dataset.io.datetime.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>datetime</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.join.html"><div class="inner"><span class="tree" style="top: -52px;"><span class="top" style="height: 61px;"></span><span class="bottom"></span></span><span>join</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.math.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>math</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.modelling.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>modelling</span></div></a></li><li class="depth-4"><a href="tech.v3.dataset.tensor.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>tensor</span></div></a></li></ul></div><div class="document" id="content"><div class="doc"><div class="markdown"><h1><a href="#csv-space-operations" name="csv-space-operations"></a>CSV Space Operations</h1>
|
||
<p>For really large datasets it may be useful to filter/manipulate/sample data in CSV space before parsing the data into columnar format. We provide a bit of support for that type of operation in the form of separating the transformation of csv into a sequence of string arrays and the parsing step of those string-array rows.</p>
|
||
<h3><a href="#input-data-string-array" name="input-data-string-array"></a>Input Data -> String Array</h3>
|
||
<pre><code class="clojure">user> (def rows (parse/csv->rows "https://raw.githubusercontent.com/techascent/tech.ml.dataset/master/test/data/stocks.csv"))
|
||
#'user/rows
|
||
user> (take 10 rows)
|
||
(["symbol", "date", "price"]
|
||
["MSFT", "Jan 1 2000", "39.81"]
|
||
["MSFT", "Feb 1 2000", "36.35"]
|
||
["MSFT", "Mar 1 2000", "43.22"]
|
||
["MSFT", "Apr 1 2000", "28.37"]
|
||
["MSFT", "May 1 2000", "25.45"]
|
||
["MSFT", "Jun 1 2000", "32.54"]
|
||
["MSFT", "Jul 1 2000", "28.4"]
|
||
["MSFT", "Aug 1 2000", "28.4"]
|
||
["MSFT", "Sep 1 2000", "24.53"])
|
||
</code></pre>
|
||
<p>Note that the header row is included in these rows. Now you can filter/sample/etc. in CSV space meaning transformations from sequences of string arrays to sequences of string arrays.</p>
|
||
<h3><a href="#various-manipulations" name="various-manipulations"></a>Various Manipulations</h3>
|
||
<p>So we save off the header row (probably don’t want to manipulate that) and do some transformations:</p>
|
||
<pre><code class="clojure">user> (def header-row (first rows))
|
||
#'user/header-row
|
||
user> (def rows (->> (rest rows)
|
||
(take-nth 4)))
|
||
#'user/rows
|
||
</code></pre>
|
||
<h3><a href="#string-array-sequence-dataset" name="string-array-sequence-dataset"></a>String Array Sequence -> Dataset</h3>
|
||
<p>Now we produce the final dataset from the concatenated header row and the rest of the string arrays. In the example below I show setting a column datatype and parsing the column name into keywords just to show that the parser options for ->dataset also apply here.</p>
|
||
<pre><code class="clojure">user> (def stocks (parse/rows->dataset {:dataset-name "stocks"
|
||
:key-fn keyword
|
||
:parser-fn {"date" :packed-local-date}}
|
||
(concat [header-row]
|
||
rows)))
|
||
#'user/stocks
|
||
user> stocks
|
||
stocks [140 3]:
|
||
|
||
| :symbol | :date | :price |
|
||
|---------|------------|-------:|
|
||
| MSFT | 2000-01-01 | 39.81 |
|
||
| MSFT | 2000-05-01 | 25.45 |
|
||
| MSFT | 2000-09-01 | 24.53 |
|
||
| MSFT | 2001-01-01 | 24.84 |
|
||
...
|
||
</code></pre>
|
||
<p>In this way you can do various manipulations in pure string space but still not have to parse the data any further. If you decide that you want to parse the data yourself then it is probably best to simply convert your parsed data into a sequence of maps and call <code>tech.ml.dataset/->dataset</code>.</p></div></div></div></body></html> |