init research
This commit is contained in:
+176
@@ -0,0 +1,176 @@
|
||||
# tech.ml.dataset Getting Started
|
||||
|
||||
## What kind of data?
|
||||
|
||||
TMD processes _tabular_ data, that is, data logically arranged in rows and columns. Similar to a spreadsheet (but handling much larger datasets) or a database (but much more convenient), TMD accelerates exploring, cleaning, and processing data tables. TMD inherits Clojure's data-orientation and flexible dynamic typing, without compromising on being _functional_; thereby extending the language's reach to new problems and domains.
|
||||
|
||||
```clojure
|
||||
> (ds/->dataset "lucy.csv")
|
||||
lucy.csv [3 3]:
|
||||
|
||||
| name | age | likes |
|
||||
|-------|----:|-------|
|
||||
| fred | 42 | pizza |
|
||||
| ethel | 42 | sushi |
|
||||
| sally | 21 | opera |
|
||||
```
|
||||
|
||||
## Reading and writing datasets
|
||||
|
||||
TMD can read datasets from many common formats (e.g., csv, tsv, xls, xlsx, json, parquet, arrow, ...). When given a file path, the [->dataset](https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var--.3Edataset) function can often detect the format automatically by the file extension and obtain the dataset. The same function can make datasets from other sources, such as sequences of Clojure maps in memory, or (again with broad format support) data downloaded from the internet.
|
||||
|
||||
For output, the [rows](https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-rows) function gives the dataset as a sequence of maps, and the [write!](https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-write.21) function can be used to serialize any dataset into any supported format.
|
||||
|
||||
|
||||
```clojure
|
||||
> (ds/->dataset [{:name "fred"
|
||||
:age 42
|
||||
:likes "pizza"}
|
||||
{:name "ethel"
|
||||
:age 42
|
||||
:likes "sushi"}
|
||||
{:name "sally"
|
||||
:age 21
|
||||
:likes "opera"}])
|
||||
_unnamed [3 3]:
|
||||
|
||||
| :name | :age | :likes |
|
||||
|-------|-----:|--------|
|
||||
| fred | 42 | pizza |
|
||||
| ethel | 42 | sushi |
|
||||
| sally | 21 | opera |
|
||||
```
|
||||
|
||||
## Filtering data
|
||||
|
||||
TMD datasets are logically _maps_ of column name to column data; this means that (for example) Clojure's `dissoc` can be used to remove a column. Datasets can also be filtered row-wise, by predicates of a single column, or of entire rows - this is similar to Clojure's `filter` function, but can operate much more efficiently by exploiting tabular structure.
|
||||
|
||||
```clojure
|
||||
> (-> (ds/->dataset "lucy.csv")
|
||||
(dissoc "likes"))
|
||||
lucy.csv [3 2]:
|
||||
|
||||
| name | age |
|
||||
|-------|----:|
|
||||
| fred | 42 |
|
||||
| ethel | 42 |
|
||||
| sally | 21 |
|
||||
```
|
||||
|
||||
```clojure
|
||||
> (-> (ds/->dataset "lucy.csv")
|
||||
(ds/filter-column "age" #(> % 30)))
|
||||
lucy.csv [2 3]:
|
||||
|
||||
| name | age | likes |
|
||||
|-------|----:|-------|
|
||||
| fred | 42 | pizza |
|
||||
| ethel | 42 | sushi |
|
||||
```
|
||||
|
||||
## Adding data
|
||||
|
||||
The powerful [row-map](https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-row-map) function can be used to create or update columns that derive from data already in the dataset. Adding rows is typically accomplished by concatenating two (or more) datasets. The [functional](https://cnuernber.github.io/dtype-next/tech.v3.datatype.functional.html) namespace provides convenient functions for operating on scalar, element-wise, or columnar data.
|
||||
|
||||
```clojure
|
||||
> (-> (ds/->dataset "lucy.csv")
|
||||
(ds/row-map (fn [{:strs [age]}]
|
||||
{"half-age" (/ age 2.0)})))
|
||||
lucy.csv [3 4]:
|
||||
|
||||
| name | age | likes | half-age |
|
||||
|-------|----:|-------|---------:|
|
||||
| fred | 42 | pizza | 21.0 |
|
||||
| ethel | 42 | sushi | 21.0 |
|
||||
| sally | 21 | opera | 10.5 |
|
||||
```
|
||||
|
||||
## Statistics
|
||||
|
||||
TMD has tools for calculating summary statistics on datasets. The [descriptive-stats](https://techascent.github.io/tech.ml.dataset/docs/tech.v3.dataset.html#var-descriptive-stats) function produces a dataset of summary statistics for each column in the input dataset - perfect for initial exploration, or further meta analysis or operation. Broad support for further columnar statistical analysis is provided by the [statistics](https://cnuernber.github.io/dtype-next/tech.v3.datatype.statistics.html) namespace.
|
||||
|
||||
```clojure
|
||||
> (-> (ds/->dataset "lucy.csv")
|
||||
(ds/row-map (fn [{:strs [age]}]
|
||||
{"half-age" (/ age 2.0)}))
|
||||
(ds/descriptive-stats {:stat-names [:col-name :datatype :min :mean :max :standard-deviation]}))
|
||||
lucy.csv: descriptive-stats [4 6]:
|
||||
|
||||
| :col-name | :datatype | :min | :mean | :max | :standard-deviation |
|
||||
|-----------|-----------|-----:|------:|-----:|--------------------:|
|
||||
| name | :string | | | | |
|
||||
| age | :int16 | 21.0 | 35.0 | 42.0 | 12.12435565 |
|
||||
| likes | :string | | | | |
|
||||
| half-age | :float64 | 10.5 | 17.5 | 21.0 | 6.06217783 |
|
||||
```
|
||||
|
||||
## Grouping
|
||||
|
||||
Like a Clojure sequence, a dataset can be grouped into a _map_ of value to dataset with that value. The [group-by](https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-group-by) function accomplishes this. The related [group-by->indexes](https://techascent.github.io/tech.ml.dataset/docs/tech.v3.dataset.html#var-group-by-.3Eindexes) function produces maps of value to row-indexes of the input dataset - working with indexes can be more efficient than constructing concrete grouped datasets.
|
||||
|
||||
```clojure
|
||||
> (-> (ds/->dataset "lucy.csv")
|
||||
(ds/group-by #(if (> (get % "age") 30) :old :not-old)))
|
||||
{:old lucy.csv [2 3]:
|
||||
|
||||
| name | age | likes |
|
||||
|-------|----:|-------|
|
||||
| fred | 42 | pizza |
|
||||
| ethel | 42 | sushi |
|
||||
, :not-old lucy.csv [1 3]:
|
||||
|
||||
| name | age | likes |
|
||||
|-------|----:|-------|
|
||||
| sally | 21 | opera |
|
||||
}
|
||||
```
|
||||
|
||||
```clojure
|
||||
> (-> (ds/->dataset "lucy.csv")
|
||||
(ds/group-by->indexes #(if (> (get % "age") 30) :old :not-old)))
|
||||
{:old [0 1], :not-old [2]}
|
||||
```
|
||||
|
||||
## Combining datasets
|
||||
|
||||
Because datasets are _maps_ of column name to column data, they can be combined column-wise using Clojure's `merge` function. The [concat](https://techascent.github.io/tech.ml.dataset/docs/tech.v3.dataset.html#var-concat) function can be used for row-wise combination of two or more datasets. The [join](https://techascent.github.io/tech.ml.dataset/docs/tech.v3.dataset.join.html) namespace provides database-like joins for aligning data from multiple datasets.
|
||||
|
||||
```clojure
|
||||
> (merge (ds/->dataset (for [i (range 3)] {"index" i}))
|
||||
(ds/->dataset "lucy.csv"))
|
||||
_unnamed [3 4]:
|
||||
|
||||
| index | name | age | likes |
|
||||
|------:|-------|----:|-------|
|
||||
| 0 | fred | 42 | pizza |
|
||||
| 1 | ethel | 42 | sushi |
|
||||
| 2 | sally | 21 | opera |
|
||||
```
|
||||
|
||||
## Date, time, and other datatypes
|
||||
|
||||
TMD knows about dates, times, instants, and many other types from the comprehensive `java.time` library. Working with these types can be much more convenient than dealing with them as strings, and datatypes are preserved throughout operations, so downstream tooling can avoid dealing with these data as strings as well.
|
||||
|
||||
In addition to `java.time` types, all Clojure types (e.g., keywords), UUIDs, as well as a comprehensive set of signed and unsigned numeric types of different widths are also transparently supported.
|
||||
|
||||
```clojure
|
||||
> (def ds (ds/->dataset [{:date "1981-03-10"}
|
||||
{:date "1999-12-31"}]
|
||||
{:parser-fn {:date :local-date}}))
|
||||
#'ds
|
||||
> (.until (first (:date ds))
|
||||
(last (:date ds)))
|
||||
#object[java.time.Period 0x2d9a2c24 "P18Y9M21D"]
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Further reading
|
||||
|
||||
- The [README](https://github.com/techascent/tech.ml.dataset#techmldataset) on GitHub has information about installing, and first steps with TMD.
|
||||
|
||||
- The [walkthrough](https://techascent.github.io/tech.ml.dataset/100-walkthrough.html) topic has long-form examples of processing real data with TMD.
|
||||
|
||||
- The [quick reference](https://techascent.github.io/tech.ml.dataset/200-quick-reference.html) summarizes many of the most frequently used functions with hints about their use.
|
||||
|
||||
- The [API docs](https://techascent.github.io/tech.ml.dataset/index.html) list every function available in TMD.
|
||||
+1284
File diff suppressed because it is too large
Load Diff
+178
@@ -0,0 +1,178 @@
|
||||
# tech.ml.dataset Quick Reference
|
||||
|
||||
This topic summarizes many of the most frequently used TMD functions, together with some quick notes about their use. Functions here are linked to further documentation, or their source. Note, unless a namespace is specified, each function is accessible via the `tech.ml.dataset` namespace.
|
||||
|
||||
For a more thorough treatment, the [API docs](https://techascent.github.io/tech.ml.dataset/index.html) list every available function.
|
||||
|
||||
### Table of Contents
|
||||
1. [Loading/Saving](#LoadingSaving)
|
||||
1. [Accessing Values](#AccessingValues)
|
||||
1. [REPL Friendly Printing](#PrintOptions)
|
||||
1. [Exploring Datasets](#ExploringDatasets)
|
||||
1. [Selecting Subrects](#SelectingSubrects)
|
||||
1. [Manipulating Datasets](#ManipulatingDatasets)
|
||||
1. [Elementwise Arithmetic](#ElementwiseArithmetic)
|
||||
1. [Forcing Lazy Evaluation](#ForcingLazyEvaluation)
|
||||
|
||||
-----
|
||||
<div id="LoadingSaving"></div>
|
||||
|
||||
## Loading/Saving
|
||||
|
||||
* [->dataset, ->>dataset](https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var--.3Edataset) - obtains datasets from files or streams of csv/tsv, sequence-of-maps, map-of-arrays, xlsx, xls, and other typical formats. If their respective namespaces and dependencies are loaded, this function can also load parquet and arrow. [SQL](https://github.com/techascent/tech.ml.dataset.sql) and [ClojureScript](https://github.com/cnuernber/tmdjs) support is provided by separate libraries.
|
||||
* [write!](https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-write.21) - Writes csv, tsv, nippy (or a variety of other formats) with optional gzipping. Depends on scanning file path string to determine options.
|
||||
* [parquet support](https://techascent.github.io/tech.ml.dataset/tech.v3.libs.parquet.html)
|
||||
* [xlsx, xls support](https://techascent.github.io/tech.ml.dataset/tech.v3.libs.poi.html)
|
||||
* [fast xlsx support](https://techascent.github.io/tech.ml.dataset/tech.v3.libs.fastexcel.html)
|
||||
* [arrow support](https://techascent.github.io/tech.ml.dataset/tech.v3.libs.arrow.html)
|
||||
* [dataset->data](https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-dataset-.3Edata) - Useful if you want an entire dataset represented as Clojure/JVM datastructures. Primitive arrays save space, roaring bitmaps represent missing sets, and string tables receive special treatment.
|
||||
* [data->dataset](https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-data-.3Edataset) - Inverse of data->dataset.
|
||||
* [tech.ml.dataset.io.univocity/csv->rows](https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.io.univocity.html#var-csv-.3Erows) - lower-level support for lazily parsing a csv or tsv as a sequence of `string[]` rows. Offers a subset of the `->dataset` options.
|
||||
* [tech.ml.dataset.parse/rows->dataset](https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.io.string-row-parser.html#var-rows-.3Edataset) - lower-level support for obtaining a dataset from a sequence of `string[]` rows. Offers subset of the `->dataset` options.
|
||||
|
||||
-----
|
||||
<div id="AccessingValues"></div>
|
||||
|
||||
## Accessing Values
|
||||
|
||||
* Datasets are logically maps of column name to column, and to this end implement `IPersistentMap`. So, `(map meta (vals ds))` will return a sequence of column metadata. Moreover, datasets implement `Ifn`, and so are functions of their column names. Thus, `(ds :colname)` will return the column named `:colname`. Functions like `keys`, `vals`, `contains?`, `assoc`, `dissoc`, `merge`, and map-style destructuring all work on datasets. Notably, `update` does not work as update always returns a persistent map.
|
||||
* Columns are iterable and implement indexed (random access) so they work with `map`, `count` and `nth`. Columns also implement `IFn` analgous to to persistent vectors. Helpfully, using negative values as indexes reads from the end similar to numpy and pandas.
|
||||
* [row-count](https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-row-count) - count dataset and column rows.
|
||||
* [rows](https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-rows) - get the rows of the dataset as a `java.util.List` of persistent-map-like maps. Accomplished by a flyweight implementation of `clojure.lang.APersistentMap` where data is read out of the underlying dataset on demand. This keeps the data in the backing store for lazily access; this makes reading marginally more expensive, but allows this call not to increase memory working-set size. Indexing rows returned like this with negative values indexes from the end similar to numpy and pandas.
|
||||
* [rowvecs](https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-rowvecs) - get the rows of the dataset as a `java.util.List` of persistent-vector-like entries. These rows are safe to use in maps. When using row values as keys in maps, the `{:copying? true}` option can help with performance, because each hash and equals comparison is using data located in the vector, not re-reading the data out of the source dataset. Negative values index from the end similar to numpy and pandas.
|
||||
* `rows` and `rowvecs` are lazy and thus `(rand-nth (ds/rows ds))` is a relatively efficient pathway (and fun). `(ds/rows (ds/sample ds))` is also good for a quick scan.
|
||||
* [column-count](https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-column-count) - count columns.
|
||||
* Typed random access is supported the `(tech.v3.datatype/->reader col)` transformation. This is guaranteed to return an implementation of `java.util.List` storing typed values. These implement `IFn` like a column or persistent vector. Direct access to packed datetime columns may produce surprising results; call `tech.v3.datatype.datetime/unpack` on the column prior to calling `tech.v3.datatype/->reader` to get to the unpacked datatype. Negative indexes on readers index from the end.
|
||||
* [missing](https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-missing) - return a RoaringBitmap of missing indexes. For columns, returns the column's specific missing indexes. For datasets, returns a union of all the columns' missing indexes.
|
||||
* [meta, with-meta, vary-meta](https://github.com/clojure/clojure/blob/master/src/clj/clojure/core.clj#L202) - both datasets and columns implement `clojure.lang.IObj` so metadata works. The key `:name` has meaning in the system and setting it directly on a column is not recommended. In general, operations preserve metadata.
|
||||
|
||||
-----
|
||||
<div id="PrintOptions"></div>
|
||||
|
||||
## REPL Friendly Printing
|
||||
|
||||
REPL workflows are an important part of TMD, and so controlling what is printed (especially for larger datasets) is critical. Many options are provided, by metadata, to get the right information on the screen for perusal.
|
||||
|
||||
Be default, printing is abbreviated, the helpful [print-all](https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-print-all) function overrides this behavior to enable printing all rows.
|
||||
|
||||
In general, any option can be set like `(vary-meta ds assoc :print-column-max-width 10)`.
|
||||
|
||||
* Summary of [print metadata options](https://github.com/techascent/tech.ml.dataset/blob/7c8c7514e0e35995050c1e326122a1826cc18273/src/tech/v3/dataset/print.clj#L93)
|
||||
|
||||
-----
|
||||
<div id="ExploringDatasets"></div>
|
||||
|
||||
## Exploring Datasets
|
||||
|
||||
* [head](https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-head) - obtains a dataset consisting of the first N rows of the input dataset.
|
||||
* [tail](https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-tail) - obtains a dataset consisting of the last N rows of the input dataset.
|
||||
* [sample](https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-sample) - samples N rows, randomly, as a dataset.
|
||||
* [rand-nth](https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-rand-nth) - samples a single row of the dataset.
|
||||
* [descriptive-stats](https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-descriptive-stats) - produces a dataset of columnwise descriptive statistics.
|
||||
* [brief](https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-brief) - get descriptive statistics as Clojure data (an edn sequence of maps, one for each column).
|
||||
|
||||
-----
|
||||
<div id="SelectingSubrects"></div>
|
||||
|
||||
## Selecting Subrects
|
||||
|
||||
Recall that since datasets are maps, `assoc`, `dissoc`, and `merge` all work at the dataset level - beyond that, consider these helpful subrect selection functions.
|
||||
|
||||
* [select-columns](https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-select-columns) - select a subset of columns. Notably, this also controls column _order_ for downstream printing and serialization.
|
||||
* [select-rows](https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-select-rows) - get a specific subset of rows from a datasets or column.
|
||||
* [select](https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-select) - get a specific set of rows and columns in a single call, can be used for renaming.
|
||||
* [drop-rows](https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-drop-rows) - drop rows by index from a datasets or column.
|
||||
* [drop-missing](https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-drop-missing) - drop any rows with missing values from the dataset.
|
||||
|
||||
-----
|
||||
<div id="ManipulatingDatasets"></div>
|
||||
|
||||
## Manipulating Datasets
|
||||
|
||||
* [new-dataset](https://github.com/techascent/tech.ml.dataset/blob/7c8c7514e0e35995050c1e326122a1826cc18273/src/tech/v3/dataset/impl/dataset.clj#L380) - Create a new dataset from a sequence of columns. Columns may be actual columns created via `tech.ml.dataset.column/new-column` or they could be maps containing at least keys `#:tech.v3.dataset{:name :data}` but also potentially `#:tech.v3.dataset{:metadata :missing}` in order to create a column with a specific set of missing values and metadata. `:force-datatype true` will disable the system from attempting to scan the data for missing values and e.g. create a float column from a vector of Float objects. The above also applies to using `clojure.core/assoc` with a dataset.
|
||||
* ⭐ [row-map](https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-row-map) ⭐ - maps a function from map->map in parallel over the dataset. The returned maps will be used to create or update columns in the output dataset, merging with the original. Note there are options to return a sequence of datasets as opposed to a single large final dataset.
|
||||
* [row-mapcat](https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-row-mapcat) - maps a function from map->sequence-of-maps over the dataset in parallel, potentially expanding or shrinking the result (in terms of row count). When expanding, row information not included in the original map is efficiently duplicated. Note there are options to return a sequence of datasets as opposed to a single potentially very large final dataset.
|
||||
* [pd-merge](https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.join.html#var-pd-merge) - implements generalized left, right, inner, outer, and cross joins. Allows combining datasets in a way familiar to users of traditional databases.
|
||||
* [replace-missing](https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-replace-missing), [replace-missing-value](https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-replace-missing-value) - replace missing values in one or more columns.
|
||||
* [group-by-column](https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-group-by-column), [group-by](https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-group-by) - creates a map of value to dataset with that value. These datasets are created via indexing into the original dataset for efficiency, so no data is copied.
|
||||
* [sort-by-column](https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-sort-by-column), [sort-by](https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-sort-by) - sorts the dataset by column values.
|
||||
* [filter-column](https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-filter-column), [filter](https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-filter) - produces a new dataset with only rows that pass a predicate.
|
||||
* [concat-copying](https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-concat-copying), [concat-inplace](https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-concat-inplace) - produces a new dataset as a concatenation of supplied datasets. Copying can be more efficient than in-place, but uses more memory - `(apply ds/concat-copying x-seq)` is **far** more efficient than `(reduce ds/concat-copying x-seq)`; this also is true for `concat-inplace`.
|
||||
* [unique-by-column](https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-unique-by-column), [unique-by](https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-unique-by) - removes duplicate rows. Passing in `keep-fn` allows you to choose either first, last, or some other criteria for rows that have the same values. For `unique-by`, `identity` will work just fine (rows have sane equality semantics).
|
||||
* [pmap-ds](https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-pmap-ds) - maps a function of ds->ds in parallel over batches of data in the dataset. Can return either a new dataset via concat-copying or a sequence of datasets.
|
||||
* [left-join-asof](https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.join.html#var-left-join-asof) - specialized join-nearest functionality useful for doing things like finding the nearest values in time in irregularly sampled data.
|
||||
* [rolling](https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.rolling.html#var-rolling) - fixed and variable rolling window operations.
|
||||
* [group-column-by-agg](https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.reductions.html#var-group-by-column-agg) - unusually high performance primitive that logically combines `group-by` and `reduce` operations. Each key in the supplied map of reducers becomes a column in the output dataset.
|
||||
* [neanderthal support](https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.neanderthal.html) - transformations of datasets to/from neanderthal dense native matrixes.
|
||||
* [tensor support](https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.tensor.html) - transformations of datasets to/from [tech.v3.tensor](https://cnuernber.github.io/dtype-next/tech.v3.tensor.html) objects.
|
||||
|
||||
Many of the functions above come in `->column` variants, which can be faster by avoiding the creation of fully-realized output datasets with superfluous data. Moreover, some of these functions come in `->indexes` variants, which simply return indexes and thus skip creating sub-datasets. Operating in index space as such can be _very_ efficient.
|
||||
|
||||
-----
|
||||
<div id="ElementwiseArithmetic"></div>
|
||||
|
||||
## Elementwise Arithmetic
|
||||
|
||||
Functions in the `tech.v3.datatype.functional` namespace operate elementwise on a column, lazily returning a new column. It is highly recommended to remove all missing values before using element-wise arithmetic as the `functional` namespace has no knowledge of missing values. Integer columns with missing values will be upcast to float or double columns in order to support a missing value indicator.
|
||||
|
||||
Note the use of `dfn` from `(require [tech.v3.datatype.functional :as dfn])`:
|
||||
|
||||
```clojure
|
||||
(assoc ds :value (dtype/elemwise-cast (ds :value) :int64)
|
||||
:shrs-or-prn-amt (dtype/elemwise-cast (ds :shrs-or-prn-amt) :int64)
|
||||
:cik (dtype/const-reader (:cik filing) (ds/row-count ds))
|
||||
:investor (dtype/const-reader investor (ds/row-count ds))
|
||||
:form-type (dtype/const-reader form-type (ds/row-count ds))
|
||||
:edgar-id (dtype/const-reader (:edgar-id filing) (ds/row-count ds))
|
||||
:weight (dfn// (ds :value)
|
||||
(double (dfn/reduce-+ (ds :value)))))
|
||||
```
|
||||
|
||||
|
||||
-----
|
||||
<div id="ForcingLazyEvaluation"></div>
|
||||
|
||||
## Forcing Lazy Evaluation
|
||||
|
||||
In pandas or with R's `data.table`s one frequently needs to consider making a copy of some data before operating on it. Making too many copies uses too much memory, making too few copies leads to confusing non-local overwrites of data. There is untold lossage in these nonfunctional notions of dataset processing. Unlike these, TMD's datasets are functional.
|
||||
|
||||
TMD's functional datasets rely on index indirection, lazyness, and structural sharing to simplify the mental model necessary to reason about their operation. This allows low-cost aggregation of operations, and eliminates most wondering about whether making a copy is necessary or not (it's generally not). However, these indirections sometimes increase read costs.
|
||||
|
||||
At any time, `clone` can be used to make a clean copy of the dataset that relies on no indirect computation, and stores the data separately, so there is no chance of accidental overwrites. Clone is multithreaded and very efficient, boiling down to parallelized iteration over the data and `System/arraycopy` calls. Moreover, calling `clone` can reduce the in-memory size of the dataset by a bit - sometimes 20%, by converting `List`s that have some overhead into arrays that have no extra capacity.
|
||||
|
||||
* [tech.v3.datatype/clone](https://github.com/cnuernber/dtype-next/blob/152f09f925041d41782e05009bbf84d7d6cfdbc6/src/tech/v3/datatype.clj#L95) - clones the dataset realizing lazy operations and copying the data into java arrays. Operates on datasets and columns.
|
||||
|
||||
---
|
||||
|
||||
## Additional Selling Points
|
||||
|
||||
Sophisticated support for [Apache Arrow](https://techascent.github.io/tech.ml.dataset/tech.v3.libs.arrow.html), including mmap support for JDK-8->JDK-17 although if you are on an M-1 Mac you will need to use JDK-17. Also, with arrow, per-column compression (LZ4, ZSTD) exists across all supported platforms. At the time of writing, the official Arrow SDK does not support mmap, or JDK-17, and has no user-accessible way to save a compressed streaming format file.
|
||||
|
||||
Support is provided for operating on _sequences_ of datasets, enabling working on larger, potentially out-of-memory workloads. This is consistent with the design of the parquet and arrow data storage systems and aggregation operations for sequences of datasets are efficiently implemented in the
|
||||
[tech.v3.dataset.reductions](https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.reductions.html) namespace.
|
||||
|
||||
Preliminary support for algorithms from the [Apache Data Sketches](https://datasketches.apache.org/) system can be found in the [apache-data-sketch](https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.reductions.apache-data-sketch.html) namespace. Summations/means in this area are implemented using the
|
||||
[Kahan compensated summation](https://en.wikipedia.org/wiki/Kahan_summation_algorithm) algorithm.
|
||||
|
||||
### Efficient Rowwise Operations
|
||||
|
||||
TMD uses efficient parallelized mechanisms to operate on data for rowwise map and mapcat operations. Argument functions are passed maps that lazily read only the required data from the underlying dataset (huge savings over reading all the data). TMD scans the returned maps from the argument function for datatype and missing information. Columns derived from the mapping operation overwrite columns in the original dataset - the powerful `row-map` function works this way.
|
||||
|
||||
The mapping operations are run in parallel using a primitive named `pmap-ds` and the resulting datasets can either be returned in a sequence or combined into a single larger dataset.
|
||||
|
||||
* [row-map](https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-row-map)
|
||||
* [row-mapcat](https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-row-mapcat)
|
||||
* [rows](https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-rows)
|
||||
* [rowvecs](https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-rowvecs)
|
||||
|
||||
---
|
||||
|
||||
## 📚 Additional Documentation 📚
|
||||
|
||||
The best place to start is the "Getting Started" topic in the documentation: [https://techascent.github.io/tech.ml.dataset/000-getting-started.html](https://techascent.github.io/tech.ml.dataset/000-getting-started.html)
|
||||
|
||||
The "Walkthrough" topic provides long-form examples of processing real data: [https://techascent.github.io/tech.ml.dataset/100-walkthrough.html](https://techascent.github.io/tech.ml.dataset/100-walkthrough.html)
|
||||
|
||||
The API docs document every available function: [https://techascent.github.io/tech.ml.dataset/](https://techascent.github.io/tech.ml.dataset/)
|
||||
|
||||
The provided Java API ([javadoc](https://techascent.github.io/tech.ml.dataset/javadoc/tech/v3/TMD.html) / [with frames](https://techascent.github.io/tech.ml.dataset/javadoc/index.html)) and sample program ([source](java_test/java/jtest/TMDDemo.java)) show how to use TMD from Java.
|
||||
@@ -0,0 +1,170 @@
|
||||
# tech.ml.dataset Columns, Readers, and Datatypes
|
||||
|
||||
|
||||
In `tech.ml.dataset`, columns are composed of three things:
|
||||
[data, metadata, and the missing set](https://github.com/techascent/tech.ml.dataset/blob/7c8c7514e0e35995050c1e326122a1826cc18273/src/tech/v3/dataset/impl/column.clj#L140).
|
||||
The column's datatype is the datatype of the `data` member. The data member can
|
||||
be anything convertible to a tech.v2.datatype reader of the appropriate type.
|
||||
|
||||
|
||||
Buffers are a [simple abstraction](https://github.com/cnuernber/dtype-next/blob/152f09f925041d41782e05009bbf84d7d6cfdbc6/java/tech/v3/datatype/Buffer.java) of typed random access read-only
|
||||
memory that implement all the interfaces required to both efficient and easy to use.
|
||||
You can create a buffer by reifying the appropriately typed interface from
|
||||
`tech.v3.datatype` but the datatype library has
|
||||
[quick paths](https://github.com/cnuernber/dtype-next/blob/152f09f925041d41782e05009bbf84d7d6cfdbc6/src/tech/v3/datatype.clj#L102) to creating these:
|
||||
|
||||
```clojure
|
||||
user> (require '[tech.v3.datatype :as dtype])
|
||||
nil
|
||||
user> (dtype/make-reader :float32 5 idx)
|
||||
[0.0 1.0 2.0 3.0 4.0]
|
||||
user> (dtype/make-reader :float32 5 (* 2 idx))
|
||||
[0.0 2.0 4.0 6.0 8.0]
|
||||
```
|
||||
|
||||
|
||||
|
||||
A read-only buffer only needs three methods - `elemwiseDatatype` (optional), `lsize`, and
|
||||
`read[X]`. `read[X]` is typed to the datatype so for instance in the example above,
|
||||
readFloat returns a primitive float object. `lsize` returns a long. Unlike a the
|
||||
similar method `get` in java lists, the `read[X]` methods takes a long. This allows us
|
||||
to use read methods on storage mechanism capable of addressing more than 2 (signed int)
|
||||
or 4 (unsigned int) billion addresses.
|
||||
|
||||
|
||||
Another way to create a reader is to do a 'map' type translation from one or more other
|
||||
readers. This is provided in two ways:
|
||||
|
||||
* [`dtype/emap`](https://github.com/cnuernber/dtype-next/blob/152f09f925041d41782e05009bbf84d7d6cfdbc6/src/tech/v3/datatype/emap.clj#L97) - Missing set ignorant mapping into a typed representation.
|
||||
* [`tech.v3.dataset.column/column-map`](https://github.com/techascent/tech.ml.dataset/blob/7c8c7514e0e35995050c1e326122a1826cc18273/src/tech/v3/dataset/column.clj#L174) - Missing set aware mapping into a typed representation.
|
||||
|
||||
|
||||
The dataset system in general is smart enough to create columns out of readers in most
|
||||
situations. So for instance if you have a dataset and you want a column of a
|
||||
particular type, you can add-or-update-column and pass in a reader that implements what
|
||||
you want:
|
||||
|
||||
```clojure
|
||||
user> (def stocks (ds/->dataset "test/data/stocks.csv"))
|
||||
#'user/stocks
|
||||
user> (ds/head stocks)
|
||||
test/data/stocks.csv [5 3]:
|
||||
|
||||
| symbol | date | price |
|
||||
|--------+------------+-------|
|
||||
| MSFT | 2000-01-01 | 39.81 |
|
||||
| MSFT | 2000-02-01 | 36.35 |
|
||||
| MSFT | 2000-03-01 | 43.22 |
|
||||
| MSFT | 2000-04-01 | 28.37 |
|
||||
| MSFT | 2000-05-01 | 25.45 |
|
||||
user> (ds/head (ds/add-or-update-column stocks "id"
|
||||
(dtype/make-reader :int64
|
||||
(ds/row-count stocks)
|
||||
idx)))
|
||||
test/data/stocks.csv [5 4]:
|
||||
|
||||
| symbol | date | price | id |
|
||||
|--------+------------+-------+----|
|
||||
| MSFT | 2000-01-01 | 39.81 | 0 |
|
||||
| MSFT | 2000-02-01 | 36.35 | 1 |
|
||||
| MSFT | 2000-03-01 | 43.22 | 2 |
|
||||
| MSFT | 2000-04-01 | 28.37 | 3 |
|
||||
| MSFT | 2000-05-01 | 25.45 | 4 |
|
||||
```
|
||||
|
||||
|
||||
There are many different datatypes currently used in the datatype system -
|
||||
the primitive numeric types:
|
||||
|
||||
|
||||
* `:boolean` - convert to and from 0 (false) or 1 (true) when used as a number.
|
||||
* `:int8`,`:uint8` - signed/unsigned bytes.
|
||||
* `:int16`,`:uint16` - signed/unsigned shorts.
|
||||
* `:int32`,`:uint32` - signed/unsigned ints.
|
||||
* `:int64` - signed longs (haven't figured out unsigned longs really yet).
|
||||
* `:float32`, `float64` - floats, doubles respectively.
|
||||
|
||||
|
||||
There are more types that can be represented by primitives (they 'alias' the primitive
|
||||
type) but we will leave that for another article.
|
||||
|
||||
Outside of the primitive types (and types aliased to primitive types), we have an
|
||||
infinite object types. Any datatype the system doesn't understand it will treat as
|
||||
type :object during generic options.
|
||||
|
||||
|
||||
One very important aspect to note is that columns marked as `:object` datatypes will
|
||||
use the Clojure numerics stack during mathematical operations. This is
|
||||
important because Clojure number tower, similar to the APL number tower,
|
||||
actively promotes values to the next appropriate size and is thus less error prone
|
||||
to use if you aren't absolutely certain of your value range how it interacts with
|
||||
your arithmetic pathways.
|
||||
|
||||
|
||||
```clojure
|
||||
user> (require '[tech.v3.dataset :as ds])
|
||||
nil
|
||||
user> (def stocks (ds/->dataset "test/data/stocks.csv"))
|
||||
#'user/stocks
|
||||
user> (require '[tech.v3.datatype.functional :as dfn])
|
||||
nil
|
||||
user> (def stocks-lag
|
||||
(assoc stocks "price-lag"
|
||||
(let [price-data (dtype/->reader (stocks "price"))]
|
||||
(dtype/make-reader :float64 (.lsize price-data)
|
||||
(.readDouble price-data
|
||||
(max 0 (dec idx)))))))
|
||||
|
||||
#'user/stocks-lag
|
||||
user> (ds/head (assoc stocks-lag "price-lag-diff" (dfn/- (stocks-lag "price")
|
||||
(stocks-lag "price-lag"))))
|
||||
test/data/stocks.csv [5 5]:
|
||||
|
||||
| symbol | date | price | price-lag | price-lag-diff |
|
||||
|--------+------------+-------+-----------+----------------|
|
||||
| MSFT | 2000-01-01 | 39.81 | 39.81 | 0.000 |
|
||||
| MSFT | 2000-02-01 | 36.35 | 39.81 | -3.460 |
|
||||
| MSFT | 2000-03-01 | 43.22 | 36.35 | 6.870 |
|
||||
| MSFT | 2000-04-01 | 28.37 | 43.22 | -14.85 |
|
||||
| MSFT | 2000-05-01 | 25.45 | 28.37 | -2.920 |
|
||||
```
|
||||
|
||||
All these operations are intrinsically lazy, so values are only calculated when
|
||||
requested. This is usually fine but in some cases it may be desired to force
|
||||
the calculation of a particular column completely (like in the instance where
|
||||
the calculation is particularly expensive). One way to force the column
|
||||
efficiently is to clone it:
|
||||
|
||||
```clojure
|
||||
user> (ds/head (ds/update-column stocks-lag "price-lag" dtype/clone))
|
||||
test/data/stocks.csv [5 4]:
|
||||
|
||||
| symbol | date | price | price-lag |
|
||||
|--------+------------+-------+-----------|
|
||||
| MSFT | 2000-01-01 | 39.81 | 39.81 |
|
||||
| MSFT | 2000-02-01 | 36.35 | 39.81 |
|
||||
| MSFT | 2000-03-01 | 43.22 | 36.35 |
|
||||
| MSFT | 2000-04-01 | 28.37 | 43.22 |
|
||||
| MSFT | 2000-05-01 | 25.45 | 28.37 |
|
||||
```
|
||||
|
||||
If we now get the actual type of the column's data member, we can see that it is
|
||||
a concrete type.
|
||||
|
||||
```clojure
|
||||
user> (-> (ds/update-column stocks-lag "price-lag" dtype/clone)
|
||||
(get "price-lag")
|
||||
(dtype/as-concrete-buffer))
|
||||
#array-buffer<float64>[560]
|
||||
[39.81, 39.81, 36.35, 43.22, 28.37, 25.45, 32.54, 28.40, 28.40, 24.53, 28.02, 23.34, 17.65, 24.84, 24.00, 22.25, 27.56, 28.14, 29.70, 26.93, ...]
|
||||
```
|
||||
|
||||
|
||||
This ability - lazily define a column via interface implementation and still
|
||||
efficiently operate on that column - separates the implementation of
|
||||
the `tech.ml.dataset` library from other libraries in this field. This is likely
|
||||
to have an interesting and different set of advantages and disadvantages that will
|
||||
present themselves over time. The dataset library is very loosely bound to the
|
||||
underlying data representation allowing it to represent data that is much larger
|
||||
than can fit in memory and allowing dynamic column definitions to be defined at
|
||||
program runtime as equations and extensions derived from other sources of data.
|
||||
+232
@@ -0,0 +1,232 @@
|
||||
# tech.ml.dataset And nippy
|
||||
|
||||
|
||||
We are big fans of the [nippy system](https://github.com/ptaoussanis/nippy) for
|
||||
freezing/thawing data. So we were pleasantly surprized with how well it performs
|
||||
with dataset and how easy it was to extend the dataset object to support nippy
|
||||
natively.
|
||||
|
||||
|
||||
## Nippy Hits One Out Of the Park
|
||||
|
||||
|
||||
We start with a decent size gzipped tabbed-delimited file.
|
||||
|
||||
```console
|
||||
chrisn@chrisn-lt-01:~/dev/tech.all/tech.ml.dataset/nippy-demo$ ls -alh
|
||||
total 44M
|
||||
drwxrwxr-x 2 chrisn chrisn 4.0K Jun 18 13:27 .
|
||||
drwxr-xr-x 13 chrisn chrisn 4.0K Jun 18 13:27 ..
|
||||
-rw-rw-r-- 1 chrisn chrisn 44M Jun 18 13:27 2010.tsv.gz
|
||||
```
|
||||
|
||||
|
||||
```clojure
|
||||
user> (def ds-2010 (time (ds/->dataset
|
||||
"nippy-demo/2010.tsv.gz"
|
||||
{:parser-fn {"date" [:packed-local-date "yyyy-MM-dd"]}})))
|
||||
"Elapsed time: 8588.080218 msecs"
|
||||
#'user/ds-2010
|
||||
user> ;;rename column names so the tables print nicely
|
||||
user> (def ds-2010
|
||||
(ds/select-columns ds-2010
|
||||
(->> (ds/column-names ds-2010)
|
||||
(map (fn [oldname]
|
||||
[oldname (.replace ^String oldname "_" "-")]))
|
||||
(into {}))))
|
||||
user> ds-2010
|
||||
nippy-demo/2010.tsv.gz [2769708 12]:
|
||||
|
||||
| low | comp-name-2 | high | currency-code | comp-name | m-ticker | ticker | close | volume | exchange | date | open |
|
||||
|-------:|-------------|-------:|---------------|------------|----------|--------|-------:|---------------:|----------|------------|-------:|
|
||||
| | | | USD | ALCOA CORP | AA2 | AA | 48.365 | | NYSE | 2010-01-01 | |
|
||||
| 49.355 | | 51.065 | USD | ALCOA CORP | AA2 | AA | 51.065 | 1.10618840E+07 | NYSE | 2010-01-08 | 49.385 |
|
||||
| | | | USD | ALCOA CORP | AA2 | AA | 46.895 | | NYSE | 2010-01-18 | |
|
||||
| 39.904 | | 41.854 | USD | ALCOA CORP | AA2 | AA | 40.624 | 1.46292500E+07 | NYSE | 2010-01-26 | 40.354 |
|
||||
| 40.294 | | 41.674 | USD | ALCOA CORP | AA2 | AA | 40.474 | 1.20107520E+07 | NYSE | 2010-02-03 | 40.804 |
|
||||
| 39.304 | | 40.504 | USD | ALCOA CORP | AA2 | AA | 39.844 | 1.46702890E+07 | NYSE | 2010-02-09 | 40.084 |
|
||||
| 39.574 | | 40.264 | USD | ALCOA CORP | AA2 | AA | 39.844 | 1.53728400E+07 | NYSE | 2010-02-12 | 39.994 |
|
||||
| 40.324 | | 41.104 | USD | ALCOA CORP | AA2 | AA | 40.624 | 7.72947100E+06 | NYSE | 2010-02-22 | 41.044 |
|
||||
| 39.664 | | 40.564 | USD | ALCOA CORP | AA2 | AA | 39.724 | 1.08365810E+07 | NYSE | 2010-03-02 | 40.234 |
|
||||
```
|
||||
|
||||
|
||||
Our 44MB gzipped tsv produced 2.7 million rows and 12 columns.
|
||||
|
||||
Let's check the ram usage:
|
||||
```clojure
|
||||
user> (require '[clj-memory-meter.core :as mm])
|
||||
nil
|
||||
user> (mm/measure ds-2010)
|
||||
"121.5 MB"
|
||||
```
|
||||
|
||||
Now, let's save to an uncompressed nippy file:
|
||||
|
||||
```clojure
|
||||
user> (require '[tech.io :as io])
|
||||
nil
|
||||
user> (time (tech.io/put-nippy! "test.nippy" ds-2010))
|
||||
"Elapsed time: 1069.781703 msecs"
|
||||
nil
|
||||
```
|
||||
|
||||
One second, pretty nice :-).
|
||||
|
||||
What is the file size?
|
||||
```console
|
||||
chrisn@chrisn-lt-01:~/dev/tech.all/tech.ml.dataset/nippy-demo$ ls -alh
|
||||
total 95M
|
||||
drwxrwxr-x 2 chrisn chrisn 4.0K Jun 18 13:38 .
|
||||
drwxr-xr-x 13 chrisn chrisn 4.0K Jun 18 13:36 ..
|
||||
-rw-rw-r-- 1 chrisn chrisn 51M Jun 18 13:38 2010.nippy
|
||||
-rw-rw-r-- 1 chrisn chrisn 44M Jun 18 13:27 2010.tsv.gz
|
||||
```
|
||||
|
||||
Not bad, just a slight bit larger.
|
||||
|
||||
The load performance, however, is spectacular:
|
||||
```clojure
|
||||
user> (def loaded-2010 (time (io/get-nippy "nippy-demo/2010.nippy")))
|
||||
"Elapsed time: 314.502715 msecs"
|
||||
#'user/loaded-2010
|
||||
user> (mm/measure loaded-2010)
|
||||
"93.9 MB"
|
||||
user> loaded-2010
|
||||
nippy-demo/2010.tsv.gz [2769708 12]:
|
||||
|
||||
| low | comp-name-2 | high | currency-code | comp-name | m-ticker | ticker | close | volume | exchange | date | open |
|
||||
|-------:|-------------|-------:|---------------|------------|----------|--------|-------:|---------------:|----------|------------|-------:|
|
||||
| | | | USD | ALCOA CORP | AA2 | AA | 48.365 | | NYSE | 2010-01-01 | |
|
||||
| 49.355 | | 51.065 | USD | ALCOA CORP | AA2 | AA | 51.065 | 1.10618840E+07 | NYSE | 2010-01-08 | 49.385 |
|
||||
| | | | USD | ALCOA CORP | AA2 | AA | 46.895 | | NYSE | 2010-01-18 | |
|
||||
| 39.904 | | 41.854 | USD | ALCOA CORP | AA2 | AA | 40.624 | 1.46292500E+07 | NYSE | 2010-01-26 | 40.354 |
|
||||
| 40.294 | | 41.674 | USD | ALCOA CORP | AA2 | AA | 40.474 | 1.20107520E+07 | NYSE | 2010-02-03 | 40.804 |
|
||||
| 39.304 | | 40.504 | USD | ALCOA CORP | AA2 | AA | 39.844 | 1.46702890E+07 | NYSE | 2010-02-09 | 40.084 |
|
||||
| 39.574 | | 40.264 | USD | ALCOA CORP | AA2 | AA | 39.844 | 1.53728400E+07 | NYSE | 2010-02-12 | 39.994 |
|
||||
```
|
||||
|
||||
It takes 8 seconds to load the tsv. It takes 315 milliseconds to load the nippy!
|
||||
That is great :-).
|
||||
|
||||
|
||||
The resulting dataset is somewhat smaller in memory. This is because when we
|
||||
parse a dataset we use fastutil lists and append elements to them and then return a
|
||||
dataset that sits directly on top of those lists as the column storage mechanism. Those lists have a bit
|
||||
more capacity than absolutely necessary.
|
||||
|
||||
When we save the data, we convert the data into base java/clojure datastructures
|
||||
such as primitive arrays. This is what makes things smaller: converting from a list
|
||||
with a bit of extra capacity allocated to an exact sized array. This operation is
|
||||
optimized and hits System/arraycopy under the covers as fastutil lists use arrays as
|
||||
the backing store and we make sure of the rest with `tech.datatype`.
|
||||
|
||||
|
||||
## Gzipping The Nippy
|
||||
|
||||
|
||||
We can do a bit better. If you are really concerned about dataset size on disk, we
|
||||
can save out a gzipped nippy:
|
||||
|
||||
|
||||
```clojure
|
||||
user> (time (io/put-nippy! (io/gzip-output-stream! "nippy-demo/2010.nippy.gz") ds-2010))
|
||||
"Elapsed time: 7026.500505 msecs"
|
||||
nil
|
||||
```
|
||||
|
||||
This beats the gzipped tsv in terms of size by 10%:
|
||||
```console
|
||||
chrisn@chrisn-lt-01:~/dev/tech.all/tech.ml.dataset/nippy-demo$ ls -alh
|
||||
total 134M
|
||||
drwxrwxr-x 2 chrisn chrisn 4.0K Jun 18 13:47 .
|
||||
drwxr-xr-x 13 chrisn chrisn 4.0K Jun 18 13:36 ..
|
||||
-rw-rw-r-- 1 chrisn chrisn 51M Jun 18 13:38 2010.nippy
|
||||
-rw-rw-r-- 1 chrisn chrisn 40M Jun 18 13:47 2010.nippy.gz
|
||||
-rw-rw-r-- 1 chrisn chrisn 44M Jun 18 13:27 2010.tsv.gz
|
||||
```
|
||||
|
||||
And now it takes twice the time to load:
|
||||
|
||||
```clojure
|
||||
user> (def loaded-gzipped-2010 (time (io/get-nippy (io/gzip-input-stream "nippy-demo/2010.nippy.gz"))))
|
||||
"Elapsed time: 680.165118 msecs"
|
||||
#'user/loaded-gzipped-2010
|
||||
user> (mm/measure loaded-gzipped-2010)
|
||||
"93.9 MB"
|
||||
```
|
||||
|
||||
You can probably handle load times in the 700ms range if you have a strong reason to
|
||||
have data compressed on disc.
|
||||
|
||||
|
||||
## Intermix With Clojure Data
|
||||
|
||||
Another aspect of nippy that is really valuable is that it can save/load datasets that
|
||||
are parts of arbitrary datastructures. So for example you can save
|
||||
the result of `group-by-column`:
|
||||
|
||||
```clojure
|
||||
|
||||
user> (def tickers (ds/group-by-column "ticker" ds-2010))
|
||||
#'user/tickers
|
||||
user> (type tickers)
|
||||
clojure.lang.PersistentHashMap
|
||||
user> (count tickers)
|
||||
11532
|
||||
user> (first tickers)
|
||||
["RBYCF" RBYCF [261 12]:
|
||||
|
||||
| low | comp_name_2 | high | currency_code | comp_name | m_ticker | ticker | close | volume | exchange | date | open |
|
||||
|--------:|-------------|--------:|---------------|---------------|----------|--------|--------:|---------:|----------|------------|--------:|
|
||||
| | | | USD | RUBICON MNRLS | RUBI | RBYCF | 759.677 | | OTC | 2010-01-01 | |
|
||||
| 795.161 | | 827.419 | USD | RUBICON MNRLS | RUBI | RBYCF | 800.000 | 3596.775 | OTC | 2010-01-12 | 816.129 |
|
||||
| 741.935 | | 779.032 | USD | RUBICON MNRLS | RUBI | RBYCF | 758.064 | 5490.292 | OTC | 2010-01-20 | 779.032 |
|
||||
| 645.161 | | 688.710 | USD | RUBICON MNRLS | RUBI | RBYCF | 682.258 | 6201.953 | OTC | 2010-01-28 | 669.355 |
|
||||
| 685.484 | | 725.806 | USD | RUBICON MNRLS | RUBI | RBYCF | 687.097 | 3491.220 | OTC | 2010-02-08 | 714.516 |
|
||||
| 750.000 | | 783.871 | USD | RUBICON MNRLS | RUBI | RBYCF | 770.968 | 2927.057 | OTC | 2010-02-17 | 780.645 |
|
||||
...
|
||||
```
|
||||
|
||||
`group-by and `group-by-column` both return persistent maps of key->dataset.
|
||||
|
||||
```clojure
|
||||
user> (tech.io/put-nippy! "ticker-sorted.nippy" tickers)
|
||||
nil
|
||||
user> (def loaded-tickers (tech.io/get-nippy "ticker-sorted.nippy"))
|
||||
#'user/loaded-tickers
|
||||
user> (count loaded-tickers)
|
||||
11532
|
||||
user> (first loaded-tickers)
|
||||
["RBYCF" RBYCF [261 12]:
|
||||
|
||||
| low | comp_name_2 | high | currency_code | comp_name | m_ticker | ticker | close | volume | exchange | date | open |
|
||||
|--------:|-------------|--------:|---------------|---------------|----------|--------|--------:|---------:|----------|------------|--------:|
|
||||
| | | | USD | RUBICON MNRLS | RUBI | RBYCF | 759.677 | | OTC | 2010-01-01 | |
|
||||
| 795.161 | | 827.419 | USD | RUBICON MNRLS | RUBI | RBYCF | 800.000 | 3596.775 | OTC | 2010-01-12 | 816.129 |
|
||||
| 741.935 | | 779.032 | USD | RUBICON MNRLS | RUBI | RBYCF | 758.064 | 5490.292 | OTC | 2010-01-20 | 779.032 |
|
||||
| 645.161 | | 688.710 | USD | RUBICON MNRLS | RUBI | RBYCF | 682.258 | 6201.953 | OTC | 2010-01-28 | 669.355 |
|
||||
| 685.484 | | 725.806 | USD | RUBICON MNRLS | RUBI | RBYCF | 687.097 | 3491.220 | OTC | 2010-02-08 | 714.516 |
|
||||
| 750.000 | | 783.871 | USD | RUBICON MNRLS | RUBI | RBYCF | 770.968 | 2927.057 | OTC | 2010-02-17 | 780.645 |
|
||||
```
|
||||
|
||||
Thus datasets can be used in maps, vectors, you name it and you can load/save those
|
||||
really complex datastructures. That can be a big help for complex dataflows.
|
||||
|
||||
|
||||
## Simple Implementation
|
||||
|
||||
|
||||
Our implementation of save/load for this pathway goes through two public functions:
|
||||
|
||||
|
||||
* [dataset->data](https://github.com/techascent/tech.ml.dataset/blob/7c8c7514e0e35995050c1e326122a1826cc18273/src/tech/v3/dataset/base.clj#L666) - Convert a dataset into a pure
|
||||
clojure/java datastructure suitable for serialization. Data is in arrays and string
|
||||
tables have been slightly deconstructed.
|
||||
|
||||
* [data->dataset](https://github.com/techascent/tech.ml.dataset/blob/7c8c7514e0e35995050c1e326122a1826cc18273/src/tech/v3/dataset/base.clj#L694) - Given a data-description of a
|
||||
dataset create a new dataset. This is mainly a zero copy operation so it should be
|
||||
quite quick.
|
||||
|
||||
Near those functions you can see how easy it was to implement direct nippy support for
|
||||
the dataset object itself. Really nice, Nippy is truly a great library :-).
|
||||
+309
@@ -0,0 +1,309 @@
|
||||
# tech.ml.dataset Supported Datatypes
|
||||
|
||||
|
||||
`tech.ml.dataset` supports a wide range of datatypes and has a system for expanding
|
||||
the supported datatype set, aliasing new names to existing datatypes, and packing
|
||||
object datatypes into primitive containers. Let's walk through each of these topics
|
||||
and finally see how they relate to actually getting data into and out of a dataset.
|
||||
|
||||
|
||||
## Typesystem Fundamentals
|
||||
|
||||
|
||||
### Base Concepts
|
||||
|
||||
|
||||
There are two fundamental namespaces that describe the entire type system for
|
||||
`dtype-next` derived projects. The first is the [casting
|
||||
namespace](https://github.com/cnuernber/dtype-next/blob/master/src/tech/v3/datatype/casting.clj)
|
||||
- this registers the various datatypes and has maps describing the current set of
|
||||
datatypes. `dtype-next` has a simple typesystem in order to support primitve
|
||||
unsigned types which are completely unsupported on the JVM otherwise.
|
||||
|
||||
|
||||
If we just load the casting namespace we see the base dtype-next datatypes:
|
||||
|
||||
|
||||
```clojure
|
||||
user> (require '[tech.v3.datatype.casting :as casting])
|
||||
nil
|
||||
user> @casting/valid-datatype-set
|
||||
#{:byte :int8 :float32 :long :bool :int32 :int :object :float64 :string :uint64 :uint16 :boolean :short :double :char :keyword :uint8 :uuid :uint32 :int16 :float :int64}
|
||||
```
|
||||
|
||||
Now if we load the dtype-next namespace we see quite a few more datatypes registered:
|
||||
|
||||
|
||||
```clojure
|
||||
user> (require '[tech.v3.datatype :as dtype])
|
||||
nil
|
||||
user> @casting/valid-datatype-set
|
||||
#{:byte :int8 :float32 :char-array :int :object-array :float64 :list :uint64 :uint16 :char :int64-array :uint8 :int32-array :boolean-array :persistent-map :persistent-vector :persistent-set :float :long :bool :int32 :object :int16-array :string :boolean :short :float64-array :double :float32-array :keyword :uuid :int8-array :native-buffer :uint32 :array-buffer :int16 :int64}
|
||||
```
|
||||
|
||||
Right away you can perhaps tell that there is a dynamic mechanism for registering more datatypes
|
||||
- we will get to that later. This set ties into the dtype-next [datatype api](https://cnuernber.github.io/dtype-next/tech.v3.datatype.html#var-datatype):
|
||||
|
||||
|
||||
```clojure
|
||||
user> (dtype/datatype (java.util.UUID/randomUUID))
|
||||
:uuid
|
||||
user> (dtype/datatype (int 10))
|
||||
:int32
|
||||
user> (dtype/datatype (float 10))
|
||||
:float32
|
||||
user> (dtype/datatype (double 10))
|
||||
:float64
|
||||
```
|
||||
|
||||
|
||||
If we have a container of data one important question we have is what type of data is in
|
||||
the container. This is where the [elemwise-datatype api](https://cnuernber.github.io/dtype-next/tech.v3.datatype.html#var-elemwise-datatype) comes in:
|
||||
|
||||
|
||||
```clojure
|
||||
user> (dtype/elemwise-datatype (float-array 10))
|
||||
:float32
|
||||
user> (dtype/elemwise-datatype (int-array 10))
|
||||
:int32
|
||||
```
|
||||
|
||||
Given 2 (or more) numeric datatypes we can ask the typesystem what datatype should a combined
|
||||
operation, such as `+`, operate in?
|
||||
|
||||
```clojure
|
||||
user> (casting/widest-datatype :float32 :int32)
|
||||
:float64
|
||||
```
|
||||
|
||||
The root of our type system is the object datatype. All types can be represented by the object
|
||||
datatype albeit at some cost and generic containers such as persistent vectors or java
|
||||
`ArrayList`s and generic sequences produced by operations such as `map` do not have any
|
||||
information about the type of data they contain and thus they have the dataytpe of `:object`:
|
||||
|
||||
```clojure
|
||||
user> (dtype/elemwise-datatype (range 10))
|
||||
:int64
|
||||
user> (dtype/elemwise-datatype (vec (range 10)))
|
||||
:object
|
||||
```
|
||||
|
||||
If we include the dataset api then we see the typesystem is extended to include support for
|
||||
various datetime types:
|
||||
|
||||
```clojure
|
||||
user> (require '[tech.v3.dataset :as ds])
|
||||
nil
|
||||
user> @casting/valid-datatype-set
|
||||
#{:byte :int8 :local-date-time :float32 :char-array :int :object-array :epoch-milliseconds :uint64 :char :packed-instant :uint8 :bitmap :int32-array :boolean-array :persistent-map :persistent-vector :days :tensor :persistent-set :seconds :long :microseconds :int32 :boolean :short :double :epoch-days :float32-array :instant :zoned-date-time :keyword :dataset :text :native-buffer :array-buffer :years :int64 :epoch-microseconds :milliseconds :float64 :list :uint16 :int64-array :nanoseconds :duration :packed-duration :float :bool :object :int16-array :string :hours :float64-array :epoch-seconds :packed-local-date :epoch-hours :uuid :weeks :local-date :int8-array :uint32 :int16}
|
||||
```
|
||||
|
||||
|
||||
Given a container of a with a specific datatype we can create a new read-only representation
|
||||
of a datatype that we desire with [make-reader](https://cnuernber.github.io/dtype-next/tech.v3.datatype.html#var-make-reader):
|
||||
|
||||
|
||||
```clojure
|
||||
user> (def generic-data (vec (range 10)))
|
||||
#'user/generic-data
|
||||
user> generic-data
|
||||
[0 1 2 3 4 5 6 7 8 9]
|
||||
user> (dtype/make-reader :float32 (count generic-data) (float (generic-data idx)))
|
||||
[0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0]
|
||||
```
|
||||
|
||||
The default datetime definition of all datatypes is in [datatype/base.clj](https://github.com/cnuernber/dtype-next/blob/master/src/tech/v3/datatype/datetime/base.clj#L398).
|
||||
|
||||
|
||||
### Packing
|
||||
|
||||
|
||||
The second fundamental concept to the typesystem is the concept of packing which is storing
|
||||
a java object in a primitive datatype. This allows us to use `:int64` data to represent
|
||||
`java.time.Instant` objects and `:int32` data to represent `java.time.LocalDate` objects.
|
||||
This compression has both speed and size benefits especially when it comes to serializing
|
||||
the data. It also allows us to support parquet and apache arrow file formats more
|
||||
transparently because they represent, e.g. `LocalDate` objects as epoch days. Currently
|
||||
only datetime objects are packed.
|
||||
|
||||
|
||||
Packing has generic support in the underlying buffer system so that it works in an integrated
|
||||
fashion throughout the system.
|
||||
|
||||
```clojure
|
||||
user> (dtype/make-container :packed-local-date (repeat 10 (java.time.LocalDate/now)))
|
||||
#array-buffer<packed-local-date>[10]
|
||||
[2021-12-15, 2021-12-15, 2021-12-15, 2021-12-15, 2021-12-15, 2021-12-15, 2021-12-15, 2021-12-15, 2021-12-15, 2021-12-15]
|
||||
user> (def packed *1)
|
||||
#'user/packed
|
||||
user> (def unpacked (dtype/make-container :local-date (repeat 10 (java.time.LocalDate/now))))
|
||||
#'user/unpacked
|
||||
user> unpacked
|
||||
#array-buffer<local-date>[10]
|
||||
[2021-12-15, 2021-12-15, 2021-12-15, 2021-12-15, 2021-12-15, 2021-12-15, 2021-12-15, 2021-12-15, 2021-12-15, 2021-12-15]
|
||||
user> (.readLong (dtype/->reader packed) 0)
|
||||
18976
|
||||
user> (.readObject (dtype/->reader packed) 0)
|
||||
#object[java.time.LocalDate 0x2d867250 "2021-12-15"]
|
||||
user> (.toEpochDay *1)
|
||||
18976
|
||||
user> (.readLong (dtype/->reader unpacked) 0)
|
||||
Execution error at tech.v3.datatype.NumericConversions/numberCast (NumericConversions.java:22).
|
||||
Invalid argument
|
||||
user> (.readObject (dtype/->reader unpacked) 0)
|
||||
#object[java.time.LocalDate 0x2f13a18c "2021-12-15"]
|
||||
```
|
||||
|
||||
Packing is defined in the namespace [tech.v3.datatype.packing](https://github.com/cnuernber/dtype-next/blob/master/src/tech/v3/datatype/packing.clj). We can add new packed datatypes
|
||||
but I strongly suggest avoiding this in general. While it certainly works well it is usually
|
||||
unnecessary and less clear than simply defining an alias and conversion methods do/from
|
||||
the alias.
|
||||
|
||||
|
||||
The best example of using the packing system is the definition of the [datetime packed
|
||||
datatypes](https://github.com/cnuernber/dtype-next/blob/master/src/tech/v3/datatype/datetime/packing.clj).
|
||||
|
||||
|
||||
### Aliasing Datatypes
|
||||
|
||||
|
||||
C/C++ contain the concept of datatype aliasing in the `typedef` keyword. For our use cases
|
||||
it is useful, especially when dealiing with datetime types to alias some datatypes to integers
|
||||
of various sizes so you can have a container of `:milliseconds` and such. You can see several
|
||||
examples in the aforementioned [datatime/base.clj](https://github.com/cnuernber/dtype-next/blob/master/src/tech/v3/datatype/datetime/base.clj#L398).
|
||||
|
||||
|
||||
```clojure
|
||||
user> (casting/alias-datatype! :foobar :float32)
|
||||
#{:byte :int8 :local-date-time :float32 :char-array :int :object-array :epoch-milliseconds :uint64 :char :packed-instant :uint8 :bitmap :int32-array :boolean-array :persistent-map :persistent-vector :days :tensor :persistent-set :seconds :long :microseconds :int32 :boolean :short :double :epoch-days :float32-array :instant :zoned-date-time :keyword :dataset :text :native-buffer :array-buffer :years :int64 :epoch-microseconds :milliseconds :float64 :list :uint16 :int64-array :nanoseconds :duration :packed-duration :float :bool :foobar :object :int16-array :string :hours :float64-array :epoch-seconds :packed-local-date :epoch-hours :uuid :weeks :local-date :int8-array :uint32 :int16}
|
||||
user> (dtype/make-container :foobar (range 10))
|
||||
#array-buffer<foobar>[10]
|
||||
[0.000, 1.000, 2.000, 3.000, 4.000, 5.000, 6.000, 7.000, 8.000, 9.000]
|
||||
```
|
||||
|
||||
|
||||
In general, because this is all done at runtime I ask that people refrain aliasing new
|
||||
datatypes, defining new datatypes, and packing new datatypes. This doesn't mean it is an
|
||||
error if someone does it but it does mean that every new datatype definition, packing
|
||||
definition, and alias definition slightly slows down the system.
|
||||
|
||||
|
||||
|
||||
## Supported Meaningful Datatypes
|
||||
|
||||
|
||||
For dataset processing, the currently supported meaningful datatypes are:
|
||||
|
||||
* `[:int8 :uint8 :int16 :uint16 :int32 :uint32 :int64 :uint64 :float32 :float64
|
||||
:string :keyword :uuid
|
||||
:local-date :packed-local-date :instant :packed-instant :duration :packed-duration
|
||||
:local-date-time]`
|
||||
|
||||
|
||||
There are more datatypes but for general purpose dataset processing these are a reasonable
|
||||
subset.
|
||||
|
||||
|
||||
When parsing data into the dataset system we can define both the container of the data
|
||||
and the parser of the data:
|
||||
|
||||
|
||||
```clojure
|
||||
user> (def data-maps (for [idx (range 10)]
|
||||
{:a idx
|
||||
:b (str (.plusDays (java.time.LocalDate/now) idx))}))
|
||||
#'user/data-maps
|
||||
user> data-maps
|
||||
({:a 0, :b "2021-12-15"}
|
||||
{:a 1, :b "2021-12-16"}
|
||||
{:a 2, :b "2021-12-17"}
|
||||
{:a 3, :b "2021-12-18"}
|
||||
{:a 4, :b "2021-12-19"}
|
||||
{:a 5, :b "2021-12-20"}
|
||||
{:a 6, :b "2021-12-21"}
|
||||
{:a 7, :b "2021-12-22"}
|
||||
{:a 8, :b "2021-12-23"}
|
||||
{:a 9, :b "2021-12-24"})
|
||||
user> (:b (ds/->dataset data-maps))
|
||||
#tech.v3.dataset.column<string>[10]
|
||||
:b
|
||||
[2021-12-15, 2021-12-16, 2021-12-17, 2021-12-18, 2021-12-19, 2021-12-20, 2021-12-21, 2021-12-22, 2021-12-23, 2021-12-24]
|
||||
user> (:b (ds/->dataset data-maps {:parser-fn {:b :local-date}}))
|
||||
#tech.v3.dataset.column<local-date>[10]
|
||||
:b
|
||||
[2021-12-15, 2021-12-16, 2021-12-17, 2021-12-18, 2021-12-19, 2021-12-20, 2021-12-21, 2021-12-22, 2021-12-23, 2021-12-24]
|
||||
user> (:b (ds/->dataset data-maps {:parser-fn {:b :packed-local-date}}))
|
||||
#tech.v3.dataset.column<packed-local-date>[10]
|
||||
:b
|
||||
[2021-12-15, 2021-12-16, 2021-12-17, 2021-12-18, 2021-12-19, 2021-12-20, 2021-12-21, 2021-12-22, 2021-12-23, 2021-12-24]
|
||||
user> (:b (ds/->dataset data-maps {:parser-fn {:b [:packed-local-date
|
||||
(fn [data]
|
||||
(java.time.LocalDate/parse (str data)))]}}))
|
||||
#tech.v3.dataset.column<packed-local-date>[10]
|
||||
:b
|
||||
[2021-12-15, 2021-12-16, 2021-12-17, 2021-12-18, 2021-12-19, 2021-12-20, 2021-12-21, 2021-12-22, 2021-12-23, 2021-12-24]
|
||||
```
|
||||
|
||||
|
||||
## Extending Datatype System
|
||||
|
||||
|
||||
Lets say we have tons of data in which only year-months are relevant. We can have
|
||||
escalating levels of support depending on how much it really matters. The first level is to
|
||||
convert the data into a type the system already understands, in this case a LocalDate:
|
||||
|
||||
|
||||
```clojure
|
||||
user> (:b (ds/->dataset
|
||||
data-maps
|
||||
{:parser-fn {:b [:packed-local-date
|
||||
(fn [data]
|
||||
(let [ym (java.time.YearMonth/parse (str data))]
|
||||
(java.time.LocalDate/of (.getYear ym) (.getMonth ym) 1)))]
|
||||
}}))
|
||||
#tech.v3.dataset.column<packed-local-date>[10]
|
||||
:b
|
||||
[2021-12-01, 2021-12-01, 2021-12-01, 2021-12-01, 2021-12-01, 2021-12-01, 2021-12-01, 2021-12-01, 2021-12-01, 2021-12-01]
|
||||
```
|
||||
|
||||
|
||||
The second is to parse to year-months and accept our column type will just be `:object` -
|
||||
|
||||
|
||||
```clojure
|
||||
user> (:b (ds/->dataset
|
||||
data-maps
|
||||
{:parser-fn {:b [:object
|
||||
(fn [data]
|
||||
(let [ym (java.time.YearMonth/parse (str data))]
|
||||
ym))]
|
||||
}}))
|
||||
#tech.v3.dataset.column<object>[10]
|
||||
:b
|
||||
[2021-12, 2021-12, 2021-12, 2021-12, 2021-12, 2021-12, 2021-12, 2021-12, 2021-12, 2021-12]
|
||||
```
|
||||
|
||||
Third, we can extend the type system to support year month's as object datatypes. This is
|
||||
only slightly better than base object support but it does allow us to ensure we can
|
||||
make containers with only `YearMonth` or nil objects in them:
|
||||
|
||||
|
||||
```clojure
|
||||
user> (casting/add-object-datatype! :year-month java.time.YearMonth)
|
||||
:ok
|
||||
user> (:b (ds/->dataset
|
||||
data-maps
|
||||
{:parser-fn {:b [:year-month
|
||||
(fn [data]
|
||||
(let [ym (java.time.YearMonth/parse (str data))]
|
||||
ym))]
|
||||
}}))
|
||||
#tech.v3.dataset.column<year-month>[10]
|
||||
:b
|
||||
[2021-12, 2021-12, 2021-12, 2021-12, 2021-12, 2021-12, 2021-12, 2021-12, 2021-12, 2021-12]
|
||||
```
|
||||
|
||||
And finally we could implement packing for this type. This means we could store year-month as
|
||||
perhaps 32-bit integer epoch-months or something like that. We won't demonstrate this as
|
||||
it is tedious but the example in [datetime/packing.clj](https://github.com/cnuernber/dtype-next/blob/master/src/tech/v3/datatype/datetime/packing.clj)
|
||||
may be sufficient to show how to do this - if not let us know on Zulip or drop me an email.
|
||||
Reference in New Issue
Block a user