init research

This commit is contained in:
2026-02-08 11:20:43 -10:00
commit bdf064f54d
3041 changed files with 1592200 additions and 0 deletions
+176
View File
@@ -0,0 +1,176 @@
# tech.ml.dataset Getting Started
## What kind of data?
TMD processes _tabular_ data, that is, data logically arranged in rows and columns. Similar to a spreadsheet (but handling much larger datasets) or a database (but much more convenient), TMD accelerates exploring, cleaning, and processing data tables. TMD inherits Clojure's data-orientation and flexible dynamic typing, without compromising on being _functional_; thereby extending the language's reach to new problems and domains.
```clojure
> (ds/->dataset "lucy.csv")
lucy.csv [3 3]:
| name | age | likes |
|-------|----:|-------|
| fred | 42 | pizza |
| ethel | 42 | sushi |
| sally | 21 | opera |
```
## Reading and writing datasets
TMD can read datasets from many common formats (e.g., csv, tsv, xls, xlsx, json, parquet, arrow, ...). When given a file path, the [->dataset](https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var--.3Edataset) function can often detect the format automatically by the file extension and obtain the dataset. The same function can make datasets from other sources, such as sequences of Clojure maps in memory, or (again with broad format support) data downloaded from the internet.
For output, the [rows](https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-rows) function gives the dataset as a sequence of maps, and the [write!](https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-write.21) function can be used to serialize any dataset into any supported format.
```clojure
> (ds/->dataset [{:name "fred"
:age 42
:likes "pizza"}
{:name "ethel"
:age 42
:likes "sushi"}
{:name "sally"
:age 21
:likes "opera"}])
_unnamed [3 3]:
| :name | :age | :likes |
|-------|-----:|--------|
| fred | 42 | pizza |
| ethel | 42 | sushi |
| sally | 21 | opera |
```
## Filtering data
TMD datasets are logically _maps_ of column name to column data; this means that (for example) Clojure's `dissoc` can be used to remove a column. Datasets can also be filtered row-wise, by predicates of a single column, or of entire rows - this is similar to Clojure's `filter` function, but can operate much more efficiently by exploiting tabular structure.
```clojure
> (-> (ds/->dataset "lucy.csv")
(dissoc "likes"))
lucy.csv [3 2]:
| name | age |
|-------|----:|
| fred | 42 |
| ethel | 42 |
| sally | 21 |
```
```clojure
> (-> (ds/->dataset "lucy.csv")
(ds/filter-column "age" #(> % 30)))
lucy.csv [2 3]:
| name | age | likes |
|-------|----:|-------|
| fred | 42 | pizza |
| ethel | 42 | sushi |
```
## Adding data
The powerful [row-map](https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-row-map) function can be used to create or update columns that derive from data already in the dataset. Adding rows is typically accomplished by concatenating two (or more) datasets. The [functional](https://cnuernber.github.io/dtype-next/tech.v3.datatype.functional.html) namespace provides convenient functions for operating on scalar, element-wise, or columnar data.
```clojure
> (-> (ds/->dataset "lucy.csv")
(ds/row-map (fn [{:strs [age]}]
{"half-age" (/ age 2.0)})))
lucy.csv [3 4]:
| name | age | likes | half-age |
|-------|----:|-------|---------:|
| fred | 42 | pizza | 21.0 |
| ethel | 42 | sushi | 21.0 |
| sally | 21 | opera | 10.5 |
```
## Statistics
TMD has tools for calculating summary statistics on datasets. The [descriptive-stats](https://techascent.github.io/tech.ml.dataset/docs/tech.v3.dataset.html#var-descriptive-stats) function produces a dataset of summary statistics for each column in the input dataset - perfect for initial exploration, or further meta analysis or operation. Broad support for further columnar statistical analysis is provided by the [statistics](https://cnuernber.github.io/dtype-next/tech.v3.datatype.statistics.html) namespace.
```clojure
> (-> (ds/->dataset "lucy.csv")
(ds/row-map (fn [{:strs [age]}]
{"half-age" (/ age 2.0)}))
(ds/descriptive-stats {:stat-names [:col-name :datatype :min :mean :max :standard-deviation]}))
lucy.csv: descriptive-stats [4 6]:
| :col-name | :datatype | :min | :mean | :max | :standard-deviation |
|-----------|-----------|-----:|------:|-----:|--------------------:|
| name | :string | | | | |
| age | :int16 | 21.0 | 35.0 | 42.0 | 12.12435565 |
| likes | :string | | | | |
| half-age | :float64 | 10.5 | 17.5 | 21.0 | 6.06217783 |
```
## Grouping
Like a Clojure sequence, a dataset can be grouped into a _map_ of value to dataset with that value. The [group-by](https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-group-by) function accomplishes this. The related [group-by->indexes](https://techascent.github.io/tech.ml.dataset/docs/tech.v3.dataset.html#var-group-by-.3Eindexes) function produces maps of value to row-indexes of the input dataset - working with indexes can be more efficient than constructing concrete grouped datasets.
```clojure
> (-> (ds/->dataset "lucy.csv")
(ds/group-by #(if (> (get % "age") 30) :old :not-old)))
{:old lucy.csv [2 3]:
| name | age | likes |
|-------|----:|-------|
| fred | 42 | pizza |
| ethel | 42 | sushi |
, :not-old lucy.csv [1 3]:
| name | age | likes |
|-------|----:|-------|
| sally | 21 | opera |
}
```
```clojure
> (-> (ds/->dataset "lucy.csv")
(ds/group-by->indexes #(if (> (get % "age") 30) :old :not-old)))
{:old [0 1], :not-old [2]}
```
## Combining datasets
Because datasets are _maps_ of column name to column data, they can be combined column-wise using Clojure's `merge` function. The [concat](https://techascent.github.io/tech.ml.dataset/docs/tech.v3.dataset.html#var-concat) function can be used for row-wise combination of two or more datasets. The [join](https://techascent.github.io/tech.ml.dataset/docs/tech.v3.dataset.join.html) namespace provides database-like joins for aligning data from multiple datasets.
```clojure
> (merge (ds/->dataset (for [i (range 3)] {"index" i}))
(ds/->dataset "lucy.csv"))
_unnamed [3 4]:
| index | name | age | likes |
|------:|-------|----:|-------|
| 0 | fred | 42 | pizza |
| 1 | ethel | 42 | sushi |
| 2 | sally | 21 | opera |
```
## Date, time, and other datatypes
TMD knows about dates, times, instants, and many other types from the comprehensive `java.time` library. Working with these types can be much more convenient than dealing with them as strings, and datatypes are preserved throughout operations, so downstream tooling can avoid dealing with these data as strings as well.
In addition to `java.time` types, all Clojure types (e.g., keywords), UUIDs, as well as a comprehensive set of signed and unsigned numeric types of different widths are also transparently supported.
```clojure
> (def ds (ds/->dataset [{:date "1981-03-10"}
{:date "1999-12-31"}]
{:parser-fn {:date :local-date}}))
#'ds
> (.until (first (:date ds))
(last (:date ds)))
#object[java.time.Period 0x2d9a2c24 "P18Y9M21D"]
```
---
## Further reading
- The [README](https://github.com/techascent/tech.ml.dataset#techmldataset) on GitHub has information about installing, and first steps with TMD.
- The [walkthrough](https://techascent.github.io/tech.ml.dataset/100-walkthrough.html) topic has long-form examples of processing real data with TMD.
- The [quick reference](https://techascent.github.io/tech.ml.dataset/200-quick-reference.html) summarizes many of the most frequently used functions with hints about their use.
- The [API docs](https://techascent.github.io/tech.ml.dataset/index.html) list every function available in TMD.
File diff suppressed because it is too large Load Diff
+178
View File
@@ -0,0 +1,178 @@
# tech.ml.dataset Quick Reference
This topic summarizes many of the most frequently used TMD functions, together with some quick notes about their use. Functions here are linked to further documentation, or their source. Note, unless a namespace is specified, each function is accessible via the `tech.ml.dataset` namespace.
For a more thorough treatment, the [API docs](https://techascent.github.io/tech.ml.dataset/index.html) list every available function.
### Table of Contents
1. [Loading/Saving](#LoadingSaving)
1. [Accessing Values](#AccessingValues)
1. [REPL Friendly Printing](#PrintOptions)
1. [Exploring Datasets](#ExploringDatasets)
1. [Selecting Subrects](#SelectingSubrects)
1. [Manipulating Datasets](#ManipulatingDatasets)
1. [Elementwise Arithmetic](#ElementwiseArithmetic)
1. [Forcing Lazy Evaluation](#ForcingLazyEvaluation)
-----
<div id="LoadingSaving"></div>
## Loading/Saving
* [->dataset, ->>dataset](https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var--.3Edataset) - obtains datasets from files or streams of csv/tsv, sequence-of-maps, map-of-arrays, xlsx, xls, and other typical formats. If their respective namespaces and dependencies are loaded, this function can also load parquet and arrow. [SQL](https://github.com/techascent/tech.ml.dataset.sql) and [ClojureScript](https://github.com/cnuernber/tmdjs) support is provided by separate libraries.
* [write!](https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-write.21) - Writes csv, tsv, nippy (or a variety of other formats) with optional gzipping. Depends on scanning file path string to determine options.
* [parquet support](https://techascent.github.io/tech.ml.dataset/tech.v3.libs.parquet.html)
* [xlsx, xls support](https://techascent.github.io/tech.ml.dataset/tech.v3.libs.poi.html)
* [fast xlsx support](https://techascent.github.io/tech.ml.dataset/tech.v3.libs.fastexcel.html)
* [arrow support](https://techascent.github.io/tech.ml.dataset/tech.v3.libs.arrow.html)
* [dataset->data](https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-dataset-.3Edata) - Useful if you want an entire dataset represented as Clojure/JVM datastructures. Primitive arrays save space, roaring bitmaps represent missing sets, and string tables receive special treatment.
* [data->dataset](https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-data-.3Edataset) - Inverse of data->dataset.
* [tech.ml.dataset.io.univocity/csv->rows](https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.io.univocity.html#var-csv-.3Erows) - lower-level support for lazily parsing a csv or tsv as a sequence of `string[]` rows. Offers a subset of the `->dataset` options.
* [tech.ml.dataset.parse/rows->dataset](https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.io.string-row-parser.html#var-rows-.3Edataset) - lower-level support for obtaining a dataset from a sequence of `string[]` rows. Offers subset of the `->dataset` options.
-----
<div id="AccessingValues"></div>
## Accessing Values
* Datasets are logically maps of column name to column, and to this end implement `IPersistentMap`. So, `(map meta (vals ds))` will return a sequence of column metadata. Moreover, datasets implement `Ifn`, and so are functions of their column names. Thus, `(ds :colname)` will return the column named `:colname`. Functions like `keys`, `vals`, `contains?`, `assoc`, `dissoc`, `merge`, and map-style destructuring all work on datasets. Notably, `update` does not work as update always returns a persistent map.
* Columns are iterable and implement indexed (random access) so they work with `map`, `count` and `nth`. Columns also implement `IFn` analgous to to persistent vectors. Helpfully, using negative values as indexes reads from the end similar to numpy and pandas.
* [row-count](https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-row-count) - count dataset and column rows.
* [rows](https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-rows) - get the rows of the dataset as a `java.util.List` of persistent-map-like maps. Accomplished by a flyweight implementation of `clojure.lang.APersistentMap` where data is read out of the underlying dataset on demand. This keeps the data in the backing store for lazily access; this makes reading marginally more expensive, but allows this call not to increase memory working-set size. Indexing rows returned like this with negative values indexes from the end similar to numpy and pandas.
* [rowvecs](https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-rowvecs) - get the rows of the dataset as a `java.util.List` of persistent-vector-like entries. These rows are safe to use in maps. When using row values as keys in maps, the `{:copying? true}` option can help with performance, because each hash and equals comparison is using data located in the vector, not re-reading the data out of the source dataset. Negative values index from the end similar to numpy and pandas.
* `rows` and `rowvecs` are lazy and thus `(rand-nth (ds/rows ds))` is a relatively efficient pathway (and fun). `(ds/rows (ds/sample ds))` is also good for a quick scan.
* [column-count](https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-column-count) - count columns.
* Typed random access is supported the `(tech.v3.datatype/->reader col)` transformation. This is guaranteed to return an implementation of `java.util.List` storing typed values. These implement `IFn` like a column or persistent vector. Direct access to packed datetime columns may produce surprising results; call `tech.v3.datatype.datetime/unpack` on the column prior to calling `tech.v3.datatype/->reader` to get to the unpacked datatype. Negative indexes on readers index from the end.
* [missing](https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-missing) - return a RoaringBitmap of missing indexes. For columns, returns the column's specific missing indexes. For datasets, returns a union of all the columns' missing indexes.
* [meta, with-meta, vary-meta](https://github.com/clojure/clojure/blob/master/src/clj/clojure/core.clj#L202) - both datasets and columns implement `clojure.lang.IObj` so metadata works. The key `:name` has meaning in the system and setting it directly on a column is not recommended. In general, operations preserve metadata.
-----
<div id="PrintOptions"></div>
## REPL Friendly Printing
REPL workflows are an important part of TMD, and so controlling what is printed (especially for larger datasets) is critical. Many options are provided, by metadata, to get the right information on the screen for perusal.
Be default, printing is abbreviated, the helpful [print-all](https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-print-all) function overrides this behavior to enable printing all rows.
In general, any option can be set like `(vary-meta ds assoc :print-column-max-width 10)`.
* Summary of [print metadata options](https://github.com/techascent/tech.ml.dataset/blob/7c8c7514e0e35995050c1e326122a1826cc18273/src/tech/v3/dataset/print.clj#L93)
-----
<div id="ExploringDatasets"></div>
## Exploring Datasets
* [head](https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-head) - obtains a dataset consisting of the first N rows of the input dataset.
* [tail](https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-tail) - obtains a dataset consisting of the last N rows of the input dataset.
* [sample](https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-sample) - samples N rows, randomly, as a dataset.
* [rand-nth](https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-rand-nth) - samples a single row of the dataset.
* [descriptive-stats](https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-descriptive-stats) - produces a dataset of columnwise descriptive statistics.
* [brief](https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-brief) - get descriptive statistics as Clojure data (an edn sequence of maps, one for each column).
-----
<div id="SelectingSubrects"></div>
## Selecting Subrects
Recall that since datasets are maps, `assoc`, `dissoc`, and `merge` all work at the dataset level - beyond that, consider these helpful subrect selection functions.
* [select-columns](https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-select-columns) - select a subset of columns. Notably, this also controls column _order_ for downstream printing and serialization.
* [select-rows](https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-select-rows) - get a specific subset of rows from a datasets or column.
* [select](https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-select) - get a specific set of rows and columns in a single call, can be used for renaming.
* [drop-rows](https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-drop-rows) - drop rows by index from a datasets or column.
* [drop-missing](https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-drop-missing) - drop any rows with missing values from the dataset.
-----
<div id="ManipulatingDatasets"></div>
## Manipulating Datasets
* [new-dataset](https://github.com/techascent/tech.ml.dataset/blob/7c8c7514e0e35995050c1e326122a1826cc18273/src/tech/v3/dataset/impl/dataset.clj#L380) - Create a new dataset from a sequence of columns. Columns may be actual columns created via `tech.ml.dataset.column/new-column` or they could be maps containing at least keys `#:tech.v3.dataset{:name :data}` but also potentially `#:tech.v3.dataset{:metadata :missing}` in order to create a column with a specific set of missing values and metadata. `:force-datatype true` will disable the system from attempting to scan the data for missing values and e.g. create a float column from a vector of Float objects. The above also applies to using `clojure.core/assoc` with a dataset.
* ⭐ [row-map](https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-row-map) ⭐ - maps a function from map->map in parallel over the dataset. The returned maps will be used to create or update columns in the output dataset, merging with the original. Note there are options to return a sequence of datasets as opposed to a single large final dataset.
* [row-mapcat](https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-row-mapcat) - maps a function from map->sequence-of-maps over the dataset in parallel, potentially expanding or shrinking the result (in terms of row count). When expanding, row information not included in the original map is efficiently duplicated. Note there are options to return a sequence of datasets as opposed to a single potentially very large final dataset.
* [pd-merge](https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.join.html#var-pd-merge) - implements generalized left, right, inner, outer, and cross joins. Allows combining datasets in a way familiar to users of traditional databases.
* [replace-missing](https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-replace-missing), [replace-missing-value](https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-replace-missing-value) - replace missing values in one or more columns.
* [group-by-column](https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-group-by-column), [group-by](https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-group-by) - creates a map of value to dataset with that value. These datasets are created via indexing into the original dataset for efficiency, so no data is copied.
* [sort-by-column](https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-sort-by-column), [sort-by](https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-sort-by) - sorts the dataset by column values.
* [filter-column](https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-filter-column), [filter](https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-filter) - produces a new dataset with only rows that pass a predicate.
* [concat-copying](https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-concat-copying), [concat-inplace](https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-concat-inplace) - produces a new dataset as a concatenation of supplied datasets. Copying can be more efficient than in-place, but uses more memory - `(apply ds/concat-copying x-seq)` is **far** more efficient than `(reduce ds/concat-copying x-seq)`; this also is true for `concat-inplace`.
* [unique-by-column](https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-unique-by-column), [unique-by](https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-unique-by) - removes duplicate rows. Passing in `keep-fn` allows you to choose either first, last, or some other criteria for rows that have the same values. For `unique-by`, `identity` will work just fine (rows have sane equality semantics).
* [pmap-ds](https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-pmap-ds) - maps a function of ds->ds in parallel over batches of data in the dataset. Can return either a new dataset via concat-copying or a sequence of datasets.
* [left-join-asof](https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.join.html#var-left-join-asof) - specialized join-nearest functionality useful for doing things like finding the nearest values in time in irregularly sampled data.
* [rolling](https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.rolling.html#var-rolling) - fixed and variable rolling window operations.
* [group-column-by-agg](https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.reductions.html#var-group-by-column-agg) - unusually high performance primitive that logically combines `group-by` and `reduce` operations. Each key in the supplied map of reducers becomes a column in the output dataset.
* [neanderthal support](https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.neanderthal.html) - transformations of datasets to/from neanderthal dense native matrixes.
* [tensor support](https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.tensor.html) - transformations of datasets to/from [tech.v3.tensor](https://cnuernber.github.io/dtype-next/tech.v3.tensor.html) objects.
Many of the functions above come in `->column` variants, which can be faster by avoiding the creation of fully-realized output datasets with superfluous data. Moreover, some of these functions come in `->indexes` variants, which simply return indexes and thus skip creating sub-datasets. Operating in index space as such can be _very_ efficient.
-----
<div id="ElementwiseArithmetic"></div>
## Elementwise Arithmetic
Functions in the `tech.v3.datatype.functional` namespace operate elementwise on a column, lazily returning a new column. It is highly recommended to remove all missing values before using element-wise arithmetic as the `functional` namespace has no knowledge of missing values. Integer columns with missing values will be upcast to float or double columns in order to support a missing value indicator.
Note the use of `dfn` from `(require [tech.v3.datatype.functional :as dfn])`:
```clojure
(assoc ds :value (dtype/elemwise-cast (ds :value) :int64)
:shrs-or-prn-amt (dtype/elemwise-cast (ds :shrs-or-prn-amt) :int64)
:cik (dtype/const-reader (:cik filing) (ds/row-count ds))
:investor (dtype/const-reader investor (ds/row-count ds))
:form-type (dtype/const-reader form-type (ds/row-count ds))
:edgar-id (dtype/const-reader (:edgar-id filing) (ds/row-count ds))
:weight (dfn// (ds :value)
(double (dfn/reduce-+ (ds :value)))))
```
-----
<div id="ForcingLazyEvaluation"></div>
## Forcing Lazy Evaluation
In pandas or with R's `data.table`s one frequently needs to consider making a copy of some data before operating on it. Making too many copies uses too much memory, making too few copies leads to confusing non-local overwrites of data. There is untold lossage in these nonfunctional notions of dataset processing. Unlike these, TMD's datasets are functional.
TMD's functional datasets rely on index indirection, lazyness, and structural sharing to simplify the mental model necessary to reason about their operation. This allows low-cost aggregation of operations, and eliminates most wondering about whether making a copy is necessary or not (it's generally not). However, these indirections sometimes increase read costs.
At any time, `clone` can be used to make a clean copy of the dataset that relies on no indirect computation, and stores the data separately, so there is no chance of accidental overwrites. Clone is multithreaded and very efficient, boiling down to parallelized iteration over the data and `System/arraycopy` calls. Moreover, calling `clone` can reduce the in-memory size of the dataset by a bit - sometimes 20%, by converting `List`s that have some overhead into arrays that have no extra capacity.
* [tech.v3.datatype/clone](https://github.com/cnuernber/dtype-next/blob/152f09f925041d41782e05009bbf84d7d6cfdbc6/src/tech/v3/datatype.clj#L95) - clones the dataset realizing lazy operations and copying the data into java arrays. Operates on datasets and columns.
---
## Additional Selling Points
Sophisticated support for [Apache Arrow](https://techascent.github.io/tech.ml.dataset/tech.v3.libs.arrow.html), including mmap support for JDK-8->JDK-17 although if you are on an M-1 Mac you will need to use JDK-17. Also, with arrow, per-column compression (LZ4, ZSTD) exists across all supported platforms. At the time of writing, the official Arrow SDK does not support mmap, or JDK-17, and has no user-accessible way to save a compressed streaming format file.
Support is provided for operating on _sequences_ of datasets, enabling working on larger, potentially out-of-memory workloads. This is consistent with the design of the parquet and arrow data storage systems and aggregation operations for sequences of datasets are efficiently implemented in the
[tech.v3.dataset.reductions](https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.reductions.html) namespace.
Preliminary support for algorithms from the [Apache Data Sketches](https://datasketches.apache.org/) system can be found in the [apache-data-sketch](https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.reductions.apache-data-sketch.html) namespace. Summations/means in this area are implemented using the
[Kahan compensated summation](https://en.wikipedia.org/wiki/Kahan_summation_algorithm) algorithm.
### Efficient Rowwise Operations
TMD uses efficient parallelized mechanisms to operate on data for rowwise map and mapcat operations. Argument functions are passed maps that lazily read only the required data from the underlying dataset (huge savings over reading all the data). TMD scans the returned maps from the argument function for datatype and missing information. Columns derived from the mapping operation overwrite columns in the original dataset - the powerful `row-map` function works this way.
The mapping operations are run in parallel using a primitive named `pmap-ds` and the resulting datasets can either be returned in a sequence or combined into a single larger dataset.
* [row-map](https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-row-map)
* [row-mapcat](https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-row-mapcat)
* [rows](https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-rows)
* [rowvecs](https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-rowvecs)
---
## 📚 Additional Documentation 📚
The best place to start is the "Getting Started" topic in the documentation: [https://techascent.github.io/tech.ml.dataset/000-getting-started.html](https://techascent.github.io/tech.ml.dataset/000-getting-started.html)
The "Walkthrough" topic provides long-form examples of processing real data: [https://techascent.github.io/tech.ml.dataset/100-walkthrough.html](https://techascent.github.io/tech.ml.dataset/100-walkthrough.html)
The API docs document every available function: [https://techascent.github.io/tech.ml.dataset/](https://techascent.github.io/tech.ml.dataset/)
The provided Java API ([javadoc](https://techascent.github.io/tech.ml.dataset/javadoc/tech/v3/TMD.html) / [with frames](https://techascent.github.io/tech.ml.dataset/javadoc/index.html)) and sample program ([source](java_test/java/jtest/TMDDemo.java)) show how to use TMD from Java.
+170
View File
@@ -0,0 +1,170 @@
# tech.ml.dataset Columns, Readers, and Datatypes
In `tech.ml.dataset`, columns are composed of three things:
[data, metadata, and the missing set](https://github.com/techascent/tech.ml.dataset/blob/7c8c7514e0e35995050c1e326122a1826cc18273/src/tech/v3/dataset/impl/column.clj#L140).
The column's datatype is the datatype of the `data` member. The data member can
be anything convertible to a tech.v2.datatype reader of the appropriate type.
Buffers are a [simple abstraction](https://github.com/cnuernber/dtype-next/blob/152f09f925041d41782e05009bbf84d7d6cfdbc6/java/tech/v3/datatype/Buffer.java) of typed random access read-only
memory that implement all the interfaces required to both efficient and easy to use.
You can create a buffer by reifying the appropriately typed interface from
`tech.v3.datatype` but the datatype library has
[quick paths](https://github.com/cnuernber/dtype-next/blob/152f09f925041d41782e05009bbf84d7d6cfdbc6/src/tech/v3/datatype.clj#L102) to creating these:
```clojure
user> (require '[tech.v3.datatype :as dtype])
nil
user> (dtype/make-reader :float32 5 idx)
[0.0 1.0 2.0 3.0 4.0]
user> (dtype/make-reader :float32 5 (* 2 idx))
[0.0 2.0 4.0 6.0 8.0]
```
A read-only buffer only needs three methods - `elemwiseDatatype` (optional), `lsize`, and
`read[X]`. `read[X]` is typed to the datatype so for instance in the example above,
readFloat returns a primitive float object. `lsize` returns a long. Unlike a the
similar method `get` in java lists, the `read[X]` methods takes a long. This allows us
to use read methods on storage mechanism capable of addressing more than 2 (signed int)
or 4 (unsigned int) billion addresses.
Another way to create a reader is to do a 'map' type translation from one or more other
readers. This is provided in two ways:
* [`dtype/emap`](https://github.com/cnuernber/dtype-next/blob/152f09f925041d41782e05009bbf84d7d6cfdbc6/src/tech/v3/datatype/emap.clj#L97) - Missing set ignorant mapping into a typed representation.
* [`tech.v3.dataset.column/column-map`](https://github.com/techascent/tech.ml.dataset/blob/7c8c7514e0e35995050c1e326122a1826cc18273/src/tech/v3/dataset/column.clj#L174) - Missing set aware mapping into a typed representation.
The dataset system in general is smart enough to create columns out of readers in most
situations. So for instance if you have a dataset and you want a column of a
particular type, you can add-or-update-column and pass in a reader that implements what
you want:
```clojure
user> (def stocks (ds/->dataset "test/data/stocks.csv"))
#'user/stocks
user> (ds/head stocks)
test/data/stocks.csv [5 3]:
| symbol | date | price |
|--------+------------+-------|
| MSFT | 2000-01-01 | 39.81 |
| MSFT | 2000-02-01 | 36.35 |
| MSFT | 2000-03-01 | 43.22 |
| MSFT | 2000-04-01 | 28.37 |
| MSFT | 2000-05-01 | 25.45 |
user> (ds/head (ds/add-or-update-column stocks "id"
(dtype/make-reader :int64
(ds/row-count stocks)
idx)))
test/data/stocks.csv [5 4]:
| symbol | date | price | id |
|--------+------------+-------+----|
| MSFT | 2000-01-01 | 39.81 | 0 |
| MSFT | 2000-02-01 | 36.35 | 1 |
| MSFT | 2000-03-01 | 43.22 | 2 |
| MSFT | 2000-04-01 | 28.37 | 3 |
| MSFT | 2000-05-01 | 25.45 | 4 |
```
There are many different datatypes currently used in the datatype system -
the primitive numeric types:
* `:boolean` - convert to and from 0 (false) or 1 (true) when used as a number.
* `:int8`,`:uint8` - signed/unsigned bytes.
* `:int16`,`:uint16` - signed/unsigned shorts.
* `:int32`,`:uint32` - signed/unsigned ints.
* `:int64` - signed longs (haven't figured out unsigned longs really yet).
* `:float32`, `float64` - floats, doubles respectively.
There are more types that can be represented by primitives (they 'alias' the primitive
type) but we will leave that for another article.
Outside of the primitive types (and types aliased to primitive types), we have an
infinite object types. Any datatype the system doesn't understand it will treat as
type :object during generic options.
One very important aspect to note is that columns marked as `:object` datatypes will
use the Clojure numerics stack during mathematical operations. This is
important because Clojure number tower, similar to the APL number tower,
actively promotes values to the next appropriate size and is thus less error prone
to use if you aren't absolutely certain of your value range how it interacts with
your arithmetic pathways.
```clojure
user> (require '[tech.v3.dataset :as ds])
nil
user> (def stocks (ds/->dataset "test/data/stocks.csv"))
#'user/stocks
user> (require '[tech.v3.datatype.functional :as dfn])
nil
user> (def stocks-lag
(assoc stocks "price-lag"
(let [price-data (dtype/->reader (stocks "price"))]
(dtype/make-reader :float64 (.lsize price-data)
(.readDouble price-data
(max 0 (dec idx)))))))
#'user/stocks-lag
user> (ds/head (assoc stocks-lag "price-lag-diff" (dfn/- (stocks-lag "price")
(stocks-lag "price-lag"))))
test/data/stocks.csv [5 5]:
| symbol | date | price | price-lag | price-lag-diff |
|--------+------------+-------+-----------+----------------|
| MSFT | 2000-01-01 | 39.81 | 39.81 | 0.000 |
| MSFT | 2000-02-01 | 36.35 | 39.81 | -3.460 |
| MSFT | 2000-03-01 | 43.22 | 36.35 | 6.870 |
| MSFT | 2000-04-01 | 28.37 | 43.22 | -14.85 |
| MSFT | 2000-05-01 | 25.45 | 28.37 | -2.920 |
```
All these operations are intrinsically lazy, so values are only calculated when
requested. This is usually fine but in some cases it may be desired to force
the calculation of a particular column completely (like in the instance where
the calculation is particularly expensive). One way to force the column
efficiently is to clone it:
```clojure
user> (ds/head (ds/update-column stocks-lag "price-lag" dtype/clone))
test/data/stocks.csv [5 4]:
| symbol | date | price | price-lag |
|--------+------------+-------+-----------|
| MSFT | 2000-01-01 | 39.81 | 39.81 |
| MSFT | 2000-02-01 | 36.35 | 39.81 |
| MSFT | 2000-03-01 | 43.22 | 36.35 |
| MSFT | 2000-04-01 | 28.37 | 43.22 |
| MSFT | 2000-05-01 | 25.45 | 28.37 |
```
If we now get the actual type of the column's data member, we can see that it is
a concrete type.
```clojure
user> (-> (ds/update-column stocks-lag "price-lag" dtype/clone)
(get "price-lag")
(dtype/as-concrete-buffer))
#array-buffer<float64>[560]
[39.81, 39.81, 36.35, 43.22, 28.37, 25.45, 32.54, 28.40, 28.40, 24.53, 28.02, 23.34, 17.65, 24.84, 24.00, 22.25, 27.56, 28.14, 29.70, 26.93, ...]
```
This ability - lazily define a column via interface implementation and still
efficiently operate on that column - separates the implementation of
the `tech.ml.dataset` library from other libraries in this field. This is likely
to have an interesting and different set of advantages and disadvantages that will
present themselves over time. The dataset library is very loosely bound to the
underlying data representation allowing it to represent data that is much larger
than can fit in memory and allowing dynamic column definitions to be defined at
program runtime as equations and extensions derived from other sources of data.
+232
View File
@@ -0,0 +1,232 @@
# tech.ml.dataset And nippy
We are big fans of the [nippy system](https://github.com/ptaoussanis/nippy) for
freezing/thawing data. So we were pleasantly surprized with how well it performs
with dataset and how easy it was to extend the dataset object to support nippy
natively.
## Nippy Hits One Out Of the Park
We start with a decent size gzipped tabbed-delimited file.
```console
chrisn@chrisn-lt-01:~/dev/tech.all/tech.ml.dataset/nippy-demo$ ls -alh
total 44M
drwxrwxr-x 2 chrisn chrisn 4.0K Jun 18 13:27 .
drwxr-xr-x 13 chrisn chrisn 4.0K Jun 18 13:27 ..
-rw-rw-r-- 1 chrisn chrisn 44M Jun 18 13:27 2010.tsv.gz
```
```clojure
user> (def ds-2010 (time (ds/->dataset
"nippy-demo/2010.tsv.gz"
{:parser-fn {"date" [:packed-local-date "yyyy-MM-dd"]}})))
"Elapsed time: 8588.080218 msecs"
#'user/ds-2010
user> ;;rename column names so the tables print nicely
user> (def ds-2010
(ds/select-columns ds-2010
(->> (ds/column-names ds-2010)
(map (fn [oldname]
[oldname (.replace ^String oldname "_" "-")]))
(into {}))))
user> ds-2010
nippy-demo/2010.tsv.gz [2769708 12]:
| low | comp-name-2 | high | currency-code | comp-name | m-ticker | ticker | close | volume | exchange | date | open |
|-------:|-------------|-------:|---------------|------------|----------|--------|-------:|---------------:|----------|------------|-------:|
| | | | USD | ALCOA CORP | AA2 | AA | 48.365 | | NYSE | 2010-01-01 | |
| 49.355 | | 51.065 | USD | ALCOA CORP | AA2 | AA | 51.065 | 1.10618840E+07 | NYSE | 2010-01-08 | 49.385 |
| | | | USD | ALCOA CORP | AA2 | AA | 46.895 | | NYSE | 2010-01-18 | |
| 39.904 | | 41.854 | USD | ALCOA CORP | AA2 | AA | 40.624 | 1.46292500E+07 | NYSE | 2010-01-26 | 40.354 |
| 40.294 | | 41.674 | USD | ALCOA CORP | AA2 | AA | 40.474 | 1.20107520E+07 | NYSE | 2010-02-03 | 40.804 |
| 39.304 | | 40.504 | USD | ALCOA CORP | AA2 | AA | 39.844 | 1.46702890E+07 | NYSE | 2010-02-09 | 40.084 |
| 39.574 | | 40.264 | USD | ALCOA CORP | AA2 | AA | 39.844 | 1.53728400E+07 | NYSE | 2010-02-12 | 39.994 |
| 40.324 | | 41.104 | USD | ALCOA CORP | AA2 | AA | 40.624 | 7.72947100E+06 | NYSE | 2010-02-22 | 41.044 |
| 39.664 | | 40.564 | USD | ALCOA CORP | AA2 | AA | 39.724 | 1.08365810E+07 | NYSE | 2010-03-02 | 40.234 |
```
Our 44MB gzipped tsv produced 2.7 million rows and 12 columns.
Let's check the ram usage:
```clojure
user> (require '[clj-memory-meter.core :as mm])
nil
user> (mm/measure ds-2010)
"121.5 MB"
```
Now, let's save to an uncompressed nippy file:
```clojure
user> (require '[tech.io :as io])
nil
user> (time (tech.io/put-nippy! "test.nippy" ds-2010))
"Elapsed time: 1069.781703 msecs"
nil
```
One second, pretty nice :-).
What is the file size?
```console
chrisn@chrisn-lt-01:~/dev/tech.all/tech.ml.dataset/nippy-demo$ ls -alh
total 95M
drwxrwxr-x 2 chrisn chrisn 4.0K Jun 18 13:38 .
drwxr-xr-x 13 chrisn chrisn 4.0K Jun 18 13:36 ..
-rw-rw-r-- 1 chrisn chrisn 51M Jun 18 13:38 2010.nippy
-rw-rw-r-- 1 chrisn chrisn 44M Jun 18 13:27 2010.tsv.gz
```
Not bad, just a slight bit larger.
The load performance, however, is spectacular:
```clojure
user> (def loaded-2010 (time (io/get-nippy "nippy-demo/2010.nippy")))
"Elapsed time: 314.502715 msecs"
#'user/loaded-2010
user> (mm/measure loaded-2010)
"93.9 MB"
user> loaded-2010
nippy-demo/2010.tsv.gz [2769708 12]:
| low | comp-name-2 | high | currency-code | comp-name | m-ticker | ticker | close | volume | exchange | date | open |
|-------:|-------------|-------:|---------------|------------|----------|--------|-------:|---------------:|----------|------------|-------:|
| | | | USD | ALCOA CORP | AA2 | AA | 48.365 | | NYSE | 2010-01-01 | |
| 49.355 | | 51.065 | USD | ALCOA CORP | AA2 | AA | 51.065 | 1.10618840E+07 | NYSE | 2010-01-08 | 49.385 |
| | | | USD | ALCOA CORP | AA2 | AA | 46.895 | | NYSE | 2010-01-18 | |
| 39.904 | | 41.854 | USD | ALCOA CORP | AA2 | AA | 40.624 | 1.46292500E+07 | NYSE | 2010-01-26 | 40.354 |
| 40.294 | | 41.674 | USD | ALCOA CORP | AA2 | AA | 40.474 | 1.20107520E+07 | NYSE | 2010-02-03 | 40.804 |
| 39.304 | | 40.504 | USD | ALCOA CORP | AA2 | AA | 39.844 | 1.46702890E+07 | NYSE | 2010-02-09 | 40.084 |
| 39.574 | | 40.264 | USD | ALCOA CORP | AA2 | AA | 39.844 | 1.53728400E+07 | NYSE | 2010-02-12 | 39.994 |
```
It takes 8 seconds to load the tsv. It takes 315 milliseconds to load the nippy!
That is great :-).
The resulting dataset is somewhat smaller in memory. This is because when we
parse a dataset we use fastutil lists and append elements to them and then return a
dataset that sits directly on top of those lists as the column storage mechanism. Those lists have a bit
more capacity than absolutely necessary.
When we save the data, we convert the data into base java/clojure datastructures
such as primitive arrays. This is what makes things smaller: converting from a list
with a bit of extra capacity allocated to an exact sized array. This operation is
optimized and hits System/arraycopy under the covers as fastutil lists use arrays as
the backing store and we make sure of the rest with `tech.datatype`.
## Gzipping The Nippy
We can do a bit better. If you are really concerned about dataset size on disk, we
can save out a gzipped nippy:
```clojure
user> (time (io/put-nippy! (io/gzip-output-stream! "nippy-demo/2010.nippy.gz") ds-2010))
"Elapsed time: 7026.500505 msecs"
nil
```
This beats the gzipped tsv in terms of size by 10%:
```console
chrisn@chrisn-lt-01:~/dev/tech.all/tech.ml.dataset/nippy-demo$ ls -alh
total 134M
drwxrwxr-x 2 chrisn chrisn 4.0K Jun 18 13:47 .
drwxr-xr-x 13 chrisn chrisn 4.0K Jun 18 13:36 ..
-rw-rw-r-- 1 chrisn chrisn 51M Jun 18 13:38 2010.nippy
-rw-rw-r-- 1 chrisn chrisn 40M Jun 18 13:47 2010.nippy.gz
-rw-rw-r-- 1 chrisn chrisn 44M Jun 18 13:27 2010.tsv.gz
```
And now it takes twice the time to load:
```clojure
user> (def loaded-gzipped-2010 (time (io/get-nippy (io/gzip-input-stream "nippy-demo/2010.nippy.gz"))))
"Elapsed time: 680.165118 msecs"
#'user/loaded-gzipped-2010
user> (mm/measure loaded-gzipped-2010)
"93.9 MB"
```
You can probably handle load times in the 700ms range if you have a strong reason to
have data compressed on disc.
## Intermix With Clojure Data
Another aspect of nippy that is really valuable is that it can save/load datasets that
are parts of arbitrary datastructures. So for example you can save
the result of `group-by-column`:
```clojure
user> (def tickers (ds/group-by-column "ticker" ds-2010))
#'user/tickers
user> (type tickers)
clojure.lang.PersistentHashMap
user> (count tickers)
11532
user> (first tickers)
["RBYCF" RBYCF [261 12]:
| low | comp_name_2 | high | currency_code | comp_name | m_ticker | ticker | close | volume | exchange | date | open |
|--------:|-------------|--------:|---------------|---------------|----------|--------|--------:|---------:|----------|------------|--------:|
| | | | USD | RUBICON MNRLS | RUBI | RBYCF | 759.677 | | OTC | 2010-01-01 | |
| 795.161 | | 827.419 | USD | RUBICON MNRLS | RUBI | RBYCF | 800.000 | 3596.775 | OTC | 2010-01-12 | 816.129 |
| 741.935 | | 779.032 | USD | RUBICON MNRLS | RUBI | RBYCF | 758.064 | 5490.292 | OTC | 2010-01-20 | 779.032 |
| 645.161 | | 688.710 | USD | RUBICON MNRLS | RUBI | RBYCF | 682.258 | 6201.953 | OTC | 2010-01-28 | 669.355 |
| 685.484 | | 725.806 | USD | RUBICON MNRLS | RUBI | RBYCF | 687.097 | 3491.220 | OTC | 2010-02-08 | 714.516 |
| 750.000 | | 783.871 | USD | RUBICON MNRLS | RUBI | RBYCF | 770.968 | 2927.057 | OTC | 2010-02-17 | 780.645 |
...
```
`group-by and `group-by-column` both return persistent maps of key->dataset.
```clojure
user> (tech.io/put-nippy! "ticker-sorted.nippy" tickers)
nil
user> (def loaded-tickers (tech.io/get-nippy "ticker-sorted.nippy"))
#'user/loaded-tickers
user> (count loaded-tickers)
11532
user> (first loaded-tickers)
["RBYCF" RBYCF [261 12]:
| low | comp_name_2 | high | currency_code | comp_name | m_ticker | ticker | close | volume | exchange | date | open |
|--------:|-------------|--------:|---------------|---------------|----------|--------|--------:|---------:|----------|------------|--------:|
| | | | USD | RUBICON MNRLS | RUBI | RBYCF | 759.677 | | OTC | 2010-01-01 | |
| 795.161 | | 827.419 | USD | RUBICON MNRLS | RUBI | RBYCF | 800.000 | 3596.775 | OTC | 2010-01-12 | 816.129 |
| 741.935 | | 779.032 | USD | RUBICON MNRLS | RUBI | RBYCF | 758.064 | 5490.292 | OTC | 2010-01-20 | 779.032 |
| 645.161 | | 688.710 | USD | RUBICON MNRLS | RUBI | RBYCF | 682.258 | 6201.953 | OTC | 2010-01-28 | 669.355 |
| 685.484 | | 725.806 | USD | RUBICON MNRLS | RUBI | RBYCF | 687.097 | 3491.220 | OTC | 2010-02-08 | 714.516 |
| 750.000 | | 783.871 | USD | RUBICON MNRLS | RUBI | RBYCF | 770.968 | 2927.057 | OTC | 2010-02-17 | 780.645 |
```
Thus datasets can be used in maps, vectors, you name it and you can load/save those
really complex datastructures. That can be a big help for complex dataflows.
## Simple Implementation
Our implementation of save/load for this pathway goes through two public functions:
* [dataset->data](https://github.com/techascent/tech.ml.dataset/blob/7c8c7514e0e35995050c1e326122a1826cc18273/src/tech/v3/dataset/base.clj#L666) - Convert a dataset into a pure
clojure/java datastructure suitable for serialization. Data is in arrays and string
tables have been slightly deconstructed.
* [data->dataset](https://github.com/techascent/tech.ml.dataset/blob/7c8c7514e0e35995050c1e326122a1826cc18273/src/tech/v3/dataset/base.clj#L694) - Given a data-description of a
dataset create a new dataset. This is mainly a zero copy operation so it should be
quite quick.
Near those functions you can see how easy it was to implement direct nippy support for
the dataset object itself. Really nice, Nippy is truly a great library :-).
+309
View File
@@ -0,0 +1,309 @@
# tech.ml.dataset Supported Datatypes
`tech.ml.dataset` supports a wide range of datatypes and has a system for expanding
the supported datatype set, aliasing new names to existing datatypes, and packing
object datatypes into primitive containers. Let's walk through each of these topics
and finally see how they relate to actually getting data into and out of a dataset.
## Typesystem Fundamentals
### Base Concepts
There are two fundamental namespaces that describe the entire type system for
`dtype-next` derived projects. The first is the [casting
namespace](https://github.com/cnuernber/dtype-next/blob/master/src/tech/v3/datatype/casting.clj)
- this registers the various datatypes and has maps describing the current set of
datatypes. `dtype-next` has a simple typesystem in order to support primitve
unsigned types which are completely unsupported on the JVM otherwise.
If we just load the casting namespace we see the base dtype-next datatypes:
```clojure
user> (require '[tech.v3.datatype.casting :as casting])
nil
user> @casting/valid-datatype-set
#{:byte :int8 :float32 :long :bool :int32 :int :object :float64 :string :uint64 :uint16 :boolean :short :double :char :keyword :uint8 :uuid :uint32 :int16 :float :int64}
```
Now if we load the dtype-next namespace we see quite a few more datatypes registered:
```clojure
user> (require '[tech.v3.datatype :as dtype])
nil
user> @casting/valid-datatype-set
#{:byte :int8 :float32 :char-array :int :object-array :float64 :list :uint64 :uint16 :char :int64-array :uint8 :int32-array :boolean-array :persistent-map :persistent-vector :persistent-set :float :long :bool :int32 :object :int16-array :string :boolean :short :float64-array :double :float32-array :keyword :uuid :int8-array :native-buffer :uint32 :array-buffer :int16 :int64}
```
Right away you can perhaps tell that there is a dynamic mechanism for registering more datatypes
- we will get to that later. This set ties into the dtype-next [datatype api](https://cnuernber.github.io/dtype-next/tech.v3.datatype.html#var-datatype):
```clojure
user> (dtype/datatype (java.util.UUID/randomUUID))
:uuid
user> (dtype/datatype (int 10))
:int32
user> (dtype/datatype (float 10))
:float32
user> (dtype/datatype (double 10))
:float64
```
If we have a container of data one important question we have is what type of data is in
the container. This is where the [elemwise-datatype api](https://cnuernber.github.io/dtype-next/tech.v3.datatype.html#var-elemwise-datatype) comes in:
```clojure
user> (dtype/elemwise-datatype (float-array 10))
:float32
user> (dtype/elemwise-datatype (int-array 10))
:int32
```
Given 2 (or more) numeric datatypes we can ask the typesystem what datatype should a combined
operation, such as `+`, operate in?
```clojure
user> (casting/widest-datatype :float32 :int32)
:float64
```
The root of our type system is the object datatype. All types can be represented by the object
datatype albeit at some cost and generic containers such as persistent vectors or java
`ArrayList`s and generic sequences produced by operations such as `map` do not have any
information about the type of data they contain and thus they have the dataytpe of `:object`:
```clojure
user> (dtype/elemwise-datatype (range 10))
:int64
user> (dtype/elemwise-datatype (vec (range 10)))
:object
```
If we include the dataset api then we see the typesystem is extended to include support for
various datetime types:
```clojure
user> (require '[tech.v3.dataset :as ds])
nil
user> @casting/valid-datatype-set
#{:byte :int8 :local-date-time :float32 :char-array :int :object-array :epoch-milliseconds :uint64 :char :packed-instant :uint8 :bitmap :int32-array :boolean-array :persistent-map :persistent-vector :days :tensor :persistent-set :seconds :long :microseconds :int32 :boolean :short :double :epoch-days :float32-array :instant :zoned-date-time :keyword :dataset :text :native-buffer :array-buffer :years :int64 :epoch-microseconds :milliseconds :float64 :list :uint16 :int64-array :nanoseconds :duration :packed-duration :float :bool :object :int16-array :string :hours :float64-array :epoch-seconds :packed-local-date :epoch-hours :uuid :weeks :local-date :int8-array :uint32 :int16}
```
Given a container of a with a specific datatype we can create a new read-only representation
of a datatype that we desire with [make-reader](https://cnuernber.github.io/dtype-next/tech.v3.datatype.html#var-make-reader):
```clojure
user> (def generic-data (vec (range 10)))
#'user/generic-data
user> generic-data
[0 1 2 3 4 5 6 7 8 9]
user> (dtype/make-reader :float32 (count generic-data) (float (generic-data idx)))
[0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0]
```
The default datetime definition of all datatypes is in [datatype/base.clj](https://github.com/cnuernber/dtype-next/blob/master/src/tech/v3/datatype/datetime/base.clj#L398).
### Packing
The second fundamental concept to the typesystem is the concept of packing which is storing
a java object in a primitive datatype. This allows us to use `:int64` data to represent
`java.time.Instant` objects and `:int32` data to represent `java.time.LocalDate` objects.
This compression has both speed and size benefits especially when it comes to serializing
the data. It also allows us to support parquet and apache arrow file formats more
transparently because they represent, e.g. `LocalDate` objects as epoch days. Currently
only datetime objects are packed.
Packing has generic support in the underlying buffer system so that it works in an integrated
fashion throughout the system.
```clojure
user> (dtype/make-container :packed-local-date (repeat 10 (java.time.LocalDate/now)))
#array-buffer<packed-local-date>[10]
[2021-12-15, 2021-12-15, 2021-12-15, 2021-12-15, 2021-12-15, 2021-12-15, 2021-12-15, 2021-12-15, 2021-12-15, 2021-12-15]
user> (def packed *1)
#'user/packed
user> (def unpacked (dtype/make-container :local-date (repeat 10 (java.time.LocalDate/now))))
#'user/unpacked
user> unpacked
#array-buffer<local-date>[10]
[2021-12-15, 2021-12-15, 2021-12-15, 2021-12-15, 2021-12-15, 2021-12-15, 2021-12-15, 2021-12-15, 2021-12-15, 2021-12-15]
user> (.readLong (dtype/->reader packed) 0)
18976
user> (.readObject (dtype/->reader packed) 0)
#object[java.time.LocalDate 0x2d867250 "2021-12-15"]
user> (.toEpochDay *1)
18976
user> (.readLong (dtype/->reader unpacked) 0)
Execution error at tech.v3.datatype.NumericConversions/numberCast (NumericConversions.java:22).
Invalid argument
user> (.readObject (dtype/->reader unpacked) 0)
#object[java.time.LocalDate 0x2f13a18c "2021-12-15"]
```
Packing is defined in the namespace [tech.v3.datatype.packing](https://github.com/cnuernber/dtype-next/blob/master/src/tech/v3/datatype/packing.clj). We can add new packed datatypes
but I strongly suggest avoiding this in general. While it certainly works well it is usually
unnecessary and less clear than simply defining an alias and conversion methods do/from
the alias.
The best example of using the packing system is the definition of the [datetime packed
datatypes](https://github.com/cnuernber/dtype-next/blob/master/src/tech/v3/datatype/datetime/packing.clj).
### Aliasing Datatypes
C/C++ contain the concept of datatype aliasing in the `typedef` keyword. For our use cases
it is useful, especially when dealiing with datetime types to alias some datatypes to integers
of various sizes so you can have a container of `:milliseconds` and such. You can see several
examples in the aforementioned [datatime/base.clj](https://github.com/cnuernber/dtype-next/blob/master/src/tech/v3/datatype/datetime/base.clj#L398).
```clojure
user> (casting/alias-datatype! :foobar :float32)
#{:byte :int8 :local-date-time :float32 :char-array :int :object-array :epoch-milliseconds :uint64 :char :packed-instant :uint8 :bitmap :int32-array :boolean-array :persistent-map :persistent-vector :days :tensor :persistent-set :seconds :long :microseconds :int32 :boolean :short :double :epoch-days :float32-array :instant :zoned-date-time :keyword :dataset :text :native-buffer :array-buffer :years :int64 :epoch-microseconds :milliseconds :float64 :list :uint16 :int64-array :nanoseconds :duration :packed-duration :float :bool :foobar :object :int16-array :string :hours :float64-array :epoch-seconds :packed-local-date :epoch-hours :uuid :weeks :local-date :int8-array :uint32 :int16}
user> (dtype/make-container :foobar (range 10))
#array-buffer<foobar>[10]
[0.000, 1.000, 2.000, 3.000, 4.000, 5.000, 6.000, 7.000, 8.000, 9.000]
```
In general, because this is all done at runtime I ask that people refrain aliasing new
datatypes, defining new datatypes, and packing new datatypes. This doesn't mean it is an
error if someone does it but it does mean that every new datatype definition, packing
definition, and alias definition slightly slows down the system.
## Supported Meaningful Datatypes
For dataset processing, the currently supported meaningful datatypes are:
* `[:int8 :uint8 :int16 :uint16 :int32 :uint32 :int64 :uint64 :float32 :float64
:string :keyword :uuid
:local-date :packed-local-date :instant :packed-instant :duration :packed-duration
:local-date-time]`
There are more datatypes but for general purpose dataset processing these are a reasonable
subset.
When parsing data into the dataset system we can define both the container of the data
and the parser of the data:
```clojure
user> (def data-maps (for [idx (range 10)]
{:a idx
:b (str (.plusDays (java.time.LocalDate/now) idx))}))
#'user/data-maps
user> data-maps
({:a 0, :b "2021-12-15"}
{:a 1, :b "2021-12-16"}
{:a 2, :b "2021-12-17"}
{:a 3, :b "2021-12-18"}
{:a 4, :b "2021-12-19"}
{:a 5, :b "2021-12-20"}
{:a 6, :b "2021-12-21"}
{:a 7, :b "2021-12-22"}
{:a 8, :b "2021-12-23"}
{:a 9, :b "2021-12-24"})
user> (:b (ds/->dataset data-maps))
#tech.v3.dataset.column<string>[10]
:b
[2021-12-15, 2021-12-16, 2021-12-17, 2021-12-18, 2021-12-19, 2021-12-20, 2021-12-21, 2021-12-22, 2021-12-23, 2021-12-24]
user> (:b (ds/->dataset data-maps {:parser-fn {:b :local-date}}))
#tech.v3.dataset.column<local-date>[10]
:b
[2021-12-15, 2021-12-16, 2021-12-17, 2021-12-18, 2021-12-19, 2021-12-20, 2021-12-21, 2021-12-22, 2021-12-23, 2021-12-24]
user> (:b (ds/->dataset data-maps {:parser-fn {:b :packed-local-date}}))
#tech.v3.dataset.column<packed-local-date>[10]
:b
[2021-12-15, 2021-12-16, 2021-12-17, 2021-12-18, 2021-12-19, 2021-12-20, 2021-12-21, 2021-12-22, 2021-12-23, 2021-12-24]
user> (:b (ds/->dataset data-maps {:parser-fn {:b [:packed-local-date
(fn [data]
(java.time.LocalDate/parse (str data)))]}}))
#tech.v3.dataset.column<packed-local-date>[10]
:b
[2021-12-15, 2021-12-16, 2021-12-17, 2021-12-18, 2021-12-19, 2021-12-20, 2021-12-21, 2021-12-22, 2021-12-23, 2021-12-24]
```
## Extending Datatype System
Lets say we have tons of data in which only year-months are relevant. We can have
escalating levels of support depending on how much it really matters. The first level is to
convert the data into a type the system already understands, in this case a LocalDate:
```clojure
user> (:b (ds/->dataset
data-maps
{:parser-fn {:b [:packed-local-date
(fn [data]
(let [ym (java.time.YearMonth/parse (str data))]
(java.time.LocalDate/of (.getYear ym) (.getMonth ym) 1)))]
}}))
#tech.v3.dataset.column<packed-local-date>[10]
:b
[2021-12-01, 2021-12-01, 2021-12-01, 2021-12-01, 2021-12-01, 2021-12-01, 2021-12-01, 2021-12-01, 2021-12-01, 2021-12-01]
```
The second is to parse to year-months and accept our column type will just be `:object` -
```clojure
user> (:b (ds/->dataset
data-maps
{:parser-fn {:b [:object
(fn [data]
(let [ym (java.time.YearMonth/parse (str data))]
ym))]
}}))
#tech.v3.dataset.column<object>[10]
:b
[2021-12, 2021-12, 2021-12, 2021-12, 2021-12, 2021-12, 2021-12, 2021-12, 2021-12, 2021-12]
```
Third, we can extend the type system to support year month's as object datatypes. This is
only slightly better than base object support but it does allow us to ensure we can
make containers with only `YearMonth` or nil objects in them:
```clojure
user> (casting/add-object-datatype! :year-month java.time.YearMonth)
:ok
user> (:b (ds/->dataset
data-maps
{:parser-fn {:b [:year-month
(fn [data]
(let [ym (java.time.YearMonth/parse (str data))]
ym))]
}}))
#tech.v3.dataset.column<year-month>[10]
:b
[2021-12, 2021-12, 2021-12, 2021-12, 2021-12, 2021-12, 2021-12, 2021-12, 2021-12, 2021-12]
```
And finally we could implement packing for this type. This means we could store year-month as
perhaps 32-bit integer epoch-months or something like that. We won't demonstrate this as
it is tedious but the example in [datetime/packing.clj](https://github.com/cnuernber/dtype-next/blob/master/src/tech/v3/datatype/datetime/packing.clj)
may be sufficient to show how to do this - if not let us know on Zulip or drop me an email.