# tech.ml.dataset Walkthrough
`tech.ml.dataset` (TMD) is a Clojure library designed to ease working with tabular data, similar to `data.table` in R or Python's Pandas. TMD takes inspiration from the design of those tools, but does not aim to copy their functionality. Instead, TMD is a building block that increases Clojure's already considerable data processing power.
## High Level Design
In TMD, a dataset is logically a map of column name to column data. Column data is typed (e.g., a column of 16 bit integers, or a column of 64 bit floating point numbers), similar to a database. Column names may be any Java object - keywords and strings are typical - and column values may be any Java primitive type, or type supported by `tech.datatype`, datetimes, or arbitrary objects. Column data is stored contiguously in JVM arrays, and missing values are indicated with bitsets.
The outline of the walkthrough follows typical Clojure workflows, showing how those ideas are expanded by TMD:
1. [Dataset creation](#DatasetCreation) from typical formats (csv, tsv, etc...), or sequences of maps, or maps of column name to column values
1. [REPL-friendly printing](#REPLPrinting) of datasets
1. [Access to dataset (column) values](#AccessValues), and working with missing values
1. [Selecting](#Selecting) rows or columns of a dataset, or both at once
1. [Adding, removing, and updating](#AddingRemovingUpdating) columns
1. [Statistical analysis](#StatisticalAnalysis) of dataset data
1. [Functions familiar from `clojure.core`](#SortFilterGroup) (e.g., `sort-by`, `filter`, `group-by`, etc...) that operate on datasets
1. [Efficient elementwise operations](#ElementwiseOperations) and their use in adding or updating columns derived from data in the dataset
1. [Getting data back out](#GettingDataBackOut) of datasets for downstream consumers
-----
## Dataset Creation
### `->dataset`, `->>dataset`
Creating a dataset from `.edn` (sequence of maps):
```clojure
user> (require '[tech.v3.dataset :as ds])
nil
user> (ds/->dataset [{:a 1 :b 2} {:a 2 :c 3}])
_unnamed [2 3]:
| :a | :b | :c |
|----|----|----|
| 1 | 2 | |
| 2 | | 3 |
```
Passing options (as the first argument or the last, whichever is convenient):
```clojure
user> (ds/->dataset [{:a 1 :b 2} {:a 2 :c 3}] {:dataset-name "simple"})
simple [2 3]:
| :a | :b | :c |
|---:|---:|---:|
| 1 | 2 | |
| 2 | | 3 |
user> (ds/->>dataset {:dataset-name "simple"} [{:a 1 :b 2} {:a 2 :c 3}])
simple [2 3]:
| :a | :b | :c |
|---:|---:|---:|
| 1 | 2 | |
| 2 | | 3 |
```
Creating a dataset from a file:
```clojure
user> (ds/->dataset "test/data/stocks.csv")
test/data/stocks.csv [560 3]:
| symbol | date | price |
|--------|------------|-------:|
| MSFT | 2000-01-01 | 39.81 |
| MSFT | 2000-02-01 | 36.35 |
| MSFT | 2000-03-01 | 43.22 |
| MSFT | 2000-04-01 | 28.37 |
| MSFT | 2000-05-01 | 25.45 |
| MSFT | 2000-06-01 | 32.54 |
| MSFT | 2000-07-01 | 28.40 |
| MSFT | 2000-08-01 | 28.40 |
| MSFT | 2000-09-01 | 24.53 |
| MSFT | 2000-10-01 | 28.02 |
| ... | ... | ... |
| AAPL | 2009-05-01 | 135.81 |
| AAPL | 2009-06-01 | 142.43 |
| AAPL | 2009-07-01 | 163.39 |
| AAPL | 2009-08-01 | 168.21 |
| AAPL | 2009-09-01 | 185.35 |
| AAPL | 2009-10-01 | 188.50 |
| AAPL | 2009-11-01 | 199.91 |
| AAPL | 2009-12-01 | 210.73 |
| AAPL | 2010-01-01 | 192.06 |
| AAPL | 2010-02-01 | 204.62 |
| AAPL | 2010-03-01 | 223.02 |
```
`->dataset` and `->>dataset` interpret a `String` arguments as a file path (URL), and the format is determined automatically on load. TMD has built-in broad support for typical formats.
For text-based formats such as csv and tsv, zipped (e.g., `*.csv.zip`) and gzipped (e.g., `*.tsv.gz`) inputs are uncompressed automatically on load.
`InputStream`s also work instead of a `String`, this can be convenient when loading from private cloud storage.
While loading, TMD guesses at column datatypes, and generally does an ok job. If the guesses turn out to be sub-optimal, they can be overridden.
The same functions also load datasets from the internet:
```clojure
user> (def ds (ds/->dataset "https://github.com/techascent/tech.ml.dataset/raw/master/test/data/ames-train.csv.gz"))
#'user/ds
user> (type ds)
tech.v3.dataset.impl.dataset.Dataset
user> (ds/row-count ds)
1460
```
#### Parsing Options
The `->dataset` docstring documents the many options available for parsing:
* https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var--.3Edataset
Limiting rows and columns at parse time:
```clojure
user> (ds/->dataset "https://github.com/techascent/tech.ml.dataset/raw/master/test/data/ames-train.csv.gz"
{:column-whitelist ["SalePrice" "1stFlrSF" "2ndFlrSF"]
:n-records 5})
https://github.com/techascent/tech.ml.dataset/raw/master/test/data/ames-train.csv.gz [5 3]:
| SalePrice | 1stFlrSF | 2ndFlrSF |
|-----------|----------|----------|
| 208500 | 856 | 854 |
| 181500 | 1262 | 0 |
| 223500 | 920 | 866 |
| 140000 | 961 | 756 |
| 250000 | 1145 | 1053 |
```
Forcing a datatype for a column before parsing:
```clojure
user> (ds/->dataset "https://github.com/techascent/tech.ml.dataset/raw/master/test/data/ames-train.csv.gz"
{:column-whitelist ["SalePrice" "1stFlrSF" "2ndFlrSF"]
:n-records 5
:parser-fn {"SalePrice" :float32}})
https://github.com/techascent/tech.ml.dataset/raw/master/test/data/ames-train.csv.gz [4 3]:
| SalePrice | 1stFlrSF | 2ndFlrSF |
|-----------|----------|----------|
| 208500.0 | 856 | 854 |
| 181500.0 | 1262 | 0 |
| 223500.0 | 920 | 866 |
| 140000.0 | 961 | 756 |
| 250000.0 | 1145 | 1053 |
```
Loading from `.xlsx`:
```clojure
user> (require '[tech.v3.libs.fastexcel]) ;; for xlsx support
nil
user> (def ds (ds/->dataset "https://github.com/techascent/tech.ml.dataset/raw/master/test/data/file_example_XLSX_1000.xlsx"))
#'user/ds
user> ds
https://github.com/techascent/tech.ml.dataset/raw/master/test/data/file_example_XLSX_1000.xlsx [1000 8]:
| column-0 | First Name | Last Name | Gender | Country | Age | Date | Id |
|---------:|------------|------------|--------|---------------|-----:|------------|-------:|
| 1.0 | Dulce | Abril | Female | United States | 32.0 | 15/10/2017 | 1562.0 |
| 2.0 | Mara | Hashimoto | Female | Great Britain | 25.0 | 16/08/2016 | 1582.0 |
| 3.0 | Philip | Gent | Male | France | 36.0 | 21/05/2015 | 2587.0 |
| 4.0 | Kathleen | Hanner | Female | United States | 25.0 | 15/10/2017 | 3549.0 |
| 5.0 | Nereida | Magwood | Female | United States | 58.0 | 16/08/2016 | 2468.0 |
| 6.0 | Gaston | Brumm | Male | United States | 24.0 | 21/05/2015 | 2554.0 |
| 7.0 | Etta | Hurn | Female | Great Britain | 56.0 | 15/10/2017 | 3598.0 |
| 8.0 | Earlean | Melgar | Female | United States | 27.0 | 16/08/2016 | 2456.0 |
| 9.0 | Vincenza | Weiland | Female | United States | 40.0 | 21/05/2015 | 6548.0 |
| 10.0 | Fallon | Winward | Female | Great Britain | 28.0 | 16/08/2016 | 5486.0 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 990.0 | Chase | Karner | Male | United States | 37.0 | 15/10/2017 | 2154.0 |
| 991.0 | Tommie | Underdahl | Male | United States | 26.0 | 16/08/2016 | 3265.0 |
| 992.0 | Dorcas | Darity | Female | United States | 37.0 | 21/05/2015 | 8765.0 |
| 993.0 | Angel | Sanor | Male | France | 24.0 | 15/10/2017 | 3259.0 |
| 994.0 | Willodean | Harn | Female | United States | 39.0 | 16/08/2016 | 3567.0 |
| 995.0 | Weston | Martina | Male | United States | 26.0 | 21/05/2015 | 6540.0 |
| 996.0 | Roma | Lafollette | Female | United States | 34.0 | 15/10/2017 | 2654.0 |
| 997.0 | Felisa | Cail | Female | United States | 28.0 | 16/08/2016 | 6525.0 |
| 998.0 | Demetria | Abbey | Female | United States | 32.0 | 21/05/2015 | 3265.0 |
| 999.0 | Jeromy | Danz | Male | United States | 39.0 | 15/10/2017 | 3265.0 |
| 1000.0 | Rasheeda | Alkire | Female | United States | 29.0 | 16/08/2016 | 6125.0 |
```
Notably, here, the "Date" column ended up as a `:string` due to the funny `dd/MM/yyyy` format:
```clojure
user> (meta (get ds "Date"))
{:categorical? true, :name "Date", :datatype :string, :n-elems 1000}
```
This can be overridden at parse time to get more robust types:
```clojure
user> (def ds (ds/->dataset "https://github.com/techascent/tech.ml.dataset/raw/master/test/data/file_example_XLSX_1000.xlsx"
{:parser-fn {"Date" [:local-date "dd/MM/yyyy"]}}))
#'user/ds
user> ds
https://github.com/techascent/tech.ml.dataset/raw/master/test/data/file_example_XLSX_1000.xlsx [1000 8]:
| column-0 | First Name | Last Name | Gender | Country | Age | Date | Id |
|---------:|------------|------------|--------|---------------|-----:|------------|-------:|
| 1.0 | Dulce | Abril | Female | United States | 32.0 | 2017-10-15 | 1562.0 |
| 2.0 | Mara | Hashimoto | Female | Great Britain | 25.0 | 2016-08-16 | 1582.0 |
| 3.0 | Philip | Gent | Male | France | 36.0 | 2015-05-21 | 2587.0 |
| 4.0 | Kathleen | Hanner | Female | United States | 25.0 | 2017-10-15 | 3549.0 |
| 5.0 | Nereida | Magwood | Female | United States | 58.0 | 2016-08-16 | 2468.0 |
| 6.0 | Gaston | Brumm | Male | United States | 24.0 | 2015-05-21 | 2554.0 |
| 7.0 | Etta | Hurn | Female | Great Britain | 56.0 | 2017-10-15 | 3598.0 |
| 8.0 | Earlean | Melgar | Female | United States | 27.0 | 2016-08-16 | 2456.0 |
| 9.0 | Vincenza | Weiland | Female | United States | 40.0 | 2015-05-21 | 6548.0 |
| 10.0 | Fallon | Winward | Female | Great Britain | 28.0 | 2016-08-16 | 5486.0 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 990.0 | Chase | Karner | Male | United States | 37.0 | 2017-10-15 | 2154.0 |
| 991.0 | Tommie | Underdahl | Male | United States | 26.0 | 2016-08-16 | 3265.0 |
| 992.0 | Dorcas | Darity | Female | United States | 37.0 | 2015-05-21 | 8765.0 |
| 993.0 | Angel | Sanor | Male | France | 24.0 | 2017-10-15 | 3259.0 |
| 994.0 | Willodean | Harn | Female | United States | 39.0 | 2016-08-16 | 3567.0 |
| 995.0 | Weston | Martina | Male | United States | 26.0 | 2015-05-21 | 6540.0 |
| 996.0 | Roma | Lafollette | Female | United States | 34.0 | 2017-10-15 | 2654.0 |
| 997.0 | Felisa | Cail | Female | United States | 28.0 | 2016-08-16 | 6525.0 |
| 998.0 | Demetria | Abbey | Female | United States | 32.0 | 2015-05-21 | 3265.0 |
| 999.0 | Jeromy | Danz | Male | United States | 39.0 | 2017-10-15 | 3265.0 |
| 1000.0 | Rasheeda | Alkire | Female | United States | 29.0 | 2016-08-16 | 6125.0 |
user> (meta (get ds "Date"))
{:name "Date", :datatype :local-date, :n-elems 1000}
user> (first (get ds "Date"))
#object[java.time.LocalDate 0x36b45eff "2017-10-15"]
```
In general, supplying tuples of `[datatype parse-fn]` allows overriding. As shown above, for datetime types, `parse-fn` can additionally be a DateTimeFormat format string or a DateTimeFormat object.
A more complete example (necessary, because excel is lax with serialized types):
```clojure
user> (ds/->dataset "https://github.com/techascent/tech.ml.dataset/raw/master/test/data/file_example_XLSX_1000.xlsx"
{:parser-fn {"Date" [:local-date "dd/MM/yyyy"]
"Id" :int32
"column-0" :int32
"Age" :int16}})
https://github.com/techascent/tech.ml.dataset/raw/master/test/data/file_example_XLSX_1000.xlsx [1000 8]:
| column-0 | First Name | Last Name | Gender | Country | Age | Date | Id |
|---------:|------------|------------|--------|---------------|----:|------------|-----:|
| 1 | Dulce | Abril | Female | United States | 32 | 2017-10-15 | 1562 |
| 2 | Mara | Hashimoto | Female | Great Britain | 25 | 2016-08-16 | 1582 |
| 3 | Philip | Gent | Male | France | 36 | 2015-05-21 | 2587 |
| 4 | Kathleen | Hanner | Female | United States | 25 | 2017-10-15 | 3549 |
| 5 | Nereida | Magwood | Female | United States | 58 | 2016-08-16 | 2468 |
| 6 | Gaston | Brumm | Male | United States | 24 | 2015-05-21 | 2554 |
| 7 | Etta | Hurn | Female | Great Britain | 56 | 2017-10-15 | 3598 |
| 8 | Earlean | Melgar | Female | United States | 27 | 2016-08-16 | 2456 |
| 9 | Vincenza | Weiland | Female | United States | 40 | 2015-05-21 | 6548 |
| 10 | Fallon | Winward | Female | Great Britain | 28 | 2016-08-16 | 5486 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 990 | Chase | Karner | Male | United States | 37 | 2017-10-15 | 2154 |
| 991 | Tommie | Underdahl | Male | United States | 26 | 2016-08-16 | 3265 |
| 992 | Dorcas | Darity | Female | United States | 37 | 2015-05-21 | 8765 |
| 993 | Angel | Sanor | Male | France | 24 | 2017-10-15 | 3259 |
| 994 | Willodean | Harn | Female | United States | 39 | 2016-08-16 | 3567 |
| 995 | Weston | Martina | Male | United States | 26 | 2015-05-21 | 6540 |
| 996 | Roma | Lafollette | Female | United States | 34 | 2017-10-15 | 2654 |
| 997 | Felisa | Cail | Female | United States | 28 | 2016-08-16 | 6525 |
| 998 | Demetria | Abbey | Female | United States | 32 | 2015-05-21 | 3265 |
| 999 | Jeromy | Danz | Male | United States | 39 | 2017-10-15 | 3265 |
| 1000 | Rasheeda | Alkire | Female | United States | 29 | 2016-08-16 | 6125 |
```
#### Map of Columns Format
Creating a dataset from a map of column name to column data:
```clojure
user> (ds/->dataset {:name ["fred" "ethel" "sally"]
:age [42 42 21]
:likes ["pizza" "sushi" "opera"]})
_unnamed [3 3]:
| :name | :age | :likes |
|-------|-----:|--------|
| fred | 42 | pizza |
| ethel | 42 | sushi |
| sally | 21 | opera |
```
TMD interprets untyped column data (like a persistent vector of objects for example) as either `string` or `double` depending on the first entry of the column data sequence. In contrast, typed column data (e.g., a Java array of some primitive type, or a typed `tech.v3.datatype` container) undergoes no interpretation.
-----
## REPL Friendly Printing
In general, it is safe to print out very large datasets or columns. By default, TMD will only print out around twenty values, eliding with ellipses as necessary. This turns out to be super-useful, for example when getting a feel for the data in a particular dataset or column.
By default, datasets with more than 20 rows print abbreviated:
```clojure
user> (ds/->dataset (for [i (range 21)] {:a (rand)}))
_unnamed [21 1]:
| :a |
|-----------:|
| 0.61372362 |
| 0.17076240 |
| 0.99000549 |
| 0.75449856 |
| 0.95344183 |
| 0.06785647 |
| 0.85751960 |
| 0.29387977 |
| 0.12027818 |
| 0.31456890 |
| ... |
| 0.08843911 |
| 0.56787410 |
| 0.79682279 |
| 0.56327258 |
| 0.32338803 |
| 0.95067985 |
| 0.74832399 |
| 0.10802058 |
| 0.80990261 |
| 0.81393921 |
| 0.42608164 |
```
The helpful `print-all` function overrides this so every row is printed:
```clojure
user> (-> (ds/->dataset (for [i (range 21)] {:a (rand)}))
(ds/print-all))
_unnamed [21 1]:
| :a |
|-----------:|
| 0.42679967 |
| 0.21603676 |
| 0.62454379 |
| 0.20501915 |
| 0.01175086 |
| 0.33983128 |
| 0.25639060 |
| 0.08791288 |
| 0.35557448 |
| 0.65832242 |
| 0.70054068 |
| 0.23177467 |
| 0.84374634 |
| 0.52497340 |
| 0.13117490 |
| 0.87235367 |
| 0.69944234 |
| 0.71521474 |
| 0.13842164 |
| 0.41571045 |
| 0.36734680 |
```
Note that `print-all` does not actually print anything, it merely changes the metadata on the dataset to override the default (abbreviated) printing behavior.
-----
Multiline printing:
```clojure
user> (require '[tech.v3.tensor :as dtt]) ;; Has some types that print nicely on multiple lines
nil
user> (def t (dtt/->tensor (partition 3 (range 9))))
#'user/t
user> (ds/->dataset [{:a 1 :b t}{:a 2 :b t}])
_unnamed [2 2]:
| :a | :b |
|---:|------------------------------|
| 1 | #tech.v3.tensor