1285 lines
50 KiB
Markdown
Vendored
1285 lines
50 KiB
Markdown
Vendored
# tech.ml.dataset Walkthrough
|
|
|
|
`tech.ml.dataset` (TMD) is a Clojure library designed to ease working with tabular data, similar to `data.table` in R or Python's Pandas. TMD takes inspiration from the design of those tools, but does not aim to copy their functionality. Instead, TMD is a building block that increases Clojure's already considerable data processing power.
|
|
|
|
## High Level Design
|
|
|
|
In TMD, a dataset is logically a map of column name to column data. Column data is typed (e.g., a column of 16 bit integers, or a column of 64 bit floating point numbers), similar to a database. Column names may be any Java object - keywords and strings are typical - and column values may be any Java primitive type, or type supported by `tech.datatype`, datetimes, or arbitrary objects. Column data is stored contiguously in JVM arrays, and missing values are indicated with bitsets.
|
|
|
|
The outline of the walkthrough follows typical Clojure workflows, showing how those ideas are expanded by TMD:
|
|
|
|
1. [Dataset creation](#DatasetCreation) from typical formats (csv, tsv, etc...), or sequences of maps, or maps of column name to column values
|
|
1. [REPL-friendly printing](#REPLPrinting) of datasets
|
|
1. [Access to dataset (column) values](#AccessValues), and working with missing values
|
|
1. [Selecting](#Selecting) rows or columns of a dataset, or both at once
|
|
1. [Adding, removing, and updating](#AddingRemovingUpdating) columns
|
|
1. [Statistical analysis](#StatisticalAnalysis) of dataset data
|
|
1. [Functions familiar from `clojure.core`](#SortFilterGroup) (e.g., `sort-by`, `filter`, `group-by`, etc...) that operate on datasets
|
|
1. [Efficient elementwise operations](#ElementwiseOperations) and their use in adding or updating columns derived from data in the dataset
|
|
1. [Getting data back out](#GettingDataBackOut) of datasets for downstream consumers
|
|
|
|
-----
|
|
|
|
<div id="DatasetCreation"></div>
|
|
|
|
## Dataset Creation
|
|
|
|
### `->dataset`, `->>dataset`
|
|
|
|
Creating a dataset from `.edn` (sequence of maps):
|
|
|
|
```clojure
|
|
user> (require '[tech.v3.dataset :as ds])
|
|
nil
|
|
user> (ds/->dataset [{:a 1 :b 2} {:a 2 :c 3}])
|
|
_unnamed [2 3]:
|
|
|
|
| :a | :b | :c |
|
|
|----|----|----|
|
|
| 1 | 2 | |
|
|
| 2 | | 3 |
|
|
```
|
|
|
|
Passing options (as the first argument or the last, whichever is convenient):
|
|
|
|
```clojure
|
|
user> (ds/->dataset [{:a 1 :b 2} {:a 2 :c 3}] {:dataset-name "simple"})
|
|
simple [2 3]:
|
|
|
|
| :a | :b | :c |
|
|
|---:|---:|---:|
|
|
| 1 | 2 | |
|
|
| 2 | | 3 |
|
|
user> (ds/->>dataset {:dataset-name "simple"} [{:a 1 :b 2} {:a 2 :c 3}])
|
|
simple [2 3]:
|
|
|
|
| :a | :b | :c |
|
|
|---:|---:|---:|
|
|
| 1 | 2 | |
|
|
| 2 | | 3 |
|
|
```
|
|
|
|
Creating a dataset from a file:
|
|
|
|
```clojure
|
|
user> (ds/->dataset "test/data/stocks.csv")
|
|
test/data/stocks.csv [560 3]:
|
|
|
|
| symbol | date | price |
|
|
|--------|------------|-------:|
|
|
| MSFT | 2000-01-01 | 39.81 |
|
|
| MSFT | 2000-02-01 | 36.35 |
|
|
| MSFT | 2000-03-01 | 43.22 |
|
|
| MSFT | 2000-04-01 | 28.37 |
|
|
| MSFT | 2000-05-01 | 25.45 |
|
|
| MSFT | 2000-06-01 | 32.54 |
|
|
| MSFT | 2000-07-01 | 28.40 |
|
|
| MSFT | 2000-08-01 | 28.40 |
|
|
| MSFT | 2000-09-01 | 24.53 |
|
|
| MSFT | 2000-10-01 | 28.02 |
|
|
| ... | ... | ... |
|
|
| AAPL | 2009-05-01 | 135.81 |
|
|
| AAPL | 2009-06-01 | 142.43 |
|
|
| AAPL | 2009-07-01 | 163.39 |
|
|
| AAPL | 2009-08-01 | 168.21 |
|
|
| AAPL | 2009-09-01 | 185.35 |
|
|
| AAPL | 2009-10-01 | 188.50 |
|
|
| AAPL | 2009-11-01 | 199.91 |
|
|
| AAPL | 2009-12-01 | 210.73 |
|
|
| AAPL | 2010-01-01 | 192.06 |
|
|
| AAPL | 2010-02-01 | 204.62 |
|
|
| AAPL | 2010-03-01 | 223.02 |
|
|
```
|
|
|
|
`->dataset` and `->>dataset` interpret a `String` arguments as a file path (URL), and the format is determined automatically on load. TMD has built-in broad support for typical formats.
|
|
|
|
For text-based formats such as csv and tsv, zipped (e.g., `*.csv.zip`) and gzipped (e.g., `*.tsv.gz`) inputs are uncompressed automatically on load.
|
|
|
|
`InputStream`s also work instead of a `String`, this can be convenient when loading from private cloud storage.
|
|
|
|
While loading, TMD guesses at column datatypes, and generally does an ok job. If the guesses turn out to be sub-optimal, they can be overridden.
|
|
|
|
The same functions also load datasets from the internet:
|
|
|
|
```clojure
|
|
user> (def ds (ds/->dataset "https://github.com/techascent/tech.ml.dataset/raw/master/test/data/ames-train.csv.gz"))
|
|
#'user/ds
|
|
user> (type ds)
|
|
tech.v3.dataset.impl.dataset.Dataset
|
|
user> (ds/row-count ds)
|
|
1460
|
|
```
|
|
|
|
#### Parsing Options
|
|
|
|
The `->dataset` docstring documents the many options available for parsing:
|
|
|
|
* https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var--.3Edataset
|
|
|
|
Limiting rows and columns at parse time:
|
|
|
|
```clojure
|
|
user> (ds/->dataset "https://github.com/techascent/tech.ml.dataset/raw/master/test/data/ames-train.csv.gz"
|
|
{:column-whitelist ["SalePrice" "1stFlrSF" "2ndFlrSF"]
|
|
:n-records 5})
|
|
https://github.com/techascent/tech.ml.dataset/raw/master/test/data/ames-train.csv.gz [5 3]:
|
|
|
|
| SalePrice | 1stFlrSF | 2ndFlrSF |
|
|
|-----------|----------|----------|
|
|
| 208500 | 856 | 854 |
|
|
| 181500 | 1262 | 0 |
|
|
| 223500 | 920 | 866 |
|
|
| 140000 | 961 | 756 |
|
|
| 250000 | 1145 | 1053 |
|
|
```
|
|
|
|
Forcing a datatype for a column before parsing:
|
|
|
|
```clojure
|
|
user> (ds/->dataset "https://github.com/techascent/tech.ml.dataset/raw/master/test/data/ames-train.csv.gz"
|
|
{:column-whitelist ["SalePrice" "1stFlrSF" "2ndFlrSF"]
|
|
:n-records 5
|
|
:parser-fn {"SalePrice" :float32}})
|
|
https://github.com/techascent/tech.ml.dataset/raw/master/test/data/ames-train.csv.gz [4 3]:
|
|
|
|
| SalePrice | 1stFlrSF | 2ndFlrSF |
|
|
|-----------|----------|----------|
|
|
| 208500.0 | 856 | 854 |
|
|
| 181500.0 | 1262 | 0 |
|
|
| 223500.0 | 920 | 866 |
|
|
| 140000.0 | 961 | 756 |
|
|
| 250000.0 | 1145 | 1053 |
|
|
```
|
|
|
|
Loading from `.xlsx`:
|
|
|
|
```clojure
|
|
user> (require '[tech.v3.libs.fastexcel]) ;; for xlsx support
|
|
nil
|
|
user> (def ds (ds/->dataset "https://github.com/techascent/tech.ml.dataset/raw/master/test/data/file_example_XLSX_1000.xlsx"))
|
|
#'user/ds
|
|
user> ds
|
|
https://github.com/techascent/tech.ml.dataset/raw/master/test/data/file_example_XLSX_1000.xlsx [1000 8]:
|
|
|
|
| column-0 | First Name | Last Name | Gender | Country | Age | Date | Id |
|
|
|---------:|------------|------------|--------|---------------|-----:|------------|-------:|
|
|
| 1.0 | Dulce | Abril | Female | United States | 32.0 | 15/10/2017 | 1562.0 |
|
|
| 2.0 | Mara | Hashimoto | Female | Great Britain | 25.0 | 16/08/2016 | 1582.0 |
|
|
| 3.0 | Philip | Gent | Male | France | 36.0 | 21/05/2015 | 2587.0 |
|
|
| 4.0 | Kathleen | Hanner | Female | United States | 25.0 | 15/10/2017 | 3549.0 |
|
|
| 5.0 | Nereida | Magwood | Female | United States | 58.0 | 16/08/2016 | 2468.0 |
|
|
| 6.0 | Gaston | Brumm | Male | United States | 24.0 | 21/05/2015 | 2554.0 |
|
|
| 7.0 | Etta | Hurn | Female | Great Britain | 56.0 | 15/10/2017 | 3598.0 |
|
|
| 8.0 | Earlean | Melgar | Female | United States | 27.0 | 16/08/2016 | 2456.0 |
|
|
| 9.0 | Vincenza | Weiland | Female | United States | 40.0 | 21/05/2015 | 6548.0 |
|
|
| 10.0 | Fallon | Winward | Female | Great Britain | 28.0 | 16/08/2016 | 5486.0 |
|
|
| ... | ... | ... | ... | ... | ... | ... | ... |
|
|
| 990.0 | Chase | Karner | Male | United States | 37.0 | 15/10/2017 | 2154.0 |
|
|
| 991.0 | Tommie | Underdahl | Male | United States | 26.0 | 16/08/2016 | 3265.0 |
|
|
| 992.0 | Dorcas | Darity | Female | United States | 37.0 | 21/05/2015 | 8765.0 |
|
|
| 993.0 | Angel | Sanor | Male | France | 24.0 | 15/10/2017 | 3259.0 |
|
|
| 994.0 | Willodean | Harn | Female | United States | 39.0 | 16/08/2016 | 3567.0 |
|
|
| 995.0 | Weston | Martina | Male | United States | 26.0 | 21/05/2015 | 6540.0 |
|
|
| 996.0 | Roma | Lafollette | Female | United States | 34.0 | 15/10/2017 | 2654.0 |
|
|
| 997.0 | Felisa | Cail | Female | United States | 28.0 | 16/08/2016 | 6525.0 |
|
|
| 998.0 | Demetria | Abbey | Female | United States | 32.0 | 21/05/2015 | 3265.0 |
|
|
| 999.0 | Jeromy | Danz | Male | United States | 39.0 | 15/10/2017 | 3265.0 |
|
|
| 1000.0 | Rasheeda | Alkire | Female | United States | 29.0 | 16/08/2016 | 6125.0 |
|
|
```
|
|
|
|
Notably, here, the "Date" column ended up as a `:string` due to the funny `dd/MM/yyyy` format:
|
|
|
|
```clojure
|
|
user> (meta (get ds "Date"))
|
|
{:categorical? true, :name "Date", :datatype :string, :n-elems 1000}
|
|
```
|
|
|
|
This can be overridden at parse time to get more robust types:
|
|
|
|
```clojure
|
|
user> (def ds (ds/->dataset "https://github.com/techascent/tech.ml.dataset/raw/master/test/data/file_example_XLSX_1000.xlsx"
|
|
{:parser-fn {"Date" [:local-date "dd/MM/yyyy"]}}))
|
|
#'user/ds
|
|
user> ds
|
|
https://github.com/techascent/tech.ml.dataset/raw/master/test/data/file_example_XLSX_1000.xlsx [1000 8]:
|
|
|
|
| column-0 | First Name | Last Name | Gender | Country | Age | Date | Id |
|
|
|---------:|------------|------------|--------|---------------|-----:|------------|-------:|
|
|
| 1.0 | Dulce | Abril | Female | United States | 32.0 | 2017-10-15 | 1562.0 |
|
|
| 2.0 | Mara | Hashimoto | Female | Great Britain | 25.0 | 2016-08-16 | 1582.0 |
|
|
| 3.0 | Philip | Gent | Male | France | 36.0 | 2015-05-21 | 2587.0 |
|
|
| 4.0 | Kathleen | Hanner | Female | United States | 25.0 | 2017-10-15 | 3549.0 |
|
|
| 5.0 | Nereida | Magwood | Female | United States | 58.0 | 2016-08-16 | 2468.0 |
|
|
| 6.0 | Gaston | Brumm | Male | United States | 24.0 | 2015-05-21 | 2554.0 |
|
|
| 7.0 | Etta | Hurn | Female | Great Britain | 56.0 | 2017-10-15 | 3598.0 |
|
|
| 8.0 | Earlean | Melgar | Female | United States | 27.0 | 2016-08-16 | 2456.0 |
|
|
| 9.0 | Vincenza | Weiland | Female | United States | 40.0 | 2015-05-21 | 6548.0 |
|
|
| 10.0 | Fallon | Winward | Female | Great Britain | 28.0 | 2016-08-16 | 5486.0 |
|
|
| ... | ... | ... | ... | ... | ... | ... | ... |
|
|
| 990.0 | Chase | Karner | Male | United States | 37.0 | 2017-10-15 | 2154.0 |
|
|
| 991.0 | Tommie | Underdahl | Male | United States | 26.0 | 2016-08-16 | 3265.0 |
|
|
| 992.0 | Dorcas | Darity | Female | United States | 37.0 | 2015-05-21 | 8765.0 |
|
|
| 993.0 | Angel | Sanor | Male | France | 24.0 | 2017-10-15 | 3259.0 |
|
|
| 994.0 | Willodean | Harn | Female | United States | 39.0 | 2016-08-16 | 3567.0 |
|
|
| 995.0 | Weston | Martina | Male | United States | 26.0 | 2015-05-21 | 6540.0 |
|
|
| 996.0 | Roma | Lafollette | Female | United States | 34.0 | 2017-10-15 | 2654.0 |
|
|
| 997.0 | Felisa | Cail | Female | United States | 28.0 | 2016-08-16 | 6525.0 |
|
|
| 998.0 | Demetria | Abbey | Female | United States | 32.0 | 2015-05-21 | 3265.0 |
|
|
| 999.0 | Jeromy | Danz | Male | United States | 39.0 | 2017-10-15 | 3265.0 |
|
|
| 1000.0 | Rasheeda | Alkire | Female | United States | 29.0 | 2016-08-16 | 6125.0 |
|
|
user> (meta (get ds "Date"))
|
|
{:name "Date", :datatype :local-date, :n-elems 1000}
|
|
user> (first (get ds "Date"))
|
|
#object[java.time.LocalDate 0x36b45eff "2017-10-15"]
|
|
```
|
|
|
|
|
|
In general, supplying tuples of `[datatype parse-fn]` allows overriding. As shown above, for datetime types, `parse-fn` can additionally be a DateTimeFormat format string or a DateTimeFormat object.
|
|
|
|
A more complete example (necessary, because excel is lax with serialized types):
|
|
|
|
```clojure
|
|
user> (ds/->dataset "https://github.com/techascent/tech.ml.dataset/raw/master/test/data/file_example_XLSX_1000.xlsx"
|
|
{:parser-fn {"Date" [:local-date "dd/MM/yyyy"]
|
|
"Id" :int32
|
|
"column-0" :int32
|
|
"Age" :int16}})
|
|
https://github.com/techascent/tech.ml.dataset/raw/master/test/data/file_example_XLSX_1000.xlsx [1000 8]:
|
|
|
|
| column-0 | First Name | Last Name | Gender | Country | Age | Date | Id |
|
|
|---------:|------------|------------|--------|---------------|----:|------------|-----:|
|
|
| 1 | Dulce | Abril | Female | United States | 32 | 2017-10-15 | 1562 |
|
|
| 2 | Mara | Hashimoto | Female | Great Britain | 25 | 2016-08-16 | 1582 |
|
|
| 3 | Philip | Gent | Male | France | 36 | 2015-05-21 | 2587 |
|
|
| 4 | Kathleen | Hanner | Female | United States | 25 | 2017-10-15 | 3549 |
|
|
| 5 | Nereida | Magwood | Female | United States | 58 | 2016-08-16 | 2468 |
|
|
| 6 | Gaston | Brumm | Male | United States | 24 | 2015-05-21 | 2554 |
|
|
| 7 | Etta | Hurn | Female | Great Britain | 56 | 2017-10-15 | 3598 |
|
|
| 8 | Earlean | Melgar | Female | United States | 27 | 2016-08-16 | 2456 |
|
|
| 9 | Vincenza | Weiland | Female | United States | 40 | 2015-05-21 | 6548 |
|
|
| 10 | Fallon | Winward | Female | Great Britain | 28 | 2016-08-16 | 5486 |
|
|
| ... | ... | ... | ... | ... | ... | ... | ... |
|
|
| 990 | Chase | Karner | Male | United States | 37 | 2017-10-15 | 2154 |
|
|
| 991 | Tommie | Underdahl | Male | United States | 26 | 2016-08-16 | 3265 |
|
|
| 992 | Dorcas | Darity | Female | United States | 37 | 2015-05-21 | 8765 |
|
|
| 993 | Angel | Sanor | Male | France | 24 | 2017-10-15 | 3259 |
|
|
| 994 | Willodean | Harn | Female | United States | 39 | 2016-08-16 | 3567 |
|
|
| 995 | Weston | Martina | Male | United States | 26 | 2015-05-21 | 6540 |
|
|
| 996 | Roma | Lafollette | Female | United States | 34 | 2017-10-15 | 2654 |
|
|
| 997 | Felisa | Cail | Female | United States | 28 | 2016-08-16 | 6525 |
|
|
| 998 | Demetria | Abbey | Female | United States | 32 | 2015-05-21 | 3265 |
|
|
| 999 | Jeromy | Danz | Male | United States | 39 | 2017-10-15 | 3265 |
|
|
| 1000 | Rasheeda | Alkire | Female | United States | 29 | 2016-08-16 | 6125 |
|
|
```
|
|
|
|
#### Map of Columns Format
|
|
|
|
Creating a dataset from a map of column name to column data:
|
|
|
|
```clojure
|
|
user> (ds/->dataset {:name ["fred" "ethel" "sally"]
|
|
:age [42 42 21]
|
|
:likes ["pizza" "sushi" "opera"]})
|
|
_unnamed [3 3]:
|
|
|
|
| :name | :age | :likes |
|
|
|-------|-----:|--------|
|
|
| fred | 42 | pizza |
|
|
| ethel | 42 | sushi |
|
|
| sally | 21 | opera |
|
|
```
|
|
|
|
TMD interprets untyped column data (like a persistent vector of objects for example) as either `string` or `double` depending on the first entry of the column data sequence. In contrast, typed column data (e.g., a Java array of some primitive type, or a typed `tech.v3.datatype` container) undergoes no interpretation.
|
|
|
|
-----
|
|
|
|
<div id="REPLPrinting"></div>
|
|
|
|
## REPL Friendly Printing
|
|
|
|
In general, it is safe to print out very large datasets or columns. By default, TMD will only print out around twenty values, eliding with ellipses as necessary. This turns out to be super-useful, for example when getting a feel for the data in a particular dataset or column.
|
|
|
|
By default, datasets with more than 20 rows print abbreviated:
|
|
|
|
```clojure
|
|
user> (ds/->dataset (for [i (range 21)] {:a (rand)}))
|
|
_unnamed [21 1]:
|
|
|
|
| :a |
|
|
|-----------:|
|
|
| 0.61372362 |
|
|
| 0.17076240 |
|
|
| 0.99000549 |
|
|
| 0.75449856 |
|
|
| 0.95344183 |
|
|
| 0.06785647 |
|
|
| 0.85751960 |
|
|
| 0.29387977 |
|
|
| 0.12027818 |
|
|
| 0.31456890 |
|
|
| ... |
|
|
| 0.08843911 |
|
|
| 0.56787410 |
|
|
| 0.79682279 |
|
|
| 0.56327258 |
|
|
| 0.32338803 |
|
|
| 0.95067985 |
|
|
| 0.74832399 |
|
|
| 0.10802058 |
|
|
| 0.80990261 |
|
|
| 0.81393921 |
|
|
| 0.42608164 |
|
|
```
|
|
|
|
The helpful `print-all` function overrides this so every row is printed:
|
|
|
|
```clojure
|
|
user> (-> (ds/->dataset (for [i (range 21)] {:a (rand)}))
|
|
(ds/print-all))
|
|
_unnamed [21 1]:
|
|
|
|
| :a |
|
|
|-----------:|
|
|
| 0.42679967 |
|
|
| 0.21603676 |
|
|
| 0.62454379 |
|
|
| 0.20501915 |
|
|
| 0.01175086 |
|
|
| 0.33983128 |
|
|
| 0.25639060 |
|
|
| 0.08791288 |
|
|
| 0.35557448 |
|
|
| 0.65832242 |
|
|
| 0.70054068 |
|
|
| 0.23177467 |
|
|
| 0.84374634 |
|
|
| 0.52497340 |
|
|
| 0.13117490 |
|
|
| 0.87235367 |
|
|
| 0.69944234 |
|
|
| 0.71521474 |
|
|
| 0.13842164 |
|
|
| 0.41571045 |
|
|
| 0.36734680 |
|
|
```
|
|
|
|
Note that `print-all` does not actually print anything, it merely changes the metadata on the dataset to override the default (abbreviated) printing behavior.
|
|
|
|
-----
|
|
|
|
Multiline printing:
|
|
|
|
```clojure
|
|
user> (require '[tech.v3.tensor :as dtt]) ;; Has some types that print nicely on multiple lines
|
|
nil
|
|
user> (def t (dtt/->tensor (partition 3 (range 9))))
|
|
#'user/t
|
|
user> (ds/->dataset [{:a 1 :b t}{:a 2 :b t}])
|
|
_unnamed [2 2]:
|
|
|
|
| :a | :b |
|
|
|---:|------------------------------|
|
|
| 1 | #tech.v3.tensor<object>[3 3] |
|
|
| | [[0 1 2] |
|
|
| | [3 4 5] |
|
|
| | [6 7 8]] |
|
|
| 2 | #tech.v3.tensor<object>[3 3] |
|
|
| | [[0 1 2] |
|
|
| | [3 4 5] |
|
|
| | [6 7 8]] |
|
|
```
|
|
|
|
Overriding this via metadata:
|
|
|
|
```clojure
|
|
user> (def ds *1)
|
|
#'user/ds
|
|
user> (with-meta ds
|
|
(assoc (meta ds) :print-line-policy :single))
|
|
_unnamed [2 2]:
|
|
|
|
| :a | :b |
|
|
|---:|------------------------------|
|
|
| 1 | #tech.v3.tensor<object>[3 3] |
|
|
| 2 | #tech.v3.tensor<object>[3 3] |
|
|
```
|
|
|
|
Especially useful when dealing with datasets that may have large amounts of per-column data:
|
|
|
|
```clojure
|
|
user> (def events-ds (-> (ds/->dataset "https://api.github.com/events"
|
|
{:key-fn keyword
|
|
:file-type :json})
|
|
(vary-meta assoc
|
|
:print-line-policy :single
|
|
:print-column-max-width 25)))
|
|
#'user/events-ds
|
|
user> (ds/head events-ds)
|
|
https://api.github.com/events [5 8]:
|
|
|
|
| :id | :type | :actor | :repo | :payload | :public | :created_at | :org |
|
|
|-------------|------------------------|----------------|-----------------|-----------------------|---------|----------------------|----------------|
|
|
| 13911736787 | PushEvent | {:id 29139614, | {:id 253259114, | {:push_id 5888739305, | true | 2020-10-20T17:49:36Z | |
|
|
| 13911736794 | IssuesEvent | {:id 47793873, | {:id 240806054, | {:action "opened", | true | 2020-10-20T17:49:36Z | {:id 61098177, |
|
|
| 13911736759 | PushEvent | {:id 71535163, | {:id 304746399, | {:push_id 5888739282, | true | 2020-10-20T17:49:36Z | |
|
|
| 13911736795 | PullRequestReviewEvent | {:id 47063667, | {:id 305218173, | {:action "created", | true | 2020-10-20T17:49:36Z | |
|
|
| 13911736760 | PushEvent | {:id 22623307, | {:id 287289752, | {:push_id 5888739280, | true | 2020-10-20T17:49:36Z | |
|
|
```
|
|
|
|
### Markdown
|
|
|
|
Output is printed as markdown tables:
|
|
|
|
So, this:
|
|
|
|
```
|
|
| :name | :age | :likes |
|
|
|-------|-----:|--------|
|
|
| fred | 42 | pizza |
|
|
| ethel | 42 | sushi |
|
|
| sally | 21 | opera |
|
|
```
|
|
|
|
Renders nicely under markdown processing:
|
|
|
|
| :name | :age | :likes |
|
|
|-------|-----:|--------|
|
|
| fred | 42 | pizza |
|
|
| ethel | 42 | sushi |
|
|
| sally | 21 | opera |
|
|
|
|
The full list of possible printing options is provided in the documentation for [dataset-data->str](https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.print.html).
|
|
|
|
-----
|
|
|
|
<div id="AccessValues"></div>
|
|
|
|
## Access To Dataset (Column) Values
|
|
|
|
### Datasets are Maps
|
|
|
|
Every Dataset implements `clojure.lang.IPersistentMap`. However, unlike Clojure's persistent maps, dataset columns are ordered.
|
|
|
|
Investigating a dataset as a map, and locating missing values:
|
|
|
|
```clojure
|
|
user> (def ds (ds/->dataset [{:a 1 :b 2} {:a 2 :c 3}]))
|
|
#'user/ds
|
|
user> ds
|
|
_unnamed [2 3]:
|
|
|
|
| :a | :b | :c |
|
|
|---:|---:|---:|
|
|
| 1 | 2 | |
|
|
| 2 | | 3 |
|
|
user> (first ds)
|
|
[:a #tech.v3.dataset.column<int64>[2]
|
|
:a
|
|
[1, 2]]
|
|
user> (second ds)
|
|
[:b #tech.v3.dataset.column<int64>[2]
|
|
:b
|
|
[2, ]]
|
|
user> (ds :b)
|
|
#tech.v3.dataset.column<int64>[2]
|
|
:b
|
|
[2, ]
|
|
user> (ds/missing (ds :b))
|
|
{1}
|
|
user> (ds/column-names ds)
|
|
(:a :b :c)
|
|
user> (for [k (ds/column-names ds)]
|
|
[k (ds/missing (ds k))])
|
|
([:a {}] [:b {1}] [:c {0}])
|
|
```
|
|
|
|
### Columns are Indexed
|
|
|
|
Columns implement `clojure.lang.Indexed` (and so provide nth) and also implement `clojure.lang.IFn` just like Clojure's persistent vectors.
|
|
|
|
Getting values is easy:
|
|
|
|
```clojure
|
|
user> (def lucy-ds (ds/->dataset {:name ["fred" "ethel" "sally"]
|
|
:age [42 42 21]
|
|
:likes ["pizza" "sushi" "opera"]}))
|
|
#'user/lucy-ds
|
|
user> (lucy-ds :name)
|
|
#tech.v3.dataset.column<string>[3]
|
|
:name
|
|
[fred, ethel, sally]
|
|
user> (nth (lucy-ds :name) 0)
|
|
"fred"
|
|
user> ((lucy-ds :name) 0)
|
|
"fred"
|
|
user> ((lucy-ds :name) 2)
|
|
"sally"
|
|
```
|
|
|
|
### As Clojure Data
|
|
|
|
Accessing a dataset's rows as a sequence of maps (with one key per column):
|
|
|
|
```clojure
|
|
user> (ds/rows lucy-ds)
|
|
[{:name "fred", :age 42, :likes "pizza"}
|
|
{:name "ethel", :age 42, :likes "sushi"}
|
|
{:name "sally", :age 21, :likes "opera"}]
|
|
```
|
|
|
|
Accessing a dataset's rows as a sequence of vectors (with one entry per column):
|
|
|
|
```clojure
|
|
user> (ds/rowvecs lucy-ds)
|
|
[["fred" 42 "pizza"]
|
|
["ethel" 42 "sushi"]
|
|
["sally" 21 "opera"]]
|
|
```
|
|
|
|
These are just _views_ on the dataset, and so accessing these does not double RAM usage, and a Clojureist's intuition about immutability, persistence, and structural sharing are respected (when possible).
|
|
|
|
-----
|
|
|
|
<div id="Selecting"></div>
|
|
|
|
## Selecting Dataset Rows or Columns - (Subrect Selection)
|
|
|
|
A dataset with many columns:
|
|
|
|
```clojure
|
|
user> (def ames-ds (ds/->dataset "https://github.com/techascent/tech.ml.dataset/raw/master/test/data/ames-train.csv.gz"))
|
|
#'user/ames-ds
|
|
user> (ds/column-count ames-ds)
|
|
81
|
|
user> (ds/column-names ames-ds)
|
|
("Id"
|
|
"MSSubClass"
|
|
"MSZoning"
|
|
...)
|
|
user> (get ames-ds "OverallQual")
|
|
#tech.v3.dataset.column<int16>[1460]
|
|
OverallQual
|
|
[7, 6, 7, 7, 8, 5, 8, 7, 7, 5, 5, 9, 5, 7, 6, 7, 6, 4, 5, 5...]
|
|
```
|
|
|
|
Selecting - and implicitly reordering (!) - columns:
|
|
|
|
```clojure
|
|
user> (ds/select-columns ames-ds ["Id" "OverallQual" "SalePrice"])
|
|
https://github.com/techascent/tech.ml.dataset/raw/master/test/data/ames-train.csv.gz [1460 3]:
|
|
|
|
| Id | OverallQual | SalePrice |
|
|
|-----:|------------:|----------:|
|
|
| 1 | 7 | 208500 |
|
|
| 2 | 6 | 181500 |
|
|
| 3 | 7 | 223500 |
|
|
| 4 | 7 | 140000 |
|
|
| 5 | 8 | 250000 |
|
|
| 6 | 5 | 143000 |
|
|
| 7 | 8 | 307000 |
|
|
| 8 | 7 | 200000 |
|
|
| 9 | 7 | 129900 |
|
|
| 10 | 5 | 118000 |
|
|
| ... | ... | ... |
|
|
| 1450 | 5 | 92000 |
|
|
| 1451 | 5 | 136000 |
|
|
| 1452 | 8 | 287090 |
|
|
| 1453 | 5 | 145000 |
|
|
| 1454 | 5 | 84500 |
|
|
| 1455 | 7 | 185000 |
|
|
| 1456 | 6 | 175000 |
|
|
| 1457 | 6 | 210000 |
|
|
| 1458 | 7 | 266500 |
|
|
| 1459 | 5 | 142125 |
|
|
| 1460 | 5 | 147500 |
|
|
```
|
|
|
|
Interest in just a few specific rows:
|
|
|
|
```clojure
|
|
user> (-> (ds/select-columns ames-ds ["Id" "OverallQual" "SalePrice"])
|
|
(ds/select-rows [0 1 3 5 17 42]))
|
|
https://github.com/techascent/tech.ml.dataset/raw/master/test/data/ames-train.csv.gz [6 3]:
|
|
|
|
| Id | OverallQual | SalePrice |
|
|
|---:|------------:|----------:|
|
|
| 1 | 7 | 208500 |
|
|
| 2 | 6 | 181500 |
|
|
| 4 | 7 | 140000 |
|
|
| 6 | 5 | 143000 |
|
|
| 18 | 4 | 90000 |
|
|
| 43 | 5 | 144000 |
|
|
```
|
|
|
|
Or the same as a one-liner:
|
|
|
|
```clojure
|
|
user> (ds/select ames-ds ["Id" "OverallQual" "SalePrice"] [0 1 3 5 17 42])
|
|
https://github.com/techascent/tech.ml.dataset/raw/master/test/data/ames-train.csv.gz [6 3]:
|
|
|
|
| Id | OverallQual | SalePrice |
|
|
|---:|------------:|----------:|
|
|
| 1 | 7 | 208500 |
|
|
| 2 | 6 | 181500 |
|
|
| 4 | 7 | 140000 |
|
|
| 6 | 5 | 143000 |
|
|
| 18 | 4 | 90000 |
|
|
| 43 | 5 | 144000 |
|
|
```
|
|
|
|
-----
|
|
|
|
<div id="AddingRemovingUpdating"></div>
|
|
|
|
## Adding, Removing, and Updating Columns
|
|
|
|
Because datasets are maps (of column name to column data), adding a sequence as a column is trivial:
|
|
|
|
```clojure
|
|
user> (assoc lucy-ds :foo (repeatedly #(rand)))
|
|
_unnamed [3 4]:
|
|
|
|
| :name | :age | :likes | :foo |
|
|
|-------|-----:|--------|-----------:|
|
|
| fred | 42 | pizza | 0.21302376 |
|
|
| ethel | 42 | sushi | 0.40027647 |
|
|
| sally | 21 | opera | 0.04117065 |
|
|
```
|
|
|
|
Similarly, removing columns is just `dissoc`:
|
|
|
|
```clojure
|
|
user> (dissoc lucy-ds :age)
|
|
_unnamed [3 2]:
|
|
|
|
| :name | :likes |
|
|
|-------|--------|
|
|
| fred | pizza |
|
|
| ethel | sushi |
|
|
| sally | opera |
|
|
```
|
|
|
|
Look at these last two examples again, they subtly indicate that TMD datasets are _functional_... Neither operation changed `lucy-ds`.
|
|
|
|
🕉
|
|
|
|
-----
|
|
|
|
The powerful `row-map` function is often the best way to add (or update) columns derived from data already in the dataset:
|
|
|
|
```clojure
|
|
user> (-> (ds/select ames-ds ["Id" "OverallQual" "SalePrice"] [0 1 3 5 17 42])
|
|
(ds/row-map (fn [{:strs [OverallQual SalePrice]}]
|
|
{"PricePerQual" (quot SalePrice OverallQual)
|
|
"BetterPrice" (* 2 SalePrice)})))
|
|
https://github.com/techascent/tech.ml.dataset/raw/master/test/data/ames-train.csv.gz [6 5]:
|
|
|
|
| Id | OverallQual | SalePrice | PricePerQual | BetterPrice |
|
|
|---:|------------:|----------:|-------------:|------------:|
|
|
| 1 | 7 | 208500 | 29785 | 417000 |
|
|
| 2 | 6 | 181500 | 30250 | 363000 |
|
|
| 4 | 7 | 140000 | 20000 | 280000 |
|
|
| 6 | 5 | 143000 | 28600 | 286000 |
|
|
| 18 | 4 | 90000 | 22500 | 180000 |
|
|
| 43 | 5 | 144000 | 28800 | 288000 |
|
|
```
|
|
|
|
Another example of adding a column:
|
|
|
|
```clojure
|
|
user> (require '[tech.v3.datatype.functional :as dfn]) ;; for functional column operation
|
|
nil
|
|
user> lucy-ds
|
|
_unnamed [3 3]:
|
|
|
|
| :name | :age | :likes |
|
|
|-------|-----:|--------|
|
|
| fred | 42 | pizza |
|
|
| ethel | 42 | sushi |
|
|
| sally | 21 | opera |
|
|
user> (dfn/log (:age lucy-ds))
|
|
[3.7376696182833684 3.7376696182833684 3.044522437723423]
|
|
user> (assoc lucy-ds :log-age (dfn/log (:age lucy-ds)))
|
|
_unnamed [3 4]:
|
|
|
|
| :name | :age | :likes | :log-age |
|
|
|-------|-----:|--------|-----------:|
|
|
| fred | 42 | pizza | 3.73766962 |
|
|
| ethel | 42 | sushi | 3.73766962 |
|
|
| sally | 21 | opera | 3.04452244 |
|
|
```
|
|
|
|
Constants are broadcast to each row:
|
|
|
|
```clojure
|
|
user> (assoc lucy-ds :spring-chicken false)
|
|
_unnamed [3 4]:
|
|
|
|
| :name | :age | :likes | :spring-chicken |
|
|
|-------|-----:|--------|-----------------|
|
|
| fred | 42 | pizza | false |
|
|
| ethel | 42 | sushi | false |
|
|
| sally | 21 | opera | false |
|
|
```
|
|
|
|
-----
|
|
|
|
<div id="StatisticalAnalysis"></div>
|
|
|
|
## Statistical Analysis
|
|
|
|
The `tech.v3.datatype.statistics` namespace from the underlying `dtype-next` system provides many useful statistical functions:
|
|
|
|
```clojure
|
|
user> (require '[tech.v3.datatype.statistics :as stats])
|
|
nil
|
|
user> (def stock-ds (ds/->dataset "https://github.com/techascent/tech.ml.dataset/raw/master/test/data/stocks.csv"
|
|
{:key-fn keyword}))
|
|
#'user/stock-ds
|
|
user> (ds/filter-column stock-ds :symbol #{"MSFT"})
|
|
https://github.com/techascent/tech.ml.dataset/raw/master/test/data/stocks.csv [123 3]:
|
|
|
|
| :symbol | :date | :price |
|
|
|---------|------------|-------:|
|
|
| MSFT | 2000-01-01 | 39.81 |
|
|
| MSFT | 2000-02-01 | 36.35 |
|
|
| MSFT | 2000-03-01 | 43.22 |
|
|
| MSFT | 2000-04-01 | 28.37 |
|
|
| MSFT | 2000-05-01 | 25.45 |
|
|
| MSFT | 2000-06-01 | 32.54 |
|
|
| MSFT | 2000-07-01 | 28.40 |
|
|
| MSFT | 2000-08-01 | 28.40 |
|
|
| MSFT | 2000-09-01 | 24.53 |
|
|
| MSFT | 2000-10-01 | 28.02 |
|
|
| ... | ... | ... |
|
|
| MSFT | 2009-05-01 | 20.59 |
|
|
| MSFT | 2009-06-01 | 23.42 |
|
|
| MSFT | 2009-07-01 | 23.18 |
|
|
| MSFT | 2009-08-01 | 24.43 |
|
|
| MSFT | 2009-09-01 | 25.49 |
|
|
| MSFT | 2009-10-01 | 27.48 |
|
|
| MSFT | 2009-11-01 | 29.27 |
|
|
| MSFT | 2009-12-01 | 30.34 |
|
|
| MSFT | 2010-01-01 | 28.05 |
|
|
| MSFT | 2010-02-01 | 28.67 |
|
|
| MSFT | 2010-03-01 | 28.80 |
|
|
user> (def prices (:price (ds/filter-column stock-ds :symbol #{"MSFT"})))
|
|
#'user/prices
|
|
user> (stats/min prices)
|
|
15.81
|
|
user> (stats/mean prices)
|
|
24.736747967479673
|
|
user> (stats/max prices)
|
|
43.22
|
|
```
|
|
|
|
The `descriptive-stats` function calculates many statistics at once, returning the result as a dataset:
|
|
|
|
```clojure
|
|
user> (ds/descriptive-stats (ds/filter-column stock-ds :symbol #{"MSFT"}))
|
|
https://github.com/techascent/tech.ml.dataset/raw/master/test/data/stocks.csv: descriptive-stats [3 12]:
|
|
|
|
| :col-name | :datatype | :n-valid | :n-missing | :min | :mean | :mode | :max | :standard-deviation | :skew | :first | :last |
|
|
|-----------|--------------------|---------:|-----------:|------------|------------|-------|------------|--------------------:|-----------:|------------|------------|
|
|
| :symbol | :string | 123 | 0 | | | MSFT | | | | MSFT | MSFT |
|
|
| :date | :packed-local-date | 123 | 0 | 2000-01-01 | 2005-01-30 | | 2010-03-01 | 9.37554538E+10 | 0.00025335 | 2000-01-01 | 2010-03-01 |
|
|
| :price | :float64 | 123 | 0 | 15.81 | 24.74 | | 43.22 | 4.30395786E+00 | 1.16559225 | 39.81 | 28.80 |
|
|
```
|
|
|
|
The same data is available as a sequence of maps (with one map per column) from `brief`:
|
|
|
|
```clojure
|
|
user> (ds/brief (ds/filter -column stock-ds :symbol #{"MSFT"}))
|
|
({:n-missing 0,
|
|
:col-name :symbol,
|
|
:histogram (["MSFT" 123]),
|
|
:datatype :string,
|
|
:mode "MSFT",
|
|
:n-valid 123,
|
|
:values ["MSFT"],
|
|
:first "MSFT",
|
|
:last "MSFT",
|
|
:n-values 1}
|
|
{:min #object[java.time.LocalDate 0x3059f98a "2000-01-01"],
|
|
:n-missing 0,
|
|
:col-name :date,
|
|
:mean #object[java.time.LocalDate 0x78193b42 "2005-01-30"],
|
|
:datatype :packed-local-date,
|
|
:skew 2.5335340939019964E-4,
|
|
:standard-deviation 9.375545376336331E10,
|
|
:quartile-3 #object[java.time.LocalDate 0x2a73389f "2007-09-01"],
|
|
:n-valid 123,
|
|
:quartile-1 #object[java.time.LocalDate 0x5818d8e "2002-07-01"],
|
|
:median #object[java.time.LocalDate 0x3eaaea8d "2005-02-01"],
|
|
:max #object[java.time.LocalDate 0x4aeeb094 "2010-03-01"],
|
|
:first #object[java.time.LocalDate 0x16291c62 "2000-01-01"],
|
|
:last #object[java.time.LocalDate 0x56533361 "2010-03-01"]}
|
|
{:min 15.81,
|
|
:n-missing 0,
|
|
:col-name :price,
|
|
:mean 24.736747967479673,
|
|
:datatype :float64,
|
|
:skew 1.165592245374292,
|
|
:standard-deviation 4.303957861320731,
|
|
:quartile-3 27.34,
|
|
:n-valid 123,
|
|
:quartile-1 21.75,
|
|
:median 24.11,
|
|
:max 43.22,
|
|
:first 39.81,
|
|
:last 28.8})
|
|
```
|
|
|
|
Supremely useful for developing generalized systems that work on arbitrary datasets.
|
|
|
|
-----
|
|
|
|
<div id="SortFilterGroup"></div>
|
|
|
|
## Sort-by, Filter, Group-by
|
|
|
|
While conceptually similar to the ones in `clojure.core`, these functions are aware of the tabular/columnar dataset structure, and so can be much more efficient (especially when using the single-column versions).
|
|
|
|
When using the whole-dataset versions, TMD passes each row (as a map) into the argument function, this matches the expectation of Clojure programmers:
|
|
|
|
```clojure
|
|
user> (-> ames-ds
|
|
(ds/filter #(< 30000 (get % "SalePrice")))
|
|
(ds/select ["OverallQual" "SalePrice"] (range 5)))
|
|
https://github.com/techascent/tech.ml.dataset/raw/master/test/data/ames-train.csv.gz [5 2]:
|
|
|
|
| OverallQual | SalePrice |
|
|
|------------:|----------:|
|
|
| 7 | 208500 |
|
|
| 6 | 181500 |
|
|
| 7 | 223500 |
|
|
| 7 | 140000 |
|
|
| 8 | 250000 |
|
|
```
|
|
|
|
Operating on a single column works the same, and is faster, so do that when possible:
|
|
|
|
```clojure
|
|
user> (-> ames-ds
|
|
(ds/filter-column "SalePrice" #(< 30000 %))
|
|
(ds/select ["OverallQual" "SalePrice"] (range 5)))
|
|
https://github.com/techascent/tech.ml.dataset/raw/master/test/data/ames-train.csv.gz [5 2]:
|
|
|
|
| OverallQual | SalePrice |
|
|
|------------:|----------:|
|
|
| 7 | 208500 |
|
|
| 6 | 181500 |
|
|
| 7 | 223500 |
|
|
| 7 | 140000 |
|
|
| 8 | 250000 |
|
|
```
|
|
|
|
### Sorting
|
|
|
|
Sorting (again, like in Clojure):
|
|
|
|
```clojure
|
|
user> (-> (ds/sort-by-column ames-ds "SalePrice" >)
|
|
(ds/select-columns ["Id" "OverallQual" "SalePrice"]))
|
|
https://github.com/techascent/tech.ml.dataset/raw/master/test/data/ames-train.csv.gz [1460 3]:
|
|
|
|
| Id | OverallQual | SalePrice |
|
|
|-----:|------------:|----------:|
|
|
| 692 | 10 | 755000 |
|
|
| 1183 | 10 | 745000 |
|
|
| 1170 | 10 | 625000 |
|
|
| 899 | 9 | 611657 |
|
|
| 804 | 9 | 582933 |
|
|
| 1047 | 9 | 556581 |
|
|
| 441 | 10 | 555000 |
|
|
| 770 | 8 | 538000 |
|
|
| 179 | 9 | 501837 |
|
|
| 799 | 9 | 485000 |
|
|
| ... | ... | ... |
|
|
| 1381 | 3 | 58500 |
|
|
| 813 | 5 | 55993 |
|
|
| 706 | 4 | 55000 |
|
|
| 1326 | 4 | 55000 |
|
|
| 1338 | 4 | 52500 |
|
|
| 711 | 3 | 52000 |
|
|
| 31 | 4 | 40000 |
|
|
| 534 | 1 | 39300 |
|
|
| 969 | 3 | 37900 |
|
|
| 917 | 2 | 35311 |
|
|
| 496 | 4 | 34900 |
|
|
```
|
|
|
|
### Grouping
|
|
|
|
Grouping produces a map of column values to _datasets_ with only rows with that column value:
|
|
|
|
```clojure
|
|
user> lucy-ds
|
|
_unnamed [3 3]:
|
|
|
|
| :name | :age | :likes |
|
|
|-------|-----:|--------|
|
|
| fred | 42 | pizza |
|
|
| ethel | 42 | sushi |
|
|
| sally | 21 | opera |
|
|
user> (ds/group-by-column lucy-ds :age)
|
|
{42 _unnamed [2 3]:
|
|
|
|
| :name | :age | :likes |
|
|
|-------|-----:|--------|
|
|
| fred | 42 | pizza |
|
|
| ethel | 42 | sushi |
|
|
, 21 _unnamed [1 3]:
|
|
|
|
| :name | :age | :likes |
|
|
|-------|-----:|--------|
|
|
| sally | 21 | opera |
|
|
}
|
|
```
|
|
|
|
This structure (a map of data values to data with that value) will be familiar to even the most casual `clojure.core/group-by` enjoyer.
|
|
|
|
### Descriptive Stats and Group-By and DateTime Types
|
|
|
|
Automatically parsing dates in a CSV dataset:
|
|
|
|
```clojure
|
|
user> (def stock-ds (ds/->dataset "https://github.com/techascent/tech.ml.dataset/raw/master/test/data/stocks.csv"))
|
|
#'user/stock-ds
|
|
user> (ds/head stock-ds)
|
|
https://github.com/techascent/tech.ml.dataset/raw/master/test/data/stocks.csv [5 3]:
|
|
|
|
| symbol | date | price |
|
|
|--------|------------|------:|
|
|
| MSFT | 2000-01-01 | 39.81 |
|
|
| MSFT | 2000-02-01 | 36.35 |
|
|
| MSFT | 2000-03-01 | 43.22 |
|
|
| MSFT | 2000-04-01 | 28.37 |
|
|
| MSFT | 2000-05-01 | 25.45 |
|
|
user> (get stock-ds "date")
|
|
#tech.v3.dataset.column<packed-local-date>[560]
|
|
date
|
|
[2000-01-01, 2000-02-01, 2000-03-01, 2000-04-01, 2000-05-01, 2000-06-01, 2000-07-01, 2000-08-01, 2000-09-01, 2000-10-01, 2000-11-01, 2000-12-01, 2001-01-01, 2001-02-01, 2001-03-01, 2001-04-01, 2001-05-01, 2001-06-01, 2001-07-01, 2001-08-01...]
|
|
user> (meta (get stock-ds "date"))
|
|
{:name "date", :datatype :packed-local-date, :n-elems 560}
|
|
```
|
|
|
|
Those dates are not strings (!)
|
|
|
|
So, math on them works as expected:
|
|
|
|
```clojure
|
|
user> (->> (ds/group-by-column stock-ds "symbol")
|
|
(map (fn [[k v]] (ds/descriptive-stats v))))
|
|
(https://github.com/techascent/tech.ml.dataset/raw/master/test/data/stocks.csv: descriptive-stats [3 12]:
|
|
|
|
| :col-name | :datatype | :n-valid | :n-missing | :min | :mean | :mode | :max | :standard-deviation | :skew | :first | :last |
|
|
|-----------|--------------------|---------:|-----------:|------------|------------|-------|------------|--------------------:|-----------:|------------|------------|
|
|
| symbol | :string | 123 | 0 | | | MSFT | | | | MSFT | MSFT |
|
|
| date | :packed-local-date | 123 | 0 | 2000-01-01 | 2005-01-30 | | 2010-03-01 | 9.37554538E+10 | 0.00025335 | 2000-01-01 | 2010-03-01 |
|
|
| price | :float64 | 123 | 0 | 15.81 | 24.74 | | 43.22 | 4.30395786E+00 | 1.16559225 | 39.81 | 28.80 |
|
|
|
|
https://github.com/techascent/tech.ml.dataset/raw/master/test/data/stocks.csv: descriptive-stats [3 12]:
|
|
|
|
| :col-name | :datatype | :n-valid | :n-missing | :min | :mean | :mode | :max | :standard-deviation | :skew | :first | :last |
|
|
|-----------|--------------------|---------:|-----------:|------------|------------|-------|------------|--------------------:|-----------:|------------|------------|
|
|
| symbol | :string | 123 | 0 | | | AMZN | | | | AMZN | AMZN |
|
|
| date | :packed-local-date | 123 | 0 | 2000-01-01 | 2005-01-30 | | 2010-03-01 | 9.37554538E+10 | 0.00025335 | 2000-01-01 | 2010-03-01 |
|
|
| price | :float64 | 123 | 0 | 5.970 | 47.99 | | 135.9 | 2.88913206E+01 | 0.98217538 | 64.56 | 128.8 |
|
|
|
|
https://github.com/techascent/tech.ml.dataset/raw/master/test/data/stocks.csv: descriptive-stats [3 12]:
|
|
|
|
| :col-name | :datatype | :n-valid | :n-missing | :min | :mean | :mode | :max | :standard-deviation | :skew | :first | :last |
|
|
|-----------|--------------------|---------:|-----------:|------------|------------|-------|------------|--------------------:|-----------:|------------|------------|
|
|
| symbol | :string | 123 | 0 | | | IBM | | | | IBM | IBM |
|
|
| date | :packed-local-date | 123 | 0 | 2000-01-01 | 2005-01-30 | | 2010-03-01 | 9.37554538E+10 | 0.00025335 | 2000-01-01 | 2010-03-01 |
|
|
| price | :float64 | 123 | 0 | 53.01 | 91.26 | | 130.3 | 1.65133647E+01 | 0.44446266 | 100.5 | 125.6 |
|
|
|
|
https://github.com/techascent/tech.ml.dataset/raw/master/test/data/stocks.csv: descriptive-stats [3 12]:
|
|
|
|
| :col-name | :datatype | :n-valid | :n-missing | :min | :mean | :mode | :max | :standard-deviation | :skew | :first | :last |
|
|
|-----------|--------------------|---------:|-----------:|------------|------------|-------|------------|--------------------:|------------:|------------|------------|
|
|
| symbol | :string | 68 | 0 | | | GOOG | | | | GOOG | GOOG |
|
|
| date | :packed-local-date | 68 | 0 | 2004-08-01 | 2007-05-17 | | 2010-03-01 | 5.20003989E+10 | 0.00094625 | 2004-08-01 | 2010-03-01 |
|
|
| price | :float64 | 68 | 0 | 102.4 | 415.9 | | 707.0 | 1.35069851E+02 | -0.22776524 | 102.4 | 560.2 |
|
|
|
|
https://github.com/techascent/tech.ml.dataset/raw/master/test/data/stocks.csv: descriptive-stats [3 12]:
|
|
|
|
| :col-name | :datatype | :n-valid | :n-missing | :min | :mean | :mode | :max | :standard-deviation | :skew | :first | :last |
|
|
|-----------|--------------------|---------:|-----------:|------------|------------|-------|------------|--------------------:|-----------:|------------|------------|
|
|
| symbol | :string | 123 | 0 | | | AAPL | | | | AAPL | AAPL |
|
|
| date | :packed-local-date | 123 | 0 | 2000-01-01 | 2005-01-30 | | 2010-03-01 | 9.37554538E+10 | 0.00025335 | 2000-01-01 | 2010-03-01 |
|
|
| price | :float64 | 123 | 0 | 7.070 | 64.73 | | 223.0 | 6.31237823E+01 | 0.93215285 | 25.94 | 223.0 |
|
|
)
|
|
```
|
|
|
|
Note:
|
|
- `group-by-column` returned a map of stock symbols to datasets of stocks with that symbol
|
|
- On the next line, the function argument to `map` destructured that as tuples with each `v` a dataset
|
|
- Finally `descriptive-stats` operates on those individually, notably doing useful min/mean/max calculations on the dates
|
|
|
|
-----
|
|
|
|
<div id="ElementwiseOperations"></div>
|
|
|
|
## Elementwise Operations
|
|
|
|
The underlying dtype-next system includes a mathematical abstraction designed to work with things like columns.
|
|
|
|
Using this to create new columns as lazily evaluated combinations of other columns:
|
|
|
|
```clojure
|
|
user> (def updated-ames
|
|
(assoc ames-ds
|
|
"TotalBath"
|
|
(dfn/+ (ames-ds "BsmtFullBath")
|
|
(dfn/* 0.5 (ames-ds "BsmtHalfBath"))
|
|
(ames-ds "FullBath")
|
|
(dfn/* 0.5 (ames-ds "HalfBath")))))
|
|
#'user/updated-ames
|
|
user> (ds/head (ds/select-columns updated-ames ["BsmtFullBath" "BsmtHalfBath" "FullBath" "HalfBath" "TotalBath"]))
|
|
https://github.com/techascent/tech.ml.dataset/raw/master/test/data/ames-train.csv.gz [5 5]:
|
|
|
|
| BsmtFullBath | BsmtHalfBath | FullBath | HalfBath | TotalBath |
|
|
|-------------:|-------------:|---------:|---------:|----------:|
|
|
| 1 | 0 | 2 | 1 | 3.5 |
|
|
| 0 | 1 | 2 | 0 | 2.5 |
|
|
| 1 | 0 | 2 | 1 | 3.5 |
|
|
| 1 | 0 | 1 | 0 | 2.0 |
|
|
| 1 | 0 | 2 | 1 | 3.5 |
|
|
```
|
|
|
|
`dtype-next`'s `emap` enables expressing arbitrary conversions from columns into columns:
|
|
|
|
```clojure
|
|
user> (require '[tech.v3.datatype :as dtype])
|
|
nil
|
|
user> (def named-baths
|
|
(assoc updated-ames "NamedBath" (dtype/emap #(let [tbaths (double %)]
|
|
(cond
|
|
(< tbaths 1.0) "almost none"
|
|
(< tbaths 2.0) "somewhat doable"
|
|
(< tbaths 3.0) "getting somewhere"
|
|
:else "living in style"))
|
|
:string
|
|
(updated-ames "TotalBath"))))
|
|
#'user/named-baths
|
|
user> (ds/head (ds/select-columns named-baths ["TotalBath" "NamedBath"]))
|
|
https://github.com/techascent/tech.ml.dataset/raw/master/test/data/ames-train.csv.gz [5 2]:
|
|
|
|
| TotalBath | NamedBath |
|
|
|----------:|-------------------|
|
|
| 3.5 | living in style |
|
|
| 2.5 | getting somewhere |
|
|
| 3.5 | living in style |
|
|
| 2.0 | getting somewhere |
|
|
| 3.5 | living in style |
|
|
```
|
|
|
|
Expensive houses have more bathrooms:
|
|
|
|
```clojure
|
|
user> (-> (ds/select-columns named-baths ["TotalBath" "NamedBath" "SalePrice"])
|
|
(ds/sort-by-column "SalePrice" >))
|
|
https://github.com/techascent/tech.ml.dataset/raw/master/test/data/ames-train.csv.gz [1460 3]:
|
|
|
|
| TotalBath | NamedBath | SalePrice |
|
|
|----------:|-------------------|----------:|
|
|
| 4.0 | living in style | 755000 |
|
|
| 4.5 | living in style | 745000 |
|
|
| 4.5 | living in style | 625000 |
|
|
| 3.5 | living in style | 611657 |
|
|
| 3.5 | living in style | 582933 |
|
|
| 3.5 | living in style | 556581 |
|
|
| 3.0 | living in style | 555000 |
|
|
| 4.5 | living in style | 538000 |
|
|
| 3.0 | living in style | 501837 |
|
|
| 3.5 | living in style | 485000 |
|
|
| ... | ... | ... |
|
|
| 2.0 | getting somewhere | 58500 |
|
|
| 1.0 | somewhat doable | 55993 |
|
|
| 2.0 | getting somewhere | 55000 |
|
|
| 1.0 | somewhat doable | 55000 |
|
|
| 1.0 | somewhat doable | 52500 |
|
|
| 1.0 | somewhat doable | 52000 |
|
|
| 1.0 | somewhat doable | 40000 |
|
|
| 1.0 | somewhat doable | 39300 |
|
|
| 1.0 | somewhat doable | 37900 |
|
|
| 1.0 | somewhat doable | 35311 |
|
|
| 1.0 | somewhat doable | 34900 |
|
|
```
|
|
|
|
### DateTime Types
|
|
|
|
The underlying `dtype-next` system supports sophisticated [datetime](https://cnuernber.github.io/dtype-next/tech.v3.datatype.datetime.html) processing.
|
|
|
|
A more real example of summarizing stock prices by year:
|
|
|
|
```clojure
|
|
user> (def stock-ds (ds/->dataset "https://github.com/techascent/tech.ml.dataset/raw/master/test/data/stocks.csv"
|
|
{:key-fn keyword}))
|
|
#'user/stock-ds
|
|
user> (ds/head stock-ds)
|
|
https://github.com/techascent/tech.ml.dataset/raw/master/test/data/stocks.csv [5 3]:
|
|
|
|
| :symbol | :date | :price |
|
|
|---------|------------|-------:|
|
|
| MSFT | 2000-01-01 | 39.81 |
|
|
| MSFT | 2000-02-01 | 36.35 |
|
|
| MSFT | 2000-03-01 | 43.22 |
|
|
| MSFT | 2000-04-01 | 28.37 |
|
|
| MSFT | 2000-05-01 | 25.45 |
|
|
user> (meta (stock-ds :date))
|
|
{:name :date, :datatype :packed-local-date, :n-elems 560}
|
|
user> (require '[tech.v3.datatype.datetime :as dtype-dt]) ;; for datetime processing
|
|
nil
|
|
user> (require '[tech.v3.datatype.functional :as dfn]) ;; for functional column processing
|
|
nil
|
|
user> (assoc stock-ds :year (dtype-dt/long-temporal-field :years (stock-ds :date)))
|
|
https://github.com/techascent/tech.ml.dataset/raw/master/test/data/stocks.csv [560 4]:
|
|
|
|
| :symbol | :date | :price | :year |
|
|
|---------|------------|-------:|------:|
|
|
| MSFT | 2000-01-01 | 39.81 | 2000 |
|
|
| MSFT | 2000-02-01 | 36.35 | 2000 |
|
|
| MSFT | 2000-03-01 | 43.22 | 2000 |
|
|
| MSFT | 2000-04-01 | 28.37 | 2000 |
|
|
| MSFT | 2000-05-01 | 25.45 | 2000 |
|
|
| MSFT | 2000-06-01 | 32.54 | 2000 |
|
|
| MSFT | 2000-07-01 | 28.40 | 2000 |
|
|
| MSFT | 2000-08-01 | 28.40 | 2000 |
|
|
| MSFT | 2000-09-01 | 24.53 | 2000 |
|
|
| MSFT | 2000-10-01 | 28.02 | 2000 |
|
|
| ... | ... | ... | ... |
|
|
| AAPL | 2009-05-01 | 135.81 | 2009 |
|
|
| AAPL | 2009-06-01 | 142.43 | 2009 |
|
|
| AAPL | 2009-07-01 | 163.39 | 2009 |
|
|
| AAPL | 2009-08-01 | 168.21 | 2009 |
|
|
| AAPL | 2009-09-01 | 185.35 | 2009 |
|
|
| AAPL | 2009-10-01 | 188.50 | 2009 |
|
|
| AAPL | 2009-11-01 | 199.91 | 2009 |
|
|
| AAPL | 2009-12-01 | 210.73 | 2009 |
|
|
| AAPL | 2010-01-01 | 192.06 | 2010 |
|
|
| AAPL | 2010-02-01 | 204.62 | 2010 |
|
|
| AAPL | 2010-03-01 | 223.02 | 2010 |
|
|
user> (->> (-> (assoc stock-ds :year (dtype-dt/long-temporal-field :years (stock-ds :date)))
|
|
(ds/group-by (juxt :symbol :year)))
|
|
;; Now operating on the map of pairs to ds returned by `group-by`
|
|
(map (fn [[[symbol year] ds]]
|
|
{:symbol symbol
|
|
:year year
|
|
:mean-price (dfn/mean (ds :price))}))
|
|
(sort-by (juxt :symbol :year))
|
|
(ds/->>dataset {:dataset-name "Mean Stock Price by Year"})
|
|
(ds/print-all))
|
|
Mean Stock Price by Year [51 3]:
|
|
|
|
| :symbol | :year | :mean-price |
|
|
|---------|------:|-------------:|
|
|
| AAPL | 2000 | 21.74833333 |
|
|
| AAPL | 2001 | 10.17583333 |
|
|
| AAPL | 2002 | 9.40833333 |
|
|
| AAPL | 2003 | 9.34750000 |
|
|
| AAPL | 2004 | 18.72333333 |
|
|
| AAPL | 2005 | 48.17166667 |
|
|
| AAPL | 2006 | 72.04333333 |
|
|
| AAPL | 2007 | 133.35333333 |
|
|
| AAPL | 2008 | 138.48083333 |
|
|
| AAPL | 2009 | 150.39333333 |
|
|
| AAPL | 2010 | 206.56666667 |
|
|
| AMZN | 2000 | 43.93083333 |
|
|
| AMZN | 2001 | 11.73916667 |
|
|
| AMZN | 2002 | 16.72333333 |
|
|
| AMZN | 2003 | 39.01666667 |
|
|
| AMZN | 2004 | 43.26750000 |
|
|
| AMZN | 2005 | 40.18750000 |
|
|
| AMZN | 2006 | 36.25166667 |
|
|
| AMZN | 2007 | 69.95250000 |
|
|
| AMZN | 2008 | 69.01500000 |
|
|
| AMZN | 2009 | 90.73083333 |
|
|
| AMZN | 2010 | 124.21000000 |
|
|
| GOOG | 2004 | 159.47600000 |
|
|
| GOOG | 2005 | 286.47250000 |
|
|
| GOOG | 2006 | 415.25666667 |
|
|
| GOOG | 2007 | 548.75833333 |
|
|
| GOOG | 2008 | 454.99916667 |
|
|
| GOOG | 2009 | 449.92000000 |
|
|
| GOOG | 2010 | 538.97666667 |
|
|
| IBM | 2000 | 96.91416667 |
|
|
| IBM | 2001 | 96.96833333 |
|
|
| IBM | 2002 | 75.12500000 |
|
|
| IBM | 2003 | 77.30750000 |
|
|
| IBM | 2004 | 83.88583333 |
|
|
| IBM | 2005 | 77.49750000 |
|
|
| IBM | 2006 | 78.71750000 |
|
|
| IBM | 2007 | 101.27666667 |
|
|
| IBM | 2008 | 107.22500000 |
|
|
| IBM | 2009 | 109.29666667 |
|
|
| IBM | 2010 | 124.85333333 |
|
|
| MSFT | 2000 | 29.67333333 |
|
|
| MSFT | 2001 | 25.34750000 |
|
|
| MSFT | 2002 | 21.82666667 |
|
|
| MSFT | 2003 | 20.93416667 |
|
|
| MSFT | 2004 | 22.67416667 |
|
|
| MSFT | 2005 | 23.84583333 |
|
|
| MSFT | 2006 | 24.75833333 |
|
|
| MSFT | 2007 | 29.28416667 |
|
|
| MSFT | 2008 | 25.20833333 |
|
|
| MSFT | 2009 | 22.87250000 |
|
|
| MSFT | 2010 | 28.50666667 |
|
|
```
|
|
|
|
-----
|
|
|
|
<div id="GettingDataBackOut"></div>
|
|
|
|
## Writing A Dataset Out
|
|
|
|
The `write!` function puts a dataset in a file or on a stream:
|
|
|
|
```clojure
|
|
(ds/write! test-ds "test.csv")
|
|
(ds/write! test-ds out-stream)
|
|
(ds/write! test-ds "test.nippy")
|
|
```
|
|
|
|
The file format is derived from the extension.
|
|
|
|
The excellent [nippy](nippy-serialization-rocks.md) format is a good choice if downstream consumers are other TMD users, in which case datasets work just like any other edn data. While CSV stores numbers as ascii text, nippy stores them binary, saving some space, but the real win is 10X-100X loading performance improvement when compared to csv/tsv.
|
|
|
|
Moreover, because nippy knows Clojure, you can write out heterogeneous data structures that contain datasets and other things such as the result of a group-by:
|
|
|
|
```clojure
|
|
user> (require '[taoensso.nippy :as nippy])
|
|
nil
|
|
user> (def byte-data (nippy/freeze (ds/group-by stock-ds :symbol)))
|
|
#'user/byte-data
|
|
user> (type byte-data)
|
|
[B
|
|
user> (keys (nippy/thaw byte-data))
|
|
("MSFT" "GOOG" "AAPL" "IBM" "AMZN")
|
|
user> (first (nippy/thaw byte-data))
|
|
["MSFT"
|
|
https://github.com/techascent/tech.ml.dataset/raw/master/test/data/stocks.csv [123 3]:
|
|
|
|
| :symbol | :date | :price |
|
|
|---------|------------|--------|
|
|
| MSFT | 2000-01-01 | 39.81 |
|
|
| MSFT | 2000-02-01 | 36.35 |
|
|
| MSFT | 2000-03-01 | 43.22 |
|
|
| MSFT | 2000-04-01 | 28.37 |
|
|
| MSFT | 2000-05-01 | 25.45 |
|
|
```
|
|
|
|
For Apache arrow support, see the [tech.v3.libs.arrow](https://techascent.github.io/tech.ml.dataset/tech.v3.libs.arrow.html) documentation.
|
|
|
|
For Parquet, see the [tech.v3.libs.parquet](https://techascent.github.io/tech.ml.dataset/tech.v3.libs.parquet.html) documentation.
|
|
|
|
-----
|
|
|
|
## Further Reading
|
|
|
|
- The [quick reference](https://techascent.github.io/tech.ml.dataset/200-quick-reference.html) summarizes many of the most frequently used functions with hints about their use.
|
|
|
|
- The [API docs](https://techascent.github.io/tech.ml.dataset/index.html) list every function available in TMD.
|