``` clojure
tech-ml-version
```
“7.021”
``` clojure
tablecloth-version
```
“7.021”
## Introduction
[tech.ml.dataset](https://github.com/techascent/tech.ml.dataset) is a
great and fast library which brings columnar dataset to the Clojure.
Chris Nuernberger has been working on this library for last year as a
part of bigger `tech.ml` stack.
I’ve started to test the library and help to fix uncovered bugs. My main
goal was to compare functionalities with the other standards from other
platforms. I focused on R solutions:
[dplyr](https://dplyr.tidyverse.org/),
[tidyr](https://tidyr.tidyverse.org/) and
[data.table](https://rdatatable.gitlab.io/data.table/).
During conversions of the examples I’ve come up how to reorganized
existing `tech.ml.dataset` functions into simple to use API. The main
goals were:
- Focus on dataset manipulation functionality, leaving other parts of
`tech.ml` like pipelines, datatypes, readers, ML, etc.
- Single entry point for common operations - one function dispatching on
given arguments.
- `group-by` results with special kind of dataset - a dataset containing
subsets created after grouping as a column.
- Most operations recognize regular dataset and grouped dataset and
process data accordingly.
- One function form to enable thread-first on dataset.
If you want to know more about `tech.ml.dataset` and `dtype-next` please
refer their documentation:
- [tech.ml.dataset
walkthrough](https://techascent.github.io/tech.ml.dataset/walkthrough.html)
- [dtype-next
overview](https://cnuernber.github.io/dtype-next/overview.html)
- [dtype-next
cheatsheet](https://cnuernber.github.io/dtype-next/cheatsheet.html)
[SOURCE CODE](https://github.com/scicloj/tablecloth)
Join the discussion on
[Zulip](https://clojurians.zulipchat.com/#narrow/stream/236259-tech.2Eml.2Edataset.2Edev/topic/api)
Let’s require main namespace and define dataset used in most examples:
``` clojure
(require '[tablecloth.api :as tc]
'[tech.v3.datatype.functional :as dfn])
(def DS (tc/dataset {:V1 (take 9 (cycle [1 2]))
:V2 (range 1 10)
:V3 (take 9 (cycle [0.5 1.0 1.5]))
:V4 (take 9 (cycle ["A" "B" "C"]))}))
```
``` clojure
DS
```
\_unnamed \[9 4\]:
| :V1 | :V2 | :V3 | :V4 |
|----:|----:|----:|-----|
| 1 | 1 | 0.5 | A |
| 2 | 2 | 1.0 | B |
| 1 | 3 | 1.5 | C |
| 2 | 4 | 0.5 | A |
| 1 | 5 | 1.0 | B |
| 2 | 6 | 1.5 | C |
| 1 | 7 | 0.5 | A |
| 2 | 8 | 1.0 | B |
| 1 | 9 | 1.5 | C |
## Functionality
### Dataset
Dataset is a special type which can be considered as a map of columns
implemented around `tech.ml.dataset` library. Each column can be
considered as named sequence of typed data. Supported types include
integers, floats, string, boolean, date/time, objects etc.
#### Dataset creation
Dataset can be created from various of types of Clojure structures and
files:
- single values
- sequence of maps
- map of sequences or values
- sequence of columns (taken from other dataset or created manually)
- sequence of pairs: `[string column-data]` or `[keyword column-data]`
- array of any arrays
- file types: raw/gzipped csv/tsv, json, xls(x) taken from local file
system or URL
- input stream
`tc/dataset` accepts:
- data
- options (see documentation of `tech.ml.dataset/->dataset` function for
full list):
- `:dataset-name` - name of the dataset
- `:num-rows` - number of rows to read from file
- `:header-row?` - indication if first row in file is a header
- `:key-fn` - function applied to column names (eg. `keyword`, to
convert column names to keywords)
- `:separator` - column separator
- `:single-value-column-name` - name of the column when single value
is provided
- `:column-names` - in case you want to name columns - only works for
sequential input (arrays) or empty dataset
- `:layout` - for numerical, native array of arrays - treat entries
`:as-rows` or `:as-columns` (default)
`tc/let-dataset` accepts bindings `symbol`-`column-data` to simulate R’s
`tibble` function. Each binding is converted into a column. You can
refer previous columns to in further bindings (as in `let`).
------------------------------------------------------------------------
Empty dataset.
``` clojure
(tc/dataset)
```
_unnamed [0 0]
------------------------------------------------------------------------
Empty dataset with column names
``` clojure
(tc/dataset nil {:column-names [:a :b]})
```
_unnamed [0 2]:
| :a | :b |
|----|----|
------------------------------------------------------------------------
Sequence of pairs (first = column name, second = value(s)).
``` clojure
(tc/dataset [[:A 33] [:B 5] [:C :a]])
```
\_unnamed \[1 3\]:
| :A | :B | :C |
|----:|----:|-----|
| 33 | 5 | :a |
------------------------------------------------------------------------
Not sequential values are repeated row-count number of times.
``` clojure
(tc/dataset [[:A [1 2 3 4 5 6]] [:B "X"] [:C :a]])
```
\_unnamed \[6 3\]:
| :A | :B | :C |
|----:|-----|-----|
| 1 | X | :a |
| 2 | X | :a |
| 3 | X | :a |
| 4 | X | :a |
| 5 | X | :a |
| 6 | X | :a |
------------------------------------------------------------------------
Dataset created from map (keys = column names, vals = value(s)). Works
the same as sequence of pairs.
``` clojure
(tc/dataset {:A 33})
(tc/dataset {:A [1 2 3]})
(tc/dataset {:A [3 4 5] :B "X"})
```
\_unnamed \[1 1\]:
| :A |
|----:|
| 33 |
\_unnamed \[3 1\]:
| :A |
|----:|
| 1 |
| 2 |
| 3 |
\_unnamed \[3 2\]:
| :A | :B |
|----:|-----|
| 3 | X |
| 4 | X |
| 5 | X |
------------------------------------------------------------------------
You can put any value inside a column
``` clojure
(tc/dataset {:A [[3 4 5] [:a :b]] :B "X"})
```
\_unnamed \[2 2\]:
| :A | :B |
|-----------|-----|
| \[3 4 5\] | X |
| \[:a :b\] | X |
------------------------------------------------------------------------
Sequence of maps
``` clojure
(tc/dataset [{:a 1 :b 3} {:b 2 :a 99}])
(tc/dataset [{:a 1 :b [1 2 3]} {:a 2 :b [3 4]}])
```
\_unnamed \[2 2\]:
| :a | :b |
|----:|----:|
| 1 | 3 |
| 99 | 2 |
\_unnamed \[2 2\]:
| :a | :b |
|----:|-----------|
| 1 | \[1 2 3\] |
| 2 | \[3 4\] |
------------------------------------------------------------------------
Missing values are marked by `nil`
``` clojure
(tc/dataset [{:a nil :b 1} {:a 3 :b 4} {:a 11}])
```
\_unnamed \[3 2\]:
| :a | :b |
|----:|----:|
| | 1 |
| 3 | 4 |
| 11 | |
------------------------------------------------------------------------
Reading from arrays, by default `:as-rows`
``` clojure
(-> (map int-array [[1 2] [3 4] [5 6]])
(into-array)
(tc/dataset))
```
:\_unnamed \[3 2\]:
| 0 | 1 |
|----:|----:|
| 1 | 2 |
| 3 | 4 |
| 5 | 6 |
`:as-columns`
``` clojure
(-> (map int-array [[1 2] [3 4] [5 6]])
(into-array)
(tc/dataset {:layout :as-columns}))
```
:\_unnamed \[2 3\]:
| 0 | 1 | 2 |
|----:|----:|----:|
| 1 | 3 | 5 |
| 2 | 4 | 6 |
`:as-rows` with names
``` clojure
(-> (map int-array [[1 2] [3 4] [5 6]])
(into-array)
(tc/dataset {:layout :as-rows
:column-names [:a :b]}))
```
:\_unnamed \[3 2\]:
| :a | :b |
|----:|----:|
| 1 | 2 |
| 3 | 4 |
| 5 | 6 |
Any objects
``` clojure
(-> (map to-array [[:a :z] ["ee" "ww"] [9 10]])
(into-array)
(tc/dataset {:column-names [:a :b :c]
:layout :as-columns}))
```
:\_unnamed \[2 3\]:
| :a | :b | :c |
|-----|-----|----:|
| :a | ee | 9 |
| :z | ww | 10 |
------------------------------------------------------------------------
Create dataset using macro `let-dataset` to simulate R `tibble`
function. Each binding is converted into a column.
``` clojure
(tc/let-dataset [x (range 1 6)
y 1
z (dfn/+ x y)])
```
\_unnamed \[5 3\]:
| :x | :y | :z |
|----:|----:|----:|
| 1 | 1 | 2 |
| 2 | 1 | 3 |
| 3 | 1 | 4 |
| 4 | 1 | 5 |
| 5 | 1 | 6 |
------------------------------------------------------------------------
Import CSV file
``` clojure
(tc/dataset "data/family.csv")
```
data/family.csv \[5 5\]:
| family | dob_child1 | dob_child2 | gender_child1 | gender_child2 |
|-------:|------------|------------|--------------:|--------------:|
| 1 | 1998-11-26 | 2000-01-29 | 1 | 2 |
| 2 | 1996-06-22 | | 2 | |
| 3 | 2002-07-11 | 2004-04-05 | 2 | 2 |
| 4 | 2004-10-10 | 2009-08-27 | 1 | 1 |
| 5 | 2000-12-05 | 2005-02-28 | 2 | 1 |
------------------------------------------------------------------------
Import from URL
``` clojure
(defonce ds (tc/dataset "https://vega.github.io/vega-lite/examples/data/seattle-weather.csv"))
```
``` clojure
ds
```
\[1461 6\]:
| date | precipitation | temp_max | temp_min | wind | weather |
|------------|--------------:|---------:|---------:|-----:|---------|
| 2012-01-01 | 0.0 | 12.8 | 5.0 | 4.7 | drizzle |
| 2012-01-02 | 10.9 | 10.6 | 2.8 | 4.5 | rain |
| 2012-01-03 | 0.8 | 11.7 | 7.2 | 2.3 | rain |
| 2012-01-04 | 20.3 | 12.2 | 5.6 | 4.7 | rain |
| 2012-01-05 | 1.3 | 8.9 | 2.8 | 6.1 | rain |
| 2012-01-06 | 2.5 | 4.4 | 2.2 | 2.2 | rain |
| 2012-01-07 | 0.0 | 7.2 | 2.8 | 2.3 | rain |
| 2012-01-08 | 0.0 | 10.0 | 2.8 | 2.0 | sun |
| 2012-01-09 | 4.3 | 9.4 | 5.0 | 3.4 | rain |
| 2012-01-10 | 1.0 | 6.1 | 0.6 | 3.4 | rain |
| … | … | … | … | … | … |
| 2015-12-21 | 27.4 | 5.6 | 2.8 | 4.3 | rain |
| 2015-12-22 | 4.6 | 7.8 | 2.8 | 5.0 | rain |
| 2015-12-23 | 6.1 | 5.0 | 2.8 | 7.6 | rain |
| 2015-12-24 | 2.5 | 5.6 | 2.2 | 4.3 | rain |
| 2015-12-25 | 5.8 | 5.0 | 2.2 | 1.5 | rain |
| 2015-12-26 | 0.0 | 4.4 | 0.0 | 2.5 | sun |
| 2015-12-27 | 8.6 | 4.4 | 1.7 | 2.9 | rain |
| 2015-12-28 | 1.5 | 5.0 | 1.7 | 1.3 | rain |
| 2015-12-29 | 0.0 | 7.2 | 0.6 | 2.6 | fog |
| 2015-12-30 | 0.0 | 5.6 | -1.0 | 3.4 | sun |
| 2015-12-31 | 0.0 | 5.6 | -2.1 | 3.5 | sun |
------------------------------------------------------------------------
When none of above works, singleton dataset is created. Along with the
error message from the exception thrown by `tech.ml.dataset`
``` clojure
(tc/dataset 999)
```
\_unnamed \[1 2\]:
| :$value | :$error | |
|------------------:|----------------------------------------------------|
| 999 | Don’t know how to create ISeq from: java.lang.Long |
To see the stack trace, turn it on by setting `:stack-trace?` to `true`.
------------------------------------------------------------------------
Set column name for single value. Also set the dataset name and turn off
creating error message column.
``` clojure
(tc/dataset 999 {:single-value-column-name "my-single-value"
:error-column? false})
(tc/dataset 999 {:single-value-column-name ""
:dataset-name "Single value"
:error-column? false})
```
\_unnamed \[1 1\]:
| my-single-value |
|----------------:|
| 999 |
Single value \[1 1\]:
| 0 |
|----:|
| 999 |
#### Saving
Export dataset to a file or output stream can be done by calling
`tc/write!`. Function accepts:
- dataset
- file name with one of the extensions: `.csv`, `.tsv`, `.csv.gz` and
`.tsv.gz` or output stream
- options:
- `:separator` - string or separator char.
``` clojure
(tc/write! ds "output.tsv.gz")
(.exists (clojure.java.io/file "output.tsv.gz"))
```
1462
true
##### Nippy
``` clojure
(tc/write! DS "output.nippy.gz")
```
nil
``` clojure
(tc/dataset "output.nippy.gz")
```
output.nippy.gz \[9 4\]:
| :V1 | :V2 | :V3 | :V4 |
|----:|----:|----:|-----|
| 1 | 1 | 0.5 | A |
| 2 | 2 | 1.0 | B |
| 1 | 3 | 1.5 | C |
| 2 | 4 | 0.5 | A |
| 1 | 5 | 1.0 | B |
| 2 | 6 | 1.5 | C |
| 1 | 7 | 0.5 | A |
| 2 | 8 | 1.0 | B |
| 1 | 9 | 1.5 | C |
#### Dataset related functions
Summary functions about the dataset like number of rows, columns and
basic stats.
------------------------------------------------------------------------
Number of rows
``` clojure
(tc/row-count ds)
```
1461
------------------------------------------------------------------------
Number of columns
``` clojure
(tc/column-count ds)
```
6
------------------------------------------------------------------------
Shape of the dataset, \[row count, column count\]
``` clojure
(tc/shape ds)
```
[1461 6]
------------------------------------------------------------------------
General info about dataset. There are three variants:
- default - containing information about columns with basic statistics
- `:basic` - just name, row and column count and information if dataset
is a result of `group-by` operation
- `:columns` - columns’ metadata
``` clojure
(tc/info ds)
(tc/info ds :basic)
(tc/info ds :columns)
```
:
descriptive-stats \[6 12\]:
| :col-name | :datatype | :n-valid | :n-missing | :min | :mean | :mode | :max | :standard-deviation | :skew | :first | :last |
|---------------|--------------------|---------:|-----------:|------------|------------|-------|------------|--------------------:|----------------:|------------|------------|
| date | :packed-local-date | 1461 | 0 | 2012-01-01 | 2013-12-31 | | 2015-12-31 | 3.64520463E+10 | 1.98971418E-17 | 2012-01-01 | 2015-12-31 |
| precipitation | :float64 | 1461 | 0 | 0.000 | 3.029 | | 55.90 | 6.68019432E+00 | 3.50564372E+00 | 0.000 | 0.000 |
| temp_max | :float64 | 1461 | 0 | -1.600 | 16.44 | | 35.60 | 7.34975810E+00 | 2.80929992E-01 | 12.80 | 5.600 |
| temp_min | :float64 | 1461 | 0 | -7.100 | 8.235 | | 18.30 | 5.02300418E+00 | -2.49458552E-01 | 5.000 | -2.100 |
| wind | :float64 | 1461 | 0 | 0.4000 | 3.241 | | 9.500 | 1.43782506E+00 | 8.91667519E-01 | 4.700 | 3.500 |
| weather | :string | 1461 | 0 | | | rain | | | | drizzle | sun |
:basic info \[1 4\]:
| :name | :grouped? | :rows | :columns |
|----------------------------------------------------------------------|-----------|------:|---------:|
| | false | 1461 | 6 |
:column info \[6 4\]:
| :name | :datatype | :n-elems | :categorical? |
|---------------|--------------------|---------:|---------------|
| date | :packed-local-date | 1461 | |
| precipitation | :float64 | 1461 | |
| temp_max | :float64 | 1461 | |
| temp_min | :float64 | 1461 | |
| wind | :float64 | 1461 | |
| weather | :string | 1461 | true |
------------------------------------------------------------------------
Getting a dataset name
``` clojure
(tc/dataset-name ds)
```
"https://vega.github.io/vega-lite/examples/data/seattle-weather.csv"
------------------------------------------------------------------------
Setting a dataset name (operation is immutable).
``` clojure
(->> "seattle-weather"
(tc/set-dataset-name ds)
(tc/dataset-name))
```
"seattle-weather"
#### Columns and rows
Get columns and rows as sequences. `column`, `columns` and `rows` treat
grouped dataset as regular one. See `Groups` to read more about grouped
datasets.
Possible result types:
- `:as-seq` or `:as-seqs` - sequence of seqences (default)
- `:as-maps` - sequence of maps (rows)
- `:as-map` - map of sequences (columns)
- `:as-double-arrays` - array of double arrays
- `:as-vecs` - sequence of vectors (rows)
For `rows` setting `:nil-missing?` option to `false` will elide keys for
nil values.
------------------------------------------------------------------------
Select column.
``` clojure
(ds "wind")
(tc/column ds "date")
```
#tech.v3.dataset.column[1461]
wind
[4.700, 4.500, 2.300, 4.700, 6.100, 2.200, 2.300, 2.000, 3.400, 3.400, 5.100, 1.900, 1.300, 5.300, 3.200, 5.000, 5.600, 5.000, 1.600, 2.300...]
#tech.v3.dataset.column[1461]
date
[2012-01-01, 2012-01-02, 2012-01-03, 2012-01-04, 2012-01-05, 2012-01-06, 2012-01-07, 2012-01-08, 2012-01-09, 2012-01-10, 2012-01-11, 2012-01-12, 2012-01-13, 2012-01-14, 2012-01-15, 2012-01-16, 2012-01-17, 2012-01-18, 2012-01-19, 2012-01-20...]
------------------------------------------------------------------------
Columns as sequence
``` clojure
(take 2 (tc/columns ds))
```
(#tech.v3.dataset.column[1461]
date
[2012-01-01, 2012-01-02, 2012-01-03, 2012-01-04, 2012-01-05, 2012-01-06, 2012-01-07, 2012-01-08, 2012-01-09, 2012-01-10, 2012-01-11, 2012-01-12, 2012-01-13, 2012-01-14, 2012-01-15, 2012-01-16, 2012-01-17, 2012-01-18, 2012-01-19, 2012-01-20...] #tech.v3.dataset.column[1461]
precipitation
[0.000, 10.90, 0.8000, 20.30, 1.300, 2.500, 0.000, 0.000, 4.300, 1.000, 0.000, 0.000, 0.000, 4.100, 5.300, 2.500, 8.100, 19.80, 15.20, 13.50...])
------------------------------------------------------------------------
Columns as map
``` clojure
(keys (tc/columns ds :as-map))
```
("date" "precipitation" "temp_max" "temp_min" "wind" "weather")
------------------------------------------------------------------------
Rows as sequence of sequences
``` clojure
(take 2 (tc/rows ds))
```
([#object[java.time.LocalDate 0x6fdedb57 "2012-01-01"] 0.0 12.8 5.0 4.7 "drizzle"] [#object[java.time.LocalDate 0x31a1da23 "2012-01-02"] 10.9 10.6 2.8 4.5 "rain"])
------------------------------------------------------------------------
Select rows/columns as double-double-array
``` clojure
(-> ds
(tc/select-columns :type/numerical)
(tc/head)
(tc/rows :as-double-arrays))
```
#object["[[D" 0x620232bc "[[D@620232bc"]
``` clojure
(-> ds
(tc/select-columns :type/numerical)
(tc/head)
(tc/columns :as-double-arrays))
```
#object["[[D" 0x73a86a0d "[[D@73a86a0d"]
------------------------------------------------------------------------
Rows as sequence of maps
``` clojure
(clojure.pprint/pprint (take 2 (tc/rows ds :as-maps)))
```
({"date" #object[java.time.LocalDate 0x451a2065 "2012-01-01"],
"precipitation" 0.0,
"temp_max" 12.8,
"temp_min" 5.0,
"wind" 4.7,
"weather" "drizzle"}
{"date" #object[java.time.LocalDate 0x38ca78fb "2012-01-02"],
"precipitation" 10.9,
"temp_max" 10.6,
"temp_min" 2.8,
"wind" 4.5,
"weather" "rain"})
------------------------------------------------------------------------
Rows with missing values
``` clojure
(-> {:a [1 nil 2]
:b [3 4 nil]}
(tc/dataset)
(tc/rows :as-maps))
```
[{:a 1, :b 3} {:a nil, :b 4} {:a 2, :b nil}]
Rows with elided missing values
``` clojure
(-> {:a [1 nil 2]
:b [3 4 nil]}
(tc/dataset)
(tc/rows :as-maps {:nil-missing? false}))
```
[{:a 1, :b 3} {:b 4} {:a 2}]
#### Single entry
Get single value from the table using `get-in` from Clojure API or
`get-entry`. First argument is column name, second is row number.
``` clojure
(get-in ds ["wind" 2])
```
2.3
``` clojure
(tc/get-entry ds "wind" 2)
```
2.3
#### Printing
Dataset is printed using `dataset->str` or `print-dataset` functions.
Options are the same as in `tech.ml.dataset/dataset-data->str`. Most
important is `:print-line-policy` which can be one of the: `:single`,
`:repl` or `:markdown`.
``` clojure
(tc/print-dataset (tc/group-by DS :V1) {:print-line-policy :markdown})
```
_unnamed [2 3]:
| :name | :group-id | :data |
|------:|----------:|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 1 | 0 | Group: 1 [5 4]:
\| :V1 \| :V2 \| :V3 \| :V4 \|
\|----:\|----:\|----:\|-----\|
\| 1 \| 1 \| 0.5 \| A \|
\| 1 \| 3 \| 1.5 \| C \|
\| 1 \| 5 \| 1.0 \| B \|
\| 1 \| 7 \| 0.5 \| A \|
\| 1 \| 9 \| 1.5 \| C \| |
| 2 | 1 | Group: 2 [4 4]:
\| :V1 \| :V2 \| :V3 \| :V4 \|
\|----:\|----:\|----:\|-----\|
\| 2 \| 2 \| 1.0 \| B \|
\| 2 \| 4 \| 0.5 \| A \|
\| 2 \| 6 \| 1.5 \| C \|
\| 2 \| 8 \| 1.0 \| B \| |
``` clojure
(tc/print-dataset (tc/group-by DS :V1) {:print-line-policy :repl})
```
_unnamed [2 3]:
| :name | :group-id | :data |
|------:|----------:|--------------------------------|
| 1 | 0 | Group: 1 [5 4]: |
| | | |
| | | \| :V1 \| :V2 \| :V3 \| :V4 \| |
| | | \|----:\|----:\|----:\|-----\| |
| | | \| 1 \| 1 \| 0.5 \| A \| |
| | | \| 1 \| 3 \| 1.5 \| C \| |
| | | \| 1 \| 5 \| 1.0 \| B \| |
| | | \| 1 \| 7 \| 0.5 \| A \| |
| | | \| 1 \| 9 \| 1.5 \| C \| |
| 2 | 1 | Group: 2 [4 4]: |
| | | |
| | | \| :V1 \| :V2 \| :V3 \| :V4 \| |
| | | \|----:\|----:\|----:\|-----\| |
| | | \| 2 \| 2 \| 1.0 \| B \| |
| | | \| 2 \| 4 \| 0.5 \| A \| |
| | | \| 2 \| 6 \| 1.5 \| C \| |
| | | \| 2 \| 8 \| 1.0 \| B \| |
``` clojure
(tc/print-dataset (tc/group-by DS :V1) {:print-line-policy :single})
```
_unnamed [2 3]:
| :name | :group-id | :data |
|------:|----------:|-----------------|
| 1 | 0 | Group: 1 [5 4]: |
| 2 | 1 | Group: 2 [4 4]: |
### Group-by
Grouping by is an operation which splits dataset into subdatasets and
packs it into new special type of… dataset. I distinguish two types of
dataset: regular dataset and grouped dataset. The latter is the result
of grouping.
Operations that perform a transformation on a regular dataset, generally
apply that same transformation to individual sub-datasets in a grouped
dataset. For example,
``` clojure
(tc/select-rows DS [0 1 2])
```
_unnamed [3 4]:
| :V1 | :V2 | :V3 | :V4 |
|----:|----:|----:|-----|
| 1 | 1 | 0.5 | A |
| 2 | 2 | 1.0 | B |
| 1 | 3 | 1.5 | C |
returns a dataset containing only the first three rows of `DS`, while
``` clojure
(-> DS
(tc/group-by :V1)
(tc/select-rows [0 1 2]))
```
_unnamed [2 3]:
| :name | :group-id | :data |
|------:|----------:|-----------------|
| 1 | 0 | Group: 1 [3 4]: |
| 2 | 1 | Group: 2 [3 4]: |
returns a grouped dataset, in which each sub-dataset contains only the
first three rows of the sub-datasets in the grouped dataset created by
`(tc/group-by DS :V1)`.
Almost all functions recognize type of the dataset (grouped or not) and
operate accordingly.
However, you can’t apply reshaping or join/concat functions on grouped
datasets.
Grouped dataset is annotated by the `:grouped?` meta tag and consists of
the following columns:
- `:name` - group name or structure
- `:group-id` - integer assigned to the group
- `:data` - groups as datasets
#### Grouping
Grouping is done by calling `group-by` function with arguments:
- `ds` - dataset
- `grouping-selector` - what to use for grouping
- options:
- `:result-type` - what to return:
- `:as-dataset` (default) - return grouped dataset
- `:as-indexes` - return rows ids (row number from original dataset)
- `:as-map` - return map with group names as keys and subdataset as
values
- `:as-seq` - return sequens of subdatasets
- `:select-keys` - list of the columns passed to a grouping selector
function
All subdatasets (groups) have set name as the group name, additionally
`group-id` is in meta.
Grouping can be done by:
- single column name
- seq of column names
- value returned by function taking row as map (limited to
`:select-keys`)
- map of keys (arbitrary group names) to sequences of row indexes
In the case of the first three of these methods, each sub-dataset
contains all and only rows from the original data set that share the
same grouping value:
- the value of the row in a specified single column
- a map from column names to corresponding values found in the row
- the value returned by the function taking row as map
In the case of the map from group names to sequences of indexes, each
sub-dataset will contain all and only rows with the indexes listed in
the sequence for a given group name (a key).
Note: currently dataset inside dataset is printed recursively so it
renders poorly from markdown. So I will use `:as-seq` result type to
show just group names and groups.
------------------------------------------------------------------------
List of columns in grouped dataset
``` clojure
(-> DS
(tc/group-by :V1)
(tc/column-names))
```
(:V1 :V2 :V3 :V4)
------------------------------------------------------------------------
List of columns in grouped dataset treated as regular dataset
``` clojure
(-> DS
(tc/group-by :V1)
(tc/as-regular-dataset)
(tc/column-names))
```
(:name :group-id :data)
------------------------------------------------------------------------
Content of the grouped dataset
``` clojure
(tc/columns (tc/group-by DS :V1) :as-map)
```
{:name #tech.v3.dataset.column[2]
:name
[1, 2], :group-id #tech.v3.dataset.column[2]
:group-id
[0, 1], :data #tech.v3.dataset.column[2]
:data
[Group: 1 [5 4]:
| :V1 | :V2 | :V3 | :V4 |
|----:|----:|----:|-----|
| 1 | 1 | 0.5 | A |
| 1 | 3 | 1.5 | C |
| 1 | 5 | 1.0 | B |
| 1 | 7 | 0.5 | A |
| 1 | 9 | 1.5 | C |
, Group: 2 [4 4]:
| :V1 | :V2 | :V3 | :V4 |
|----:|----:|----:|-----|
| 2 | 2 | 1.0 | B |
| 2 | 4 | 0.5 | A |
| 2 | 6 | 1.5 | C |
| 2 | 8 | 1.0 | B |
]}
------------------------------------------------------------------------
Grouped dataset as map
``` clojure
(keys (tc/group-by DS :V1 {:result-type :as-map}))
```
(1 2)
``` clojure
(vals (tc/group-by DS :V1 {:result-type :as-map}))
```
(Group: 1 \[5 4\]:
| :V1 | :V2 | :V3 | :V4 |
|----:|----:|----:|-----|
| 1 | 1 | 0.5 | A |
| 1 | 3 | 1.5 | C |
| 1 | 5 | 1.0 | B |
| 1 | 7 | 0.5 | A |
| 1 | 9 | 1.5 | C |
Group: 2 \[4 4\]:
| :V1 | :V2 | :V3 | :V4 |
|----:|----:|----:|-----|
| 2 | 2 | 1.0 | B |
| 2 | 4 | 0.5 | A |
| 2 | 6 | 1.5 | C |
| 2 | 8 | 1.0 | B |
)
------------------------------------------------------------------------
Group dataset as map of indexes (row ids)
``` clojure
(tc/group-by DS :V1 {:result-type :as-indexes})
```
{1 [0 2 4 6 8], 2 [1 3 5 7]}
------------------------------------------------------------------------
Grouped datasets are printed as follows by default.
``` clojure
(tc/group-by DS :V1)
```
\_unnamed \[2 3\]:
| :name | :group-id | :data |
|------:|----------:|-------------------|
| 1 | 0 | Group: 1 \[5 4\]: |
| 2 | 1 | Group: 2 \[4 4\]: |
------------------------------------------------------------------------
To get groups as sequence or a map can be done from grouped dataset
using `groups->seq` and `groups->map` functions.
Groups as seq can be obtained by just accessing `:data` column.
I will use temporary dataset here.
``` clojure
(let [ds (-> {"a" [1 1 2 2]
"b" ["a" "b" "c" "d"]}
(tc/dataset)
(tc/group-by "a"))]
(seq (ds :data))) ;; seq is not necessary but Markdown treats `:data` as command here
```
(Group: 1 \[2 2\]:
| a | b |
|----:|-----|
| 1 | a |
| 1 | b |
Group: 2 \[2 2\]:
| a | b |
|----:|-----|
| 2 | c |
| 2 | d |
)
``` clojure
(-> {"a" [1 1 2 2]
"b" ["a" "b" "c" "d"]}
(tc/dataset)
(tc/group-by "a")
(tc/groups->seq))
```
(Group: 1 \[2 2\]:
| a | b |
|----:|-----|
| 1 | a |
| 1 | b |
Group: 2 \[2 2\]:
| a | b |
|----:|-----|
| 2 | c |
| 2 | d |
)
------------------------------------------------------------------------
Groups as map
``` clojure
(-> {"a" [1 1 2 2]
"b" ["a" "b" "c" "d"]}
(tc/dataset)
(tc/group-by "a")
(tc/groups->map))
```
{1 Group: 1 \[2 2\]:
| a | b |
|----:|-----|
| 1 | a |
| 1 | b |
, 2 Group: 2 \[2 2\]:
| a | b |
|----:|-----|
| 2 | c |
| 2 | d |
}
------------------------------------------------------------------------
Grouping by more than one column. You can see that group names are maps.
When ungrouping is done these maps are used to restore column names.
``` clojure
(tc/group-by DS [:V1 :V3] {:result-type :as-seq})
```
(Group: {:V1 1, :V3 0.5} \[2 4\]:
| :V1 | :V2 | :V3 | :V4 |
|----:|----:|----:|-----|
| 1 | 1 | 0.5 | A |
| 1 | 7 | 0.5 | A |
Group: {:V1 2, :V3 1.0} \[2 4\]:
| :V1 | :V2 | :V3 | :V4 |
|----:|----:|----:|-----|
| 2 | 2 | 1.0 | B |
| 2 | 8 | 1.0 | B |
Group: {:V1 1, :V3 1.5} \[2 4\]:
| :V1 | :V2 | :V3 | :V4 |
|----:|----:|----:|-----|
| 1 | 3 | 1.5 | C |
| 1 | 9 | 1.5 | C |
Group: {:V1 2, :V3 0.5} \[1 4\]:
| :V1 | :V2 | :V3 | :V4 |
|----:|----:|----:|-----|
| 2 | 4 | 0.5 | A |
Group: {:V1 1, :V3 1.0} \[1 4\]:
| :V1 | :V2 | :V3 | :V4 |
|----:|----:|----:|-----|
| 1 | 5 | 1.0 | B |
Group: {:V1 2, :V3 1.5} \[1 4\]:
| :V1 | :V2 | :V3 | :V4 |
|----:|----:|----:|-----|
| 2 | 6 | 1.5 | C |
)
------------------------------------------------------------------------
Grouping can be done by providing just row indexes. This way you can
assign the same row to more than one group.
``` clojure
(tc/group-by DS {"group-a" [1 2 1 2]
"group-b" [5 5 5 1]} {:result-type :as-seq})
```
(Group: group-a \[4 4\]:
| :V1 | :V2 | :V3 | :V4 |
|----:|----:|----:|-----|
| 2 | 2 | 1.0 | B |
| 1 | 3 | 1.5 | C |
| 2 | 2 | 1.0 | B |
| 1 | 3 | 1.5 | C |
Group: group-b \[4 4\]:
| :V1 | :V2 | :V3 | :V4 |
|----:|----:|----:|-----|
| 2 | 6 | 1.5 | C |
| 2 | 6 | 1.5 | C |
| 2 | 6 | 1.5 | C |
| 2 | 2 | 1.0 | B |
)
------------------------------------------------------------------------
You can group by a result of grouping function which gets row as map and
should return group name. When map is used as a group name, ungrouping
restore original column names.
``` clojure
(tc/group-by DS (fn [row] (* (:V1 row)
(:V3 row))) {:result-type :as-seq})
```
(Group: 0.5 \[2 4\]:
| :V1 | :V2 | :V3 | :V4 |
|----:|----:|----:|-----|
| 1 | 1 | 0.5 | A |
| 1 | 7 | 0.5 | A |
Group: 2.0 \[2 4\]:
| :V1 | :V2 | :V3 | :V4 |
|----:|----:|----:|-----|
| 2 | 2 | 1.0 | B |
| 2 | 8 | 1.0 | B |
Group: 1.5 \[2 4\]:
| :V1 | :V2 | :V3 | :V4 |
|----:|----:|----:|-----|
| 1 | 3 | 1.5 | C |
| 1 | 9 | 1.5 | C |
Group: 1.0 \[2 4\]:
| :V1 | :V2 | :V3 | :V4 |
|----:|----:|----:|-----|
| 2 | 4 | 0.5 | A |
| 1 | 5 | 1.0 | B |
Group: 3.0 \[1 4\]:
| :V1 | :V2 | :V3 | :V4 |
|----:|----:|----:|-----|
| 2 | 6 | 1.5 | C |
)
------------------------------------------------------------------------
You can use any predicate on column to split dataset into two groups.
``` clojure
(tc/group-by DS (comp #(< % 1.0) :V3) {:result-type :as-seq})
```
(Group: true \[3 4\]:
| :V1 | :V2 | :V3 | :V4 |
|----:|----:|----:|-----|
| 1 | 1 | 0.5 | A |
| 2 | 4 | 0.5 | A |
| 1 | 7 | 0.5 | A |
Group: false \[6 4\]:
| :V1 | :V2 | :V3 | :V4 |
|----:|----:|----:|-----|
| 2 | 2 | 1.0 | B |
| 1 | 3 | 1.5 | C |
| 1 | 5 | 1.0 | B |
| 2 | 6 | 1.5 | C |
| 2 | 8 | 1.0 | B |
| 1 | 9 | 1.5 | C |
)
------------------------------------------------------------------------
`juxt` is also helpful
``` clojure
(tc/group-by DS (juxt :V1 :V3) {:result-type :as-seq})
```
(Group: \[1 0.5\] \[2 4\]:
| :V1 | :V2 | :V3 | :V4 |
|----:|----:|----:|-----|
| 1 | 1 | 0.5 | A |
| 1 | 7 | 0.5 | A |
Group: \[2 1.0\] \[2 4\]:
| :V1 | :V2 | :V3 | :V4 |
|----:|----:|----:|-----|
| 2 | 2 | 1.0 | B |
| 2 | 8 | 1.0 | B |
Group: \[1 1.5\] \[2 4\]:
| :V1 | :V2 | :V3 | :V4 |
|----:|----:|----:|-----|
| 1 | 3 | 1.5 | C |
| 1 | 9 | 1.5 | C |
Group: \[2 0.5\] \[1 4\]:
| :V1 | :V2 | :V3 | :V4 |
|----:|----:|----:|-----|
| 2 | 4 | 0.5 | A |
Group: \[1 1.0\] \[1 4\]:
| :V1 | :V2 | :V3 | :V4 |
|----:|----:|----:|-----|
| 1 | 5 | 1.0 | B |
Group: \[2 1.5\] \[1 4\]:
| :V1 | :V2 | :V3 | :V4 |
|----:|----:|----:|-----|
| 2 | 6 | 1.5 | C |
)
------------------------------------------------------------------------
`tech.ml.dataset` provides an option to limit columns which are passed
to grouping functions. It’s done for performance purposes.
``` clojure
(tc/group-by DS identity {:result-type :as-seq
:select-keys [:V1]})
```
(Group: {:V1 1} \[5 4\]:
| :V1 | :V2 | :V3 | :V4 |
|----:|----:|----:|-----|
| 1 | 1 | 0.5 | A |
| 1 | 3 | 1.5 | C |
| 1 | 5 | 1.0 | B |
| 1 | 7 | 0.5 | A |
| 1 | 9 | 1.5 | C |
Group: {:V1 2} \[4 4\]:
| :V1 | :V2 | :V3 | :V4 |
|----:|----:|----:|-----|
| 2 | 2 | 1.0 | B |
| 2 | 4 | 0.5 | A |
| 2 | 6 | 1.5 | C |
| 2 | 8 | 1.0 | B |
)
#### Ungrouping
Ungrouping simply concats all the groups into the dataset. Following
options are possible
- `:order?` - order groups according to the group name ascending order.
Default: `false`
- `:add-group-as-column` - should group name become a column? If yes
column is created with provided name (or `:$group-name` if argument is
`true`). Default: `nil`.
- `:add-group-id-as-column` - should group id become a column? If yes
column is created with provided name (or `:$group-id` if argument is
`true`). Default: `nil`.
- `:dataset-name` - to name resulting dataset. Default: `nil`
(\_unnamed)
If group name is a map, it will be splitted into separate columns. Be
sure that groups (subdatasets) doesn’t contain the same columns already.
If group name is a vector, it will be splitted into separate columns. If
you want to name them, set vector of target column names as
`:add-group-as-column` argument.
After ungrouping, order of the rows is kept within the groups but groups
are ordered according to the internal storage.
------------------------------------------------------------------------
Grouping and ungrouping.
``` clojure
(-> DS
(tc/group-by :V3)
(tc/ungroup))
```
\_unnamed \[9 4\]:
| :V1 | :V2 | :V3 | :V4 |
|----:|----:|----:|-----|
| 1 | 1 | 0.5 | A |
| 2 | 4 | 0.5 | A |
| 1 | 7 | 0.5 | A |
| 2 | 2 | 1.0 | B |
| 1 | 5 | 1.0 | B |
| 2 | 8 | 1.0 | B |
| 1 | 3 | 1.5 | C |
| 2 | 6 | 1.5 | C |
| 1 | 9 | 1.5 | C |
------------------------------------------------------------------------
Groups sorted by group name and named.
``` clojure
(-> DS
(tc/group-by :V3)
(tc/ungroup {:order? true
:dataset-name "Ordered by V3"}))
```
Ordered by V3 \[9 4\]:
| :V1 | :V2 | :V3 | :V4 |
|----:|----:|----:|-----|
| 1 | 1 | 0.5 | A |
| 2 | 4 | 0.5 | A |
| 1 | 7 | 0.5 | A |
| 2 | 2 | 1.0 | B |
| 1 | 5 | 1.0 | B |
| 2 | 8 | 1.0 | B |
| 1 | 3 | 1.5 | C |
| 2 | 6 | 1.5 | C |
| 1 | 9 | 1.5 | C |
------------------------------------------------------------------------
Groups sorted descending by group name and named.
``` clojure
(-> DS
(tc/group-by :V3)
(tc/ungroup {:order? :desc
:dataset-name "Ordered by V3 descending"}))
```
Ordered by V3 descending \[9 4\]:
| :V1 | :V2 | :V3 | :V4 |
|----:|----:|----:|-----|
| 1 | 3 | 1.5 | C |
| 2 | 6 | 1.5 | C |
| 1 | 9 | 1.5 | C |
| 2 | 2 | 1.0 | B |
| 1 | 5 | 1.0 | B |
| 2 | 8 | 1.0 | B |
| 1 | 1 | 0.5 | A |
| 2 | 4 | 0.5 | A |
| 1 | 7 | 0.5 | A |
------------------------------------------------------------------------
Let’s add group name and id as additional columns
``` clojure
(-> DS
(tc/group-by (comp #(< % 4) :V2))
(tc/ungroup {:add-group-as-column true
:add-group-id-as-column true}))
```
\_unnamed \[9 6\]:
| :$group-name | :$group-id | :V1 | :V2 | :V3 | :V4 | |
|---------------------------|----:|----:|----:|----:|-----|
| true | 0 | 1 | 1 | 0.5 | A |
| true | 0 | 2 | 2 | 1.0 | B |
| true | 0 | 1 | 3 | 1.5 | C |
| false | 1 | 2 | 4 | 0.5 | A |
| false | 1 | 1 | 5 | 1.0 | B |
| false | 1 | 2 | 6 | 1.5 | C |
| false | 1 | 1 | 7 | 0.5 | A |
| false | 1 | 2 | 8 | 1.0 | B |
| false | 1 | 1 | 9 | 1.5 | C |
------------------------------------------------------------------------
Let’s assign different column names
``` clojure
(-> DS
(tc/group-by (comp #(< % 4) :V2))
(tc/ungroup {:add-group-as-column "Is V2 less than 4?"
:add-group-id-as-column "group id"}))
```
\_unnamed \[9 6\]:
| Is V2 less than 4? | group id | :V1 | :V2 | :V3 | :V4 |
|--------------------|---------:|----:|----:|----:|-----|
| true | 0 | 1 | 1 | 0.5 | A |
| true | 0 | 2 | 2 | 1.0 | B |
| true | 0 | 1 | 3 | 1.5 | C |
| false | 1 | 2 | 4 | 0.5 | A |
| false | 1 | 1 | 5 | 1.0 | B |
| false | 1 | 2 | 6 | 1.5 | C |
| false | 1 | 1 | 7 | 0.5 | A |
| false | 1 | 2 | 8 | 1.0 | B |
| false | 1 | 1 | 9 | 1.5 | C |
------------------------------------------------------------------------
If we group by map, we can automatically create new columns out of group
names.
``` clojure
(-> DS
(tc/group-by (fn [row] {"V1 and V3 multiplied" (* (:V1 row)
(:V3 row))
"V4 as lowercase" (clojure.string/lower-case (:V4 row))}))
(tc/ungroup {:add-group-as-column true}))
```
\_unnamed \[9 6\]:
| V1 and V3 multiplied | V4 as lowercase | :V1 | :V2 | :V3 | :V4 |
|---------------------:|-----------------|----:|----:|----:|-----|
| 0.5 | a | 1 | 1 | 0.5 | A |
| 0.5 | a | 1 | 7 | 0.5 | A |
| 2.0 | b | 2 | 2 | 1.0 | B |
| 2.0 | b | 2 | 8 | 1.0 | B |
| 1.5 | c | 1 | 3 | 1.5 | C |
| 1.5 | c | 1 | 9 | 1.5 | C |
| 1.0 | a | 2 | 4 | 0.5 | A |
| 1.0 | b | 1 | 5 | 1.0 | B |
| 3.0 | c | 2 | 6 | 1.5 | C |
------------------------------------------------------------------------
We can add group names without separation
``` clojure
(-> DS
(tc/group-by (fn [row] {"V1 and V3 multiplied" (* (:V1 row)
(:V3 row))
"V4 as lowercase" (clojure.string/lower-case (:V4 row))}))
(tc/ungroup {:add-group-as-column "just map"
:separate? false}))
```
\_unnamed \[9 5\]:
| just map | :V1 | :V2 | :V3 | :V4 |
|-----------------------------------------------------|----:|----:|----:|-----|
| {“V1 and V3 multiplied” 0.5, “V4 as lowercase” “a”} | 1 | 1 | 0.5 | A |
| {“V1 and V3 multiplied” 0.5, “V4 as lowercase” “a”} | 1 | 7 | 0.5 | A |
| {“V1 and V3 multiplied” 2.0, “V4 as lowercase” “b”} | 2 | 2 | 1.0 | B |
| {“V1 and V3 multiplied” 2.0, “V4 as lowercase” “b”} | 2 | 8 | 1.0 | B |
| {“V1 and V3 multiplied” 1.5, “V4 as lowercase” “c”} | 1 | 3 | 1.5 | C |
| {“V1 and V3 multiplied” 1.5, “V4 as lowercase” “c”} | 1 | 9 | 1.5 | C |
| {“V1 and V3 multiplied” 1.0, “V4 as lowercase” “a”} | 2 | 4 | 0.5 | A |
| {“V1 and V3 multiplied” 1.0, “V4 as lowercase” “b”} | 1 | 5 | 1.0 | B |
| {“V1 and V3 multiplied” 3.0, “V4 as lowercase” “c”} | 2 | 6 | 1.5 | C |
------------------------------------------------------------------------
The same applies to group names as sequences
``` clojure
(-> DS
(tc/group-by (juxt :V1 :V3))
(tc/ungroup {:add-group-as-column "abc"}))
```
\_unnamed \[9 6\]:
| :abc-0 | :abc-1 | :V1 | :V2 | :V3 | :V4 |
|-------:|-------:|----:|----:|----:|-----|
| 1 | 0.5 | 1 | 1 | 0.5 | A |
| 1 | 0.5 | 1 | 7 | 0.5 | A |
| 2 | 1.0 | 2 | 2 | 1.0 | B |
| 2 | 1.0 | 2 | 8 | 1.0 | B |
| 1 | 1.5 | 1 | 3 | 1.5 | C |
| 1 | 1.5 | 1 | 9 | 1.5 | C |
| 2 | 0.5 | 2 | 4 | 0.5 | A |
| 1 | 1.0 | 1 | 5 | 1.0 | B |
| 2 | 1.5 | 2 | 6 | 1.5 | C |
------------------------------------------------------------------------
Let’s provide column names
``` clojure
(-> DS
(tc/group-by (juxt :V1 :V3))
(tc/ungroup {:add-group-as-column ["v1" "v3"]}))
```
\_unnamed \[9 6\]:
| v1 | v3 | :V1 | :V2 | :V3 | :V4 |
|----:|----:|----:|----:|----:|-----|
| 1 | 0.5 | 1 | 1 | 0.5 | A |
| 1 | 0.5 | 1 | 7 | 0.5 | A |
| 2 | 1.0 | 2 | 2 | 1.0 | B |
| 2 | 1.0 | 2 | 8 | 1.0 | B |
| 1 | 1.5 | 1 | 3 | 1.5 | C |
| 1 | 1.5 | 1 | 9 | 1.5 | C |
| 2 | 0.5 | 2 | 4 | 0.5 | A |
| 1 | 1.0 | 1 | 5 | 1.0 | B |
| 2 | 1.5 | 2 | 6 | 1.5 | C |
------------------------------------------------------------------------
Also we can supress separation
``` clojure
(-> DS
(tc/group-by (juxt :V1 :V3))
(tc/ungroup {:separate? false
:add-group-as-column true}))
;; => _unnamed [9 5]:
```
\_unnamed \[9 5\]:
| :\$group-name | :V1 | :V2 | :V3 | :V4 |
|---------------|----:|----:|----:|-----|
| \[1 0.5\] | 1 | 1 | 0.5 | A |
| \[1 0.5\] | 1 | 7 | 0.5 | A |
| \[2 1.0\] | 2 | 2 | 1.0 | B |
| \[2 1.0\] | 2 | 8 | 1.0 | B |
| \[1 1.5\] | 1 | 3 | 1.5 | C |
| \[1 1.5\] | 1 | 9 | 1.5 | C |
| \[2 0.5\] | 2 | 4 | 0.5 | A |
| \[1 1.0\] | 1 | 5 | 1.0 | B |
| \[2 1.5\] | 2 | 6 | 1.5 | C |
#### Other functions
To check if dataset is grouped or not just use `grouped?` function.
``` clojure
(tc/grouped? DS)
```
nil
``` clojure
(tc/grouped? (tc/group-by DS :V1))
```
true
------------------------------------------------------------------------
If you want to remove grouping annotation (to make all the functions
work as with regular dataset) you can use `unmark-group` or
`as-regular-dataset` (alias) functions.
It can be important when you want to remove some groups (rows) from
grouped dataset using `drop-rows` or something like that.
``` clojure
(-> DS
(tc/group-by :V1)
(tc/as-regular-dataset)
(tc/grouped?))
```
nil
You can also operate on grouped dataset as a regular one in case you
want to access its columns using `without-grouping->` threading macro.
``` clojure
(-> DS
(tc/group-by [:V4 :V1])
(tc/without-grouping->
(tc/order-by (comp (juxt :V4 :V1) :name))))
```
\_unnamed \[6 3\]:
| :name | :group-id | :data |
|------------------|----------:|----------------------------------|
| {:V4 “A”, :V1 1} | 0 | Group: {:V4 “A”, :V1 1} \[2 4\]: |
| {:V4 “A”, :V1 2} | 3 | Group: {:V4 “A”, :V1 2} \[1 4\]: |
| {:V4 “B”, :V1 1} | 4 | Group: {:V4 “B”, :V1 1} \[1 4\]: |
| {:V4 “B”, :V1 2} | 1 | Group: {:V4 “B”, :V1 2} \[2 4\]: |
| {:V4 “C”, :V1 1} | 2 | Group: {:V4 “C”, :V1 1} \[2 4\]: |
| {:V4 “C”, :V1 2} | 5 | Group: {:V4 “C”, :V1 2} \[1 4\]: |
------------------------------------------------------------------------
This is considered internal.
If you want to implement your own mapping function on grouped dataset
you can call `process-group-data` and pass function operating on
datasets. Result should be a dataset to have ungrouping working.
``` clojure
(-> DS
(tc/group-by :V1)
(tc/process-group-data #(str "Shape: " (vector (tc/row-count %) (tc/column-count %))))
(tc/as-regular-dataset))
```
\_unnamed \[2 3\]:
| :name | :group-id | :data |
|------:|----------:|----------------|
| 1 | 0 | Shape: \[5 4\] |
| 2 | 1 | Shape: \[4 4\] |
### Columns
Column is a special `tech.ml.dataset` structure. For our purposes we cat
treat columns as typed and named sequence bound to particular dataset.
Type of the data is inferred from a sequence during column creation.
#### Names
To select dataset columns or column names `columns-selector` is used.
`columns-selector` can be one of the following:
- `:all` keyword - selects all columns
- column name - for single column
- sequence of column names - for collection of columns
- regex - to apply pattern on column names or datatype
- filter predicate - to filter column names or datatype
- `type` namespaced keyword for specific datatype or group of datatypes
Column name can be anything.
`column-names` function returns names according to `columns-selector`
and optional `meta-field`. `meta-field` is one of the following:
- `:name` (default) - to operate on column names
- `:datatype` - to operated on column types
- `:all` - if you want to process all metadata
Datatype groups are:
- `:type/numerical` - any numerical type
- `:type/float` - floating point number (`:float32` and `:float64`)
- `:type/integer` - any integer
- `:type/datetime` - any datetime type
If qualified keyword starts with `:!type`, complement set is used.
------------------------------------------------------------------------
To select all column names you can use `column-names` function.
``` clojure
(tc/column-names DS)
```
(:V1 :V2 :V3 :V4)
or
``` clojure
(tc/column-names DS :all)
```
(:V1 :V2 :V3 :V4)
In case you want to select column which has name `:all` (or is sequence
or map), put it into a vector. Below code returns empty sequence since
there is no such column in the dataset.
``` clojure
(tc/column-names DS [:all])
```
()
------------------------------------------------------------------------
Obviously selecting single name returns it’s name if available
``` clojure
(tc/column-names DS :V1)
(tc/column-names DS "no such column")
```
(:V1)
()
------------------------------------------------------------------------
Select sequence of column names.
``` clojure
(tc/column-names DS [:V1 "V2" :V3 :V4 :V5])
```
(:V1 :V3 :V4)
------------------------------------------------------------------------
Select names based on regex, columns ends with `1` or `4`
``` clojure
(tc/column-names DS #".*[14]")
```
(:V1 :V4)
------------------------------------------------------------------------
Select names based on regex operating on type of the column (to check
what are the column types, call `(tc/info DS :columns)`. Here we want to
get integer columns only.
``` clojure
(tc/column-names DS #"^:int.*" :datatype)
```
(:V1 :V2)
or
``` clojure
(tc/column-names DS :type/integer)
```
(:V1 :V2)
------------------------------------------------------------------------
And finally we can use predicate to select names. Let’s select double
precision columns.
``` clojure
(tc/column-names DS #{:float64} :datatype)
```
(:V3)
or
``` clojure
(tc/column-names DS :type/float64)
```
(:V3)
------------------------------------------------------------------------
If you want to select all columns but given, use `complement` function.
Works only on a predicate.
``` clojure
(tc/column-names DS (complement #{:V1}))
(tc/column-names DS (complement #{:float64}) :datatype)
(tc/column-names DS :!type/float64)
```
(:V2 :V3 :V4)
(:V1 :V2 :V4)
(:V1 :V2 :V4)
------------------------------------------------------------------------
You can select column names based on all column metadata at once by
using `:all` metadata selector. Below we want to select column names
ending with `1` which have `long` datatype.
``` clojure
(tc/column-names DS (fn [meta]
(and (= :int64 (:datatype meta))
(clojure.string/ends-with? (:name meta) "1"))) :all)
```
(:V1)
#### Select
`select-columns` creates dataset with columns selected by
`columns-selector` as described above. Function works on regular and
grouped dataset.
------------------------------------------------------------------------
Select only float64 columns
``` clojure
(tc/select-columns DS #(= :float64 %) :datatype)
```
\_unnamed \[9 1\]:
| :V3 |
|----:|
| 0.5 |
| 1.0 |
| 1.5 |
| 0.5 |
| 1.0 |
| 1.5 |
| 0.5 |
| 1.0 |
| 1.5 |
or
``` clojure
(tc/select-columns DS :type/float64)
```
\_unnamed \[9 1\]:
| :V3 |
|----:|
| 0.5 |
| 1.0 |
| 1.5 |
| 0.5 |
| 1.0 |
| 1.5 |
| 0.5 |
| 1.0 |
| 1.5 |
------------------------------------------------------------------------
Select all but `:V1` columns
``` clojure
(tc/select-columns DS (complement #{:V1}))
```
\_unnamed \[9 3\]:
| :V2 | :V3 | :V4 |
|----:|----:|-----|
| 1 | 0.5 | A |
| 2 | 1.0 | B |
| 3 | 1.5 | C |
| 4 | 0.5 | A |
| 5 | 1.0 | B |
| 6 | 1.5 | C |
| 7 | 0.5 | A |
| 8 | 1.0 | B |
| 9 | 1.5 | C |
------------------------------------------------------------------------
If we have grouped data set, column selection is applied to every group
separately.
``` clojure
(-> DS
(tc/group-by :V1)
(tc/select-columns [:V2 :V3])
(tc/groups->map))
```
{1 Group: 1 \[5 2\]:
| :V2 | :V3 |
|----:|----:|
| 1 | 0.5 |
| 3 | 1.5 |
| 5 | 1.0 |
| 7 | 0.5 |
| 9 | 1.5 |
, 2 Group: 2 \[4 2\]:
| :V2 | :V3 |
|----:|----:|
| 2 | 1.0 |
| 4 | 0.5 |
| 6 | 1.5 |
| 8 | 1.0 |
}
#### Drop
`drop-columns` creates dataset with removed columns.
------------------------------------------------------------------------
Drop float64 columns
``` clojure
(tc/drop-columns DS #(= :float64 %) :datatype)
```
\_unnamed \[9 3\]:
| :V1 | :V2 | :V4 |
|----:|----:|-----|
| 1 | 1 | A |
| 2 | 2 | B |
| 1 | 3 | C |
| 2 | 4 | A |
| 1 | 5 | B |
| 2 | 6 | C |
| 1 | 7 | A |
| 2 | 8 | B |
| 1 | 9 | C |
or
``` clojure
(tc/drop-columns DS :type/float64)
```
\_unnamed \[9 3\]:
| :V1 | :V2 | :V4 |
|----:|----:|-----|
| 1 | 1 | A |
| 2 | 2 | B |
| 1 | 3 | C |
| 2 | 4 | A |
| 1 | 5 | B |
| 2 | 6 | C |
| 1 | 7 | A |
| 2 | 8 | B |
| 1 | 9 | C |
------------------------------------------------------------------------
Drop all columns but `:V1` and `:V2`
``` clojure
(tc/drop-columns DS (complement #{:V1 :V2}))
```
\_unnamed \[9 2\]:
| :V1 | :V2 |
|----:|----:|
| 1 | 1 |
| 2 | 2 |
| 1 | 3 |
| 2 | 4 |
| 1 | 5 |
| 2 | 6 |
| 1 | 7 |
| 2 | 8 |
| 1 | 9 |
------------------------------------------------------------------------
If we have grouped data set, column selection is applied to every group
separately. Selected columns are dropped.
``` clojure
(-> DS
(tc/group-by :V1)
(tc/drop-columns [:V2 :V3])
(tc/groups->map))
```
{1 Group: 1 \[5 2\]:
| :V1 | :V4 |
|----:|-----|
| 1 | A |
| 1 | C |
| 1 | B |
| 1 | A |
| 1 | C |
, 2 Group: 2 \[4 2\]:
| :V1 | :V4 |
|----:|-----|
| 2 | B |
| 2 | A |
| 2 | C |
| 2 | B |
}
#### Rename
If you want to rename colums use `rename-columns` and pass map where
keys are old names, values new ones.
You can also pass mapping function with optional columns-selector
``` clojure
(tc/rename-columns DS {:V1 "v1"
:V2 "v2"
:V3 [1 2 3]
:V4 (Object.)})
```
\_unnamed \[9 4\]:
| v1 | v2 | \[1 2 3\] | |
|----:|----:|----------:|-----------------------------|
| 1 | 1 | 0.5 | A |
| 2 | 2 | 1.0 | B |
| 1 | 3 | 1.5 | C |
| 2 | 4 | 0.5 | A |
| 1 | 5 | 1.0 | B |
| 2 | 6 | 1.5 | C |
| 1 | 7 | 0.5 | A |
| 2 | 8 | 1.0 | B |
| 1 | 9 | 1.5 | C |
------------------------------------------------------------------------
Map all names with function
``` clojure
(tc/rename-columns DS (comp str second name))
```
\_unnamed \[9 4\]:
| 1 | 2 | 3 | 4 |
|----:|----:|----:|-----|
| 1 | 1 | 0.5 | A |
| 2 | 2 | 1.0 | B |
| 1 | 3 | 1.5 | C |
| 2 | 4 | 0.5 | A |
| 1 | 5 | 1.0 | B |
| 2 | 6 | 1.5 | C |
| 1 | 7 | 0.5 | A |
| 2 | 8 | 1.0 | B |
| 1 | 9 | 1.5 | C |
------------------------------------------------------------------------
Map selected names with function
``` clojure
(tc/rename-columns DS [:V1 :V3] (comp str second name))
```
\_unnamed \[9 4\]:
| 1 | :V2 | 3 | :V4 |
|----:|----:|----:|-----|
| 1 | 1 | 0.5 | A |
| 2 | 2 | 1.0 | B |
| 1 | 3 | 1.5 | C |
| 2 | 4 | 0.5 | A |
| 1 | 5 | 1.0 | B |
| 2 | 6 | 1.5 | C |
| 1 | 7 | 0.5 | A |
| 2 | 8 | 1.0 | B |
| 1 | 9 | 1.5 | C |
------------------------------------------------------------------------
Function works on grouped dataset
``` clojure
(-> DS
(tc/group-by :V1)
(tc/rename-columns {:V1 "v1"
:V2 "v2"
:V3 [1 2 3]
:V4 (Object.)})
(tc/groups->map))
```
{1 Group: 1 \[5 4\]:
| v1 | v2 | \[1 2 3\] | |
|----:|----:|----------:|-----------------------------|
| 1 | 1 | 0.5 | A |
| 1 | 3 | 1.5 | C |
| 1 | 5 | 1.0 | B |
| 1 | 7 | 0.5 | A |
| 1 | 9 | 1.5 | C |
, 2 Group: 2 \[4 4\]:
| v1 | v2 | \[1 2 3\] | |
|----:|----:|----------:|-----------------------------|
| 2 | 2 | 1.0 | B |
| 2 | 4 | 0.5 | A |
| 2 | 6 | 1.5 | C |
| 2 | 8 | 1.0 | B |
}
#### Add or update
To add (or replace existing) column call `add-column` function. Function
accepts:
- `ds` - a dataset
- `column-name` - if it’s existing column name, column will be replaced
- `column` - can be column (from other dataset), sequence, single value
or function. Too big columns are always trimmed. Too small are cycled
or extended with missing values (according to `size-strategy`
argument)
- `size-strategy` (optional) - when new column is shorter than dataset
row count, following strategies are applied:
- `:cycle` - repeat data
- `:na` - append missing values
- `:strict` - (default) throws an exception when sizes mismatch
Function works on grouped dataset.
------------------------------------------------------------------------
Add single value as column
``` clojure
(tc/add-column DS :V5 "X")
```
\_unnamed \[9 5\]:
| :V1 | :V2 | :V3 | :V4 | :V5 |
|----:|----:|----:|-----|-----|
| 1 | 1 | 0.5 | A | X |
| 2 | 2 | 1.0 | B | X |
| 1 | 3 | 1.5 | C | X |
| 2 | 4 | 0.5 | A | X |
| 1 | 5 | 1.0 | B | X |
| 2 | 6 | 1.5 | C | X |
| 1 | 7 | 0.5 | A | X |
| 2 | 8 | 1.0 | B | X |
| 1 | 9 | 1.5 | C | X |
------------------------------------------------------------------------
Replace one column (column is trimmed)
``` clojure
(tc/add-column DS :V1 (repeatedly rand))
```
\_unnamed \[9 4\]:
| :V1 | :V2 | :V3 | :V4 |
|-----------:|----:|----:|-----|
| 0.28302117 | 1 | 0.5 | A |
| 0.55639847 | 2 | 1.0 | B |
| 0.41235340 | 3 | 1.5 | C |
| 0.04513037 | 4 | 0.5 | A |
| 0.27800569 | 5 | 1.0 | B |
| 0.18520365 | 6 | 1.5 | C |
| 0.14477020 | 7 | 0.5 | A |
| 0.16335125 | 8 | 1.0 | B |
| 0.53224105 | 9 | 1.5 | C |
------------------------------------------------------------------------
Copy column
``` clojure
(tc/add-column DS :V5 (DS :V1))
```
\_unnamed \[9 5\]:
| :V1 | :V2 | :V3 | :V4 | :V5 |
|----:|----:|----:|-----|----:|
| 1 | 1 | 0.5 | A | 1 |
| 2 | 2 | 1.0 | B | 2 |
| 1 | 3 | 1.5 | C | 1 |
| 2 | 4 | 0.5 | A | 2 |
| 1 | 5 | 1.0 | B | 1 |
| 2 | 6 | 1.5 | C | 2 |
| 1 | 7 | 0.5 | A | 1 |
| 2 | 8 | 1.0 | B | 2 |
| 1 | 9 | 1.5 | C | 1 |
------------------------------------------------------------------------
When function is used, argument is whole dataset and the result should
be column, sequence or single value
``` clojure
(tc/add-column DS :row-count tc/row-count)
```
\_unnamed \[9 5\]:
| :V1 | :V2 | :V3 | :V4 | :row-count |
|----:|----:|----:|-----|-----------:|
| 1 | 1 | 0.5 | A | 9 |
| 2 | 2 | 1.0 | B | 9 |
| 1 | 3 | 1.5 | C | 9 |
| 2 | 4 | 0.5 | A | 9 |
| 1 | 5 | 1.0 | B | 9 |
| 2 | 6 | 1.5 | C | 9 |
| 1 | 7 | 0.5 | A | 9 |
| 2 | 8 | 1.0 | B | 9 |
| 1 | 9 | 1.5 | C | 9 |
------------------------------------------------------------------------
Above example run on grouped dataset, applies function on each group
separately.
``` clojure
(-> DS
(tc/group-by :V1)
(tc/add-column :row-count tc/row-count)
(tc/ungroup))
```
\_unnamed \[9 5\]:
| :V1 | :V2 | :V3 | :V4 | :row-count |
|----:|----:|----:|-----|-----------:|
| 1 | 1 | 0.5 | A | 5 |
| 1 | 3 | 1.5 | C | 5 |
| 1 | 5 | 1.0 | B | 5 |
| 1 | 7 | 0.5 | A | 5 |
| 1 | 9 | 1.5 | C | 5 |
| 2 | 2 | 1.0 | B | 4 |
| 2 | 4 | 0.5 | A | 4 |
| 2 | 6 | 1.5 | C | 4 |
| 2 | 8 | 1.0 | B | 4 |
------------------------------------------------------------------------
When column which is added is longer than row count in dataset, column
is trimmed. When column is shorter, it’s cycled or missing values are
appended.
``` clojure
(tc/add-column DS :V5 [:r :b] :cycle)
```
\_unnamed \[9 5\]:
| :V1 | :V2 | :V3 | :V4 | :V5 |
|----:|----:|----:|-----|-----|
| 1 | 1 | 0.5 | A | :r |
| 2 | 2 | 1.0 | B | :b |
| 1 | 3 | 1.5 | C | :r |
| 2 | 4 | 0.5 | A | :b |
| 1 | 5 | 1.0 | B | :r |
| 2 | 6 | 1.5 | C | :b |
| 1 | 7 | 0.5 | A | :r |
| 2 | 8 | 1.0 | B | :b |
| 1 | 9 | 1.5 | C | :r |
``` clojure
(tc/add-column DS :V5 [:r :b] :na)
```
\_unnamed \[9 5\]:
| :V1 | :V2 | :V3 | :V4 | :V5 |
|----:|----:|----:|-----|-----|
| 1 | 1 | 0.5 | A | :r |
| 2 | 2 | 1.0 | B | :b |
| 1 | 3 | 1.5 | C | |
| 2 | 4 | 0.5 | A | |
| 1 | 5 | 1.0 | B | |
| 2 | 6 | 1.5 | C | |
| 1 | 7 | 0.5 | A | |
| 2 | 8 | 1.0 | B | |
| 1 | 9 | 1.5 | C | |
Exception is thrown when `:strict` (default) strategy is used and column
size is not equal row count
``` clojure
(try
(tc/add-column DS :V5 [:r :b])
(catch Exception e (str "Exception caught: "(ex-message e))))
```
"Exception caught: Column size (2) should be exactly the same as dataset row count (9). Consider `:cycle` or `:na` strategy."
------------------------------------------------------------------------
Tha same applies for grouped dataset
``` clojure
(-> DS
(tc/group-by :V3)
(tc/add-column :V5 [:r :b] :na)
(tc/ungroup))
```
\_unnamed \[9 5\]:
| :V1 | :V2 | :V3 | :V4 | :V5 |
|----:|----:|----:|-----|-----|
| 1 | 1 | 0.5 | A | :r |
| 2 | 4 | 0.5 | A | :b |
| 1 | 7 | 0.5 | A | |
| 2 | 2 | 1.0 | B | :r |
| 1 | 5 | 1.0 | B | :b |
| 2 | 8 | 1.0 | B | |
| 1 | 3 | 1.5 | C | :r |
| 2 | 6 | 1.5 | C | :b |
| 1 | 9 | 1.5 | C | |
------------------------------------------------------------------------
Let’s use other column to fill groups
``` clojure
(-> DS
(tc/group-by :V3)
(tc/add-column :V5 (DS :V2) :cycle)
(tc/ungroup))
```
\_unnamed \[9 5\]:
| :V1 | :V2 | :V3 | :V4 | :V5 |
|----:|----:|----:|-----|----:|
| 1 | 1 | 0.5 | A | 1 |
| 2 | 4 | 0.5 | A | 2 |
| 1 | 7 | 0.5 | A | 3 |
| 2 | 2 | 1.0 | B | 1 |
| 1 | 5 | 1.0 | B | 2 |
| 2 | 8 | 1.0 | B | 3 |
| 1 | 3 | 1.5 | C | 1 |
| 2 | 6 | 1.5 | C | 2 |
| 1 | 9 | 1.5 | C | 3 |
------------------------------------------------------------------------
In case you want to add or update several columns you can call
`add-columns` and provide map where keys are column names, vals are
columns.
``` clojure
(tc/add-columns DS {:V1 #(map inc (% :V1))
:V5 #(map (comp keyword str) (% :V4))
:V6 11})
```
\_unnamed \[9 6\]:
| :V1 | :V2 | :V3 | :V4 | :V5 | :V6 |
|----:|----:|----:|-----|-----|----:|
| 2 | 1 | 0.5 | A | :A | 11 |
| 3 | 2 | 1.0 | B | :B | 11 |
| 2 | 3 | 1.5 | C | :C | 11 |
| 3 | 4 | 0.5 | A | :A | 11 |
| 2 | 5 | 1.0 | B | :B | 11 |
| 3 | 6 | 1.5 | C | :C | 11 |
| 2 | 7 | 0.5 | A | :A | 11 |
| 3 | 8 | 1.0 | B | :B | 11 |
| 2 | 9 | 1.5 | C | :C | 11 |
#### Update
If you want to modify specific column(s) you can call `update-columns`.
Arguments:
- dataset
- one of:
- `columns-selector` and function (or sequence of functions)
- map where keys are column names and vals are function
Functions accept column and have to return column or sequence
------------------------------------------------------------------------
Reverse of columns
``` clojure
(tc/update-columns DS :all reverse)
```
\_unnamed \[9 4\]:
| :V1 | :V2 | :V3 | :V4 |
|----:|----:|----:|-----|
| 1 | 9 | 1.5 | C |
| 2 | 8 | 1.0 | B |
| 1 | 7 | 0.5 | A |
| 2 | 6 | 1.5 | C |
| 1 | 5 | 1.0 | B |
| 2 | 4 | 0.5 | A |
| 1 | 3 | 1.5 | C |
| 2 | 2 | 1.0 | B |
| 1 | 1 | 0.5 | A |
------------------------------------------------------------------------
Apply dec/inc on numerical columns
``` clojure
(tc/update-columns DS :type/numerical [(partial map dec)
(partial map inc)])
```
\_unnamed \[9 4\]:
| :V1 | :V2 | :V3 | :V4 |
|----:|----:|-----:|-----|
| 0 | 2 | -0.5 | A |
| 1 | 3 | 0.0 | B |
| 0 | 4 | 0.5 | C |
| 1 | 5 | -0.5 | A |
| 0 | 6 | 0.0 | B |
| 1 | 7 | 0.5 | C |
| 0 | 8 | -0.5 | A |
| 1 | 9 | 0.0 | B |
| 0 | 10 | 0.5 | C |
------------------------------------------------------------------------
You can also assign a function to a column by packing operations into
the map.
``` clojure
(tc/update-columns DS {:V1 reverse
:V2 (comp shuffle seq)})
```
\_unnamed \[9 4\]:
| :V1 | :V2 | :V3 | :V4 |
|----:|----:|----:|-----|
| 1 | 6 | 0.5 | A |
| 2 | 9 | 1.0 | B |
| 1 | 3 | 1.5 | C |
| 2 | 5 | 0.5 | A |
| 1 | 4 | 1.0 | B |
| 2 | 8 | 1.5 | C |
| 1 | 7 | 0.5 | A |
| 2 | 1 | 1.0 | B |
| 1 | 2 | 1.5 | C |
#### Map
The other way of creating or updating column is to map rows as regular
`map` function. The arity of mapping function should be the same as
number of selected columns.
Arguments:
- `ds` - dataset
- `column-name` - target column name
- `columns-selector` - columns selected
- `map-fn` - mapping function
------------------------------------------------------------------------
Let’s add numerical columns together
``` clojure
(tc/map-columns DS
:sum-of-numbers
(tc/column-names DS #{:int64 :float64} :datatype)
(fn [& rows]
(reduce + rows)))
```
\_unnamed \[9 5\]:
| :V1 | :V2 | :V3 | :V4 | :sum-of-numbers |
|----:|----:|----:|-----|----------------:|
| 1 | 1 | 0.5 | A | 2.5 |
| 2 | 2 | 1.0 | B | 5.0 |
| 1 | 3 | 1.5 | C | 5.5 |
| 2 | 4 | 0.5 | A | 6.5 |
| 1 | 5 | 1.0 | B | 7.0 |
| 2 | 6 | 1.5 | C | 9.5 |
| 1 | 7 | 0.5 | A | 8.5 |
| 2 | 8 | 1.0 | B | 11.0 |
| 1 | 9 | 1.5 | C | 11.5 |
The same works on grouped dataset
``` clojure
(-> DS
(tc/group-by :V4)
(tc/map-columns :sum-of-numbers
(tc/column-names DS #{:int64 :float64} :datatype)
(fn [& rows]
(reduce + rows)))
(tc/ungroup))
```
\_unnamed \[9 5\]:
| :V1 | :V2 | :V3 | :V4 | :sum-of-numbers |
|----:|----:|----:|-----|----------------:|
| 1 | 1 | 0.5 | A | 2.5 |
| 2 | 4 | 0.5 | A | 6.5 |
| 1 | 7 | 0.5 | A | 8.5 |
| 2 | 2 | 1.0 | B | 5.0 |
| 1 | 5 | 1.0 | B | 7.0 |
| 2 | 8 | 1.0 | B | 11.0 |
| 1 | 3 | 1.5 | C | 5.5 |
| 2 | 6 | 1.5 | C | 9.5 |
| 1 | 9 | 1.5 | C | 11.5 |
#### Reorder
To reorder columns use columns selectors to choose what columns go
first. The unseleted columns are appended to the end.
``` clojure
(tc/reorder-columns DS :V4 [:V3 :V2])
```
\_unnamed \[9 4\]:
| :V4 | :V3 | :V2 | :V1 |
|-----|----:|----:|----:|
| A | 0.5 | 1 | 1 |
| B | 1.0 | 2 | 2 |
| C | 1.5 | 3 | 1 |
| A | 0.5 | 4 | 2 |
| B | 1.0 | 5 | 1 |
| C | 1.5 | 6 | 2 |
| A | 0.5 | 7 | 1 |
| B | 1.0 | 8 | 2 |
| C | 1.5 | 9 | 1 |
------------------------------------------------------------------------
This function doesn’t let you select meta field, so you have to call
`column-names` in such case. Below we want to add integer columns at the
end.
``` clojure
(tc/reorder-columns DS (tc/column-names DS (complement #{:int64}) :datatype))
```
\_unnamed \[9 4\]:
| :V3 | :V4 | :V1 | :V2 |
|----:|-----|----:|----:|
| 0.5 | A | 1 | 1 |
| 1.0 | B | 2 | 2 |
| 1.5 | C | 1 | 3 |
| 0.5 | A | 2 | 4 |
| 1.0 | B | 1 | 5 |
| 1.5 | C | 2 | 6 |
| 0.5 | A | 1 | 7 |
| 1.0 | B | 2 | 8 |
| 1.5 | C | 1 | 9 |
#### Type conversion
To convert column into given datatype can be done using `convert-types`
function. Not all the types can be converted automatically also some
types require slow parsing (every conversion from string). In case where
conversion is not possible you can pass conversion function.
Arguments:
- `ds` - dataset
- Two options:
- `coltype-map` in case when you want to convert several columns, keys
are column names, vals are new types
- `column-selector` and `new-types` - column name and new datatype (or
datatypes as sequence)
`new-types` can be:
- a type like `:int64` or `:string` or sequence of types
- or sequence of pair of datetype and conversion function
After conversion additional infomation is given on problematic values.
The other conversion is casting column into java array (`->array`) of
the type column or provided as argument. Grouped dataset returns
sequence of arrays.
------------------------------------------------------------------------
Basic conversion
``` clojure
(-> DS
(tc/convert-types :V1 :float64)
(tc/info :columns))
```
\_unnamed :column info \[4 6\]:
| :name | :datatype | :n-elems | :unparsed-indexes | :unparsed-data | :categorical? |
|-------|-----------|---------:|-------------------|----------------|---------------|
| :V1 | :float64 | 9 | {} | \[\] | |
| :V2 | :int64 | 9 | | | |
| :V3 | :float64 | 9 | | | |
| :V4 | :string | 9 | | | true |
------------------------------------------------------------------------
Using custom converter. Let’s treat `:V4` as haxadecimal values. See
that this way we can map column to any value.
``` clojure
(-> DS
(tc/convert-types :V4 [[:int16 #(Integer/parseInt % 16)]]))
```
\_unnamed \[9 4\]:
| :V1 | :V2 | :V3 | :V4 |
|----:|----:|----:|----:|
| 1 | 1 | 0.5 | 10 |
| 2 | 2 | 1.0 | 11 |
| 1 | 3 | 1.5 | 12 |
| 2 | 4 | 0.5 | 10 |
| 1 | 5 | 1.0 | 11 |
| 2 | 6 | 1.5 | 12 |
| 1 | 7 | 0.5 | 10 |
| 2 | 8 | 1.0 | 11 |
| 1 | 9 | 1.5 | 12 |
------------------------------------------------------------------------
You can process several columns at once
``` clojure
(-> DS
(tc/convert-types {:V1 :float64
:V2 :object
:V3 [:boolean #(< % 1.0)]
:V4 :object})
(tc/info :columns))
```
\_unnamed :column info \[4 6\]:
| :name | :datatype | :n-elems | :unparsed-indexes | :unparsed-data | :categorical? |
|-------|-----------|---------:|-------------------|----------------|---------------|
| :V1 | :float64 | 9 | {} | \[\] | |
| :V2 | :object | 9 | {} | \[\] | true |
| :V3 | :boolean | 9 | {} | \[\] | |
| :V4 | :object | 9 | | | true |
------------------------------------------------------------------------
Convert one type into another
``` clojure
(-> DS
(tc/convert-types :type/numerical :int16)
(tc/info :columns))
```
\_unnamed :column info \[4 6\]:
| :name | :datatype | :n-elems | :unparsed-indexes | :unparsed-data | :categorical? |
|-------|-----------|---------:|-------------------|----------------|---------------|
| :V1 | :int16 | 9 | {} | \[\] | |
| :V2 | :int16 | 9 | {} | \[\] | |
| :V3 | :int16 | 9 | {} | \[\] | |
| :V4 | :string | 9 | | | true |
------------------------------------------------------------------------
Function works on the grouped dataset
``` clojure
(-> DS
(tc/group-by :V1)
(tc/convert-types :V1 :float32)
(tc/ungroup)
(tc/info :columns))
```
\_unnamed :column info \[4 6\]:
| :name | :datatype | :n-elems | :unparsed-indexes | :unparsed-data | :categorical? |
|-------|-----------|---------:|-------------------|----------------|---------------|
| :V1 | :float32 | 9 | {} | \[\] | |
| :V2 | :int64 | 9 | | | |
| :V3 | :float64 | 9 | | | |
| :V4 | :string | 9 | | | true |
------------------------------------------------------------------------
Double array conversion.
``` clojure
(tc/->array DS :V1)
```
#object["[J" 0x5b89e23 "[J@5b89e23"]
------------------------------------------------------------------------
Function also works on grouped dataset
``` clojure
(-> DS
(tc/group-by :V3)
(tc/->array :V2))
```
(#object["[J" 0x7db5ee70 "[J@7db5ee70"] #object["[J" 0x7bcd9eb8 "[J@7bcd9eb8"] #object["[J" 0x2c9c99f2 "[J@2c9c99f2"])
------------------------------------------------------------------------
You can also cast the type to the other one (if casting is possible):
``` clojure
(tc/->array DS :V4 :string)
(tc/->array DS :V1 :float32)
```
#object["[Ljava.lang.String;" 0x4b448561 "[Ljava.lang.String;@4b448561"]
#object["[F" 0x7a0e5ace "[F@7a0e5ace"]
### Rows
Rows can be selected or dropped using various selectors:
- row id(s) - row index as number or seqence of numbers (first row has
index `0`, second `1` and so on)
- sequence of true/false values
- filter by predicate (argument is row as a map)
When predicate is used you may want to limit columns passed to the
function (`select-keys` option).
Additionally you may want to precalculate some values which will be
visible for predicate as additional columns. It’s done internally by
calling `add-columns` on a dataset. `:pre` is used as a column
definitions.
#### Select
Select fifth row
``` clojure
(tc/select-rows DS 4)
```
\_unnamed \[1 4\]:
| :V1 | :V2 | :V3 | :V4 |
|----:|----:|----:|-----|
| 1 | 5 | 1.0 | B |
------------------------------------------------------------------------
Select 3 rows
``` clojure
(tc/select-rows DS [1 4 5])
```
\_unnamed \[3 4\]:
| :V1 | :V2 | :V3 | :V4 |
|----:|----:|----:|-----|
| 2 | 2 | 1.0 | B |
| 1 | 5 | 1.0 | B |
| 2 | 6 | 1.5 | C |
------------------------------------------------------------------------
Select rows using sequence of true/false values
``` clojure
(tc/select-rows DS [true nil nil true])
```
\_unnamed \[2 4\]:
| :V1 | :V2 | :V3 | :V4 |
|----:|----:|----:|-----|
| 1 | 1 | 0.5 | A |
| 2 | 4 | 0.5 | A |
------------------------------------------------------------------------
Select rows using predicate
``` clojure
(tc/select-rows DS (comp #(< % 1) :V3))
```
\_unnamed \[3 4\]:
| :V1 | :V2 | :V3 | :V4 |
|----:|----:|----:|-----|
| 1 | 1 | 0.5 | A |
| 2 | 4 | 0.5 | A |
| 1 | 7 | 0.5 | A |
------------------------------------------------------------------------
The same works on grouped dataset, let’s select first row from every
group.
``` clojure
(-> DS
(tc/group-by :V1)
(tc/select-rows 0)
(tc/ungroup))
```
\_unnamed \[2 4\]:
| :V1 | :V2 | :V3 | :V4 |
|----:|----:|----:|-----|
| 1 | 1 | 0.5 | A |
| 2 | 2 | 1.0 | B |
------------------------------------------------------------------------
If you want to select `:V2` values which are lower than or equal mean in
grouped dataset you have to precalculate it using `:pre`.
``` clojure
(-> DS
(tc/group-by :V4)
(tc/select-rows (fn [row] (<= (:V2 row) (:mean row)))
{:pre {:mean #(tech.v3.datatype.functional/mean (% :V2))}})
(tc/ungroup))
```
\_unnamed \[6 4\]:
| :V1 | :V2 | :V3 | :V4 |
|----:|----:|----:|-----|
| 1 | 1 | 0.5 | A |
| 2 | 4 | 0.5 | A |
| 2 | 2 | 1.0 | B |
| 1 | 5 | 1.0 | B |
| 1 | 3 | 1.5 | C |
| 2 | 6 | 1.5 | C |
#### Drop
`drop-rows` removes rows, and accepts exactly the same parameters as
`select-rows`
------------------------------------------------------------------------
Drop values lower than or equal `:V2` column mean in grouped dataset.
``` clojure
(-> DS
(tc/group-by :V4)
(tc/drop-rows (fn [row] (<= (:V2 row) (:mean row)))
{:pre {:mean #(tech.v3.datatype.functional/mean (% :V2))}})
(tc/ungroup))
```
\_unnamed \[3 4\]:
| :V1 | :V2 | :V3 | :V4 |
|----:|----:|----:|-----|
| 1 | 7 | 0.5 | A |
| 2 | 8 | 1.0 | B |
| 1 | 9 | 1.5 | C |
#### Map rows
Call a mapping function for every row. Mapping function should return a
map, where keys are column names (new or old) and values are column
values.
Works on grouped dataset too.
``` clojure
(tc/map-rows DS (fn [{:keys [V1 V2]}] {:V1 0
:V5 (/ (+ V1 V2) (double V2))}))
```
\_unnamed \[9 5\]:
| :V1 | :V2 | :V3 | :V4 | :V5 |
|----:|----:|----:|-----|-----------:|
| 0 | 1 | 0.5 | A | 2.00000000 |
| 0 | 2 | 1.0 | B | 2.00000000 |
| 0 | 3 | 1.5 | C | 1.33333333 |
| 0 | 4 | 0.5 | A | 1.50000000 |
| 0 | 5 | 1.0 | B | 1.20000000 |
| 0 | 6 | 1.5 | C | 1.33333333 |
| 0 | 7 | 0.5 | A | 1.14285714 |
| 0 | 8 | 1.0 | B | 1.25000000 |
| 0 | 9 | 1.5 | C | 1.11111111 |
#### Other
There are several function to select first, last, random rows, or
display head, tail of the dataset. All functions work on grouped
dataset.
All random functions accept `:seed` as an option if you want to fix
returned result.
------------------------------------------------------------------------
First row
``` clojure
(tc/first DS)
```
\_unnamed \[1 4\]:
| :V1 | :V2 | :V3 | :V4 |
|----:|----:|----:|-----|
| 1 | 1 | 0.5 | A |
------------------------------------------------------------------------
Last row
``` clojure
(tc/last DS)
```
\_unnamed \[1 4\]:
| :V1 | :V2 | :V3 | :V4 |
|----:|----:|----:|-----|
| 1 | 9 | 1.5 | C |
------------------------------------------------------------------------
Random row (single)
``` clojure
(tc/rand-nth DS)
```
\_unnamed \[1 4\]:
| :V1 | :V2 | :V3 | :V4 |
|----:|----:|----:|-----|
| 2 | 8 | 1.0 | B |
------------------------------------------------------------------------
Random row (single) with seed
``` clojure
(tc/rand-nth DS {:seed 42})
```
\_unnamed \[1 4\]:
| :V1 | :V2 | :V3 | :V4 |
|----:|----:|----:|-----|
| 2 | 6 | 1.5 | C |
------------------------------------------------------------------------
Random `n` (default: row count) rows with repetition.
``` clojure
(tc/random DS)
```
\_unnamed \[9 4\]:
| :V1 | :V2 | :V3 | :V4 |
|----:|----:|----:|-----|
| 1 | 7 | 0.5 | A |
| 2 | 4 | 0.5 | A |
| 2 | 6 | 1.5 | C |
| 1 | 5 | 1.0 | B |
| 2 | 6 | 1.5 | C |
| 2 | 6 | 1.5 | C |
| 1 | 3 | 1.5 | C |
| 1 | 3 | 1.5 | C |
| 1 | 7 | 0.5 | A |
| 2 | 6 | 1.5 | C |
| 2 | 8 | 1.0 | B |
| 2 | 2 | 1.0 | B |
------------------------------------------------------------------------
Five random rows with repetition
``` clojure
(tc/random DS 5)
```
\_unnamed \[5 4\]:
| :V1 | :V2 | :V3 | :V4 |
|----:|----:|----:|-----|
| 1 | 1 | 0.5 | A |
| 2 | 4 | 0.5 | A |
| 1 | 9 | 1.5 | C |
| 2 | 2 | 1.0 | B |
| 1 | 7 | 0.5 | A |
| 2 | 8 | 1.0 | B |
| 2 | 6 | 1.5 | C |
| 1 | 1 | 0.5 | A |
------------------------------------------------------------------------
Five random, non-repeating rows
``` clojure
(tc/random DS 5 {:repeat? false})
```
\_unnamed \[5 4\]:
| :V1 | :V2 | :V3 | :V4 |
|----:|----:|----:|-----|
| 1 | 3 | 1.5 | C |
| 1 | 7 | 0.5 | A |
| 2 | 4 | 0.5 | A |
| 1 | 9 | 1.5 | C |
| 2 | 2 | 1.0 | B |
------------------------------------------------------------------------
Five random, with seed
``` clojure
(tc/random DS 5 {:seed 42})
```
\_unnamed \[5 4\]:
| :V1 | :V2 | :V3 | :V4 |
|----:|----:|----:|-----|
| 2 | 6 | 1.5 | C |
| 1 | 5 | 1.0 | B |
| 1 | 3 | 1.5 | C |
| 1 | 1 | 0.5 | A |
| 1 | 9 | 1.5 | C |
------------------------------------------------------------------------
Shuffle dataset
``` clojure
(tc/shuffle DS)
```
\_unnamed \[9 4\]:
| :V1 | :V2 | :V3 | :V4 |
|----:|----:|----:|-----|
| 1 | 5 | 1.0 | B |
| 1 | 3 | 1.5 | C |
| 2 | 8 | 1.0 | B |
| 2 | 4 | 0.5 | A |
| 2 | 6 | 1.5 | C |
| 2 | 2 | 1.0 | B |
| 1 | 9 | 1.5 | C |
| 1 | 1 | 0.5 | A |
| 1 | 7 | 0.5 | A |
------------------------------------------------------------------------
Shuffle with seed
``` clojure
(tc/shuffle DS {:seed 42})
```
\_unnamed \[9 4\]:
| :V1 | :V2 | :V3 | :V4 |
|----:|----:|----:|-----|
| 1 | 5 | 1.0 | B |
| 2 | 2 | 1.0 | B |
| 2 | 6 | 1.5 | C |
| 2 | 4 | 0.5 | A |
| 2 | 8 | 1.0 | B |
| 1 | 3 | 1.5 | C |
| 1 | 7 | 0.5 | A |
| 1 | 1 | 0.5 | A |
| 1 | 9 | 1.5 | C |
------------------------------------------------------------------------
First `n` rows (default 5)
``` clojure
(tc/head DS)
```
\_unnamed \[5 4\]:
| :V1 | :V2 | :V3 | :V4 |
|----:|----:|----:|-----|
| 1 | 1 | 0.5 | A |
| 2 | 2 | 1.0 | B |
| 1 | 3 | 1.5 | C |
| 2 | 4 | 0.5 | A |
| 1 | 5 | 1.0 | B |
------------------------------------------------------------------------
Last `n` rows (default 5)
``` clojure
(tc/tail DS)
```
\_unnamed \[5 4\]:
| :V1 | :V2 | :V3 | :V4 |
|----:|----:|----:|-----|
| 1 | 5 | 1.0 | B |
| 2 | 6 | 1.5 | C |
| 1 | 7 | 0.5 | A |
| 2 | 8 | 1.0 | B |
| 1 | 9 | 1.5 | C |
------------------------------------------------------------------------
`by-rank` calculates rank on column(s). It’s base on [R
rank()](https://www.rdocumentation.org/packages/base/versions/3.6.1/topics/rank)
with addition of `:dense` (default) tie strategy which give consecutive
rank numbering.
`:desc?` options (default: `true`) sorts input with descending order,
giving top values under `0` value.
`rank` is zero based and is defined at `tablecloth.api.utils` namespace.
------------------------------------------------------------------------
``` clojure
(tc/by-rank DS :V3 zero?) ;; most V3 values
```
\_unnamed \[3 4\]:
| :V1 | :V2 | :V3 | :V4 |
|----:|----:|----:|-----|
| 1 | 3 | 1.5 | C |
| 2 | 6 | 1.5 | C |
| 1 | 9 | 1.5 | C |
``` clojure
(tc/by-rank DS :V3 zero? {:desc? false}) ;; least V3 values
```
\_unnamed \[3 4\]:
| :V1 | :V2 | :V3 | :V4 |
|----:|----:|----:|-----|
| 1 | 1 | 0.5 | A |
| 2 | 4 | 0.5 | A |
| 1 | 7 | 0.5 | A |
------------------------------------------------------------------------
Rank also works on multiple columns
``` clojure
(tc/by-rank DS [:V1 :V3] zero? {:desc? false})
```
\_unnamed \[2 4\]:
| :V1 | :V2 | :V3 | :V4 |
|----:|----:|----:|-----|
| 1 | 1 | 0.5 | A |
| 1 | 7 | 0.5 | A |
------------------------------------------------------------------------
Select 5 random rows from each group
``` clojure
(-> DS
(tc/group-by :V4)
(tc/random 5)
(tc/ungroup))
```
\_unnamed \[15 4\]:
| :V1 | :V2 | :V3 | :V4 |
|----:|----:|----:|-----|
| 1 | 7 | 0.5 | A |
| 1 | 1 | 0.5 | A |
| 1 | 1 | 0.5 | A |
| 2 | 4 | 0.5 | A |
| 1 | 7 | 0.5 | A |
| 1 | 7 | 0.5 | A |
| 1 | 1 | 0.5 | A |
| 1 | 5 | 1.0 | B |
| 2 | 8 | 1.0 | B |
| 2 | 8 | 1.0 | B |
| 1 | 5 | 1.0 | B |
| 2 | 2 | 1.0 | B |
| 2 | 6 | 1.5 | C |
| 1 | 9 | 1.5 | C |
| 1 | 3 | 1.5 | C |
| 1 | 9 | 1.5 | C |
| 2 | 6 | 1.5 | C |
### Aggregate
Aggregating is a function which produces single row out of dataset.
Aggregator is a function or sequence or map of functions which accept
dataset as an argument and result single value, sequence of values or
map.
Where map is given as an input or result, keys are treated as column
names.
Grouped dataset is ungrouped after aggregation. This can be turned off
by setting `:ungroup?` to false. In case you want to pass additional
ungrouping parameters add them to the options.
By default resulting column names are prefixed with `summary` prefix
(set it with `:default-column-name-prefix` option).
------------------------------------------------------------------------
Let’s calculate mean of some columns
``` clojure
(tc/aggregate DS #(reduce + (% :V2)))
```
\_unnamed \[1 1\]:
| summary |
|--------:|
| 45 |
------------------------------------------------------------------------
Let’s give resulting column a name.
``` clojure
(tc/aggregate DS {:sum-of-V2 #(reduce + (% :V2))})
```
\_unnamed \[1 1\]:
| :sum-of-V2 |
|-----------:|
| 45 |
------------------------------------------------------------------------
Sequential result is spread into separate columns
``` clojure
(tc/aggregate DS #(take 5(% :V2)))
```
\_unnamed \[1 5\]:
| :summary-0 | :summary-1 | :summary-2 | :summary-3 | :summary-4 |
|-----------:|-----------:|-----------:|-----------:|-----------:|
| 1 | 2 | 3 | 4 | 5 |
------------------------------------------------------------------------
You can combine all variants and rename default prefix
``` clojure
(tc/aggregate DS [#(take 3 (% :V2))
(fn [ds] {:sum-v1 (reduce + (ds :V1))
:prod-v3 (reduce * (ds :V3))})] {:default-column-name-prefix "V2-value"})
```
\_unnamed \[1 5\]:
| :V2-value-0-0 | :V2-value-0-1 | :V2-value-0-2 | :V2-value-1-sum-v1 | :V2-value-1-prod-v3 |
|--------------:|--------------:|--------------:|-------------------:|--------------------:|
| 1 | 2 | 3 | 13 | 0.421875 |
------------------------------------------------------------------------
Processing grouped dataset
``` clojure
(-> DS
(tc/group-by [:V4])
(tc/aggregate [#(take 3 (% :V2))
(fn [ds] {:sum-v1 (reduce + (ds :V1))
:prod-v3 (reduce * (ds :V3))})] {:default-column-name-prefix "V2-value"}))
```
\_unnamed \[3 6\]:
| :V4 | :V2-value-0-0 | :V2-value-0-1 | :V2-value-0-2 | :V2-value-1-sum-v1 | :V2-value-1-prod-v3 |
|-----|--------------:|--------------:|--------------:|-------------------:|--------------------:|
| A | 1 | 4 | 7 | 4 | 0.125 |
| B | 2 | 5 | 8 | 5 | 1.000 |
| C | 3 | 6 | 9 | 4 | 3.375 |
Result of aggregating is automatically ungrouped, you can skip this step
by stetting `:ungroup?` option to `false`.
``` clojure
(-> DS
(tc/group-by [:V3])
(tc/aggregate [#(take 3 (% :V2))
(fn [ds] {:sum-v1 (reduce + (ds :V1))
:prod-v3 (reduce * (ds :V3))})] {:default-column-name-prefix "V2-value"
:ungroup? false}))
```
\_unnamed \[3 3\]:
| :name | :group-id | :data |
|-----------|----------:|--------------------|
| {:V3 0.5} | 0 | \_unnamed \[1 5\]: |
| {:V3 1.0} | 1 | \_unnamed \[1 5\]: |
| {:V3 1.5} | 2 | \_unnamed \[1 5\]: |
#### Column
You can perform columnar aggreagation also. `aggregate-columns` selects
columns and apply aggregating function (or sequence of functions) for
each column separately.
``` clojure
(tc/aggregate-columns DS [:V1 :V2 :V3] #(reduce + %))
```
\_unnamed \[1 3\]:
| :V1 | :V2 | :V3 |
|----:|----:|----:|
| 13 | 45 | 9.0 |
------------------------------------------------------------------------
``` clojure
(tc/aggregate-columns DS [:V1 :V2 :V3] [#(reduce + %)
#(reduce max %)
#(reduce * %)])
```
\_unnamed \[1 3\]:
| :V1 | :V2 | :V3 |
|----:|----:|---------:|
| 13 | 9 | 0.421875 |
------------------------------------------------------------------------
``` clojure
(-> DS
(tc/group-by [:V4])
(tc/aggregate-columns [:V1 :V2 :V3] #(reduce + %)))
```
\_unnamed \[3 4\]:
| :V4 | :V1 | :V2 | :V3 |
|-----|----:|----:|----:|
| A | 4 | 12 | 1.5 |
| B | 5 | 15 | 3.0 |
| C | 4 | 18 | 4.5 |
You can also aggregate whole dataset
``` clojure
(-> DS
(tc/drop-columns :V4)
(tc/aggregate-columns #(reduce + %)))
```
\_unnamed \[1 3\]:
| :V1 | :V2 | :V3 |
|----:|----:|----:|
| 13 | 45 | 9.0 |
#### Crosstab
Cross tabulation built from two sets of columns. First rows and cols are
used to construct grouped dataset, then aggregation function is applied
for each pair. By default it counts rows from each group.
Options are:
- `:aggregator` - function which aggregates values of grouped dataset,
default it’s `row-count`
- `:marginal-rows` and `:marginal-cols` - if true, sum of rows and cols
are added as an additional columns and row. May be custom function
which accepts pure row and col as a seq.
- `:replace-missing?` - should missing values be replaced (default:
true) with `:missing-value` (default: 0)
- `:pivot?` - if false, flat aggregation result is returned (default:
false)
``` clojure
(def ctds (tc/dataset {:a [:foo :foo :bar :bar :foo :foo]
:b [:one :one :two :one :two :one]
:c [:dull :dull :shiny :dull :dull :shiny]}))
```
\#’user/ctds
``` clojure
ctds
```
_unnamed [6 3]:
| :a | :b | :c |
|------|------|--------|
| :foo | :one | :dull |
| :foo | :one | :dull |
| :bar | :two | :shiny |
| :bar | :one | :dull |
| :foo | :two | :dull |
| :foo | :one | :shiny |
------------------------------------------------------------------------
``` clojure
(tc/crosstab ctds :a [:b :c])
```
\_unnamed \[2 5\]:
| rows/cols | \[:one :dull\] | \[:two :shiny\] | \[:two :dull\] | \[:one :shiny\] |
|-----------|---------------:|----------------:|---------------:|----------------:|
| :foo | 2 | 0 | 1 | 1 |
| :bar | 1 | 1 | 0 | 0 |
------------------------------------------------------------------------
With marginals
``` clojure
(tc/crosstab ctds :a [:b :c] {:marginal-rows true :marginal-cols true})
```
\_unnamed \[3 6\]:
| rows/cols | \[:one :dull\] | \[:two :shiny\] | \[:two :dull\] | \[:one :shiny\] | :summary |
|-----------|---------------:|----------------:|---------------:|----------------:|---------:|
| :foo | 2 | 0 | 1 | 1 | 4 |
| :bar | 1 | 1 | 0 | 0 | 2 |
| :summary | 3 | 1 | 1 | 1 | 6 |
------------------------------------------------------------------------
Set missing value to `-1`
``` clojure
(tc/crosstab ctds :a [:b :c] {:missing-value -1})
```
\_unnamed \[2 5\]:
| rows/cols | \[:one :dull\] | \[:two :shiny\] | \[:two :dull\] | \[:one :shiny\] |
|-----------|---------------:|----------------:|---------------:|----------------:|
| :foo | 2 | -1 | 1 | 1 |
| :bar | 1 | 1 | -1 | -1 |
------------------------------------------------------------------------
Turn off pivoting
``` clojure
(tc/crosstab ctds :a [:b :c] {:pivot? false})
```
\_unnamed \[5 3\]:
| :rows | :cols | summary |
|-------|-----------------|--------:|
| :foo | \[:one :dull\] | 2 |
| :bar | \[:two :shiny\] | 1 |
| :bar | \[:one :dull\] | 1 |
| :foo | \[:two :dull\] | 1 |
| :foo | \[:one :shiny\] | 1 |
### Order
Ordering can be done by column(s) or any function operating on row.
Possible order can be:
- `:asc` for ascending order (default)
- `:desc` for descending order
- custom comparator
`:select-keys` limits row map provided to ordering functions.
------------------------------------------------------------------------
Order by single column, ascending
``` clojure
(tc/order-by DS :V1)
```
\_unnamed \[9 4\]:
| :V1 | :V2 | :V3 | :V4 |
|----:|----:|----:|-----|
| 1 | 1 | 0.5 | A |
| 1 | 3 | 1.5 | C |
| 1 | 5 | 1.0 | B |
| 1 | 7 | 0.5 | A |
| 1 | 9 | 1.5 | C |
| 2 | 6 | 1.5 | C |
| 2 | 4 | 0.5 | A |
| 2 | 8 | 1.0 | B |
| 2 | 2 | 1.0 | B |
------------------------------------------------------------------------
Descending order
``` clojure
(tc/order-by DS :V1 :desc)
```
\_unnamed \[9 4\]:
| :V1 | :V2 | :V3 | :V4 |
|----:|----:|----:|-----|
| 2 | 2 | 1.0 | B |
| 2 | 4 | 0.5 | A |
| 2 | 6 | 1.5 | C |
| 2 | 8 | 1.0 | B |
| 1 | 5 | 1.0 | B |
| 1 | 3 | 1.5 | C |
| 1 | 7 | 0.5 | A |
| 1 | 1 | 0.5 | A |
| 1 | 9 | 1.5 | C |
------------------------------------------------------------------------
Order by two columns
``` clojure
(tc/order-by DS [:V1 :V2])
```
\_unnamed \[9 4\]:
| :V1 | :V2 | :V3 | :V4 |
|----:|----:|----:|-----|
| 1 | 1 | 0.5 | A |
| 1 | 3 | 1.5 | C |
| 1 | 5 | 1.0 | B |
| 1 | 7 | 0.5 | A |
| 1 | 9 | 1.5 | C |
| 2 | 2 | 1.0 | B |
| 2 | 4 | 0.5 | A |
| 2 | 6 | 1.5 | C |
| 2 | 8 | 1.0 | B |
------------------------------------------------------------------------
Use different orders for columns
``` clojure
(tc/order-by DS [:V1 :V2] [:asc :desc])
```
\_unnamed \[9 4\]:
| :V1 | :V2 | :V3 | :V4 |
|----:|----:|----:|-----|
| 1 | 9 | 1.5 | C |
| 1 | 7 | 0.5 | A |
| 1 | 5 | 1.0 | B |
| 1 | 3 | 1.5 | C |
| 1 | 1 | 0.5 | A |
| 2 | 8 | 1.0 | B |
| 2 | 6 | 1.5 | C |
| 2 | 4 | 0.5 | A |
| 2 | 2 | 1.0 | B |
``` clojure
(tc/order-by DS [:V1 :V2] [:desc :desc])
```
\_unnamed \[9 4\]:
| :V1 | :V2 | :V3 | :V4 |
|----:|----:|----:|-----|
| 2 | 8 | 1.0 | B |
| 2 | 6 | 1.5 | C |
| 2 | 4 | 0.5 | A |
| 2 | 2 | 1.0 | B |
| 1 | 9 | 1.5 | C |
| 1 | 7 | 0.5 | A |
| 1 | 5 | 1.0 | B |
| 1 | 3 | 1.5 | C |
| 1 | 1 | 0.5 | A |
``` clojure
(tc/order-by DS [:V1 :V3] [:desc :asc])
```
\_unnamed \[9 4\]:
| :V1 | :V2 | :V3 | :V4 |
|----:|----:|----:|-----|
| 2 | 4 | 0.5 | A |
| 2 | 2 | 1.0 | B |
| 2 | 8 | 1.0 | B |
| 2 | 6 | 1.5 | C |
| 1 | 1 | 0.5 | A |
| 1 | 7 | 0.5 | A |
| 1 | 5 | 1.0 | B |
| 1 | 3 | 1.5 | C |
| 1 | 9 | 1.5 | C |
------------------------------------------------------------------------
Custom function can be used to provided ordering key. Here order by
`:V4` descending, then by product of other columns ascending.
``` clojure
(tc/order-by DS [:V4 (fn [row] (* (:V1 row)
(:V2 row)
(:V3 row)))] [:desc :asc])
```
\_unnamed \[9 4\]:
| :V1 | :V2 | :V3 | :V4 |
|----:|----:|----:|-----|
| 1 | 3 | 1.5 | C |
| 1 | 9 | 1.5 | C |
| 2 | 6 | 1.5 | C |
| 2 | 2 | 1.0 | B |
| 1 | 5 | 1.0 | B |
| 2 | 8 | 1.0 | B |
| 1 | 1 | 0.5 | A |
| 1 | 7 | 0.5 | A |
| 2 | 4 | 0.5 | A |
------------------------------------------------------------------------
Custom comparator also can be used in case objects are not comparable by
default. Let’s define artificial one: if Euclidean distance is lower
than 2, compare along `z` else along `x` and `y`. We use first three
columns for that.
``` clojure
(defn dist
[v1 v2]
(->> v2
(map - v1)
(map #(* % %))
(reduce +)
(Math/sqrt)))
```
#'user/dist
``` clojure
(tc/order-by DS [:V1 :V2 :V3] (fn [[x1 y1 z1 :as v1] [x2 y2 z2 :as v2]]
(let [d (dist v1 v2)]
(if (< d 2.0)
(compare z1 z2)
(compare [x1 y1] [x2 y2])))))
```
\_unnamed \[9 4\]:
| :V1 | :V2 | :V3 | :V4 |
|----:|----:|----:|-----|
| 1 | 1 | 0.5 | A |
| 1 | 5 | 1.0 | B |
| 1 | 7 | 0.5 | A |
| 1 | 9 | 1.5 | C |
| 2 | 2 | 1.0 | B |
| 2 | 4 | 0.5 | A |
| 1 | 3 | 1.5 | C |
| 2 | 6 | 1.5 | C |
| 2 | 8 | 1.0 | B |
### Unique
Remove rows which contains the same data. By default `unique-by` removes
duplicates from whole dataset. You can also pass list of columns or
functions (similar as in `group-by`) to remove duplicates limited by
them. Default strategy is to keep the first row. More strategies below.
`unique-by` works on groups
------------------------------------------------------------------------
Remove duplicates from whole dataset
``` clojure
(tc/unique-by DS)
```
\_unnamed \[9 4\]:
| :V1 | :V2 | :V3 | :V4 |
|----:|----:|----:|-----|
| 1 | 1 | 0.5 | A |
| 2 | 2 | 1.0 | B |
| 1 | 3 | 1.5 | C |
| 2 | 4 | 0.5 | A |
| 1 | 5 | 1.0 | B |
| 2 | 6 | 1.5 | C |
| 1 | 7 | 0.5 | A |
| 2 | 8 | 1.0 | B |
| 1 | 9 | 1.5 | C |
------------------------------------------------------------------------
Remove duplicates from each group selected by column.
``` clojure
(tc/unique-by DS :V1)
```
\_unnamed \[2 4\]:
| :V1 | :V2 | :V3 | :V4 |
|----:|----:|----:|-----|
| 1 | 1 | 0.5 | A |
| 2 | 2 | 1.0 | B |
------------------------------------------------------------------------
Pair of columns
``` clojure
(tc/unique-by DS [:V1 :V3])
```
\_unnamed \[6 4\]:
| :V1 | :V2 | :V3 | :V4 |
|----:|----:|----:|-----|
| 1 | 1 | 0.5 | A |
| 2 | 2 | 1.0 | B |
| 1 | 3 | 1.5 | C |
| 2 | 4 | 0.5 | A |
| 1 | 5 | 1.0 | B |
| 2 | 6 | 1.5 | C |
------------------------------------------------------------------------
Also function can be used, split dataset by modulo 3 on columns `:V2`
``` clojure
(tc/unique-by DS (fn [m] (mod (:V2 m) 3)))
```
\_unnamed \[3 4\]:
| :V1 | :V2 | :V3 | :V4 |
|----:|----:|----:|-----|
| 1 | 1 | 0.5 | A |
| 2 | 2 | 1.0 | B |
| 1 | 3 | 1.5 | C |
------------------------------------------------------------------------
The same can be achived with `group-by`
``` clojure
(-> DS
(tc/group-by (fn [m] (mod (:V2 m) 3)))
(tc/first)
(tc/ungroup))
```
\_unnamed \[3 4\]:
| :V1 | :V2 | :V3 | :V4 |
|----:|----:|----:|-----|
| 1 | 1 | 0.5 | A |
| 2 | 2 | 1.0 | B |
| 1 | 3 | 1.5 | C |
------------------------------------------------------------------------
Grouped dataset
``` clojure
(-> DS
(tc/group-by :V4)
(tc/unique-by :V1)
(tc/ungroup))
```
\_unnamed \[6 4\]:
| :V1 | :V2 | :V3 | :V4 |
|----:|----:|----:|-----|
| 1 | 1 | 0.5 | A |
| 2 | 4 | 0.5 | A |
| 2 | 2 | 1.0 | B |
| 1 | 5 | 1.0 | B |
| 1 | 3 | 1.5 | C |
| 2 | 6 | 1.5 | C |
#### Strategies
There are 4 strategies defined:
- `:first` - select first row (default)
- `:last` - select last row
- `:random` - select random row
- any function - apply function to a columns which are subject of
uniqueness
------------------------------------------------------------------------
Last
``` clojure
(tc/unique-by DS :V1 {:strategy :last})
```
\_unnamed \[2 4\]:
| :V1 | :V2 | :V3 | :V4 |
|----:|----:|----:|-----|
| 2 | 8 | 1.0 | B |
| 1 | 9 | 1.5 | C |
------------------------------------------------------------------------
Random
``` clojure
(tc/unique-by DS :V1 {:strategy :random})
```
\_unnamed \[2 4\]:
| :V1 | :V2 | :V3 | :V4 |
|----:|----:|----:|-----|
| 2 | 4 | 0.5 | A |
| 1 | 5 | 1.0 | B |
------------------------------------------------------------------------
Pack columns into vector
``` clojure
(tc/unique-by DS :V4 {:strategy vec})
```
\_unnamed \[3 3\]:
| :V1 | :V2 | :V3 |
|-----------|-----------|-----------------|
| \[1 2 1\] | \[1 4 7\] | \[0.5 0.5 0.5\] |
| \[2 1 2\] | \[2 5 8\] | \[1.0 1.0 1.0\] |
| \[1 2 1\] | \[3 6 9\] | \[1.5 1.5 1.5\] |
------------------------------------------------------------------------
Sum columns
``` clojure
(tc/unique-by DS :V4 {:strategy (partial reduce +)})
```
\_unnamed \[3 3\]:
| :V1 | :V2 | :V3 |
|----:|----:|----:|
| 4 | 12 | 1.5 |
| 5 | 15 | 3.0 |
| 4 | 18 | 4.5 |
------------------------------------------------------------------------
Group by function and apply functions
``` clojure
(tc/unique-by DS (fn [m] (mod (:V2 m) 3)) {:strategy vec})
```
\_unnamed \[3 4\]:
| :V1 | :V2 | :V3 | :V4 |
|-----------|-----------|-----------------|-----------------|
| \[1 2 1\] | \[1 4 7\] | \[0.5 0.5 0.5\] | \[“A” “A” “A”\] |
| \[2 1 2\] | \[2 5 8\] | \[1.0 1.0 1.0\] | \[“B” “B” “B”\] |
| \[1 2 1\] | \[3 6 9\] | \[1.5 1.5 1.5\] | \[“C” “C” “C”\] |
------------------------------------------------------------------------
Grouped dataset
``` clojure
(-> DS
(tc/group-by :V1)
(tc/unique-by (fn [m] (mod (:V2 m) 3)) {:strategy vec})
(tc/ungroup {:add-group-as-column :from-V1}))
```
\_unnamed \[6 5\]:
| :from-V1 | :V1 | :V2 | :V3 | :V4 |
|---------:|---------|---------|-------------|-------------|
| 1 | \[1 1\] | \[1 7\] | \[0.5 0.5\] | \[“A” “A”\] |
| 1 | \[1 1\] | \[3 9\] | \[1.5 1.5\] | \[“C” “C”\] |
| 1 | \[1\] | \[5\] | \[1.0\] | \[“B”\] |
| 2 | \[2 2\] | \[2 8\] | \[1.0 1.0\] | \[“B” “B”\] |
| 2 | \[2\] | \[4\] | \[0.5\] | \[“A”\] |
| 2 | \[2\] | \[6\] | \[1.5\] | \[“C”\] |
### Missing
When dataset contains missing values you can select or drop rows with
missing values or replace them using some strategy.
`column-selector` can be used to limit considered columns
Let’s define dataset which contains missing values
``` clojure
(def DSm (tc/dataset {:V1 (take 9 (cycle [1 2 nil]))
:V2 (range 1 10)
:V3 (take 9 (cycle [0.5 1.0 nil 1.5]))
:V4 (take 9 (cycle ["A" "B" "C"]))}))
```
``` clojure
DSm
```
\_unnamed \[9 4\]:
| :V1 | :V2 | :V3 | :V4 |
|----:|----:|----:|-----|
| 1 | 1 | 0.5 | A |
| 2 | 2 | 1.0 | B |
| | 3 | | C |
| 1 | 4 | 1.5 | A |
| 2 | 5 | 0.5 | B |
| | 6 | 1.0 | C |
| 1 | 7 | | A |
| 2 | 8 | 1.5 | B |
| | 9 | 0.5 | C |
#### Select
Select rows with missing values
``` clojure
(tc/select-missing DSm)
```
\_unnamed \[4 4\]:
| :V1 | :V2 | :V3 | :V4 |
|----:|----:|----:|-----|
| | 3 | | C |
| | 6 | 1.0 | C |
| 1 | 7 | | A |
| | 9 | 0.5 | C |
------------------------------------------------------------------------
Select rows with missing values in `:V1`
``` clojure
(tc/select-missing DSm :V1)
```
\_unnamed \[3 4\]:
| :V1 | :V2 | :V3 | :V4 |
|----:|----:|----:|-----|
| | 3 | | C |
| | 6 | 1.0 | C |
| | 9 | 0.5 | C |
------------------------------------------------------------------------
The same with grouped dataset
``` clojure
(-> DSm
(tc/group-by :V4)
(tc/select-missing :V3)
(tc/ungroup))
```
\_unnamed \[2 4\]:
| :V1 | :V2 | :V3 | :V4 |
|----:|----:|-----|-----|
| 1 | 7 | | A |
| | 3 | | C |
#### Drop
Drop rows with missing values
``` clojure
(tc/drop-missing DSm)
```
\_unnamed \[5 4\]:
| :V1 | :V2 | :V3 | :V4 |
|----:|----:|----:|-----|
| 1 | 1 | 0.5 | A |
| 2 | 2 | 1.0 | B |
| 1 | 4 | 1.5 | A |
| 2 | 5 | 0.5 | B |
| 2 | 8 | 1.5 | B |
------------------------------------------------------------------------
Drop rows with missing values in `:V1`
``` clojure
(tc/drop-missing DSm :V1)
```
\_unnamed \[6 4\]:
| :V1 | :V2 | :V3 | :V4 |
|----:|----:|----:|-----|
| 1 | 1 | 0.5 | A |
| 2 | 2 | 1.0 | B |
| 1 | 4 | 1.5 | A |
| 2 | 5 | 0.5 | B |
| 1 | 7 | | A |
| 2 | 8 | 1.5 | B |
------------------------------------------------------------------------
The same with grouped dataset
``` clojure
(-> DSm
(tc/group-by :V4)
(tc/drop-missing :V1)
(tc/ungroup))
```
\_unnamed \[6 4\]:
| :V1 | :V2 | :V3 | :V4 |
|----:|----:|----:|-----|
| 1 | 1 | 0.5 | A |
| 1 | 4 | 1.5 | A |
| 1 | 7 | | A |
| 2 | 2 | 1.0 | B |
| 2 | 5 | 0.5 | B |
| 2 | 8 | 1.5 | B |
#### Replace
Missing values can be replaced using several strategies.
`replace-missing` accepts:
- dataset
- column selector, default: `:all`
- strategy, default: `:nearest`
- value (optional)
- single value
- sequence of values (cycled)
- function, applied on column(s) with stripped missings
- map with \[index,value\] pairs
Strategies are:
- `:value` - replace with given value
- `:up` - copy values up
- `:down` - copy values down
- `:updown` - copy values up and then down for missing values at the end
- `:downup` - copy values down and then up for missing values at the
beginning
- `:mid` or `:nearest` - copy values around known values
- `:midpoint` - use average value from previous and next non-missing
- `:lerp` - trying to lineary approximate values, works for numbers and
datetime, otherwise applies `:nearest`. For numbers always results in
`float` datatype.
Let’s define special dataset here:
``` clojure
(def DSm2 (tc/dataset {:a [nil nil nil 1.0 2 nil nil nil nil nil 4 nil 11 nil nil]
:b [2 2 2 nil nil nil nil nil nil 13 nil 3 4 5 5]}))
```
``` clojure
DSm2
```
\_unnamed \[15 2\]:
| :a | :b |
|-----:|----:|
| | 2 |
| | 2 |
| | 2 |
| 1.0 | |
| 2.0 | |
| | |
| | |
| | |
| | |
| | 13 |
| 4.0 | |
| | 3 |
| 11.0 | 4 |
| | 5 |
| | 5 |
------------------------------------------------------------------------
Replace missing with default strategy for all columns
``` clojure
(tc/replace-missing DSm2)
```
\_unnamed \[15 2\]:
| :a | :b |
|-----:|----:|
| 1.0 | 2 |
| 1.0 | 2 |
| 1.0 | 2 |
| 1.0 | 2 |
| 2.0 | 2 |
| 2.0 | 2 |
| 2.0 | 13 |
| 2.0 | 13 |
| 4.0 | 13 |
| 4.0 | 13 |
| 4.0 | 13 |
| 4.0 | 3 |
| 11.0 | 4 |
| 11.0 | 5 |
| 11.0 | 5 |
------------------------------------------------------------------------
Replace missing with single value in whole dataset
``` clojure
(tc/replace-missing DSm2 :all :value 999)
```
\_unnamed \[15 2\]:
| :a | :b |
|------:|----:|
| 999.0 | 2 |
| 999.0 | 2 |
| 999.0 | 2 |
| 1.0 | 999 |
| 2.0 | 999 |
| 999.0 | 999 |
| 999.0 | 999 |
| 999.0 | 999 |
| 999.0 | 999 |
| 999.0 | 13 |
| 4.0 | 999 |
| 999.0 | 3 |
| 11.0 | 4 |
| 999.0 | 5 |
| 999.0 | 5 |
------------------------------------------------------------------------
Replace missing with single value in `:a` column
``` clojure
(tc/replace-missing DSm2 :a :value 999)
```
\_unnamed \[15 2\]:
| :a | :b |
|------:|----:|
| 999.0 | 2 |
| 999.0 | 2 |
| 999.0 | 2 |
| 1.0 | |
| 2.0 | |
| 999.0 | |
| 999.0 | |
| 999.0 | |
| 999.0 | |
| 999.0 | 13 |
| 4.0 | |
| 999.0 | 3 |
| 11.0 | 4 |
| 999.0 | 5 |
| 999.0 | 5 |
------------------------------------------------------------------------
Replace missing with sequence in `:a` column
``` clojure
(tc/replace-missing DSm2 :a :value [-999 -998 -997])
```
\_unnamed \[15 2\]:
| :a | :b |
|-------:|----:|
| -999.0 | 2 |
| -998.0 | 2 |
| -997.0 | 2 |
| 1.0 | |
| 2.0 | |
| -999.0 | |
| -998.0 | |
| -997.0 | |
| -999.0 | |
| -998.0 | 13 |
| 4.0 | |
| -997.0 | 3 |
| 11.0 | 4 |
| -999.0 | 5 |
| -998.0 | 5 |
------------------------------------------------------------------------
Replace missing with a function (mean)
``` clojure
(tc/replace-missing DSm2 :a :value tech.v3.datatype.functional/mean)
```
\_unnamed \[15 2\]:
| :a | :b |
|-----:|----:|
| 4.5 | 2 |
| 4.5 | 2 |
| 4.5 | 2 |
| 1.0 | |
| 2.0 | |
| 4.5 | |
| 4.5 | |
| 4.5 | |
| 4.5 | |
| 4.5 | 13 |
| 4.0 | |
| 4.5 | 3 |
| 11.0 | 4 |
| 4.5 | 5 |
| 4.5 | 5 |
------------------------------------------------------------------------
Replace missing some missing values with a map
``` clojure
(tc/replace-missing DSm2 :a :value {0 100 1 -100 14 -1000})
```
\_unnamed \[15 2\]:
| :a | :b |
|--------:|----:|
| 100.0 | 2 |
| -100.0 | 2 |
| | 2 |
| 1.0 | |
| 2.0 | |
| | |
| | |
| | |
| | |
| | 13 |
| 4.0 | |
| | 3 |
| 11.0 | 4 |
| | 5 |
| -1000.0 | 5 |
------------------------------------------------------------------------
Using `:down` strategy, fills gaps with values from above. You can see
that if missings are at the beginning, the are filled with first value
``` clojure
(tc/replace-missing DSm2 [:a :b] :downup)
```
\_unnamed \[15 2\]:
| :a | :b |
|-----:|----:|
| 1.0 | 2 |
| 1.0 | 2 |
| 1.0 | 2 |
| 1.0 | 2 |
| 2.0 | 2 |
| 2.0 | 2 |
| 2.0 | 2 |
| 2.0 | 2 |
| 2.0 | 2 |
| 2.0 | 13 |
| 4.0 | 13 |
| 4.0 | 3 |
| 11.0 | 4 |
| 11.0 | 5 |
| 11.0 | 5 |
------------------------------------------------------------------------
To fix above issue you can provide value
``` clojure
(tc/replace-missing DSm2 [:a :b] :down 999)
```
\_unnamed \[15 2\]:
| :a | :b |
|------:|----:|
| 999.0 | 2 |
| 999.0 | 2 |
| 999.0 | 2 |
| 1.0 | 2 |
| 2.0 | 2 |
| 2.0 | 2 |
| 2.0 | 2 |
| 2.0 | 2 |
| 2.0 | 2 |
| 2.0 | 13 |
| 4.0 | 13 |
| 4.0 | 3 |
| 11.0 | 4 |
| 11.0 | 5 |
| 11.0 | 5 |
------------------------------------------------------------------------
The same applies for `:up` strategy which is opposite direction.
``` clojure
(tc/replace-missing DSm2 [:a :b] :up)
```
\_unnamed \[15 2\]:
| :a | :b |
|-----:|----:|
| 1.0 | 2 |
| 1.0 | 2 |
| 1.0 | 2 |
| 1.0 | 13 |
| 2.0 | 13 |
| 4.0 | 13 |
| 4.0 | 13 |
| 4.0 | 13 |
| 4.0 | 13 |
| 4.0 | 13 |
| 4.0 | 3 |
| 11.0 | 3 |
| 11.0 | 4 |
| | 5 |
| | 5 |
------------------------------------------------------------------------
``` clojure
(tc/replace-missing DSm2 [:a :b] :updown)
```
\_unnamed \[15 2\]:
| :a | :b |
|-----:|----:|
| 1.0 | 2 |
| 1.0 | 2 |
| 1.0 | 2 |
| 1.0 | 13 |
| 2.0 | 13 |
| 4.0 | 13 |
| 4.0 | 13 |
| 4.0 | 13 |
| 4.0 | 13 |
| 4.0 | 13 |
| 4.0 | 3 |
| 11.0 | 3 |
| 11.0 | 4 |
| 11.0 | 5 |
| 11.0 | 5 |
------------------------------------------------------------------------
The same applies for `:up` strategy which is opposite direction.
``` clojure
(tc/replace-missing DSm2 [:a :b] :midpoint)
```
\_unnamed \[15 2\]:
| :a | :b |
|-----:|-----:|
| 1.0 | 2.0 |
| 1.0 | 2.0 |
| 1.0 | 2.0 |
| 1.0 | 7.5 |
| 2.0 | 7.5 |
| 3.0 | 7.5 |
| 3.0 | 7.5 |
| 3.0 | 7.5 |
| 3.0 | 7.5 |
| 3.0 | 13.0 |
| 4.0 | 8.0 |
| 7.5 | 3.0 |
| 11.0 | 4.0 |
| 11.0 | 5.0 |
| 11.0 | 5.0 |
------------------------------------------------------------------------
We can use a function which is applied after applying `:up` or `:down`
``` clojure
(tc/replace-missing DSm2 [:a :b] :down tech.v3.datatype.functional/mean)
```
\_unnamed \[15 2\]:
| :a | :b |
|-----:|----:|
| 4.5 | 2 |
| 4.5 | 2 |
| 4.5 | 2 |
| 1.0 | 2 |
| 2.0 | 2 |
| 2.0 | 2 |
| 2.0 | 2 |
| 2.0 | 2 |
| 2.0 | 2 |
| 2.0 | 13 |
| 4.0 | 13 |
| 4.0 | 3 |
| 11.0 | 4 |
| 11.0 | 5 |
| 11.0 | 5 |
------------------------------------------------------------------------
Lerp tries to apply linear interpolation of the values
``` clojure
(tc/replace-missing DSm2 [:a :b] :lerp)
```
\_unnamed \[15 2\]:
| :a | :b |
|------------:|------------:|
| 1.00000000 | 2.00000000 |
| 1.00000000 | 2.00000000 |
| 1.00000000 | 2.00000000 |
| 1.00000000 | 3.57142857 |
| 2.00000000 | 5.14285714 |
| 2.33333333 | 6.71428571 |
| 2.66666667 | 8.28571429 |
| 3.00000000 | 9.85714286 |
| 3.33333333 | 11.42857143 |
| 3.66666667 | 13.00000000 |
| 4.00000000 | 8.00000000 |
| 7.50000000 | 3.00000000 |
| 11.00000000 | 4.00000000 |
| 11.00000000 | 5.00000000 |
| 11.00000000 | 5.00000000 |
------------------------------------------------------------------------
Lerp works also on dates
``` clojure
(-> (tc/dataset {:dt [(java.time.LocalDateTime/of 2020 1 1 11 22 33)
nil nil nil nil nil nil nil
(java.time.LocalDateTime/of 2020 10 1 1 1 1)]})
(tc/replace-missing :lerp))
```
\_unnamed \[9 1\]:
| :dt |
|-------------------------|
| 2020-01-01T11:22:33 |
| 2020-02-04T16:04:51.500 |
| 2020-03-09T20:47:10 |
| 2020-04-13T01:29:28.500 |
| 2020-05-17T06:11:47 |
| 2020-06-20T10:54:05.500 |
| 2020-07-24T15:36:24 |
| 2020-08-27T20:18:42.500 |
| 2020-10-01T01:01:01 |
#### Inject
When your column contains not continuous data range you can fill up with
lacking values. Arguments:
- dataset
- column name
- expected step (`max-span`, milliseconds in case of datetime column)
- (optional) `missing-strategy` - how to replace missing, default
`:down` (set to `nil` if none)
- (optional) `missing-value` - optional value for replace missing
------------------------------------------------------------------------
``` clojure
(-> (tc/dataset {:a [1 2 9]
:b [:a :b :c]})
(tc/fill-range-replace :a 1))
```
\_unnamed \[9 2\]:
| :a | :b |
|----:|-----|
| 1.0 | :a |
| 2.0 | :b |
| 3.0 | :b |
| 4.0 | :b |
| 5.0 | :b |
| 6.0 | :b |
| 7.0 | :b |
| 8.0 | :b |
| 9.0 | :c |
### Join/Separate Columns
Joining or separating columns are operations which can help to tidy
messy dataset.
- `join-columns` joins content of the columns (as string concatenation
or other structure) and stores it in new column
- `separate-column` splits content of the columns into set of new
columns
#### Join
`join-columns` accepts:
- dataset
- column selector (as in `select-columns`)
- options
- `:separator` (default `"-"`)
- `:drop-columns?` - whether to drop source columns or not (default
`true`)
- `:result-type`
- `:map` - packs data into map
- `:seq` - packs data into sequence
- `:string` - join strings with separator (default)
- or custom function which gets row as a vector
- `:missing-subst` - substitution for missing value
------------------------------------------------------------------------
Default usage. Create `:joined` column out of other columns.
``` clojure
(tc/join-columns DSm :joined [:V1 :V2 :V4])
```
\_unnamed \[9 2\]:
| :V3 | :joined |
|----:|---------|
| 0.5 | 1-1-A |
| 1.0 | 2-2-B |
| | 3-C |
| 1.5 | 1-4-A |
| 0.5 | 2-5-B |
| 1.0 | 6-C |
| | 1-7-A |
| 1.5 | 2-8-B |
| 0.5 | 9-C |
------------------------------------------------------------------------
Without dropping source columns.
``` clojure
(tc/join-columns DSm :joined [:V1 :V2 :V4] {:drop-columns? false})
```
\_unnamed \[9 5\]:
| :V1 | :V2 | :V3 | :V4 | :joined |
|----:|----:|----:|-----|---------|
| 1 | 1 | 0.5 | A | 1-1-A |
| 2 | 2 | 1.0 | B | 2-2-B |
| | 3 | | C | 3-C |
| 1 | 4 | 1.5 | A | 1-4-A |
| 2 | 5 | 0.5 | B | 2-5-B |
| | 6 | 1.0 | C | 6-C |
| 1 | 7 | | A | 1-7-A |
| 2 | 8 | 1.5 | B | 2-8-B |
| | 9 | 0.5 | C | 9-C |
------------------------------------------------------------------------
Let’s replace missing value with “NA” string.
``` clojure
(tc/join-columns DSm :joined [:V1 :V2 :V4] {:missing-subst "NA"})
```
\_unnamed \[9 2\]:
| :V3 | :joined |
|----:|---------|
| 0.5 | 1-1-A |
| 1.0 | 2-2-B |
| | NA-3-C |
| 1.5 | 1-4-A |
| 0.5 | 2-5-B |
| 1.0 | NA-6-C |
| | 1-7-A |
| 1.5 | 2-8-B |
| 0.5 | NA-9-C |
------------------------------------------------------------------------
We can use custom separator.
``` clojure
(tc/join-columns DSm :joined [:V1 :V2 :V4] {:separator "/"
:missing-subst "."})
```
\_unnamed \[9 2\]:
| :V3 | :joined |
|----:|---------|
| 0.5 | 1/1/A |
| 1.0 | 2/2/B |
| | ./3/C |
| 1.5 | 1/4/A |
| 0.5 | 2/5/B |
| 1.0 | ./6/C |
| | 1/7/A |
| 1.5 | 2/8/B |
| 0.5 | ./9/C |
------------------------------------------------------------------------
Or even sequence of separators.
``` clojure
(tc/join-columns DSm :joined [:V1 :V2 :V4] {:separator ["-" "/"]
:missing-subst "."})
```
\_unnamed \[9 2\]:
| :V3 | :joined |
|----:|---------|
| 0.5 | 1-1/A |
| 1.0 | 2-2/B |
| | .-3/C |
| 1.5 | 1-4/A |
| 0.5 | 2-5/B |
| 1.0 | .-6/C |
| | 1-7/A |
| 1.5 | 2-8/B |
| 0.5 | .-9/C |
------------------------------------------------------------------------
The other types of results, map:
``` clojure
(tc/join-columns DSm :joined [:V1 :V2 :V4] {:result-type :map})
```
\_unnamed \[9 2\]:
| :V3 | :joined |
|----:|---------------------------|
| 0.5 | {:V1 1, :V2 1, :V4 “A”} |
| 1.0 | {:V1 2, :V2 2, :V4 “B”} |
| | {:V1 nil, :V2 3, :V4 “C”} |
| 1.5 | {:V1 1, :V2 4, :V4 “A”} |
| 0.5 | {:V1 2, :V2 5, :V4 “B”} |
| 1.0 | {:V1 nil, :V2 6, :V4 “C”} |
| | {:V1 1, :V2 7, :V4 “A”} |
| 1.5 | {:V1 2, :V2 8, :V4 “B”} |
| 0.5 | {:V1 nil, :V2 9, :V4 “C”} |
------------------------------------------------------------------------
Sequence
``` clojure
(tc/join-columns DSm :joined [:V1 :V2 :V4] {:result-type :seq})
```
\_unnamed \[9 2\]:
| :V3 | :joined |
|----:|-------------|
| 0.5 | (1 1 “A”) |
| 1.0 | (2 2 “B”) |
| | (nil 3 “C”) |
| 1.5 | (1 4 “A”) |
| 0.5 | (2 5 “B”) |
| 1.0 | (nil 6 “C”) |
| | (1 7 “A”) |
| 1.5 | (2 8 “B”) |
| 0.5 | (nil 9 “C”) |
------------------------------------------------------------------------
Custom function, calculate hash
``` clojure
(tc/join-columns DSm :joined [:V1 :V2 :V4] {:result-type hash})
```
\_unnamed \[9 2\]:
| :V3 | :joined |
|----:|------------:|
| 0.5 | 535226087 |
| 1.0 | 1128801549 |
| | -1842240303 |
| 1.5 | 2022347171 |
| 0.5 | 1884312041 |
| 1.0 | -1555412370 |
| | 1640237355 |
| 1.5 | -967279152 |
| 0.5 | 1128367958 |
------------------------------------------------------------------------
Grouped dataset
``` clojure
(-> DSm
(tc/group-by :V4)
(tc/join-columns :joined [:V1 :V2 :V4])
(tc/ungroup))
```
\_unnamed \[9 2\]:
| :V3 | :joined |
|----:|---------|
| 0.5 | 1-1-A |
| 1.5 | 1-4-A |
| | 1-7-A |
| 1.0 | 2-2-B |
| 0.5 | 2-5-B |
| 1.5 | 2-8-B |
| | 3-C |
| 1.0 | 6-C |
| 0.5 | 9-C |
------------------------------------------------------------------------
##### Tidyr examples
[source](https://tidyr.tidyverse.org/reference/unite.html)
``` clojure
(def df (tc/dataset {:x ["a" "a" nil nil]
:y ["b" nil "b" nil]}))
```
#'user/df
``` clojure
df
```
\_unnamed \[4 2\]:
| :x | :y |
|-----|-----|
| a | b |
| a | |
| | b |
| | |
------------------------------------------------------------------------
``` clojure
(tc/join-columns df "z" [:x :y] {:drop-columns? false
:missing-subst "NA"
:separator "_"})
```
\_unnamed \[4 3\]:
| :x | :y | z |
|-----|-----|-------|
| a | b | a_b |
| a | | a_NA |
| | b | NA_b |
| | | NA_NA |
------------------------------------------------------------------------
``` clojure
(tc/join-columns df "z" [:x :y] {:drop-columns? false
:separator "_"})
```
\_unnamed \[4 3\]:
| :x | :y | z |
|-----|-----|-----|
| a | b | a_b |
| a | | a |
| | b | b |
| | | |
#### Separate
Column can be also separated into several other columns using string as
separator, regex or custom function. Arguments:
- dataset
- source column
- target columns - can be `nil` or `:infer` to automatically create
columns
- separator as:
- string - it’s converted to regular expression and passed to
`clojure.string/split` function
- regex
- or custom function (default: identity)
- options
- `:drop-column?` - whether drop source column(s) or not (default:
`true`). Set to `:all` to keep only separation result.
- `:missing-subst` - values which should be treated as missing, can be
set, sequence, value or function (default: `""`)
Custom function (as separator) should return seqence of values for given
value or a sequence of map.
------------------------------------------------------------------------
Separate float into integer and factional values
``` clojure
(tc/separate-column DS :V3 [:int-part :frac-part] (fn [^double v]
[(int (quot v 1.0))
(mod v 1.0)]))
```
\_unnamed \[9 5\]:
| :V1 | :V2 | :int-part | :frac-part | :V4 |
|----:|----:|----------:|-----------:|-----|
| 1 | 1 | 0 | 0.5 | A |
| 2 | 2 | 1 | 0.0 | B |
| 1 | 3 | 1 | 0.5 | C |
| 2 | 4 | 0 | 0.5 | A |
| 1 | 5 | 1 | 0.0 | B |
| 2 | 6 | 1 | 0.5 | C |
| 1 | 7 | 0 | 0.5 | A |
| 2 | 8 | 1 | 0.0 | B |
| 1 | 9 | 1 | 0.5 | C |
------------------------------------------------------------------------
Source column can be kept
``` clojure
(tc/separate-column DS :V3 [:int-part :frac-part] (fn [^double v]
[(int (quot v 1.0))
(mod v 1.0)]) {:drop-column? false})
```
\_unnamed \[9 6\]:
| :V1 | :V2 | :V3 | :int-part | :frac-part | :V4 |
|----:|----:|----:|----------:|-----------:|-----|
| 1 | 1 | 0.5 | 0 | 0.5 | A |
| 2 | 2 | 1.0 | 1 | 0.0 | B |
| 1 | 3 | 1.5 | 1 | 0.5 | C |
| 2 | 4 | 0.5 | 0 | 0.5 | A |
| 1 | 5 | 1.0 | 1 | 0.0 | B |
| 2 | 6 | 1.5 | 1 | 0.5 | C |
| 1 | 7 | 0.5 | 0 | 0.5 | A |
| 2 | 8 | 1.0 | 1 | 0.0 | B |
| 1 | 9 | 1.5 | 1 | 0.5 | C |
------------------------------------------------------------------------
We can treat `0` or `0.0` as missing value
``` clojure
(tc/separate-column DS :V3 [:int-part :frac-part] (fn [^double v]
[(int (quot v 1.0))
(mod v 1.0)]) {:missing-subst [0 0.0]})
```
\_unnamed \[9 5\]:
| :V1 | :V2 | :int-part | :frac-part | :V4 |
|----:|----:|----------:|-----------:|-----|
| 1 | 1 | | 0.5 | A |
| 2 | 2 | 1 | | B |
| 1 | 3 | 1 | 0.5 | C |
| 2 | 4 | | 0.5 | A |
| 1 | 5 | 1 | | B |
| 2 | 6 | 1 | 0.5 | C |
| 1 | 7 | | 0.5 | A |
| 2 | 8 | 1 | | B |
| 1 | 9 | 1 | 0.5 | C |
------------------------------------------------------------------------
Works on grouped dataset
``` clojure
(-> DS
(tc/group-by :V4)
(tc/separate-column :V3 [:int-part :fract-part] (fn [^double v]
[(int (quot v 1.0))
(mod v 1.0)]))
(tc/ungroup))
```
\_unnamed \[9 5\]:
| :V1 | :V2 | :int-part | :fract-part | :V4 |
|----:|----:|----------:|------------:|-----|
| 1 | 1 | 0 | 0.5 | A |
| 2 | 4 | 0 | 0.5 | A |
| 1 | 7 | 0 | 0.5 | A |
| 2 | 2 | 1 | 0.0 | B |
| 1 | 5 | 1 | 0.0 | B |
| 2 | 8 | 1 | 0.0 | B |
| 1 | 3 | 1 | 0.5 | C |
| 2 | 6 | 1 | 0.5 | C |
| 1 | 9 | 1 | 0.5 | C |
------------------------------------------------------------------------
Separate using separator returning sequence of maps.
``` clojure
(tc/separate-column DS :V3 (fn [^double v]
{:int-part (int (quot v 1.0))
:fract-part (mod v 1.0)}))
```
\_unnamed \[9 5\]:
| :V1 | :V2 | :int-part | :fract-part | :V4 |
|----:|----:|----------:|------------:|-----|
| 1 | 1 | 0 | 0.5 | A |
| 2 | 2 | 1 | 0.0 | B |
| 1 | 3 | 1 | 0.5 | C |
| 2 | 4 | 0 | 0.5 | A |
| 1 | 5 | 1 | 0.0 | B |
| 2 | 6 | 1 | 0.5 | C |
| 1 | 7 | 0 | 0.5 | A |
| 2 | 8 | 1 | 0.0 | B |
| 1 | 9 | 1 | 0.5 | C |
Keeping all columns
``` clojure
(tc/separate-column DS :V3 nil (fn [^double v]
{:int-part (int (quot v 1.0))
:fract-part (mod v 1.0)}) {:drop-column? false})
```
\_unnamed \[9 6\]:
| :V1 | :V2 | :V3 | :int-part | :fract-part | :V4 |
|----:|----:|----:|----------:|------------:|-----|
| 1 | 1 | 0.5 | 0 | 0.5 | A |
| 2 | 2 | 1.0 | 1 | 0.0 | B |
| 1 | 3 | 1.5 | 1 | 0.5 | C |
| 2 | 4 | 0.5 | 0 | 0.5 | A |
| 1 | 5 | 1.0 | 1 | 0.0 | B |
| 2 | 6 | 1.5 | 1 | 0.5 | C |
| 1 | 7 | 0.5 | 0 | 0.5 | A |
| 2 | 8 | 1.0 | 1 | 0.0 | B |
| 1 | 9 | 1.5 | 1 | 0.5 | C |
Droping all colums but separated
``` clojure
(tc/separate-column DS :V3 nil (fn [^double v]
{:int-part (int (quot v 1.0))
:fract-part (mod v 1.0)}) {:drop-column? :all})
```
\_unnamed \[9 2\]:
| :int-part | :fract-part |
|----------:|------------:|
| 0 | 0.5 |
| 1 | 0.0 |
| 1 | 0.5 |
| 0 | 0.5 |
| 1 | 0.0 |
| 1 | 0.5 |
| 0 | 0.5 |
| 1 | 0.0 |
| 1 | 0.5 |
Infering column names
``` clojure
(tc/separate-column DS :V3 (fn [^double v]
[(int (quot v 1.0)) (mod v 1.0)]))
```
\_unnamed \[9 5\]:
| :V1 | :V2 | :V3-0 | :V3-1 | :V4 |
|----:|----:|------:|------:|-----|
| 1 | 1 | 0 | 0.5 | A |
| 2 | 2 | 1 | 0.0 | B |
| 1 | 3 | 1 | 0.5 | C |
| 2 | 4 | 0 | 0.5 | A |
| 1 | 5 | 1 | 0.0 | B |
| 2 | 6 | 1 | 0.5 | C |
| 1 | 7 | 0 | 0.5 | A |
| 2 | 8 | 1 | 0.0 | B |
| 1 | 9 | 1 | 0.5 | C |
------------------------------------------------------------------------
Join and separate together.
``` clojure
(-> DSm
(tc/join-columns :joined [:V1 :V2 :V4] {:result-type :map})
(tc/separate-column :joined [:v1 :v2 :v4] (juxt :V1 :V2 :V4)))
```
\_unnamed \[9 4\]:
| :V3 | :v1 | :v2 | :v4 |
|----:|----:|----:|-----|
| 0.5 | 1 | 1 | A |
| 1.0 | 2 | 2 | B |
| | | 3 | C |
| 1.5 | 1 | 4 | A |
| 0.5 | 2 | 5 | B |
| 1.0 | | 6 | C |
| | 1 | 7 | A |
| 1.5 | 2 | 8 | B |
| 0.5 | | 9 | C |
``` clojure
(-> DSm
(tc/join-columns :joined [:V1 :V2 :V4] {:result-type :seq})
(tc/separate-column :joined [:v1 :v2 :v4] identity))
```
\_unnamed \[9 4\]:
| :V3 | :v1 | :v2 | :v4 |
|----:|----:|----:|-----|
| 0.5 | 1 | 1 | A |
| 1.0 | 2 | 2 | B |
| | | 3 | C |
| 1.5 | 1 | 4 | A |
| 0.5 | 2 | 5 | B |
| 1.0 | | 6 | C |
| | 1 | 7 | A |
| 1.5 | 2 | 8 | B |
| 0.5 | | 9 | C |
##### Tidyr examples
[separate source](https://tidyr.tidyverse.org/reference/separate.html)
[extract source](https://tidyr.tidyverse.org/reference/extract.html)
``` clojure
(def df-separate (tc/dataset {:x [nil "a.b" "a.d" "b.c"]}))
(def df-separate2 (tc/dataset {:x ["a" "a b" nil "a b c"]}))
(def df-separate3 (tc/dataset {:x ["a?b" nil "a.b" "b:c"]}))
(def df-extract (tc/dataset {:x [nil "a-b" "a-d" "b-c" "d-e"]}))
```
#'user/df-separate
#'user/df-separate2
#'user/df-separate3
#'user/df-extract
``` clojure
df-separate
```
\_unnamed \[4 1\]:
| :x |
|-----|
| |
| a.b |
| a.d |
| b.c |
``` clojure
df-separate2
```
\_unnamed \[4 1\]:
| :x |
|-------|
| a |
| a b |
| |
| a b c |
``` clojure
df-separate3
```
\_unnamed \[4 1\]:
| :x |
|-----|
| a?b |
| |
| a.b |
| b:c |
``` clojure
df-extract
```
\_unnamed \[5 1\]:
| :x |
|-----|
| |
| a-b |
| a-d |
| b-c |
| d-e |
------------------------------------------------------------------------
``` clojure
(tc/separate-column df-separate :x [:A :B] "\\.")
```
\_unnamed \[4 2\]:
| :A | :B |
|-----|-----|
| | |
| a | b |
| a | d |
| b | c |
------------------------------------------------------------------------
You can drop columns after separation by setting `nil` as a name. We
need second value here.
``` clojure
(tc/separate-column df-separate :x [nil :B] "\\.")
```
\_unnamed \[4 1\]:
| :B |
|-----|
| |
| b |
| d |
| c |
------------------------------------------------------------------------
Extra data is dropped
``` clojure
(tc/separate-column df-separate2 :x ["a" "b"] " ")
```
\_unnamed \[4 2\]:
| a | b |
|-----|-----|
| a | |
| a | b |
| | |
| a | b |
------------------------------------------------------------------------
Split with regular expression
``` clojure
(tc/separate-column df-separate3 :x ["a" "b"] "[?\\.:]")
```
\_unnamed \[4 2\]:
| a | b |
|-----|-----|
| a | b |
| | |
| a | b |
| b | c |
------------------------------------------------------------------------
Or just regular expression to extract values
``` clojure
(tc/separate-column df-separate3 :x ["a" "b"] #"(.).(.)")
```
\_unnamed \[4 2\]:
| a | b |
|-----|-----|
| a | b |
| | |
| a | b |
| b | c |
------------------------------------------------------------------------
Extract first value only
``` clojure
(tc/separate-column df-extract :x ["A"] "-")
```
\_unnamed \[5 1\]:
| A |
|-----|
| |
| a |
| a |
| b |
| d |
------------------------------------------------------------------------
Split with regex
``` clojure
(tc/separate-column df-extract :x ["A" "B"] #"(\p{Alnum})-(\p{Alnum})")
```
\_unnamed \[5 2\]:
| A | B |
|-----|-----|
| | |
| a | b |
| a | d |
| b | c |
| d | e |
------------------------------------------------------------------------
Only `a,b,c,d` strings
``` clojure
(tc/separate-column df-extract :x ["A" "B"] #"([a-d]+)-([a-d]+)")
```
\_unnamed \[5 2\]:
| A | B |
|-----|-----|
| | |
| a | b |
| a | d |
| b | c |
| | |
#### Array column conversion
A dataset can have as well columns of type java array. We can convert
from normal columns to a single array column and back like this:
``` clojure
(-> (tc/dataset {:x [(double-array [1 2 3])
(double-array [4 5 6])]
:y [:a :b]})
(tc/array-column->columns :x))
```
_unnamed [2 4]:
| :y | 0 | 1 | 2 |
|----|----:|----:|----:|
| :a | 1.0 | 2.0 | 3.0 |
| :b | 4.0 | 5.0 | 6.0 |
and the other way around:
``` clojure
(-> (tc/dataset {0 [0.0 1 2]
1 [3.0 4 5]
:x [:a :b :c]})
(tc/columns->array-column [0 1] :y))
```
_unnamed [3 2]:
| :x | :y |
|----|-------------|
| :a | [D@28ce010b |
| :b | [D@2bf74fe4 |
| :c | [D@37138b9c |
### Fold/Unroll Rows
To pack or unpack the data into single value you can use `fold-by` and
`unroll` functions.
`fold-by` groups dataset and packs columns data from each group
separately into desired datastructure (like vector or sequence).
`unroll` does the opposite.
#### Fold-by
Group-by and pack columns into vector
``` clojure
(tc/fold-by DS [:V3 :V4 :V1])
```
\_unnamed \[6 4\]:
| :V3 | :V4 | :V1 | :V2 |
|----:|-----|----:|---------|
| 0.5 | A | 1 | \[1 7\] |
| 1.0 | B | 2 | \[2 8\] |
| 1.5 | C | 1 | \[3 9\] |
| 0.5 | A | 2 | \[4\] |
| 1.0 | B | 1 | \[5\] |
| 1.5 | C | 2 | \[6\] |
------------------------------------------------------------------------
You can pack several columns at once.
``` clojure
(tc/fold-by DS [:V4])
```
\_unnamed \[3 4\]:
| :V4 | :V1 | :V2 | :V3 |
|-----|-----------|-----------|-----------------|
| A | \[1 2 1\] | \[1 4 7\] | \[0.5 0.5 0.5\] |
| B | \[2 1 2\] | \[2 5 8\] | \[1.0 1.0 1.0\] |
| C | \[1 2 1\] | \[3 6 9\] | \[1.5 1.5 1.5\] |
------------------------------------------------------------------------
You can use custom packing function
``` clojure
(tc/fold-by DS [:V4] seq)
```
\_unnamed \[3 4\]:
| :V4 | :V1 | :V2 | :V3 |
|-----|---------|---------|---------------|
| A | (1 2 1) | (1 4 7) | (0.5 0.5 0.5) |
| B | (2 1 2) | (2 5 8) | (1.0 1.0 1.0) |
| C | (1 2 1) | (3 6 9) | (1.5 1.5 1.5) |
or
``` clojure
(tc/fold-by DS [:V4] set)
```
\_unnamed \[3 4\]:
| :V4 | :V1 | :V2 | :V3 |
|-----|---------|-----------|---------|
| A | \#{1 2} | \#{7 1 4} | \#{0.5} |
| B | \#{1 2} | \#{2 5 8} | \#{1.0} |
| C | \#{1 2} | \#{6 3 9} | \#{1.5} |
------------------------------------------------------------------------
This works also on grouped dataset
``` clojure
(-> DS
(tc/group-by :V1)
(tc/fold-by :V4)
(tc/ungroup))
```
\_unnamed \[6 4\]:
| :V4 | :V1 | :V2 | :V3 |
|-----|---------|---------|-------------|
| A | \[1 1\] | \[1 7\] | \[0.5 0.5\] |
| C | \[1 1\] | \[3 9\] | \[1.5 1.5\] |
| B | \[1\] | \[5\] | \[1.0\] |
| B | \[2 2\] | \[2 8\] | \[1.0 1.0\] |
| A | \[2\] | \[4\] | \[0.5\] |
| C | \[2\] | \[6\] | \[1.5\] |
#### Unroll
`unroll` unfolds sequences stored in data, multiplying other ones when
necessary. You can unroll more than one column at once (folded data
should have the same size!).
Options:
- `:indexes?` if true (or column name), information about index of
unrolled sequence is added.
- `:datatypes` list of datatypes which should be applied to restored
columns, a map
------------------------------------------------------------------------
Unroll one column
``` clojure
(tc/unroll (tc/fold-by DS [:V4]) [:V1])
```
\_unnamed \[9 4\]:
| :V4 | :V2 | :V3 | :V1 |
|-----|-----------|-----------------|----:|
| A | \[1 4 7\] | \[0.5 0.5 0.5\] | 1 |
| A | \[1 4 7\] | \[0.5 0.5 0.5\] | 2 |
| A | \[1 4 7\] | \[0.5 0.5 0.5\] | 1 |
| B | \[2 5 8\] | \[1.0 1.0 1.0\] | 2 |
| B | \[2 5 8\] | \[1.0 1.0 1.0\] | 1 |
| B | \[2 5 8\] | \[1.0 1.0 1.0\] | 2 |
| C | \[3 6 9\] | \[1.5 1.5 1.5\] | 1 |
| C | \[3 6 9\] | \[1.5 1.5 1.5\] | 2 |
| C | \[3 6 9\] | \[1.5 1.5 1.5\] | 1 |
------------------------------------------------------------------------
Unroll all folded columns
``` clojure
(tc/unroll (tc/fold-by DS [:V4]) [:V1 :V2 :V3])
```
\_unnamed \[9 4\]:
| :V4 | :V1 | :V2 | :V3 |
|-----|----:|----:|----:|
| A | 1 | 1 | 0.5 |
| A | 2 | 4 | 0.5 |
| A | 1 | 7 | 0.5 |
| B | 2 | 2 | 1.0 |
| B | 1 | 5 | 1.0 |
| B | 2 | 8 | 1.0 |
| C | 1 | 3 | 1.5 |
| C | 2 | 6 | 1.5 |
| C | 1 | 9 | 1.5 |
------------------------------------------------------------------------
Unroll one by one leads to cartesian product
``` clojure
(-> DS
(tc/fold-by [:V4 :V1])
(tc/unroll [:V2])
(tc/unroll [:V3]))
```
\_unnamed \[15 4\]:
| :V4 | :V1 | :V2 | :V3 |
|-----|----:|----:|----:|
| A | 1 | 1 | 0.5 |
| A | 1 | 1 | 0.5 |
| A | 1 | 7 | 0.5 |
| A | 1 | 7 | 0.5 |
| B | 2 | 2 | 1.0 |
| B | 2 | 2 | 1.0 |
| B | 2 | 8 | 1.0 |
| B | 2 | 8 | 1.0 |
| C | 1 | 3 | 1.5 |
| C | 1 | 3 | 1.5 |
| C | 1 | 9 | 1.5 |
| C | 1 | 9 | 1.5 |
| A | 2 | 4 | 0.5 |
| B | 1 | 5 | 1.0 |
| C | 2 | 6 | 1.5 |
------------------------------------------------------------------------
You can add indexes
``` clojure
(tc/unroll (tc/fold-by DS [:V1]) [:V4 :V2 :V3] {:indexes? true})
```
\_unnamed \[9 5\]:
| :V1 | :indexes | :V4 | :V2 | :V3 |
|----:|---------:|-----|----:|----:|
| 1 | 0 | A | 1 | 0.5 |
| 1 | 1 | C | 3 | 1.5 |
| 1 | 2 | B | 5 | 1.0 |
| 1 | 3 | A | 7 | 0.5 |
| 1 | 4 | C | 9 | 1.5 |
| 2 | 0 | B | 2 | 1.0 |
| 2 | 1 | A | 4 | 0.5 |
| 2 | 2 | C | 6 | 1.5 |
| 2 | 3 | B | 8 | 1.0 |
``` clojure
(tc/unroll (tc/fold-by DS [:V1]) [:V4 :V2 :V3] {:indexes? "vector idx"})
```
\_unnamed \[9 5\]:
| :V1 | vector idx | :V4 | :V2 | :V3 |
|----:|-----------:|-----|----:|----:|
| 1 | 0 | A | 1 | 0.5 |
| 1 | 1 | C | 3 | 1.5 |
| 1 | 2 | B | 5 | 1.0 |
| 1 | 3 | A | 7 | 0.5 |
| 1 | 4 | C | 9 | 1.5 |
| 2 | 0 | B | 2 | 1.0 |
| 2 | 1 | A | 4 | 0.5 |
| 2 | 2 | C | 6 | 1.5 |
| 2 | 3 | B | 8 | 1.0 |
------------------------------------------------------------------------
You can also force datatypes
``` clojure
(-> DS
(tc/fold-by [:V1])
(tc/unroll [:V4 :V2 :V3] {:datatypes {:V4 :string
:V2 :int16
:V3 :float32}})
(tc/info :columns))
```
\_unnamed :column info \[4 4\]:
| :name | :datatype | :n-elems | :categorical? |
|-------|-----------|---------:|---------------|
| :V1 | :int64 | 9 | |
| :V4 | :string | 9 | true |
| :V2 | :int16 | 9 | |
| :V3 | :float32 | 9 | |
------------------------------------------------------------------------
This works also on grouped dataset
``` clojure
(-> DS
(tc/group-by :V1)
(tc/fold-by [:V1 :V4])
(tc/unroll :V3 {:indexes? true})
(tc/ungroup))
```
\_unnamed \[9 5\]:
| :V1 | :V4 | :V2 | :indexes | :V3 |
|----:|-----|---------|---------:|----:|
| 1 | A | \[1 7\] | 0 | 0.5 |
| 1 | A | \[1 7\] | 1 | 0.5 |
| 1 | C | \[3 9\] | 0 | 1.5 |
| 1 | C | \[3 9\] | 1 | 1.5 |
| 1 | B | \[5\] | 0 | 1.0 |
| 2 | B | \[2 8\] | 0 | 1.0 |
| 2 | B | \[2 8\] | 1 | 1.0 |
| 2 | A | \[4\] | 0 | 0.5 |
| 2 | C | \[6\] | 0 | 1.5 |
### Reshape
Reshaping data provides two types of operations:
- `pivot->longer` - converting columns to rows
- `pivot->wider` - converting rows to columns
Both functions are inspired on
[tidyr](https://tidyr.tidyverse.org/articles/pivot.html) R package and
provide almost the same functionality.
All examples are taken from mentioned above documentation.
Both functions work only on regular dataset.
#### Longer
`pivot->longer` converts columns to rows. Column names are treated as
data.
Arguments:
- dataset
- columns selector
- options:
- `:target-columns` - names of the columns created or columns pattern
(see below) (default: `:$column`)
- `:value-column-name` - name of the column for values (default:
`:$value`)
- `:splitter` - string, regular expression or function which splits
source column names into data
- `:drop-missing?` - remove rows with missing? (default: `true`)
- `:datatypes` - map of target columns data types
- `:coerce-to-number` - try to convert extracted values to numbers if
possible (default: true)
`:target-columns` - can be:
- column name - source columns names are put there as a data
- column names as seqence - source columns names after split are put
separately into `:target-columns` as data
- pattern - is a sequence of names, where some of the names are `nil`.
`nil` is replaced by a name taken from splitter and such column is
used for values.
------------------------------------------------------------------------
Create rows from all columns but `"religion"`.
``` clojure
(def relig-income (tc/dataset "data/relig_income.csv"))
```
``` clojure
relig-income
```
data/relig_income.csv \[18 11\]:
| religion | \<\$10k | \$10-20k | \$20-30k | \$30-40k | \$40-50k | \$50-75k | \$75-100k | \$100-150k | \>150k | Don’t know/refused |
|-------------------------|--------:|---------:|---------:|---------:|---------:|---------:|----------:|-----------:|-------:|-------------------:|
| Agnostic | 27 | 34 | 60 | 81 | 76 | 137 | 122 | 109 | 84 | 96 |
| Atheist | 12 | 27 | 37 | 52 | 35 | 70 | 73 | 59 | 74 | 76 |
| Buddhist | 27 | 21 | 30 | 34 | 33 | 58 | 62 | 39 | 53 | 54 |
| Catholic | 418 | 617 | 732 | 670 | 638 | 1116 | 949 | 792 | 633 | 1489 |
| Don’t know/refused | 15 | 14 | 15 | 11 | 10 | 35 | 21 | 17 | 18 | 116 |
| Evangelical Prot | 575 | 869 | 1064 | 982 | 881 | 1486 | 949 | 723 | 414 | 1529 |
| Hindu | 1 | 9 | 7 | 9 | 11 | 34 | 47 | 48 | 54 | 37 |
| Historically Black Prot | 228 | 244 | 236 | 238 | 197 | 223 | 131 | 81 | 78 | 339 |
| Jehovah’s Witness | 20 | 27 | 24 | 24 | 21 | 30 | 15 | 11 | 6 | 37 |
| Jewish | 19 | 19 | 25 | 25 | 30 | 95 | 69 | 87 | 151 | 162 |
| Mainline Prot | 289 | 495 | 619 | 655 | 651 | 1107 | 939 | 753 | 634 | 1328 |
| Mormon | 29 | 40 | 48 | 51 | 56 | 112 | 85 | 49 | 42 | 69 |
| Muslim | 6 | 7 | 9 | 10 | 9 | 23 | 16 | 8 | 6 | 22 |
| Orthodox | 13 | 17 | 23 | 32 | 32 | 47 | 38 | 42 | 46 | 73 |
| Other Christian | 9 | 7 | 11 | 13 | 13 | 14 | 18 | 14 | 12 | 18 |
| Other Faiths | 20 | 33 | 40 | 46 | 49 | 63 | 46 | 40 | 41 | 71 |
| Other World Religions | 5 | 2 | 3 | 4 | 2 | 7 | 3 | 4 | 4 | 8 |
| Unaffiliated | 217 | 299 | 374 | 365 | 341 | 528 | 407 | 321 | 258 | 597 |
``` clojure
(tc/pivot->longer relig-income (complement #{"religion"}))
```
data/relig_income.csv \[180 3\]:
| religion | :$column | :$value | |
|-------------------------|--------------------|----:|
| Agnostic | \<\$10k | 27 |
| Atheist | \<\$10k | 12 |
| Buddhist | \<\$10k | 27 |
| Catholic | \<\$10k | 418 |
| Don’t know/refused | \<\$10k | 15 |
| Evangelical Prot | \<\$10k | 575 |
| Hindu | \<\$10k | 1 |
| Historically Black Prot | \<\$10k | 228 |
| Jehovah’s Witness | \<\$10k | 20 |
| Jewish | \<\$10k | 19 |
| … | … | … |
| Historically Black Prot | \>150k | 78 |
| Jehovah’s Witness | \>150k | 6 |
| Jewish | \>150k | 151 |
| Mainline Prot | \>150k | 634 |
| Mormon | \>150k | 42 |
| Muslim | \>150k | 6 |
| Orthodox | \>150k | 46 |
| Other Christian | \>150k | 12 |
| Other Faiths | \>150k | 41 |
| Other World Religions | \>150k | 4 |
| Unaffiliated | \>150k | 258 |
------------------------------------------------------------------------
Convert only columns starting with `"wk"` and pack them into `:week`
column, values go to `:rank` column
``` clojure
(def bilboard (-> (tc/dataset "data/billboard.csv.gz")
(tc/drop-columns :type/boolean))) ;; drop some boolean columns, tidyr just skips them
```
``` clojure
(->> bilboard
(tc/column-names)
(take 13)
(tc/select-columns bilboard))
```
data/billboard.csv.gz \[317 13\]:
| artist | track | date.entered | wk1 | wk2 | wk3 | wk4 | wk5 | wk6 | wk7 | wk8 | wk9 | wk10 |
|------------------|-----------------------|--------------|----:|----:|----:|----:|----:|----:|----:|----:|----:|-----:|
| 2 Pac | Baby Don’t Cry (Keep… | 2000-02-26 | 87 | 82 | 72 | 77 | 87 | 94 | 99 | | | |
| 2Ge+her | The Hardest Part Of … | 2000-09-02 | 91 | 87 | 92 | | | | | | | |
| 3 Doors Down | Kryptonite | 2000-04-08 | 81 | 70 | 68 | 67 | 66 | 57 | 54 | 53 | 51 | 51 |
| 3 Doors Down | Loser | 2000-10-21 | 76 | 76 | 72 | 69 | 67 | 65 | 55 | 59 | 62 | 61 |
| 504 Boyz | Wobble Wobble | 2000-04-15 | 57 | 34 | 25 | 17 | 17 | 31 | 36 | 49 | 53 | 57 |
| 98^0 | Give Me Just One Nig… | 2000-08-19 | 51 | 39 | 34 | 26 | 26 | 19 | 2 | 2 | 3 | 6 |
| A\*Teens | Dancing Queen | 2000-07-08 | 97 | 97 | 96 | 95 | 100 | | | | | |
| Aaliyah | I Don’t Wanna | 2000-01-29 | 84 | 62 | 51 | 41 | 38 | 35 | 35 | 38 | 38 | 36 |
| Aaliyah | Try Again | 2000-03-18 | 59 | 53 | 38 | 28 | 21 | 18 | 16 | 14 | 12 | 10 |
| Adams, Yolanda | Open My Heart | 2000-08-26 | 76 | 76 | 74 | 69 | 68 | 67 | 61 | 58 | 57 | 59 |
| … | … | … | … | … | … | … | … | … | … | … | … | … |
| Wallflowers, The | Sleepwalker | 2000-10-28 | 73 | 73 | 74 | 80 | 90 | 96 | | | | |
| Westlife | Swear It Again | 2000-04-01 | 96 | 82 | 66 | 55 | 55 | 46 | 44 | 44 | 37 | 35 |
| Williams, Robbie | Angels | 1999-11-20 | 85 | 77 | 69 | 69 | 62 | 56 | 56 | 64 | 54 | 53 |
| Wills, Mark | Back At One | 2000-01-15 | 89 | 55 | 51 | 43 | 37 | 37 | 36 | 39 | 42 | 46 |
| Worley, Darryl | When You Need My Lov… | 2000-06-17 | 98 | 88 | 93 | 92 | 85 | 85 | 84 | 80 | 80 | 80 |
| Wright, Chely | It Was | 2000-03-04 | 86 | 78 | 75 | 72 | 71 | 69 | 64 | 75 | 85 | 98 |
| Yankee Grey | Another Nine Minutes | 2000-04-29 | 86 | 83 | 77 | 74 | 83 | 79 | 88 | 95 | | |
| Yearwood, Trisha | Real Live Woman | 2000-04-01 | 85 | 83 | 83 | 82 | 81 | 91 | | | | |
| Ying Yang Twins | Whistle While You Tw… | 2000-03-18 | 95 | 94 | 91 | 85 | 84 | 78 | 74 | 78 | 85 | 89 |
| Zombie Nation | Kernkraft 400 | 2000-09-02 | 99 | 99 | | | | | | | | |
| matchbox twenty | Bent | 2000-04-29 | 60 | 37 | 29 | 24 | 22 | 21 | 18 | 16 | 13 | 12 |
``` clojure
(tc/pivot->longer bilboard #(clojure.string/starts-with? % "wk") {:target-columns :week
:value-column-name :rank})
```
data/billboard.csv.gz \[5307 5\]:
| artist | track | date.entered | :week | :rank |
|------------------|-----------------------|--------------|-------|------:|
| 3 Doors Down | Kryptonite | 2000-04-08 | wk35 | 4 |
| Braxton, Toni | He Wasn’t Man Enough | 2000-03-18 | wk35 | 34 |
| Creed | Higher | 1999-09-11 | wk35 | 22 |
| Creed | With Arms Wide Open | 2000-05-13 | wk35 | 5 |
| Hill, Faith | Breathe | 1999-11-06 | wk35 | 8 |
| Joe | I Wanna Know | 2000-01-01 | wk35 | 5 |
| Lonestar | Amazed | 1999-06-05 | wk35 | 14 |
| Vertical Horizon | Everything You Want | 2000-01-22 | wk35 | 27 |
| matchbox twenty | Bent | 2000-04-29 | wk35 | 33 |
| Creed | Higher | 1999-09-11 | wk55 | 21 |
| … | … | … | … | … |
| Savage Garden | I Knew I Loved You | 1999-10-23 | wk24 | 12 |
| Sisqo | Incomplete | 2000-06-24 | wk24 | 31 |
| Sisqo | Thong Song | 2000-01-29 | wk24 | 17 |
| Smash Mouth | Then The Morning Com… | 1999-10-30 | wk24 | 35 |
| Son By Four | A Puro Dolor (Purest… | 2000-04-08 | wk24 | 32 |
| Sonique | It Feels So Good | 2000-01-22 | wk24 | 49 |
| SoulDecision | Faded | 2000-07-08 | wk24 | 50 |
| Sting | Desert Rose | 2000-05-13 | wk24 | 45 |
| Train | Meet Virginia | 1999-10-09 | wk24 | 42 |
| Vertical Horizon | Everything You Want | 2000-01-22 | wk24 | 6 |
| matchbox twenty | Bent | 2000-04-29 | wk24 | 9 |
------------------------------------------------------------------------
We can create numerical column out of column names
``` clojure
(tc/pivot->longer bilboard #(clojure.string/starts-with? % "wk") {:target-columns :week
:value-column-name :rank
:splitter #"wk(.*)"
:datatypes {:week :int16}})
```
data/billboard.csv.gz \[5307 5\]:
| artist | track | date.entered | :week | :rank |
|------------------|-------------------------|--------------|------:|------:|
| 3 Doors Down | Kryptonite | 2000-04-08 | 46 | 21 |
| Creed | Higher | 1999-09-11 | 46 | 7 |
| Creed | With Arms Wide Open | 2000-05-13 | 46 | 37 |
| Hill, Faith | Breathe | 1999-11-06 | 46 | 31 |
| Lonestar | Amazed | 1999-06-05 | 46 | 5 |
| 3 Doors Down | Kryptonite | 2000-04-08 | 51 | 42 |
| Creed | Higher | 1999-09-11 | 51 | 14 |
| Hill, Faith | Breathe | 1999-11-06 | 51 | 49 |
| Lonestar | Amazed | 1999-06-05 | 51 | 12 |
| 2 Pac | Baby Don’t Cry (Keep… | 2000-02-26 | 6 | 94 |
| … | … | … | … | … |
| matchbox twenty | Bent | 2000-04-29 | 5 | 22 |
| 3 Doors Down | Kryptonite | 2000-04-08 | 34 | 3 |
| Braxton, Toni | He Wasn’t Man Enough | 2000-03-18 | 34 | 33 |
| Creed | Higher | 1999-09-11 | 34 | 23 |
| Creed | With Arms Wide Open | 2000-05-13 | 34 | 5 |
| Hill, Faith | Breathe | 1999-11-06 | 34 | 5 |
| Joe | I Wanna Know | 2000-01-01 | 34 | 8 |
| Lonestar | Amazed | 1999-06-05 | 34 | 17 |
| Nelly | (Hot S\*\*t) Country G… | 2000-04-29 | 34 | 49 |
| Vertical Horizon | Everything You Want | 2000-01-22 | 34 | 20 |
| matchbox twenty | Bent | 2000-04-29 | 34 | 30 |
------------------------------------------------------------------------
When column names contain observation data, such column names can be
splitted and data can be restored into separate columns.
``` clojure
(def who (tc/dataset "data/who.csv.gz"))
```
``` clojure
(->> who
(tc/column-names)
(take 10)
(tc/select-columns who))
```
data/who.csv.gz \[7240 10\]:
| country | iso2 | iso3 | year | new_sp_m014 | new_sp_m1524 | new_sp_m2534 | new_sp_m3544 | new_sp_m4554 | new_sp_m5564 |
|-------------|------|------|-----:|------------:|-------------:|-------------:|-------------:|-------------:|-------------:|
| Afghanistan | AF | AFG | 1980 | | | | | | |
| Afghanistan | AF | AFG | 1981 | | | | | | |
| Afghanistan | AF | AFG | 1982 | | | | | | |
| Afghanistan | AF | AFG | 1983 | | | | | | |
| Afghanistan | AF | AFG | 1984 | | | | | | |
| Afghanistan | AF | AFG | 1985 | | | | | | |
| Afghanistan | AF | AFG | 1986 | | | | | | |
| Afghanistan | AF | AFG | 1987 | | | | | | |
| Afghanistan | AF | AFG | 1988 | | | | | | |
| Afghanistan | AF | AFG | 1989 | | | | | | |
| … | … | … | … | … | … | … | … | … | … |
| Zimbabwe | ZW | ZWE | 2003 | 133 | 874 | 3048 | 2228 | 981 | 367 |
| Zimbabwe | ZW | ZWE | 2004 | 187 | 833 | 2908 | 2298 | 1056 | 366 |
| Zimbabwe | ZW | ZWE | 2005 | 210 | 837 | 2264 | 1855 | 762 | 295 |
| Zimbabwe | ZW | ZWE | 2006 | 215 | 736 | 2391 | 1939 | 896 | 348 |
| Zimbabwe | ZW | ZWE | 2007 | 138 | 500 | 3693 | 0 | 716 | 292 |
| Zimbabwe | ZW | ZWE | 2008 | 127 | 614 | 0 | 3316 | 704 | 263 |
| Zimbabwe | ZW | ZWE | 2009 | 125 | 578 | | 3471 | 681 | 293 |
| Zimbabwe | ZW | ZWE | 2010 | 150 | 710 | 2208 | 1682 | 761 | 350 |
| Zimbabwe | ZW | ZWE | 2011 | 152 | 784 | 2467 | 2071 | 780 | 377 |
| Zimbabwe | ZW | ZWE | 2012 | 120 | 783 | 2421 | 2086 | 796 | 360 |
| Zimbabwe | ZW | ZWE | 2013 | | | | | | |
``` clojure
(tc/pivot->longer who #(clojure.string/starts-with? % "new") {:target-columns [:diagnosis :gender :age]
:splitter #"new_?(.*)_(.)(.*)"
:value-column-name :count})
```
data/who.csv.gz \[76046 8\]:
| country | iso2 | iso3 | year | :diagnosis | :gender | :age | :count |
|------------------------------------------------------|------|------|-----:|------------|---------|-----:|-------:|
| Albania | AL | ALB | 2013 | rel | m | 1524 | 60 |
| Algeria | DZ | DZA | 2013 | rel | m | 1524 | 1021 |
| Andorra | AD | AND | 2013 | rel | m | 1524 | 0 |
| Angola | AO | AGO | 2013 | rel | m | 1524 | 2992 |
| Anguilla | AI | AIA | 2013 | rel | m | 1524 | 0 |
| Antigua and Barbuda | AG | ATG | 2013 | rel | m | 1524 | 1 |
| Argentina | AR | ARG | 2013 | rel | m | 1524 | 1124 |
| Armenia | AM | ARM | 2013 | rel | m | 1524 | 116 |
| Australia | AU | AUS | 2013 | rel | m | 1524 | 105 |
| Austria | AT | AUT | 2013 | rel | m | 1524 | 44 |
| … | … | … | … | … | … | … | … |
| United Arab Emirates | AE | ARE | 2013 | rel | m | 2534 | 9 |
| United Kingdom of Great Britain and Northern Ireland | GB | GBR | 2013 | rel | m | 2534 | 1158 |
| United States of America | US | USA | 2013 | rel | m | 2534 | 829 |
| Uruguay | UY | URY | 2013 | rel | m | 2534 | 142 |
| Uzbekistan | UZ | UZB | 2013 | rel | m | 2534 | 2371 |
| Vanuatu | VU | VUT | 2013 | rel | m | 2534 | 9 |
| Venezuela (Bolivarian Republic of) | VE | VEN | 2013 | rel | m | 2534 | 739 |
| Viet Nam | VN | VNM | 2013 | rel | m | 2534 | 6302 |
| Yemen | YE | YEM | 2013 | rel | m | 2534 | 1113 |
| Zambia | ZM | ZMB | 2013 | rel | m | 2534 | 7808 |
| Zimbabwe | ZW | ZWE | 2013 | rel | m | 2534 | 5331 |
------------------------------------------------------------------------
When data contains multiple observations per row, we can use splitter
and pattern for target columns to create new columns and put values
there. In following dataset we have two obseravations `dob` and `gender`
for two childs. We want to put child infomation into the column and
leave dob and gender for values.
``` clojure
(def family (tc/dataset "data/family.csv"))
```
``` clojure
family
```
data/family.csv \[5 5\]:
| family | dob_child1 | dob_child2 | gender_child1 | gender_child2 |
|-------:|------------|------------|--------------:|--------------:|
| 1 | 1998-11-26 | 2000-01-29 | 1 | 2 |
| 2 | 1996-06-22 | | 2 | |
| 3 | 2002-07-11 | 2004-04-05 | 2 | 2 |
| 4 | 2004-10-10 | 2009-08-27 | 1 | 1 |
| 5 | 2000-12-05 | 2005-02-28 | 2 | 1 |
``` clojure
(tc/pivot->longer family (complement #{"family"}) {:target-columns [nil :child]
:splitter "_"
:datatypes {"gender" :int16}})
```
data/family.csv \[9 4\]:
| family | :child | dob | gender |
|-------:|--------|------------|-------:|
| 1 | child1 | 1998-11-26 | 1 |
| 2 | child1 | 1996-06-22 | 2 |
| 3 | child1 | 2002-07-11 | 2 |
| 4 | child1 | 2004-10-10 | 1 |
| 5 | child1 | 2000-12-05 | 2 |
| 1 | child2 | 2000-01-29 | 2 |
| 3 | child2 | 2004-04-05 | 2 |
| 4 | child2 | 2009-08-27 | 1 |
| 5 | child2 | 2005-02-28 | 1 |
------------------------------------------------------------------------
Similar here, we have two observations: `x` and `y` in four groups.
``` clojure
(def anscombe (tc/dataset "data/anscombe.csv"))
```
``` clojure
anscombe
```
data/anscombe.csv \[11 8\]:
| x1 | x2 | x3 | x4 | y1 | y2 | y3 | y4 |
|----:|----:|----:|----:|------:|-----:|------:|------:|
| 10 | 10 | 10 | 8 | 8.04 | 9.14 | 7.46 | 6.58 |
| 8 | 8 | 8 | 8 | 6.95 | 8.14 | 6.77 | 5.76 |
| 13 | 13 | 13 | 8 | 7.58 | 8.74 | 12.74 | 7.71 |
| 9 | 9 | 9 | 8 | 8.81 | 8.77 | 7.11 | 8.84 |
| 11 | 11 | 11 | 8 | 8.33 | 9.26 | 7.81 | 8.47 |
| 14 | 14 | 14 | 8 | 9.96 | 8.10 | 8.84 | 7.04 |
| 6 | 6 | 6 | 8 | 7.24 | 6.13 | 6.08 | 5.25 |
| 4 | 4 | 4 | 19 | 4.26 | 3.10 | 5.39 | 12.50 |
| 12 | 12 | 12 | 8 | 10.84 | 9.13 | 8.15 | 5.56 |
| 7 | 7 | 7 | 8 | 4.82 | 7.26 | 6.42 | 7.91 |
| 5 | 5 | 5 | 8 | 5.68 | 4.74 | 5.73 | 6.89 |
``` clojure
(tc/pivot->longer anscombe :all {:splitter #"(.)(.)"
:target-columns [nil :set]})
```
data/anscombe.csv \[44 3\]:
| :set | x | y |
|-----:|----:|------:|
| 1 | 10 | 8.04 |
| 1 | 8 | 6.95 |
| 1 | 13 | 7.58 |
| 1 | 9 | 8.81 |
| 1 | 11 | 8.33 |
| 1 | 14 | 9.96 |
| 1 | 6 | 7.24 |
| 1 | 4 | 4.26 |
| 1 | 12 | 10.84 |
| 1 | 7 | 4.82 |
| … | … | … |
| 4 | 8 | 6.58 |
| 4 | 8 | 5.76 |
| 4 | 8 | 7.71 |
| 4 | 8 | 8.84 |
| 4 | 8 | 8.47 |
| 4 | 8 | 7.04 |
| 4 | 8 | 5.25 |
| 4 | 19 | 12.50 |
| 4 | 8 | 5.56 |
| 4 | 8 | 7.91 |
| 4 | 8 | 6.89 |
------------------------------------------------------------------------
``` clojure
(def pnl (tc/dataset {:x [1 2 3 4]
:a [1 1 0 0]
:b [0 1 1 1]
:y1 (repeatedly 4 rand)
:y2 (repeatedly 4 rand)
:z1 [3 3 3 3]
:z2 [-2 -2 -2 -2]}))
```
``` clojure
pnl
```
\_unnamed \[4 7\]:
| :x | :a | :b | :y1 | :y2 | :z1 | :z2 |
|----:|----:|----:|-----------:|-----------:|----:|----:|
| 1 | 1 | 0 | 0.61019537 | 0.80765514 | 3 | -2 |
| 2 | 1 | 1 | 0.72364622 | 0.09452177 | 3 | -2 |
| 3 | 0 | 1 | 0.18097336 | 0.01396882 | 3 | -2 |
| 4 | 0 | 1 | 0.23958321 | 0.78543603 | 3 | -2 |
``` clojure
(tc/pivot->longer pnl [:y1 :y2 :z1 :z2] {:target-columns [nil :times]
:splitter #":(.)(.)"})
```
\_unnamed \[8 6\]:
| :x | :a | :b | :times | y | z |
|----:|----:|----:|-------:|-----------:|----:|
| 1 | 1 | 0 | 1 | 0.61019537 | 3 |
| 2 | 1 | 1 | 1 | 0.72364622 | 3 |
| 3 | 0 | 1 | 1 | 0.18097336 | 3 |
| 4 | 0 | 1 | 1 | 0.23958321 | 3 |
| 1 | 1 | 0 | 2 | 0.80765514 | -2 |
| 2 | 1 | 1 | 2 | 0.09452177 | -2 |
| 3 | 0 | 1 | 2 | 0.01396882 | -2 |
| 4 | 0 | 1 | 2 | 0.78543603 | -2 |
#### Wider
`pivot->wider` converts rows to columns.
Arguments:
- dataset
- `columns-selector` - values from selected columns are converted to new
columns
- `value-columns` - what are values
When multiple columns are used as columns selector, names are joined
using `:concat-columns-with` option. `:concat-columns-with` can be a
string or function (default: “\_“). Function accepts sequence of names.
When `columns-selector` creates non unique set of values, they are
folded using `:fold-fn` (default: `vec`) option.
When `value-columns` is a sequence, multiple observations as columns are
created appending value column names into new columns. Column names are
joined using `:concat-value-with` option. `:concat-value-with` can be a
string or function (default: “-”). Function accepts current column name
and value.
------------------------------------------------------------------------
Use `station` as a name source for columns and `seen` for values
``` clojure
(def fish (tc/dataset "data/fish_encounters.csv"))
```
``` clojure
fish
```
data/fish_encounters.csv \[114 3\]:
| fish | station | seen |
|-----:|---------|-----:|
| 4842 | Release | 1 |
| 4842 | I80_1 | 1 |
| 4842 | Lisbon | 1 |
| 4842 | Rstr | 1 |
| 4842 | Base_TD | 1 |
| 4842 | BCE | 1 |
| 4842 | BCW | 1 |
| 4842 | BCE2 | 1 |
| 4842 | BCW2 | 1 |
| 4842 | MAE | 1 |
| … | … | … |
| 4862 | BCE | 1 |
| 4862 | BCW | 1 |
| 4862 | BCE2 | 1 |
| 4862 | BCW2 | 1 |
| 4863 | Release | 1 |
| 4863 | I80_1 | 1 |
| 4864 | Release | 1 |
| 4864 | I80_1 | 1 |
| 4865 | Release | 1 |
| 4865 | I80_1 | 1 |
| 4865 | Lisbon | 1 |
``` clojure
(tc/pivot->wider fish "station" "seen" {:drop-missing? false})
```
data/fish_encounters.csv \[19 12\]:
| fish | Release | I80_1 | Lisbon | Rstr | Base_TD | BCE | BCW | BCE2 | BCW2 | MAE | MAW |
|-----:|--------:|------:|-------:|-----:|--------:|----:|----:|-----:|-----:|----:|----:|
| 4842 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
| 4843 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
| 4844 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
| 4858 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
| 4861 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
| 4857 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | | |
| 4862 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | | |
| 4850 | 1 | 1 | | 1 | 1 | 1 | 1 | | | | |
| 4845 | 1 | 1 | 1 | 1 | 1 | | | | | | |
| 4855 | 1 | 1 | 1 | 1 | 1 | | | | | | |
| 4859 | 1 | 1 | 1 | 1 | 1 | | | | | | |
| 4848 | 1 | 1 | 1 | 1 | | | | | | | |
| 4847 | 1 | 1 | 1 | | | | | | | | |
| 4865 | 1 | 1 | 1 | | | | | | | | |
| 4849 | 1 | 1 | | | | | | | | | |
| 4851 | 1 | 1 | | | | | | | | | |
| 4854 | 1 | 1 | | | | | | | | | |
| 4863 | 1 | 1 | | | | | | | | | |
| 4864 | 1 | 1 | | | | | | | | | |
------------------------------------------------------------------------
If selected columns contain multiple values, such values should be
folded.
``` clojure
(def warpbreaks (tc/dataset "data/warpbreaks.csv"))
```
``` clojure
warpbreaks
```
data/warpbreaks.csv \[54 3\]:
| breaks | wool | tension |
|-------:|------|---------|
| 26 | A | L |
| 30 | A | L |
| 54 | A | L |
| 25 | A | L |
| 70 | A | L |
| 52 | A | L |
| 51 | A | L |
| 26 | A | L |
| 67 | A | L |
| 18 | A | M |
| … | … | … |
| 39 | B | M |
| 29 | B | M |
| 20 | B | H |
| 21 | B | H |
| 24 | B | H |
| 17 | B | H |
| 13 | B | H |
| 15 | B | H |
| 15 | B | H |
| 16 | B | H |
| 28 | B | H |
Let’s see how many values are for each type of `wool` and `tension`
groups
``` clojure
(-> warpbreaks
(tc/group-by ["wool" "tension"])
(tc/aggregate {:n tc/row-count}))
```
\_unnamed \[6 3\]:
| wool | tension | :n |
|------|---------|----:|
| A | L | 9 |
| A | M | 9 |
| A | H | 9 |
| B | L | 9 |
| B | M | 9 |
| B | H | 9 |
``` clojure
(-> warpbreaks
(tc/reorder-columns ["wool" "tension" "breaks"])
(tc/pivot->wider "wool" "breaks" {:fold-fn vec}))
```
data/warpbreaks.csv \[3 3\]:
| tension | A | B |
|---------|--------------------------------|--------------------------------|
| L | \[26 30 54 25 70 52 51 26 67\] | \[27 14 29 19 29 31 41 20 44\] |
| M | \[18 21 29 17 12 18 35 30 36\] | \[42 26 19 16 39 28 21 39 29\] |
| H | \[36 21 24 18 10 43 28 15 26\] | \[20 21 24 17 13 15 15 16 28\] |
We can also calculate mean (aggreate values)
``` clojure
(-> warpbreaks
(tc/reorder-columns ["wool" "tension" "breaks"])
(tc/pivot->wider "wool" "breaks" {:fold-fn tech.v3.datatype.functional/mean}))
```
data/warpbreaks.csv \[3 3\]:
| tension | A | B |
|---------|------------:|------------:|
| L | 44.55555556 | 28.22222222 |
| M | 24.00000000 | 28.77777778 |
| H | 24.55555556 | 18.77777778 |
------------------------------------------------------------------------
Multiple source columns, joined with default separator.
``` clojure
(def production (tc/dataset "data/production.csv"))
```
``` clojure
production
```
data/production.csv \[45 4\]:
| product | country | year | production |
|---------|---------|-----:|------------:|
| A | AI | 2000 | 1.63727158 |
| A | AI | 2001 | 0.15870784 |
| A | AI | 2002 | -1.56797745 |
| A | AI | 2003 | -0.44455509 |
| A | AI | 2004 | -0.07133701 |
| A | AI | 2005 | 1.61183090 |
| A | AI | 2006 | -0.70434682 |
| A | AI | 2007 | -1.53550542 |
| A | AI | 2008 | 0.83907155 |
| A | AI | 2009 | -0.37424110 |
| … | … | … | … |
| B | EI | 2004 | 0.62564999 |
| B | EI | 2005 | -1.34530299 |
| B | EI | 2006 | -0.97184975 |
| B | EI | 2007 | -1.69715821 |
| B | EI | 2008 | 0.04556128 |
| B | EI | 2009 | 1.19315043 |
| B | EI | 2010 | -1.60557503 |
| B | EI | 2011 | -0.77235497 |
| B | EI | 2012 | -2.50262738 |
| B | EI | 2013 | -1.62753769 |
| B | EI | 2014 | 0.03329645 |
``` clojure
(tc/pivot->wider production ["product" "country"] "production")
```
data/production.csv \[15 4\]:
| year | A_AI | B_AI | B_EI |
|-----:|------------:|------------:|------------:|
| 2000 | 1.63727158 | -0.02617661 | 1.40470848 |
| 2001 | 0.15870784 | -0.68863576 | -0.59618369 |
| 2002 | -1.56797745 | 0.06248741 | -0.26568579 |
| 2003 | -0.44455509 | -0.72339686 | 0.65257808 |
| 2004 | -0.07133701 | 0.47248952 | 0.62564999 |
| 2005 | 1.61183090 | -0.94173861 | -1.34530299 |
| 2006 | -0.70434682 | -0.34782108 | -0.97184975 |
| 2007 | -1.53550542 | 0.52425284 | -1.69715821 |
| 2008 | 0.83907155 | 1.83230937 | 0.04556128 |
| 2009 | -0.37424110 | 0.10706491 | 1.19315043 |
| 2010 | -0.71158926 | -0.32903664 | -1.60557503 |
| 2011 | 1.12805634 | -1.78319121 | -0.77235497 |
| 2012 | 1.45718247 | 0.61125798 | -2.50262738 |
| 2013 | -1.55934101 | -0.78526092 | -1.62753769 |
| 2014 | -0.11695838 | 0.97843635 | 0.03329645 |
Joined with custom function
``` clojure
(tc/pivot->wider production ["product" "country"] "production" {:concat-columns-with vec})
```
data/production.csv \[15 4\]:
| year | \[“A” “AI”\] | \[“B” “AI”\] | \[“B” “EI”\] |
|-----:|-------------:|-------------:|-------------:|
| 2000 | 1.63727158 | -0.02617661 | 1.40470848 |
| 2001 | 0.15870784 | -0.68863576 | -0.59618369 |
| 2002 | -1.56797745 | 0.06248741 | -0.26568579 |
| 2003 | -0.44455509 | -0.72339686 | 0.65257808 |
| 2004 | -0.07133701 | 0.47248952 | 0.62564999 |
| 2005 | 1.61183090 | -0.94173861 | -1.34530299 |
| 2006 | -0.70434682 | -0.34782108 | -0.97184975 |
| 2007 | -1.53550542 | 0.52425284 | -1.69715821 |
| 2008 | 0.83907155 | 1.83230937 | 0.04556128 |
| 2009 | -0.37424110 | 0.10706491 | 1.19315043 |
| 2010 | -0.71158926 | -0.32903664 | -1.60557503 |
| 2011 | 1.12805634 | -1.78319121 | -0.77235497 |
| 2012 | 1.45718247 | 0.61125798 | -2.50262738 |
| 2013 | -1.55934101 | -0.78526092 | -1.62753769 |
| 2014 | -0.11695838 | 0.97843635 | 0.03329645 |
------------------------------------------------------------------------
Multiple value columns
``` clojure
(def income (tc/dataset "data/us_rent_income.csv"))
```
``` clojure
income
```
data/us_rent_income.csv \[104 5\]:
| GEOID | NAME | variable | estimate | moe |
|------:|---------------|----------|---------:|----:|
| 1 | Alabama | income | 24476 | 136 |
| 1 | Alabama | rent | 747 | 3 |
| 2 | Alaska | income | 32940 | 508 |
| 2 | Alaska | rent | 1200 | 13 |
| 4 | Arizona | income | 27517 | 148 |
| 4 | Arizona | rent | 972 | 4 |
| 5 | Arkansas | income | 23789 | 165 |
| 5 | Arkansas | rent | 709 | 5 |
| 6 | California | income | 29454 | 109 |
| 6 | California | rent | 1358 | 3 |
| … | … | … | … | … |
| 51 | Virginia | rent | 1166 | 5 |
| 53 | Washington | income | 32318 | 113 |
| 53 | Washington | rent | 1120 | 4 |
| 54 | West Virginia | income | 23707 | 203 |
| 54 | West Virginia | rent | 681 | 6 |
| 55 | Wisconsin | income | 29868 | 135 |
| 55 | Wisconsin | rent | 813 | 3 |
| 56 | Wyoming | income | 30854 | 342 |
| 56 | Wyoming | rent | 828 | 11 |
| 72 | Puerto Rico | income | | |
| 72 | Puerto Rico | rent | 464 | 6 |
``` clojure
(tc/pivot->wider income "variable" ["estimate" "moe"] {:drop-missing? false})
```
data/us_rent_income.csv \[52 6\]:
| GEOID | NAME | income-estimate | income-moe | rent-estimate | rent-moe |
|------:|----------------------|----------------:|-----------:|--------------:|---------:|
| 1 | Alabama | 24476 | 136 | 747 | 3 |
| 2 | Alaska | 32940 | 508 | 1200 | 13 |
| 4 | Arizona | 27517 | 148 | 972 | 4 |
| 5 | Arkansas | 23789 | 165 | 709 | 5 |
| 6 | California | 29454 | 109 | 1358 | 3 |
| 8 | Colorado | 32401 | 109 | 1125 | 5 |
| 9 | Connecticut | 35326 | 195 | 1123 | 5 |
| 10 | Delaware | 31560 | 247 | 1076 | 10 |
| 11 | District of Columbia | 43198 | 681 | 1424 | 17 |
| 12 | Florida | 25952 | 70 | 1077 | 3 |
| … | … | … | … | … | … |
| 46 | South Dakota | 28821 | 276 | 696 | 7 |
| 47 | Tennessee | 25453 | 102 | 808 | 4 |
| 48 | Texas | 28063 | 110 | 952 | 2 |
| 49 | Utah | 27928 | 239 | 948 | 6 |
| 50 | Vermont | 29351 | 361 | 945 | 11 |
| 51 | Virginia | 32545 | 202 | 1166 | 5 |
| 53 | Washington | 32318 | 113 | 1120 | 4 |
| 54 | West Virginia | 23707 | 203 | 681 | 6 |
| 55 | Wisconsin | 29868 | 135 | 813 | 3 |
| 56 | Wyoming | 30854 | 342 | 828 | 11 |
| 72 | Puerto Rico | | | 464 | 6 |
Value concatenated by custom function
``` clojure
(tc/pivot->wider income "variable" ["estimate" "moe"] {:concat-columns-with vec
:concat-value-with vector
:drop-missing? false})
```
data/us_rent_income.csv \[52 6\]:
| GEOID | NAME | \[“income” “estimate”\] | \[“income” “moe”\] | \[“rent” “estimate”\] | \[“rent” “moe”\] |
|------:|----------------------|------------------------:|-------------------:|----------------------:|-----------------:|
| 1 | Alabama | 24476 | 136 | 747 | 3 |
| 2 | Alaska | 32940 | 508 | 1200 | 13 |
| 4 | Arizona | 27517 | 148 | 972 | 4 |
| 5 | Arkansas | 23789 | 165 | 709 | 5 |
| 6 | California | 29454 | 109 | 1358 | 3 |
| 8 | Colorado | 32401 | 109 | 1125 | 5 |
| 9 | Connecticut | 35326 | 195 | 1123 | 5 |
| 10 | Delaware | 31560 | 247 | 1076 | 10 |
| 11 | District of Columbia | 43198 | 681 | 1424 | 17 |
| 12 | Florida | 25952 | 70 | 1077 | 3 |
| … | … | … | … | … | … |
| 46 | South Dakota | 28821 | 276 | 696 | 7 |
| 47 | Tennessee | 25453 | 102 | 808 | 4 |
| 48 | Texas | 28063 | 110 | 952 | 2 |
| 49 | Utah | 27928 | 239 | 948 | 6 |
| 50 | Vermont | 29351 | 361 | 945 | 11 |
| 51 | Virginia | 32545 | 202 | 1166 | 5 |
| 53 | Washington | 32318 | 113 | 1120 | 4 |
| 54 | West Virginia | 23707 | 203 | 681 | 6 |
| 55 | Wisconsin | 29868 | 135 | 813 | 3 |
| 56 | Wyoming | 30854 | 342 | 828 | 11 |
| 72 | Puerto Rico | | | 464 | 6 |
------------------------------------------------------------------------
Reshape contact data
``` clojure
(def contacts (tc/dataset "data/contacts.csv"))
```
``` clojure
contacts
```
data/contacts.csv \[6 3\]:
| field | value | person_id |
|---------|-------------------|----------:|
| name | Jiena McLellan | 1 |
| company | Toyota | 1 |
| name | John Smith | 2 |
| company | google | 2 |
| email | | 2 |
| name | Huxley Ratcliffe | 3 |
``` clojure
(tc/pivot->wider contacts "field" "value" {:drop-missing? false})
```
data/contacts.csv \[3 4\]:
| person_id | name | company | email |
|----------:|------------------|---------|-------------------|
| 2 | John Smith | google | |
| 1 | Jiena McLellan | Toyota | |
| 3 | Huxley Ratcliffe | | |
#### Reshaping
A couple of `tidyr` examples of more complex reshaping.
------------------------------------------------------------------------
[World bank](https://tidyr.tidyverse.org/articles/pivot.html#world-bank)
``` clojure
(def world-bank-pop (tc/dataset "data/world_bank_pop.csv.gz"))
```
``` clojure
(->> world-bank-pop
(tc/column-names)
(take 8)
(tc/select-columns world-bank-pop))
```
data/world_bank_pop.csv.gz \[1056 8\]:
| country | indicator | 2000 | 2001 | 2002 | 2003 | 2004 | 2005 |
|---------|-------------|---------------:|---------------:|---------------:|---------------:|---------------:|---------------:|
| ABW | SP.URB.TOTL | 4.24440000E+04 | 4.30480000E+04 | 4.36700000E+04 | 4.42460000E+04 | 4.46690000E+04 | 4.48890000E+04 |
| ABW | SP.URB.GROW | 1.18263237E+00 | 1.41302122E+00 | 1.43455953E+00 | 1.31036044E+00 | 9.51477684E-01 | 4.91302715E-01 |
| ABW | SP.POP.TOTL | 9.08530000E+04 | 9.28980000E+04 | 9.49920000E+04 | 9.70170000E+04 | 9.87370000E+04 | 1.00031000E+05 |
| ABW | SP.POP.GROW | 2.05502678E+00 | 2.22593013E+00 | 2.22905605E+00 | 2.10935434E+00 | 1.75735287E+00 | 1.30203884E+00 |
| AFG | SP.URB.TOTL | 4.43629900E+06 | 4.64805500E+06 | 4.89295100E+06 | 5.15568600E+06 | 5.42677000E+06 | 5.69182300E+06 |
| AFG | SP.URB.GROW | 3.91222846E+00 | 4.66283822E+00 | 5.13467454E+00 | 5.23045853E+00 | 5.12439302E+00 | 4.76864700E+00 |
| AFG | SP.POP.TOTL | 2.00937560E+07 | 2.09664630E+07 | 2.19799230E+07 | 2.30648510E+07 | 2.41189790E+07 | 2.50707980E+07 |
| AFG | SP.POP.GROW | 3.49465874E+00 | 4.25150411E+00 | 4.72052846E+00 | 4.81804112E+00 | 4.46891840E+00 | 3.87047016E+00 |
| AGO | SP.URB.TOTL | 8.23476600E+06 | 8.70800000E+06 | 9.21878700E+06 | 9.76519700E+06 | 1.03435060E+07 | 1.09494240E+07 |
| AGO | SP.URB.GROW | 5.43749411E+00 | 5.58771954E+00 | 5.70013237E+00 | 5.75812711E+00 | 5.75341450E+00 | 5.69279690E+00 |
| … | … | … | … | … | … | … | … |
| ZAF | SP.URB.GROW | 2.32229180E+00 | 2.26080492E+00 | 2.29242659E+00 | 2.25719919E+00 | 2.18014731E+00 | 2.09725981E+00 |
| ZAF | SP.POP.TOTL | 4.57283150E+07 | 4.63850060E+07 | 4.70261730E+07 | 4.76487270E+07 | 4.82473950E+07 | 4.88205860E+07 |
| ZAF | SP.POP.GROW | 1.47499416E+00 | 1.42585702E+00 | 1.37280586E+00 | 1.31515951E+00 | 1.24859226E+00 | 1.18102315E+00 |
| ZMB | SP.URB.TOTL | 3.66507600E+06 | 3.78866000E+06 | 3.94496500E+06 | 4.10631700E+06 | 4.27387500E+06 | 4.44857100E+06 |
| ZMB | SP.URB.GROW | 1.50532147E+00 | 3.31633227E+00 | 4.04276877E+00 | 4.00864374E+00 | 3.99943902E+00 | 4.00620111E+00 |
| ZMB | SP.POP.TOTL | 1.05312210E+07 | 1.08241250E+07 | 1.11204090E+07 | 1.14219840E+07 | 1.17317460E+07 | 1.20521560E+07 |
| ZMB | SP.POP.GROW | 2.80705843E+00 | 2.74331654E+00 | 2.70046295E+00 | 2.67578507E+00 | 2.67585813E+00 | 2.69450644E+00 |
| ZWE | SP.URB.TOTL | 4.12598700E+06 | 4.22551900E+06 | 4.32330700E+06 | 4.35604100E+06 | 4.38192000E+06 | 4.41384500E+06 |
| ZWE | SP.URB.GROW | 2.52373518E+00 | 2.38368296E+00 | 2.28785252E+00 | 7.54299867E-01 | 5.92336717E-01 | 7.25920717E-01 |
| ZWE | SP.POP.TOTL | 1.22222510E+07 | 1.23661650E+07 | 1.25005250E+07 | 1.26338970E+07 | 1.27775110E+07 | 1.29400320E+07 |
| ZWE | SP.POP.GROW | 1.29878201E+00 | 1.17059711E+00 | 1.08065293E+00 | 1.06127964E+00 | 1.13032327E+00 | 1.26390895E+00 |
Step 1 - convert years column into values
``` clojure
(def pop2 (tc/pivot->longer world-bank-pop (map str (range 2000 2018)) {:drop-missing? false
:target-columns ["year"]
:value-column-name "value"}))
```
``` clojure
pop2
```
data/world_bank_pop.csv.gz \[19008 4\]:
| country | indicator | year | value |
|---------|-------------|-----:|---------------:|
| ABW | SP.URB.TOTL | 2013 | 4.43600000E+04 |
| ABW | SP.URB.GROW | 2013 | 6.69503994E-01 |
| ABW | SP.POP.TOTL | 2013 | 1.03187000E+05 |
| ABW | SP.POP.GROW | 2013 | 5.92914005E-01 |
| AFG | SP.URB.TOTL | 2013 | 7.73396400E+06 |
| AFG | SP.URB.GROW | 2013 | 4.19297967E+00 |
| AFG | SP.POP.TOTL | 2013 | 3.17316880E+07 |
| AFG | SP.POP.GROW | 2013 | 3.31522413E+00 |
| AGO | SP.URB.TOTL | 2013 | 1.61194910E+07 |
| AGO | SP.URB.GROW | 2013 | 4.72272270E+00 |
| … | … | … | … |
| ZAF | SP.URB.GROW | 2012 | 2.23077040E+00 |
| ZAF | SP.POP.TOTL | 2012 | 5.29982130E+07 |
| ZAF | SP.POP.GROW | 2012 | 1.39596592E+00 |
| ZMB | SP.URB.TOTL | 2012 | 5.93201300E+06 |
| ZMB | SP.URB.GROW | 2012 | 4.25944078E+00 |
| ZMB | SP.POP.TOTL | 2012 | 1.46999370E+07 |
| ZMB | SP.POP.GROW | 2012 | 3.00513283E+00 |
| ZWE | SP.URB.TOTL | 2012 | 4.83015300E+06 |
| ZWE | SP.URB.GROW | 2012 | 1.67857380E+00 |
| ZWE | SP.POP.TOTL | 2012 | 1.47108260E+07 |
| ZWE | SP.POP.GROW | 2012 | 2.22830616E+00 |
Step 2 - separate `"indicate"` column
``` clojure
(def pop3 (tc/separate-column pop2
"indicator" ["area" "variable"]
#(rest (clojure.string/split % #"\."))))
```
``` clojure
pop3
```
data/world_bank_pop.csv.gz \[19008 5\]:
| country | area | variable | year | value |
|---------|------|----------|-----:|---------------:|
| ABW | URB | TOTL | 2013 | 4.43600000E+04 |
| ABW | URB | GROW | 2013 | 6.69503994E-01 |
| ABW | POP | TOTL | 2013 | 1.03187000E+05 |
| ABW | POP | GROW | 2013 | 5.92914005E-01 |
| AFG | URB | TOTL | 2013 | 7.73396400E+06 |
| AFG | URB | GROW | 2013 | 4.19297967E+00 |
| AFG | POP | TOTL | 2013 | 3.17316880E+07 |
| AFG | POP | GROW | 2013 | 3.31522413E+00 |
| AGO | URB | TOTL | 2013 | 1.61194910E+07 |
| AGO | URB | GROW | 2013 | 4.72272270E+00 |
| … | … | … | … | … |
| ZAF | URB | GROW | 2012 | 2.23077040E+00 |
| ZAF | POP | TOTL | 2012 | 5.29982130E+07 |
| ZAF | POP | GROW | 2012 | 1.39596592E+00 |
| ZMB | URB | TOTL | 2012 | 5.93201300E+06 |
| ZMB | URB | GROW | 2012 | 4.25944078E+00 |
| ZMB | POP | TOTL | 2012 | 1.46999370E+07 |
| ZMB | POP | GROW | 2012 | 3.00513283E+00 |
| ZWE | URB | TOTL | 2012 | 4.83015300E+06 |
| ZWE | URB | GROW | 2012 | 1.67857380E+00 |
| ZWE | POP | TOTL | 2012 | 1.47108260E+07 |
| ZWE | POP | GROW | 2012 | 2.22830616E+00 |
Step 3 - Make columns based on `"variable"` values.
``` clojure
(tc/pivot->wider pop3 "variable" "value" {:drop-missing? false})
```
data/world_bank_pop.csv.gz \[9504 5\]:
| country | area | year | TOTL | GROW |
|---------|------|-----:|---------------:|------------:|
| ABW | URB | 2013 | 4.43600000E+04 | 0.66950399 |
| ABW | POP | 2013 | 1.03187000E+05 | 0.59291401 |
| AFG | URB | 2013 | 7.73396400E+06 | 4.19297967 |
| AFG | POP | 2013 | 3.17316880E+07 | 3.31522413 |
| AGO | URB | 2013 | 1.61194910E+07 | 4.72272270 |
| AGO | POP | 2013 | 2.59983400E+07 | 3.53182419 |
| ALB | URB | 2013 | 1.60350500E+06 | 1.74363937 |
| ALB | POP | 2013 | 2.89509200E+06 | -0.18321138 |
| AND | URB | 2013 | 7.15270000E+04 | -2.11923331 |
| AND | POP | 2013 | 8.07880000E+04 | -2.01331401 |
| … | … | … | … | … |
| WSM | POP | 2012 | 1.89194000E+05 | 0.81144852 |
| XKX | URB | 2012 | | |
| XKX | POP | 2012 | 1.80520000E+06 | 0.78972659 |
| YEM | URB | 2012 | 8.20982800E+06 | 4.49478765 |
| YEM | POP | 2012 | 2.49099690E+07 | 2.67605025 |
| ZAF | URB | 2012 | 3.35330290E+07 | 2.23077040 |
| ZAF | POP | 2012 | 5.29982130E+07 | 1.39596592 |
| ZMB | URB | 2012 | 5.93201300E+06 | 4.25944078 |
| ZMB | POP | 2012 | 1.46999370E+07 | 3.00513283 |
| ZWE | URB | 2012 | 4.83015300E+06 | 1.67857380 |
| ZWE | POP | 2012 | 1.47108260E+07 | 2.22830616 |
------------------------------------------------------------------------
------------------------------------------------------------------------
[Multi-choice](https://tidyr.tidyverse.org/articles/pivot.html#multi-choice)
``` clojure
(def multi (tc/dataset {:id [1 2 3 4]
:choice1 ["A" "C" "D" "B"]
:choice2 ["B" "B" nil "D"]
:choice3 ["C" nil nil nil]}))
```
``` clojure
multi
```
\_unnamed \[4 4\]:
| :id | :choice1 | :choice2 | :choice3 |
|----:|----------|----------|----------|
| 1 | A | B | C |
| 2 | C | B | |
| 3 | D | | |
| 4 | B | D | |
Step 1 - convert all choices into rows and add artificial column to all
values which are not missing.
``` clojure
(def multi2 (-> multi
(tc/pivot->longer (complement #{:id}))
(tc/add-column :checked true)))
```
``` clojure
multi2
```
\_unnamed \[8 4\]:
| :id | :$column | :$value | :checked | |
|----:|--------------------|----------|------|
| 1 | :choice1 | A | true |
| 2 | :choice1 | C | true |
| 3 | :choice1 | D | true |
| 4 | :choice1 | B | true |
| 1 | :choice2 | B | true |
| 2 | :choice2 | B | true |
| 4 | :choice2 | D | true |
| 1 | :choice3 | C | true |
Step 2 - Convert back to wide form with actual choices as columns
``` clojure
(-> multi2
(tc/drop-columns :$column)
(tc/pivot->wider :$value :checked {:drop-missing? false})
(tc/order-by :id))
```
\_unnamed \[4 5\]:
| :id | A | C | D | B |
|----:|------|------|------|------|
| 1 | true | true | | true |
| 2 | | true | | true |
| 3 | | | true | |
| 4 | | | true | true |
------------------------------------------------------------------------
------------------------------------------------------------------------
[Construction](https://tidyr.tidyverse.org/articles/pivot.html#by-hand)
``` clojure
(def construction (tc/dataset "data/construction.csv"))
(def construction-unit-map {"1 unit" "1"
"2 to 4 units" "2-4"
"5 units or more" "5+"})
```
``` clojure
construction
```
data/construction.csv \[9 9\]:
| Year | Month | 1 unit | 2 to 4 units | 5 units or more | Northeast | Midwest | South | West |
|-----:|-----------|-------:|--------------|----------------:|----------:|--------:|------:|-----:|
| 2018 | January | 859 | | 348 | 114 | 169 | 596 | 339 |
| 2018 | February | 882 | | 400 | 138 | 160 | 655 | 336 |
| 2018 | March | 862 | | 356 | 150 | 154 | 595 | 330 |
| 2018 | April | 797 | | 447 | 144 | 196 | 613 | 304 |
| 2018 | May | 875 | | 364 | 90 | 169 | 673 | 319 |
| 2018 | June | 867 | | 342 | 76 | 170 | 610 | 360 |
| 2018 | July | 829 | | 360 | 108 | 183 | 594 | 310 |
| 2018 | August | 939 | | 286 | 90 | 205 | 649 | 286 |
| 2018 | September | 835 | | 304 | 117 | 175 | 560 | 296 |
Conversion 1 - Group two column types
``` clojure
(-> construction
(tc/pivot->longer #"^[125NWS].*|Midwest" {:target-columns [:units :region]
:splitter (fn [col-name]
(if (re-matches #"^[125].*" col-name)
[(construction-unit-map col-name) nil]
[nil col-name]))
:value-column-name :n
:drop-missing? false}))
```
data/construction.csv \[63 5\]:
| Year | Month | :units | :region | :n |
|-----:|-----------|--------|---------|----:|
| 2018 | January | 1 | | 859 |
| 2018 | February | 1 | | 882 |
| 2018 | March | 1 | | 862 |
| 2018 | April | 1 | | 797 |
| 2018 | May | 1 | | 875 |
| 2018 | June | 1 | | 867 |
| 2018 | July | 1 | | 829 |
| 2018 | August | 1 | | 939 |
| 2018 | September | 1 | | 835 |
| 2018 | January | 2-4 | | |
| … | … | … | … | … |
| 2018 | August | | South | 649 |
| 2018 | September | | South | 560 |
| 2018 | January | | West | 339 |
| 2018 | February | | West | 336 |
| 2018 | March | | West | 330 |
| 2018 | April | | West | 304 |
| 2018 | May | | West | 319 |
| 2018 | June | | West | 360 |
| 2018 | July | | West | 310 |
| 2018 | August | | West | 286 |
| 2018 | September | | West | 296 |
Conversion 2 - Convert to longer form and back and rename columns
``` clojure
(-> construction
(tc/pivot->longer #"^[125NWS].*|Midwest" {:target-columns [:units :region]
:splitter (fn [col-name]
(if (re-matches #"^[125].*" col-name)
[(construction-unit-map col-name) nil]
[nil col-name]))
:value-column-name :n
:drop-missing? false})
(tc/pivot->wider [:units :region] :n {:drop-missing? false})
(tc/rename-columns (zipmap (vals construction-unit-map)
(keys construction-unit-map))))
```
data/construction.csv \[9 9\]:
| Year | Month | 1 | 2 to 4 units | 5 units or more | Northeast | Midwest | South | West |
|-----:|-----------|----:|-------------:|----------------:|----------:|--------:|------:|-----:|
| 2018 | January | 859 | | 348 | 114 | 169 | 596 | 339 |
| 2018 | February | 882 | | 400 | 138 | 160 | 655 | 336 |
| 2018 | March | 862 | | 356 | 150 | 154 | 595 | 330 |
| 2018 | April | 797 | | 447 | 144 | 196 | 613 | 304 |
| 2018 | May | 875 | | 364 | 90 | 169 | 673 | 319 |
| 2018 | June | 867 | | 342 | 76 | 170 | 610 | 360 |
| 2018 | July | 829 | | 360 | 108 | 183 | 594 | 310 |
| 2018 | August | 939 | | 286 | 90 | 205 | 649 | 286 |
| 2018 | September | 835 | | 304 | 117 | 175 | 560 | 296 |
------------------------------------------------------------------------
Various operations on stocks, examples taken from
[gather](https://tidyr.tidyverse.org/reference/gather.html) and
[spread](https://tidyr.tidyverse.org/reference/spread.html) manuals.
``` clojure
(def stocks-tidyr (tc/dataset "data/stockstidyr.csv"))
```
``` clojure
stocks-tidyr
```
data/stockstidyr.csv \[10 4\]:
| time | X | Y | Z |
|------------|------------:|------------:|------------:|
| 2009-01-01 | 1.30989806 | -1.89040193 | -1.77946880 |
| 2009-01-02 | -0.29993804 | -1.82473090 | 2.39892513 |
| 2009-01-03 | 0.53647501 | -1.03606860 | -3.98697977 |
| 2009-01-04 | -1.88390802 | -0.52178390 | -2.83065490 |
| 2009-01-05 | -0.96052361 | -2.21683349 | 1.43715171 |
| 2009-01-06 | -1.18528966 | -2.89350924 | 3.39784140 |
| 2009-01-07 | -0.85207056 | -2.16794818 | -1.20108258 |
| 2009-01-08 | 0.25234172 | -0.32854117 | -1.53160473 |
| 2009-01-09 | 0.40257136 | 1.96407898 | -6.80878830 |
| 2009-01-10 | -0.64383500 | 2.68618382 | -2.55909321 |
Convert to longer form
``` clojure
(def stocks-long (tc/pivot->longer stocks-tidyr ["X" "Y" "Z"] {:value-column-name :price
:target-columns :stocks}))
```
``` clojure
stocks-long
```
data/stockstidyr.csv \[30 3\]:
| time | :stocks | :price |
|------------|---------|------------:|
| 2009-01-01 | X | 1.30989806 |
| 2009-01-02 | X | -0.29993804 |
| 2009-01-03 | X | 0.53647501 |
| 2009-01-04 | X | -1.88390802 |
| 2009-01-05 | X | -0.96052361 |
| 2009-01-06 | X | -1.18528966 |
| 2009-01-07 | X | -0.85207056 |
| 2009-01-08 | X | 0.25234172 |
| 2009-01-09 | X | 0.40257136 |
| 2009-01-10 | X | -0.64383500 |
| … | … | … |
| 2009-01-10 | Y | 2.68618382 |
| 2009-01-01 | Z | -1.77946880 |
| 2009-01-02 | Z | 2.39892513 |
| 2009-01-03 | Z | -3.98697977 |
| 2009-01-04 | Z | -2.83065490 |
| 2009-01-05 | Z | 1.43715171 |
| 2009-01-06 | Z | 3.39784140 |
| 2009-01-07 | Z | -1.20108258 |
| 2009-01-08 | Z | -1.53160473 |
| 2009-01-09 | Z | -6.80878830 |
| 2009-01-10 | Z | -2.55909321 |
Convert back to wide form
``` clojure
(tc/pivot->wider stocks-long :stocks :price)
```
data/stockstidyr.csv \[10 4\]:
| time | X | Y | Z |
|------------|------------:|------------:|------------:|
| 2009-01-01 | 1.30989806 | -1.89040193 | -1.77946880 |
| 2009-01-02 | -0.29993804 | -1.82473090 | 2.39892513 |
| 2009-01-03 | 0.53647501 | -1.03606860 | -3.98697977 |
| 2009-01-04 | -1.88390802 | -0.52178390 | -2.83065490 |
| 2009-01-05 | -0.96052361 | -2.21683349 | 1.43715171 |
| 2009-01-06 | -1.18528966 | -2.89350924 | 3.39784140 |
| 2009-01-07 | -0.85207056 | -2.16794818 | -1.20108258 |
| 2009-01-08 | 0.25234172 | -0.32854117 | -1.53160473 |
| 2009-01-09 | 0.40257136 | 1.96407898 | -6.80878830 |
| 2009-01-10 | -0.64383500 | 2.68618382 | -2.55909321 |
Convert to wide form on time column (let’s limit values to a couple of
rows)
``` clojure
(-> stocks-long
(tc/select-rows (range 0 30 4))
(tc/pivot->wider "time" :price {:drop-missing? false}))
```
data/stockstidyr.csv \[3 6\]:
| :stocks | 2009-01-01 | 2009-01-05 | 2009-01-09 | 2009-01-03 | 2009-01-07 |
|---------|------------:|------------:|------------:|-----------:|------------:|
| Y | | | | -1.0360686 | -2.16794818 |
| X | 1.30989806 | -0.96052361 | 0.40257136 | | |
| Z | -1.77946880 | 1.43715171 | -6.80878830 | | |
### Join/Concat Datasets
Dataset join and concatenation functions.
Joins accept left-side and right-side datasets and columns selector.
Options are the same as in `tech.ml.dataset` functions.
A column selector can be a map with `:left` and `:right` keys to specify
column names separate for left and right dataset.
The difference between `tech.ml.dataset` join functions are: arguments
order (first datasets) and possibility to join on multiple columns.
Multiple columns joins create temporary index column from column
selection. The method for creating index is based on `:hashing` option
and defaults to `identity`. Prior to `7.000-beta-50` `hash` function was
used, which caused hash collision for certain cases.
Additionally set operations are defined: `intersect` and `difference`.
To concat two datasets rowwise you can choose:
- `concat` - concats rows for matching columns, the number of columns
should be equal.
- `union` - like concat but returns unique values
- `bind` - concats rows add missing, empty columns
To add two datasets columnwise use `bind`. The number of rows should be
equal.
Datasets used in examples:
``` clojure
(def ds1 (tc/dataset {:a [1 2 1 2 3 4 nil nil 4]
:b (range 101 110)
:c (map str "abs tract")}))
(def ds2 (tc/dataset {:a [nil 1 2 5 4 3 2 1 nil]
:b (range 110 101 -1)
:c (map str "datatable")
:d (symbol "X")
:e [3 4 5 6 7 nil 8 1 1]}))
```
``` clojure
ds1
ds2
```
\_unnamed \[9 3\]:
| :a | :b | :c |
|----:|----:|-----|
| 1 | 101 | a |
| 2 | 102 | b |
| 1 | 103 | s |
| 2 | 104 | |
| 3 | 105 | t |
| 4 | 106 | r |
| | 107 | a |
| | 108 | c |
| 4 | 109 | t |
\_unnamed \[9 5\]:
| :a | :b | :c | :d | :e |
|----:|----:|-----|-----|----:|
| | 110 | d | X | 3 |
| 1 | 109 | a | X | 4 |
| 2 | 108 | t | X | 5 |
| 5 | 107 | a | X | 6 |
| 4 | 106 | t | X | 7 |
| 3 | 105 | a | X | |
| 2 | 104 | b | X | 8 |
| 1 | 103 | l | X | 1 |
| | 102 | e | X | 1 |
#### Left
``` clojure
(tc/left-join ds1 ds2 :b)
```
left-outer-join \[9 8\]:
| :b | :a | :c | :right.b | :right.a | :right.c | :d | :e |
|----:|----:|-----|---------:|---------:|----------|-----|----:|
| 109 | 4 | t | 109 | 1 | a | X | 4 |
| 108 | | c | 108 | 2 | t | X | 5 |
| 107 | | a | 107 | 5 | a | X | 6 |
| 106 | 4 | r | 106 | 4 | t | X | 7 |
| 105 | 3 | t | 105 | 3 | a | X | |
| 104 | 2 | | 104 | 2 | b | X | 8 |
| 103 | 1 | s | 103 | 1 | l | X | 1 |
| 102 | 2 | b | 102 | | e | X | 1 |
| 101 | 1 | a | | | | | |
------------------------------------------------------------------------
``` clojure
(tc/left-join ds2 ds1 :b)
```
left-outer-join \[9 8\]:
| :b | :a | :c | :d | :e | :right.b | :right.a | :right.c |
|----:|----:|-----|-----|----:|---------:|---------:|----------|
| 102 | | e | X | 1 | 102 | 2 | b |
| 103 | 1 | l | X | 1 | 103 | 1 | s |
| 104 | 2 | b | X | 8 | 104 | 2 | |
| 105 | 3 | a | X | | 105 | 3 | t |
| 106 | 4 | t | X | 7 | 106 | 4 | r |
| 107 | 5 | a | X | 6 | 107 | | a |
| 108 | 2 | t | X | 5 | 108 | | c |
| 109 | 1 | a | X | 4 | 109 | 4 | t |
| 110 | | d | X | 3 | | | |
------------------------------------------------------------------------
``` clojure
(tc/left-join ds1 ds2 [:a :b])
```
left-outer-join \[9 8\]:
| :a | :b | :c | :right.a | :right.b | :right.c | :d | :e |
|----:|----:|-----|---------:|---------:|----------|-----|----:|
| 4 | 106 | r | 4 | 106 | t | X | 7 |
| 3 | 105 | t | 3 | 105 | a | X | |
| 2 | 104 | | 2 | 104 | b | X | 8 |
| 1 | 103 | s | 1 | 103 | l | X | 1 |
| 1 | 101 | a | | | | | |
| 2 | 102 | b | | | | | |
| | 107 | a | | | | | |
| | 108 | c | | | | | |
| 4 | 109 | t | | | | | |
------------------------------------------------------------------------
``` clojure
(tc/left-join ds2 ds1 [:a :b])
```
left-outer-join \[9 8\]:
| :a | :b | :c | :d | :e | :right.a | :right.b | :right.c |
|----:|----:|-----|-----|----:|---------:|---------:|----------|
| 1 | 103 | l | X | 1 | 1 | 103 | s |
| 2 | 104 | b | X | 8 | 2 | 104 | |
| 3 | 105 | a | X | | 3 | 105 | t |
| 4 | 106 | t | X | 7 | 4 | 106 | r |
| | 110 | d | X | 3 | | | |
| 1 | 109 | a | X | 4 | | | |
| 2 | 108 | t | X | 5 | | | |
| 5 | 107 | a | X | 6 | | | |
| | 102 | e | X | 1 | | | |
------------------------------------------------------------------------
``` clojure
(tc/left-join ds1 ds2 {:left :a :right :e})
```
left-outer-join \[11 8\]:
| :a | :b | :c | :e | :right.a | :right.b | :right.c | :d |
|----:|----:|-----|----:|---------:|---------:|----------|-----|
| 3 | 105 | t | 3 | | 110 | d | X |
| 4 | 106 | r | 4 | 1 | 109 | a | X |
| 4 | 109 | t | 4 | 1 | 109 | a | X |
| 1 | 101 | a | 1 | 1 | 103 | l | X |
| 1 | 103 | s | 1 | 1 | 103 | l | X |
| 1 | 101 | a | 1 | | 102 | e | X |
| 1 | 103 | s | 1 | | 102 | e | X |
| 2 | 102 | b | | | | | |
| 2 | 104 | | | | | | |
| | 107 | a | | | | | |
| | 108 | c | | | | | |
------------------------------------------------------------------------
``` clojure
(tc/left-join ds2 ds1 {:left :e :right :a})
```
left-outer-join \[12 8\]:
| :e | :a | :b | :c | :d | :right.a | :right.b | :right.c |
|----:|----:|----:|-----|-----|---------:|---------:|----------|
| 1 | 1 | 103 | l | X | 1 | 101 | a |
| 1 | | 102 | e | X | 1 | 101 | a |
| 1 | 1 | 103 | l | X | 1 | 103 | s |
| 1 | | 102 | e | X | 1 | 103 | s |
| 3 | | 110 | d | X | 3 | 105 | t |
| 4 | 1 | 109 | a | X | 4 | 106 | r |
| 4 | 1 | 109 | a | X | 4 | 109 | t |
| 5 | 2 | 108 | t | X | | | |
| 6 | 5 | 107 | a | X | | | |
| 7 | 4 | 106 | t | X | | | |
| | 3 | 105 | a | X | | | |
| 8 | 2 | 104 | b | X | | | |
#### Right
``` clojure
(tc/right-join ds1 ds2 :b)
```
right-outer-join \[9 8\]:
| :b | :a | :c | :right.b | :right.a | :right.c | :d | :e |
|----:|----:|-----|---------:|---------:|----------|-----|----:|
| 109 | 4 | t | 109 | 1 | a | X | 4 |
| 108 | | c | 108 | 2 | t | X | 5 |
| 107 | | a | 107 | 5 | a | X | 6 |
| 106 | 4 | r | 106 | 4 | t | X | 7 |
| 105 | 3 | t | 105 | 3 | a | X | |
| 104 | 2 | | 104 | 2 | b | X | 8 |
| 103 | 1 | s | 103 | 1 | l | X | 1 |
| 102 | 2 | b | 102 | | e | X | 1 |
| | | | 110 | | d | X | 3 |
------------------------------------------------------------------------
``` clojure
(tc/right-join ds2 ds1 :b)
```
right-outer-join \[9 8\]:
| :b | :a | :c | :d | :e | :right.b | :right.a | :right.c |
|----:|----:|-----|-----|----:|---------:|---------:|----------|
| 102 | | e | X | 1 | 102 | 2 | b |
| 103 | 1 | l | X | 1 | 103 | 1 | s |
| 104 | 2 | b | X | 8 | 104 | 2 | |
| 105 | 3 | a | X | | 105 | 3 | t |
| 106 | 4 | t | X | 7 | 106 | 4 | r |
| 107 | 5 | a | X | 6 | 107 | | a |
| 108 | 2 | t | X | 5 | 108 | | c |
| 109 | 1 | a | X | 4 | 109 | 4 | t |
| | | | | | 101 | 1 | a |
------------------------------------------------------------------------
``` clojure
(tc/right-join ds1 ds2 [:a :b])
```
right-outer-join \[9 8\]:
| :a | :b | :c | :right.a | :right.b | :right.c | :d | :e |
|----:|----:|-----|---------:|---------:|----------|-----|----:|
| 4 | 106 | r | 4 | 106 | t | X | 7 |
| 3 | 105 | t | 3 | 105 | a | X | |
| 2 | 104 | | 2 | 104 | b | X | 8 |
| 1 | 103 | s | 1 | 103 | l | X | 1 |
| | | | | 110 | d | X | 3 |
| | | | 1 | 109 | a | X | 4 |
| | | | 2 | 108 | t | X | 5 |
| | | | 5 | 107 | a | X | 6 |
| | | | | 102 | e | X | 1 |
------------------------------------------------------------------------
``` clojure
(tc/right-join ds2 ds1 [:a :b])
```
right-outer-join \[9 8\]:
| :a | :b | :c | :d | :e | :right.a | :right.b | :right.c |
|----:|----:|-----|-----|----:|---------:|---------:|----------|
| 1 | 103 | l | X | 1 | 1 | 103 | s |
| 2 | 104 | b | X | 8 | 2 | 104 | |
| 3 | 105 | a | X | | 3 | 105 | t |
| 4 | 106 | t | X | 7 | 4 | 106 | r |
| | | | | | 1 | 101 | a |
| | | | | | 2 | 102 | b |
| | | | | | | 107 | a |
| | | | | | | 108 | c |
| | | | | | 4 | 109 | t |
------------------------------------------------------------------------
``` clojure
(tc/right-join ds1 ds2 {:left :a :right :e})
```
right-outer-join \[12 8\]:
| :a | :b | :c | :e | :right.a | :right.b | :right.c | :d |
|----:|----:|-----|----:|---------:|---------:|----------|-----|
| 3 | 105 | t | 3 | | 110 | d | X |
| 4 | 106 | r | 4 | 1 | 109 | a | X |
| 4 | 109 | t | 4 | 1 | 109 | a | X |
| 1 | 101 | a | 1 | 1 | 103 | l | X |
| 1 | 103 | s | 1 | 1 | 103 | l | X |
| 1 | 101 | a | 1 | | 102 | e | X |
| 1 | 103 | s | 1 | | 102 | e | X |
| | | | 5 | 2 | 108 | t | X |
| | | | 6 | 5 | 107 | a | X |
| | | | 7 | 4 | 106 | t | X |
| | | | | 3 | 105 | a | X |
| | | | 8 | 2 | 104 | b | X |
------------------------------------------------------------------------
``` clojure
(tc/right-join ds2 ds1 {:left :e :right :a})
```
right-outer-join \[11 8\]:
| :e | :a | :b | :c | :d | :right.a | :right.b | :right.c |
|----:|----:|----:|-----|-----|---------:|---------:|----------|
| 1 | 1 | 103 | l | X | 1 | 101 | a |
| 1 | | 102 | e | X | 1 | 101 | a |
| 1 | 1 | 103 | l | X | 1 | 103 | s |
| 1 | | 102 | e | X | 1 | 103 | s |
| 3 | | 110 | d | X | 3 | 105 | t |
| 4 | 1 | 109 | a | X | 4 | 106 | r |
| 4 | 1 | 109 | a | X | 4 | 109 | t |
| | | | | | 2 | 102 | b |
| | | | | | 2 | 104 | |
| | | | | | | 107 | a |
| | | | | | | 108 | c |
#### Inner
``` clojure
(tc/inner-join ds1 ds2 :b)
```
inner-join \[8 7\]:
| :b | :a | :c | :right.a | :right.c | :d | :e |
|----:|----:|-----|---------:|----------|-----|----:|
| 109 | 4 | t | 1 | a | X | 4 |
| 108 | | c | 2 | t | X | 5 |
| 107 | | a | 5 | a | X | 6 |
| 106 | 4 | r | 4 | t | X | 7 |
| 105 | 3 | t | 3 | a | X | |
| 104 | 2 | | 2 | b | X | 8 |
| 103 | 1 | s | 1 | l | X | 1 |
| 102 | 2 | b | | e | X | 1 |
------------------------------------------------------------------------
``` clojure
(tc/inner-join ds2 ds1 :b)
```
inner-join \[8 7\]:
| :b | :a | :c | :d | :e | :right.a | :right.c |
|----:|----:|-----|-----|----:|---------:|----------|
| 102 | | e | X | 1 | 2 | b |
| 103 | 1 | l | X | 1 | 1 | s |
| 104 | 2 | b | X | 8 | 2 | |
| 105 | 3 | a | X | | 3 | t |
| 106 | 4 | t | X | 7 | 4 | r |
| 107 | 5 | a | X | 6 | | a |
| 108 | 2 | t | X | 5 | | c |
| 109 | 1 | a | X | 4 | 4 | t |
------------------------------------------------------------------------
``` clojure
(tc/inner-join ds1 ds2 [:a :b])
```
inner-join \[4 8\]:
| :a | :b | :c | :right.a | :right.b | :right.c | :d | :e |
|----:|----:|-----|---------:|---------:|----------|-----|----:|
| 4 | 106 | r | 4 | 106 | t | X | 7 |
| 3 | 105 | t | 3 | 105 | a | X | |
| 2 | 104 | | 2 | 104 | b | X | 8 |
| 1 | 103 | s | 1 | 103 | l | X | 1 |
------------------------------------------------------------------------
``` clojure
(tc/inner-join ds2 ds1 [:a :b])
```
inner-join \[4 8\]:
| :a | :b | :c | :d | :e | :right.a | :right.b | :right.c |
|----:|----:|-----|-----|----:|---------:|---------:|----------|
| 1 | 103 | l | X | 1 | 1 | 103 | s |
| 2 | 104 | b | X | 8 | 2 | 104 | |
| 3 | 105 | a | X | | 3 | 105 | t |
| 4 | 106 | t | X | 7 | 4 | 106 | r |
------------------------------------------------------------------------
``` clojure
(tc/inner-join ds1 ds2 {:left :a :right :e})
```
inner-join \[7 7\]:
| :a | :b | :c | :right.a | :right.b | :right.c | :d |
|----:|----:|-----|---------:|---------:|----------|-----|
| 3 | 105 | t | | 110 | d | X |
| 4 | 106 | r | 1 | 109 | a | X |
| 4 | 109 | t | 1 | 109 | a | X |
| 1 | 101 | a | 1 | 103 | l | X |
| 1 | 103 | s | 1 | 103 | l | X |
| 1 | 101 | a | | 102 | e | X |
| 1 | 103 | s | | 102 | e | X |
------------------------------------------------------------------------
``` clojure
(tc/inner-join ds2 ds1 {:left :e :right :a})
```
inner-join \[7 7\]:
| :e | :a | :b | :c | :d | :right.b | :right.c |
|----:|----:|----:|-----|-----|---------:|----------|
| 1 | 1 | 103 | l | X | 101 | a |
| 1 | | 102 | e | X | 101 | a |
| 1 | 1 | 103 | l | X | 103 | s |
| 1 | | 102 | e | X | 103 | s |
| 3 | | 110 | d | X | 105 | t |
| 4 | 1 | 109 | a | X | 106 | r |
| 4 | 1 | 109 | a | X | 109 | t |
#### Full
Join keeping all rows
``` clojure
(tc/full-join ds1 ds2 :b)
```
full-join \[10 8\]:
| :b | :a | :c | :right.b | :right.a | :right.c | :d | :e |
|----:|----:|-----|---------:|---------:|----------|-----|----:|
| 109 | 4 | t | 109 | 1 | a | X | 4 |
| 108 | | c | 108 | 2 | t | X | 5 |
| 107 | | a | 107 | 5 | a | X | 6 |
| 106 | 4 | r | 106 | 4 | t | X | 7 |
| 105 | 3 | t | 105 | 3 | a | X | |
| 104 | 2 | | 104 | 2 | b | X | 8 |
| 103 | 1 | s | 103 | 1 | l | X | 1 |
| 102 | 2 | b | 102 | | e | X | 1 |
| 101 | 1 | a | | | | | |
| | | | 110 | | d | X | 3 |
------------------------------------------------------------------------
``` clojure
(tc/full-join ds2 ds1 :b)
```
full-join \[10 8\]:
| :b | :a | :c | :d | :e | :right.b | :right.a | :right.c |
|----:|----:|-----|-----|----:|---------:|---------:|----------|
| 102 | | e | X | 1 | 102 | 2 | b |
| 103 | 1 | l | X | 1 | 103 | 1 | s |
| 104 | 2 | b | X | 8 | 104 | 2 | |
| 105 | 3 | a | X | | 105 | 3 | t |
| 106 | 4 | t | X | 7 | 106 | 4 | r |
| 107 | 5 | a | X | 6 | 107 | | a |
| 108 | 2 | t | X | 5 | 108 | | c |
| 109 | 1 | a | X | 4 | 109 | 4 | t |
| 110 | | d | X | 3 | | | |
| | | | | | 101 | 1 | a |
------------------------------------------------------------------------
``` clojure
(tc/full-join ds1 ds2 [:a :b])
```
full-join \[14 8\]:
| :a | :b | :c | :right.a | :right.b | :right.c | :d | :e |
|----:|----:|-----|---------:|---------:|----------|-----|----:|
| 4 | 106 | r | 4 | 106 | t | X | 7 |
| 3 | 105 | t | 3 | 105 | a | X | |
| 2 | 104 | | 2 | 104 | b | X | 8 |
| 1 | 103 | s | 1 | 103 | l | X | 1 |
| 1 | 101 | a | | | | | |
| 2 | 102 | b | | | | | |
| | 107 | a | | | | | |
| | 108 | c | | | | | |
| 4 | 109 | t | | | | | |
| | | | | 110 | d | X | 3 |
| | | | 1 | 109 | a | X | 4 |
| | | | 2 | 108 | t | X | 5 |
| | | | 5 | 107 | a | X | 6 |
| | | | | 102 | e | X | 1 |
------------------------------------------------------------------------
``` clojure
(tc/full-join ds2 ds1 [:a :b])
```
full-join \[14 8\]:
| :a | :b | :c | :d | :e | :right.a | :right.b | :right.c |
|----:|----:|-----|-----|----:|---------:|---------:|----------|
| 1 | 103 | l | X | 1 | 1 | 103 | s |
| 2 | 104 | b | X | 8 | 2 | 104 | |
| 3 | 105 | a | X | | 3 | 105 | t |
| 4 | 106 | t | X | 7 | 4 | 106 | r |
| | 110 | d | X | 3 | | | |
| 1 | 109 | a | X | 4 | | | |
| 2 | 108 | t | X | 5 | | | |
| 5 | 107 | a | X | 6 | | | |
| | 102 | e | X | 1 | | | |
| | | | | | 1 | 101 | a |
| | | | | | 2 | 102 | b |
| | | | | | | 107 | a |
| | | | | | | 108 | c |
| | | | | | 4 | 109 | t |
------------------------------------------------------------------------
``` clojure
(tc/full-join ds1 ds2 {:left :a :right :e})
```
full-join \[16 8\]:
| :a | :b | :c | :e | :right.a | :right.b | :right.c | :d |
|----:|----:|-----|----:|---------:|---------:|----------|-----|
| 3 | 105 | t | 3 | | 110 | d | X |
| 4 | 106 | r | 4 | 1 | 109 | a | X |
| 4 | 109 | t | 4 | 1 | 109 | a | X |
| 1 | 101 | a | 1 | 1 | 103 | l | X |
| 1 | 103 | s | 1 | 1 | 103 | l | X |
| 1 | 101 | a | 1 | | 102 | e | X |
| 1 | 103 | s | 1 | | 102 | e | X |
| 2 | 102 | b | | | | | |
| 2 | 104 | | | | | | |
| | 107 | a | | | | | |
| | 108 | c | | | | | |
| | | | 5 | 2 | 108 | t | X |
| | | | 6 | 5 | 107 | a | X |
| | | | 7 | 4 | 106 | t | X |
| | | | | 3 | 105 | a | X |
| | | | 8 | 2 | 104 | b | X |
------------------------------------------------------------------------
``` clojure
(tc/full-join ds2 ds1 {:left :e :right :a})
```
full-join \[16 8\]:
| :e | :a | :b | :c | :d | :right.a | :right.b | :right.c |
|----:|----:|----:|-----|-----|---------:|---------:|----------|
| 1 | 1 | 103 | l | X | 1 | 101 | a |
| 1 | | 102 | e | X | 1 | 101 | a |
| 1 | 1 | 103 | l | X | 1 | 103 | s |
| 1 | | 102 | e | X | 1 | 103 | s |
| 3 | | 110 | d | X | 3 | 105 | t |
| 4 | 1 | 109 | a | X | 4 | 106 | r |
| 4 | 1 | 109 | a | X | 4 | 109 | t |
| 5 | 2 | 108 | t | X | | | |
| 6 | 5 | 107 | a | X | | | |
| 7 | 4 | 106 | t | X | | | |
| | 3 | 105 | a | X | | | |
| 8 | 2 | 104 | b | X | | | |
| | | | | | 2 | 102 | b |
| | | | | | 2 | 104 | |
| | | | | | | 107 | a |
| | | | | | | 108 | c |
#### Semi
Return rows from ds1 matching ds2
``` clojure
(tc/semi-join ds1 ds2 :b)
```
semi-join \[8 3\]:
| :b | :a | :c |
|----:|----:|-----|
| 109 | 4 | t |
| 108 | | c |
| 107 | | a |
| 106 | 4 | r |
| 105 | 3 | t |
| 104 | 2 | |
| 103 | 1 | s |
| 102 | 2 | b |
------------------------------------------------------------------------
``` clojure
(tc/semi-join ds2 ds1 :b)
```
semi-join \[8 5\]:
| :b | :a | :c | :d | :e |
|----:|----:|-----|-----|----:|
| 102 | | e | X | 1 |
| 103 | 1 | l | X | 1 |
| 104 | 2 | b | X | 8 |
| 105 | 3 | a | X | |
| 106 | 4 | t | X | 7 |
| 107 | 5 | a | X | 6 |
| 108 | 2 | t | X | 5 |
| 109 | 1 | a | X | 4 |
------------------------------------------------------------------------
``` clojure
(tc/semi-join ds1 ds2 [:a :b])
```
semi-join \[4 3\]:
| :a | :b | :c |
|----:|----:|-----|
| 4 | 106 | r |
| 3 | 105 | t |
| 2 | 104 | |
| 1 | 103 | s |
------------------------------------------------------------------------
``` clojure
(tc/semi-join ds2 ds1 [:a :b])
```
semi-join \[4 5\]:
| :a | :b | :c | :d | :e |
|----:|----:|-----|-----|----:|
| 1 | 103 | l | X | 1 |
| 2 | 104 | b | X | 8 |
| 3 | 105 | a | X | |
| 4 | 106 | t | X | 7 |
------------------------------------------------------------------------
``` clojure
(tc/semi-join ds1 ds2 {:left :a :right :e})
```
semi-join \[7 3\]:
| :a | :b | :c |
|----:|----:|-----|
| 3 | 105 | t |
| 4 | 106 | r |
| 4 | 109 | t |
| 1 | 101 | a |
| 1 | 103 | s |
| | 107 | a |
| | 108 | c |
------------------------------------------------------------------------
``` clojure
(tc/semi-join ds2 ds1 {:left :e :right :a})
```
semi-join \[5 5\]:
| :e | :a | :b | :c | :d |
|----:|----:|----:|-----|-----|
| 1 | 1 | 103 | l | X |
| 1 | | 102 | e | X |
| 3 | | 110 | d | X |
| 4 | 1 | 109 | a | X |
| | 3 | 105 | a | X |
#### Anti
Return rows from ds1 not matching ds2
``` clojure
(tc/anti-join ds1 ds2 :b)
```
anti-join \[1 3\]:
| :b | :a | :c |
|----:|----:|-----|
| 101 | 1 | a |
------------------------------------------------------------------------
``` clojure
(tc/anti-join ds2 ds1 :b)
```
anti-join \[1 5\]:
| :b | :a | :c | :d | :e |
|----:|----:|-----|-----|----:|
| 110 | | d | X | 3 |
------------------------------------------------------------------------
``` clojure
(tc/anti-join ds1 ds2 [:a :b])
```
anti-join \[5 3\]:
| :a | :b | :c |
|----:|----:|-----|
| 1 | 101 | a |
| 2 | 102 | b |
| | 107 | a |
| | 108 | c |
| 4 | 109 | t |
------------------------------------------------------------------------
``` clojure
(tc/anti-join ds1 ds2 {:left :a :right :e})
```
anti-join \[2 3\]:
| :a | :b | :c |
|----:|----:|-----|
| 2 | 102 | b |
| 2 | 104 | |
------------------------------------------------------------------------
``` clojure
(tc/anti-join ds2 ds1 {:left :e :right :a})
```
anti-join \[4 5\]:
| :e | :a | :b | :c | :d |
|----:|----:|----:|-----|-----|
| 5 | 2 | 108 | t | X |
| 6 | 5 | 107 | a | X |
| 7 | 4 | 106 | t | X |
| 8 | 2 | 104 | b | X |
#### Hashing
When `:hashing` option is used, data from join columns are preprocessed
by applying `join-columns` funtion with `:result-type` set to the value
of `:hashing`. This helps to create custom joining behaviour. Function
used for hashing will get vector of row values from join columns.
In the following example we will join columns on value modulo 5.
``` clojure
(tc/left-join ds1 ds2 :b {:hashing (fn [[v]] (mod v 5))})
```
left-outer-join \[16 8\]:
| :a | :b | :c | :right.a | :right.b | :right.c | :d | :e |
|----:|----:|-----|---------:|---------:|----------|-----|----:|
| 3 | 105 | t | | 110 | d | X | 3 |
| 2 | 104 | | 1 | 109 | a | X | 4 |
| 4 | 109 | t | 1 | 109 | a | X | 4 |
| 1 | 103 | s | 2 | 108 | t | X | 5 |
| | 108 | c | 2 | 108 | t | X | 5 |
| 2 | 102 | b | 5 | 107 | a | X | 6 |
| | 107 | a | 5 | 107 | a | X | 6 |
| 1 | 101 | a | 4 | 106 | t | X | 7 |
| 4 | 106 | r | 4 | 106 | t | X | 7 |
| 3 | 105 | t | 3 | 105 | a | X | |
| 2 | 104 | | 2 | 104 | b | X | 8 |
| 4 | 109 | t | 2 | 104 | b | X | 8 |
| 1 | 103 | s | 1 | 103 | l | X | 1 |
| | 108 | c | 1 | 103 | l | X | 1 |
| 2 | 102 | b | | 102 | e | X | 1 |
| | 107 | a | | 102 | e | X | 1 |
#### Cross
Cross product from selected columns
``` clojure
(tc/cross-join ds1 ds2 [:a :b])
```
cross-join \[81 4\]:
| :a | :b | :right.a | :right.b |
|----:|----:|---------:|---------:|
| 1 | 101 | | 110 |
| 1 | 101 | 1 | 109 |
| 1 | 101 | 2 | 108 |
| 1 | 101 | 5 | 107 |
| 1 | 101 | 4 | 106 |
| 1 | 101 | 3 | 105 |
| 1 | 101 | 2 | 104 |
| 1 | 101 | 1 | 103 |
| 1 | 101 | | 102 |
| 2 | 102 | | 110 |
| … | … | … | … |
| | 108 | 1 | 103 |
| | 108 | | 102 |
| 4 | 109 | | 110 |
| 4 | 109 | 1 | 109 |
| 4 | 109 | 2 | 108 |
| 4 | 109 | 5 | 107 |
| 4 | 109 | 4 | 106 |
| 4 | 109 | 3 | 105 |
| 4 | 109 | 2 | 104 |
| 4 | 109 | 1 | 103 |
| 4 | 109 | | 102 |
------------------------------------------------------------------------
``` clojure
(tc/cross-join ds1 ds2 {:left [:a :b] :right :e})
```
cross-join \[81 3\]:
| :a | :b | :e |
|----:|----:|----:|
| 1 | 101 | 3 |
| 1 | 101 | 4 |
| 1 | 101 | 5 |
| 1 | 101 | 6 |
| 1 | 101 | 7 |
| 1 | 101 | |
| 1 | 101 | 8 |
| 1 | 101 | 1 |
| 1 | 101 | 1 |
| 2 | 102 | 3 |
| … | … | … |
| | 108 | 1 |
| | 108 | 1 |
| 4 | 109 | 3 |
| 4 | 109 | 4 |
| 4 | 109 | 5 |
| 4 | 109 | 6 |
| 4 | 109 | 7 |
| 4 | 109 | |
| 4 | 109 | 8 |
| 4 | 109 | 1 |
| 4 | 109 | 1 |
#### Expand
Similar to cross product but works on a single dataset.
``` clojure
(tc/expand ds2 :a :c :d)
```
cross-join \[36 3\]:
| :a | :c | :d |
|----:|-----|-----|
| | d | X |
| | a | X |
| | t | X |
| | b | X |
| | l | X |
| | e | X |
| 1 | d | X |
| 1 | a | X |
| 1 | t | X |
| 1 | b | X |
| … | … | … |
| 4 | a | X |
| 4 | t | X |
| 4 | b | X |
| 4 | l | X |
| 4 | e | X |
| 3 | d | X |
| 3 | a | X |
| 3 | t | X |
| 3 | b | X |
| 3 | l | X |
| 3 | e | X |
------------------------------------------------------------------------
Columns can be also bundled (nested) in tuples which are treated as a
single entity during cross product.
``` clojure
(tc/expand ds2 [:a :c] [:e :b])
```
cross-join \[81 4\]:
| :a | :c | :e | :b |
|----:|-----|----:|----:|
| | d | 3 | 110 |
| | d | 4 | 109 |
| | d | 5 | 108 |
| | d | 6 | 107 |
| | d | 7 | 106 |
| | d | | 105 |
| | d | 8 | 104 |
| | d | 1 | 103 |
| | d | 1 | 102 |
| 1 | a | 3 | 110 |
| … | … | … | … |
| 1 | l | 1 | 103 |
| 1 | l | 1 | 102 |
| | e | 3 | 110 |
| | e | 4 | 109 |
| | e | 5 | 108 |
| | e | 6 | 107 |
| | e | 7 | 106 |
| | e | | 105 |
| | e | 8 | 104 |
| | e | 1 | 103 |
| | e | 1 | 102 |
#### Complete
Same as expand with all other columns preserved (filled with missing
values if necessary).
``` clojure
(tc/complete ds2 :a :c :d)
```
left-outer-join \[36 5\]:
| :a | :c | :d | :b | :e |
|----:|-----|-----|----:|----:|
| | d | X | 110 | 3 |
| 1 | a | X | 109 | 4 |
| 2 | t | X | 108 | 5 |
| 5 | a | X | 107 | 6 |
| 4 | t | X | 106 | 7 |
| 3 | a | X | 105 | |
| 2 | b | X | 104 | 8 |
| 1 | l | X | 103 | 1 |
| | e | X | 102 | 1 |
| | a | X | | |
| … | … | … | … | … |
| 5 | e | X | | |
| 4 | d | X | | |
| 4 | a | X | | |
| 4 | b | X | | |
| 4 | l | X | | |
| 4 | e | X | | |
| 3 | d | X | | |
| 3 | t | X | | |
| 3 | b | X | | |
| 3 | l | X | | |
| 3 | e | X | | |
------------------------------------------------------------------------
``` clojure
(tc/complete ds2 [:a :c] [:e :b])
```
left-outer-join \[81 5\]:
| :a | :c | :e | :b | :d |
|----:|-----|----:|----:|-----|
| | d | 3 | 110 | X |
| 1 | a | 4 | 109 | X |
| 2 | t | 5 | 108 | X |
| 5 | a | 6 | 107 | X |
| 4 | t | 7 | 106 | X |
| 3 | a | | 105 | X |
| 2 | b | 8 | 104 | X |
| 1 | l | 1 | 103 | X |
| | e | 1 | 102 | X |
| | d | 4 | 109 | |
| … | … | … | … | … |
| 1 | l | | 105 | |
| 1 | l | 8 | 104 | |
| 1 | l | 1 | 102 | |
| | e | 3 | 110 | |
| | e | 4 | 109 | |
| | e | 5 | 108 | |
| | e | 6 | 107 | |
| | e | 7 | 106 | |
| | e | | 105 | |
| | e | 8 | 104 | |
| | e | 1 | 103 | |
#### asof
``` clojure
(def left-ds (tc/dataset {:a [1 5 10]
:left-val ["a" "b" "c"]}))
(def right-ds (tc/dataset {:a [1 2 3 6 7]
:right-val [:a :b :c :d :e]}))
```
``` clojure
left-ds
right-ds
```
\_unnamed \[3 2\]:
| :a | :left-val |
|----:|-----------|
| 1 | a |
| 5 | b |
| 10 | c |
\_unnamed \[5 2\]:
| :a | :right-val |
|----:|------------|
| 1 | :a |
| 2 | :b |
| 3 | :c |
| 6 | :d |
| 7 | :e |
``` clojure
(tc/asof-join left-ds right-ds :a)
```
asof-\<= \[3 4\]:
| :a | :left-val | :right.a | :right-val |
|----:|-----------|---------:|------------|
| 1 | a | 1 | :a |
| 5 | b | 6 | :d |
| 10 | c | | |
``` clojure
(tc/asof-join left-ds right-ds :a {:asof-op :nearest})
```
asof-nearest \[3 4\]:
| :a | :left-val | :right.a | :right-val |
|----:|-----------|---------:|------------|
| 1 | a | 1 | :a |
| 5 | b | 6 | :d |
| 10 | c | 7 | :e |
``` clojure
(tc/asof-join left-ds right-ds :a {:asof-op :>=})
```
asof-\>= \[3 4\]:
| :a | :left-val | :right.a | :right-val |
|----:|-----------|---------:|------------|
| 1 | a | 1 | :a |
| 5 | b | 3 | :c |
| 10 | c | 7 | :e |
#### Concat
`contact` joins rows from other datasets
``` clojure
(tc/concat ds1)
```
\_unnamed \[9 3\]:
| :a | :b | :c |
|----:|----:|-----|
| 1 | 101 | a |
| 2 | 102 | b |
| 1 | 103 | s |
| 2 | 104 | |
| 3 | 105 | t |
| 4 | 106 | r |
| | 107 | a |
| | 108 | c |
| 4 | 109 | t |
------------------------------------------------------------------------
`concat-copying` ensures all readers are evaluated.
``` clojure
(tc/concat-copying ds1)
```
\_unnamed \[9 3\]:
| :a | :b | :c |
|----:|----:|-----|
| 1 | 101 | a |
| 2 | 102 | b |
| 1 | 103 | s |
| 2 | 104 | |
| 3 | 105 | t |
| 4 | 106 | r |
| | 107 | a |
| | 108 | c |
| 4 | 109 | t |
------------------------------------------------------------------------
``` clojure
(tc/concat ds1 (tc/drop-columns ds2 :d))
```
\_unnamed \[18 4\]:
| :a | :b | :c | :e |
|----:|----:|-----|----:|
| 1 | 101 | a | |
| 2 | 102 | b | |
| 1 | 103 | s | |
| 2 | 104 | | |
| 3 | 105 | t | |
| 4 | 106 | r | |
| | 107 | a | |
| | 108 | c | |
| 4 | 109 | t | |
| | 110 | d | 3 |
| 1 | 109 | a | 4 |
| 2 | 108 | t | 5 |
| 5 | 107 | a | 6 |
| 4 | 106 | t | 7 |
| 3 | 105 | a | |
| 2 | 104 | b | 8 |
| 1 | 103 | l | 1 |
| | 102 | e | 1 |
------------------------------------------------------------------------
``` clojure
(apply tc/concat (repeatedly 3 #(tc/random DS)))
```
\_unnamed \[27 4\]:
| :V1 | :V2 | :V3 | :V4 |
|----:|----:|----:|-----|
| 2 | 6 | 1.5 | C |
| 2 | 4 | 0.5 | A |
| 2 | 4 | 0.5 | A |
| 1 | 1 | 0.5 | A |
| 1 | 9 | 1.5 | C |
| 1 | 5 | 1.0 | B |
| 2 | 6 | 1.5 | C |
| 1 | 7 | 0.5 | A |
| 1 | 1 | 0.5 | A |
| 1 | 3 | 1.5 | C |
| … | … | … | … |
| 2 | 8 | 1.0 | B |
| 2 | 6 | 1.5 | C |
| 1 | 3 | 1.5 | C |
| 1 | 1 | 0.5 | A |
| 1 | 7 | 0.5 | A |
| 1 | 1 | 0.5 | A |
| 1 | 7 | 0.5 | A |
| 2 | 6 | 1.5 | C |
| 2 | 4 | 0.5 | A |
| 2 | 8 | 1.0 | B |
| 2 | 2 | 1.0 | B |
| 2 | 8 | 1.0 | B |
| 1 | 9 | 1.5 | C |
| 2 | 2 | 1.0 | B |
| 2 | 8 | 1.0 | B |
| 2 | 4 | 0.5 | A |
| 1 | 9 | 1.5 | C |
##### Concat grouped dataset
Concatenation of grouped datasets results also in grouped dataset.
``` clojure
(tc/concat (tc/group-by DS [:V3])
(tc/group-by DS [:V4]))
```
\_unnamed \[6 3\]:
| :name | :group-id | :data |
|-----------|----------:|---------------------------|
| {:V3 0.5} | 0 | Group: {:V3 0.5} \[3 4\]: |
| {:V3 1.0} | 1 | Group: {:V3 1.0} \[3 4\]: |
| {:V3 1.5} | 2 | Group: {:V3 1.5} \[3 4\]: |
| {:V4 “A”} | 3 | Group: {:V4 “A”} \[3 4\]: |
| {:V4 “B”} | 4 | Group: {:V4 “B”} \[3 4\]: |
| {:V4 “C”} | 5 | Group: {:V4 “C”} \[3 4\]: |
#### Union
The same as `concat` but returns unique rows
``` clojure
(apply tc/union (tc/drop-columns ds2 :d) (repeat 10 ds1))
```
union \[18 4\]:
| :a | :b | :c | :e |
|----:|----:|-----|----:|
| | 110 | d | 3 |
| 1 | 109 | a | 4 |
| 2 | 108 | t | 5 |
| 5 | 107 | a | 6 |
| 4 | 106 | t | 7 |
| 3 | 105 | a | |
| 2 | 104 | b | 8 |
| 1 | 103 | l | 1 |
| | 102 | e | 1 |
| 1 | 101 | a | |
| 2 | 102 | b | |
| 1 | 103 | s | |
| 2 | 104 | | |
| 3 | 105 | t | |
| 4 | 106 | r | |
| | 107 | a | |
| | 108 | c | |
| 4 | 109 | t | |
------------------------------------------------------------------------
``` clojure
(apply tc/union (repeatedly 10 #(tc/random DS)))
```
union \[9 4\]:
| :V1 | :V2 | :V3 | :V4 |
|----:|----:|----:|-----|
| 1 | 7 | 0.5 | A |
| 2 | 4 | 0.5 | A |
| 1 | 1 | 0.5 | A |
| 1 | 5 | 1.0 | B |
| 1 | 9 | 1.5 | C |
| 2 | 2 | 1.0 | B |
| 1 | 3 | 1.5 | C |
| 2 | 8 | 1.0 | B |
| 2 | 6 | 1.5 | C |
#### Bind
`bind` adds empty columns during concat
``` clojure
(tc/bind ds1 ds2)
```
\_unnamed \[18 5\]:
| :a | :b | :c | :e | :d |
|----:|----:|-----|----:|-----|
| 1 | 101 | a | | |
| 2 | 102 | b | | |
| 1 | 103 | s | | |
| 2 | 104 | | | |
| 3 | 105 | t | | |
| 4 | 106 | r | | |
| | 107 | a | | |
| | 108 | c | | |
| 4 | 109 | t | | |
| | 110 | d | 3 | X |
| 1 | 109 | a | 4 | X |
| 2 | 108 | t | 5 | X |
| 5 | 107 | a | 6 | X |
| 4 | 106 | t | 7 | X |
| 3 | 105 | a | | X |
| 2 | 104 | b | 8 | X |
| 1 | 103 | l | 1 | X |
| | 102 | e | 1 | X |
------------------------------------------------------------------------
``` clojure
(tc/bind ds2 ds1)
```
\_unnamed \[18 5\]:
| :a | :b | :c | :d | :e |
|----:|----:|-----|-----|----:|
| | 110 | d | X | 3 |
| 1 | 109 | a | X | 4 |
| 2 | 108 | t | X | 5 |
| 5 | 107 | a | X | 6 |
| 4 | 106 | t | X | 7 |
| 3 | 105 | a | X | |
| 2 | 104 | b | X | 8 |
| 1 | 103 | l | X | 1 |
| | 102 | e | X | 1 |
| 1 | 101 | a | | |
| 2 | 102 | b | | |
| 1 | 103 | s | | |
| 2 | 104 | | | |
| 3 | 105 | t | | |
| 4 | 106 | r | | |
| | 107 | a | | |
| | 108 | c | | |
| 4 | 109 | t | | |
#### Append
`append` concats columns
``` clojure
(tc/append ds1 ds2)
```
\_unnamed \[9 8\]:
| :a | :b | :c | :a | :b | :c | :d | :e |
|----:|----:|-----|----:|----:|-----|-----|----:|
| 1 | 101 | a | | 110 | d | X | 3 |
| 2 | 102 | b | 1 | 109 | a | X | 4 |
| 1 | 103 | s | 2 | 108 | t | X | 5 |
| 2 | 104 | | 5 | 107 | a | X | 6 |
| 3 | 105 | t | 4 | 106 | t | X | 7 |
| 4 | 106 | r | 3 | 105 | a | X | |
| | 107 | a | 2 | 104 | b | X | 8 |
| | 108 | c | 1 | 103 | l | X | 1 |
| 4 | 109 | t | | 102 | e | X | 1 |
#### Intersection
``` clojure
(tc/intersect (tc/select-columns ds1 :b)
(tc/select-columns ds2 :b))
```
intersection \[8 1\]:
| :b |
|----:|
| 109 |
| 108 |
| 107 |
| 106 |
| 105 |
| 104 |
| 103 |
| 102 |
#### Difference
``` clojure
(tc/difference (tc/select-columns ds1 :b)
(tc/select-columns ds2 :b))
```
difference \[1 1\]:
| :b |
|----:|
| 101 |
------------------------------------------------------------------------
``` clojure
(tc/difference (tc/select-columns ds2 :b)
(tc/select-columns ds1 :b))
```
difference \[1 1\]:
| :b |
|----:|
| 110 |
### Split into train/test
In ML world very often you need to test given model and prepare
collection of train and test datasets. `split` creates new dataset with
two additional columns:
- `:$split-name` - with `:train`, `:test`, `:split-2`, … values
- `:$split-id` - id of splitted group (for k-fold and repeating)
`split-type` can be one of the following:
- `:kfold` (default) - k-fold strategy, `:k` defines number of folds
(defaults to `5`), produces `k` splits
- `:bootstrap` - `:ratio` defines ratio of observations put into result
(defaults to `1.0`), produces `1` split
- `:holdout` - split into two or more parts with given ratio(s)
(defaults to `2/3`), produces `1` split
- `:holdouts` - splits into two parts for ascending ratio. Range of
rations is given by `steps` option
- `:loo` - leave one out, produces the same number of splits as number
of observations
`:holdout` can accept also probabilites or ratios and can split to more
than 2 subdatasets
Additionally you can provide:
- `:seed` - for random number generator
- `:shuffle?` - turn on/off shuffle of the rows (default: `true`)
- `:repeats` - repeat procedure `:repeats` times
- `:partition-selector` - same as in `group-by` for stratified splitting
to reflect dataset structure in splits.
- `:split-names` names of subdatasets different than default, ie.
`[:train :test :split-2 ...]`
- `:split-col-name` - a column where name of split is stored, either
`:train` or `:test` values (default: `:$split-name`)
- `:split-id-col-name` - a column where id of the train/test pair is
stored (default: `:$split-id`)
In case of grouped dataset each group is processed separately.
See
[more](https://www.mitpressjournals.org/doi/pdf/10.1162/EVCO_a_00069)
``` clojure
(def for-splitting (tc/dataset (map-indexed (fn [id v] {:id id
:partition v
:group (rand-nth [:g1 :g2 :g3])})
(concat (repeat 20 :a) (repeat 5 :b)))))
```
``` clojure
for-splitting
```
\_unnamed \[25 3\]:
| :id | :partition | :group |
|----:|------------|--------|
| 0 | :a | :g3 |
| 1 | :a | :g3 |
| 2 | :a | :g3 |
| 3 | :a | :g2 |
| 4 | :a | :g3 |
| 5 | :a | :g3 |
| 6 | :a | :g2 |
| 7 | :a | :g1 |
| 8 | :a | :g3 |
| 9 | :a | :g1 |
| … | … | … |
| 14 | :a | :g3 |
| 15 | :a | :g1 |
| 16 | :a | :g3 |
| 17 | :a | :g1 |
| 18 | :a | :g2 |
| 19 | :a | :g1 |
| 20 | :b | :g2 |
| 21 | :b | :g1 |
| 22 | :b | :g2 |
| 23 | :b | :g2 |
| 24 | :b | :g2 |
#### k-Fold
Returns `k=5` maps
``` clojure
(-> for-splitting
(tc/split)
(tc/head 30))
```
\_unnamed, (splitted) \[30 5\]:
| :id | :partition | :group | :*s**p**l**i**t* − *n**a**m**e*\|:split-id | |
|----:|------------|--------|--------------------------------------------|----:|
| 23 | :b | :g2 | :train | 0 |
| 0 | :a | :g3 | :train | 0 |
| 3 | :a | :g2 | :train | 0 |
| 13 | :a | :g3 | :train | 0 |
| 24 | :b | :g2 | :train | 0 |
| 18 | :a | :g2 | :train | 0 |
| 10 | :a | :g2 | :train | 0 |
| 22 | :b | :g2 | :train | 0 |
| 1 | :a | :g3 | :train | 0 |
| 20 | :b | :g2 | :train | 0 |
| 5 | :a | :g3 | :train | 0 |
| 17 | :a | :g1 | :train | 0 |
| 14 | :a | :g3 | :train | 0 |
| 4 | :a | :g3 | :train | 0 |
| 2 | :a | :g3 | :train | 0 |
| 8 | :a | :g3 | :train | 0 |
| 19 | :a | :g1 | :train | 0 |
| 6 | :a | :g2 | :train | 0 |
| 12 | :a | :g3 | :train | 0 |
| 16 | :a | :g3 | :train | 0 |
| 11 | :a | :g2 | :test | 0 |
| 7 | :a | :g1 | :test | 0 |
| 15 | :a | :g1 | :test | 0 |
| 21 | :b | :g1 | :test | 0 |
| 9 | :a | :g1 | :test | 0 |
| 11 | :a | :g2 | :train | 1 |
| 7 | :a | :g1 | :train | 1 |
| 15 | :a | :g1 | :train | 1 |
| 21 | :b | :g1 | :train | 1 |
| 9 | :a | :g1 | :train | 1 |
Partition according to `:k` column to reflect it’s distribution
``` clojure
(-> for-splitting
(tc/split :kfold {:partition-selector :partition})
(tc/head 30))
```
\_unnamed, (splitted) \[30 5\]:
| :id | :partition | :group | :*s**p**l**i**t* − *n**a**m**e*\|:split-id | |
|----:|------------|--------|--------------------------------------------|----:|
| 11 | :a | :g2 | :train | 0 |
| 12 | :a | :g3 | :train | 0 |
| 8 | :a | :g3 | :train | 0 |
| 15 | :a | :g1 | :train | 0 |
| 5 | :a | :g3 | :train | 0 |
| 18 | :a | :g2 | :train | 0 |
| 2 | :a | :g3 | :train | 0 |
| 14 | :a | :g3 | :train | 0 |
| 9 | :a | :g1 | :train | 0 |
| 4 | :a | :g3 | :train | 0 |
| 19 | :a | :g1 | :train | 0 |
| 1 | :a | :g3 | :train | 0 |
| 6 | :a | :g2 | :train | 0 |
| 13 | :a | :g3 | :train | 0 |
| 16 | :a | :g3 | :train | 0 |
| 17 | :a | :g1 | :train | 0 |
| 10 | :a | :g2 | :test | 0 |
| 0 | :a | :g3 | :test | 0 |
| 3 | :a | :g2 | :test | 0 |
| 7 | :a | :g1 | :test | 0 |
| 10 | :a | :g2 | :train | 1 |
| 0 | :a | :g3 | :train | 1 |
| 3 | :a | :g2 | :train | 1 |
| 7 | :a | :g1 | :train | 1 |
| 5 | :a | :g3 | :train | 1 |
| 18 | :a | :g2 | :train | 1 |
| 2 | :a | :g3 | :train | 1 |
| 14 | :a | :g3 | :train | 1 |
| 9 | :a | :g1 | :train | 1 |
| 4 | :a | :g3 | :train | 1 |
#### Bootstrap
``` clojure
(tc/split for-splitting :bootstrap)
```
\_unnamed, (splitted) \[33 5\]:
| :id | :partition | :group | :*s**p**l**i**t* − *n**a**m**e*\|:split-id | |
|----:|------------|--------|--------------------------------------------|----:|
| 16 | :a | :g3 | :train | 0 |
| 16 | :a | :g3 | :train | 0 |
| 21 | :b | :g1 | :train | 0 |
| 20 | :b | :g2 | :train | 0 |
| 15 | :a | :g1 | :train | 0 |
| 1 | :a | :g3 | :train | 0 |
| 5 | :a | :g3 | :train | 0 |
| 18 | :a | :g2 | :train | 0 |
| 4 | :a | :g3 | :train | 0 |
| 21 | :b | :g1 | :train | 0 |
| … | … | … | … | … |
| 20 | :b | :g2 | :train | 0 |
| 22 | :b | :g2 | :train | 0 |
| 21 | :b | :g1 | :train | 0 |
| 0 | :a | :g3 | :test | 0 |
| 2 | :a | :g3 | :test | 0 |
| 3 | :a | :g2 | :test | 0 |
| 7 | :a | :g1 | :test | 0 |
| 8 | :a | :g3 | :test | 0 |
| 9 | :a | :g1 | :test | 0 |
| 10 | :a | :g2 | :test | 0 |
| 12 | :a | :g3 | :test | 0 |
with repeats, to get 100 splits
``` clojure
(-> for-splitting
(tc/split :bootstrap {:repeats 100})
(:$split-id)
(distinct)
(count))
```
100
#### Holdout
with small ratio
``` clojure
(tc/split for-splitting :holdout {:ratio 0.2})
```
\_unnamed, (splitted) \[25 5\]:
| :id | :partition | :group | :*s**p**l**i**t* − *n**a**m**e*\|:split-id | |
|----:|------------|--------|--------------------------------------------|----:|
| 19 | :a | :g1 | :train | 0 |
| 4 | :a | :g3 | :train | 0 |
| 16 | :a | :g3 | :train | 0 |
| 15 | :a | :g1 | :train | 0 |
| 22 | :b | :g2 | :train | 0 |
| 7 | :a | :g1 | :test | 0 |
| 11 | :a | :g2 | :test | 0 |
| 2 | :a | :g3 | :test | 0 |
| 20 | :b | :g2 | :test | 0 |
| 6 | :a | :g2 | :test | 0 |
| … | … | … | … | … |
| 14 | :a | :g3 | :test | 0 |
| 21 | :b | :g1 | :test | 0 |
| 3 | :a | :g2 | :test | 0 |
| 12 | :a | :g3 | :test | 0 |
| 9 | :a | :g1 | :test | 0 |
| 13 | :a | :g3 | :test | 0 |
| 0 | :a | :g3 | :test | 0 |
| 5 | :a | :g3 | :test | 0 |
| 18 | :a | :g2 | :test | 0 |
| 23 | :b | :g2 | :test | 0 |
| 1 | :a | :g3 | :test | 0 |
you can split to more than two subdatasets with holdout
``` clojure
(tc/split for-splitting :holdout {:ratio [0.1 0.2 0.3 0.15 0.25]})
```
\_unnamed, (splitted) \[25 5\]:
| :id | :partition | :group | :*s**p**l**i**t* − *n**a**m**e*\|:split-id | |
|----:|------------|--------|--------------------------------------------|----:|
| 22 | :b | :g2 | :train | 0 |
| 17 | :a | :g1 | :train | 0 |
| 23 | :b | :g2 | :test | 0 |
| 5 | :a | :g3 | :test | 0 |
| 15 | :a | :g1 | :test | 0 |
| 18 | :a | :g2 | :test | 0 |
| 20 | :b | :g2 | :test | 0 |
| 0 | :a | :g3 | :split-2 | 0 |
| 4 | :a | :g3 | :split-2 | 0 |
| 2 | :a | :g3 | :split-2 | 0 |
| … | … | … | … | … |
| 13 | :a | :g3 | :split-3 | 0 |
| 16 | :a | :g3 | :split-3 | 0 |
| 21 | :b | :g1 | :split-3 | 0 |
| 7 | :a | :g1 | :split-4 | 0 |
| 6 | :a | :g2 | :split-4 | 0 |
| 9 | :a | :g1 | :split-4 | 0 |
| 10 | :a | :g2 | :split-4 | 0 |
| 19 | :a | :g1 | :split-4 | 0 |
| 14 | :a | :g3 | :split-4 | 0 |
| 3 | :a | :g2 | :split-4 | 0 |
| 1 | :a | :g3 | :split-4 | 0 |
you can use also proportions with custom names
``` clojure
(tc/split for-splitting :holdout {:ratio [5 3 11 2]
:split-names ["small" "smaller" "big" "the rest"]})
```
\_unnamed, (splitted) \[25 5\]:
| :id | :partition | :group | :*s**p**l**i**t* − *n**a**m**e*\|:split-id | |
|----:|------------|--------|--------------------------------------------|----:|
| 10 | :a | :g2 | small | 0 |
| 2 | :a | :g3 | small | 0 |
| 14 | :a | :g3 | small | 0 |
| 13 | :a | :g3 | small | 0 |
| 15 | :a | :g1 | small | 0 |
| 17 | :a | :g1 | smaller | 0 |
| 9 | :a | :g1 | smaller | 0 |
| 21 | :b | :g1 | smaller | 0 |
| 5 | :a | :g3 | big | 0 |
| 24 | :b | :g2 | big | 0 |
| … | … | … | … | … |
| 20 | :b | :g2 | big | 0 |
| 0 | :a | :g3 | big | 0 |
| 19 | :a | :g1 | big | 0 |
| 1 | :a | :g3 | big | 0 |
| 6 | :a | :g2 | big | 0 |
| 23 | :b | :g2 | big | 0 |
| 8 | :a | :g3 | big | 0 |
| 3 | :a | :g2 | the rest | 0 |
| 16 | :a | :g3 | the rest | 0 |
| 7 | :a | :g1 | the rest | 0 |
| 11 | :a | :g2 | the rest | 0 |
#### Holdouts
With ratios from 5% to 95% of the dataset with step 1.5 generates 15
splits with ascending rows in train dataset.
``` clojure
(-> (tc/split for-splitting :holdouts {:steps [0.05 0.95 1.5]
:shuffle? false})
(tc/group-by [:$split-id :$split-name]))
```
\_unnamed \[30 3\]:
| :name | :group-id | :data |
|--------------------------------------|----------:|-------------------------------------------------------|
| {:$split-id 0, :$split-name :train} | 0 | Group: {:$split-id 0, :$split-name :train} \[1 5\]: |
| {:$split-id 0, :$split-name :test} | 1 | Group: {:$split-id 0, :$split-name :test} \[24 5\]: |
| {:$split-id 1, :$split-name :train} | 2 | Group: {:$split-id 1, :$split-name :train} \[2 5\]: |
| {:$split-id 1, :$split-name :test} | 3 | Group: {:$split-id 1, :$split-name :test} \[23 5\]: |
| {:$split-id 2, :$split-name :train} | 4 | Group: {:$split-id 2, :$split-name :train} \[4 5\]: |
| {:$split-id 2, :$split-name :test} | 5 | Group: {:$split-id 2, :$split-name :test} \[21 5\]: |
| {:$split-id 3, :$split-name :train} | 6 | Group: {:$split-id 3, :$split-name :train} \[5 5\]: |
| {:$split-id 3, :$split-name :test} | 7 | Group: {:$split-id 3, :$split-name :test} \[20 5\]: |
| {:$split-id 4, :$split-name :train} | 8 | Group: {:$split-id 4, :$split-name :train} \[7 5\]: |
| {:$split-id 4, :$split-name :test} | 9 | Group: {:$split-id 4, :$split-name :test} \[18 5\]: |
| … | … | … |
| {:$split-id 9, :$split-name :test} | 19 | Group: {:$split-id 9, :$split-name :test} \[11 5\]: |
| {:$split-id 10, :$split-name :train} | 20 | Group: {:$split-id 10, :$split-name :train} \[16 5\]: |
| {:$split-id 10, :$split-name :test} | 21 | Group: {:$split-id 10, :$split-name :test} \[9 5\]: |
| {:$split-id 11, :$split-name :train} | 22 | Group: {:$split-id 11, :$split-name :train} \[17 5\]: |
| {:$split-id 11, :$split-name :test} | 23 | Group: {:$split-id 11, :$split-name :test} \[8 5\]: |
| {:$split-id 12, :$split-name :train} | 24 | Group: {:$split-id 12, :$split-name :train} \[19 5\]: |
| {:$split-id 12, :$split-name :test} | 25 | Group: {:$split-id 12, :$split-name :test} \[6 5\]: |
| {:$split-id 13, :$split-name :train} | 26 | Group: {:$split-id 13, :$split-name :train} \[20 5\]: |
| {:$split-id 13, :$split-name :test} | 27 | Group: {:$split-id 13, :$split-name :test} \[5 5\]: |
| {:$split-id 14, :$split-name :train} | 28 | Group: {:$split-id 14, :$split-name :train} \[22 5\]: |
| {:$split-id 14, :$split-name :test} | 29 | Group: {:$split-id 14, :$split-name :test} \[3 5\]: |
#### Leave One Out
``` clojure
(-> for-splitting
(tc/split :loo)
(tc/head 30))
```
\_unnamed, (splitted) \[30 5\]:
| :id | :partition | :group | :*s**p**l**i**t* − *n**a**m**e*\|:split-id | |
|----:|------------|--------|--------------------------------------------|----:|
| 23 | :b | :g2 | :train | 0 |
| 3 | :a | :g2 | :train | 0 |
| 20 | :b | :g2 | :train | 0 |
| 2 | :a | :g3 | :train | 0 |
| 7 | :a | :g1 | :train | 0 |
| 24 | :b | :g2 | :train | 0 |
| 21 | :b | :g1 | :train | 0 |
| 14 | :a | :g3 | :train | 0 |
| 13 | :a | :g3 | :train | 0 |
| 0 | :a | :g3 | :train | 0 |
| 8 | :a | :g3 | :train | 0 |
| 19 | :a | :g1 | :train | 0 |
| 17 | :a | :g1 | :train | 0 |
| 11 | :a | :g2 | :train | 0 |
| 5 | :a | :g3 | :train | 0 |
| 12 | :a | :g3 | :train | 0 |
| 16 | :a | :g3 | :train | 0 |
| 1 | :a | :g3 | :train | 0 |
| 22 | :b | :g2 | :train | 0 |
| 18 | :a | :g2 | :train | 0 |
| 15 | :a | :g1 | :train | 0 |
| 10 | :a | :g2 | :train | 0 |
| 6 | :a | :g2 | :train | 0 |
| 9 | :a | :g1 | :train | 0 |
| 4 | :a | :g3 | :test | 0 |
| 4 | :a | :g3 | :train | 1 |
| 3 | :a | :g2 | :train | 1 |
| 20 | :b | :g2 | :train | 1 |
| 2 | :a | :g3 | :train | 1 |
| 7 | :a | :g1 | :train | 1 |
``` clojure
(-> for-splitting
(tc/split :loo)
(tc/row-count))
```
625
#### Grouped dataset with partitioning
``` clojure
(-> for-splitting
(tc/group-by :group)
(tc/split :bootstrap {:partition-selector :partition :seed 11 :ratio 0.8}))
```
\_unnamed \[3 3\]:
| :name | :group-id | :data |
|-------|----------:|----------------------------------|
| :g3 | 0 | Group: :g3, (splitted) \[13 5\]: |
| :g2 | 1 | Group: :g2, (splitted) \[12 5\]: |
| :g1 | 2 | Group: :g1, (splitted) \[8 5\]: |
#### Split as a sequence
To get a sequence of pairs, use `split->seq` function
``` clojure
(-> for-splitting
(tc/split->seq :kfold {:partition-selector :partition})
(first))
```
{:train Group: 0 \[20 3\]:
| :id | :partition | :group |
|----:|------------|--------|
| 4 | :a | :g3 |
| 5 | :a | :g3 |
| 9 | :a | :g1 |
| 2 | :a | :g3 |
| 12 | :a | :g3 |
| 14 | :a | :g3 |
| 10 | :a | :g2 |
| 1 | :a | :g3 |
| 7 | :a | :g1 |
| 13 | :a | :g3 |
| 15 | :a | :g1 |
| 16 | :a | :g3 |
| 3 | :a | :g2 |
| 19 | :a | :g1 |
| 11 | :a | :g2 |
| 18 | :a | :g2 |
| 23 | :b | :g2 |
| 21 | :b | :g1 |
| 24 | :b | :g2 |
| 22 | :b | :g2 |
, :test Group: 0 \[5 3\]:
| :id | :partition | :group |
|----:|------------|--------|
| 8 | :a | :g3 |
| 0 | :a | :g3 |
| 6 | :a | :g2 |
| 17 | :a | :g1 |
| 20 | :b | :g2 |
}
``` clojure
(-> for-splitting
(tc/group-by :group)
(tc/split->seq :bootstrap {:partition-selector :partition :seed 11 :ratio 0.8 :repeats 2})
(first))
```
\[:g3 ({:train Group: 0 \[8 3\]:\]
## Pipeline
`tablecloth.pipeline` exports special versions of API which create
functions operating only on dataset. This creates the possibility to
chain operations and compose them easily.
There are two ways to create pipelines:
- functional, as a composition of functions
- declarative, separating task declarations and concrete
parametrization.
Pipeline operations are prepared to work with
[metamorph](https://github.com/scicloj/metamorph) library. That means
that result of the pipeline is wrapped into a map and dataset is stored
under `:metamorph/data` key.
> **Warning: Duplicated `metamorph` pipeline functions are removed from
> `tablecloth.pipeline` namespace.**
## Functions
This API doesn’t provide any statistical, numerical or date/time
functions. Use below namespaces:
| Namespace | functions |
|-------------------------------|-------------------------------------------|
| `tech.v3.datatype.functional` | primitive oprations, reducers, statistics |
| `tech.v3.datatype.datetime` | date/time converters and operations |
## Other examples
### Stocks
``` clojure
(defonce stocks (tc/dataset "https://raw.githubusercontent.com/techascent/tech.ml.dataset/master/test/data/stocks.csv" {:key-fn keyword}))
```
``` clojure
stocks
```
\[560 3\]:
| :symbol | :date | :price |
|---------|------------|-------:|
| MSFT | 2000-01-01 | 39.81 |
| MSFT | 2000-02-01 | 36.35 |
| MSFT | 2000-03-01 | 43.22 |
| MSFT | 2000-04-01 | 28.37 |
| MSFT | 2000-05-01 | 25.45 |
| MSFT | 2000-06-01 | 32.54 |
| MSFT | 2000-07-01 | 28.40 |
| MSFT | 2000-08-01 | 28.40 |
| MSFT | 2000-09-01 | 24.53 |
| MSFT | 2000-10-01 | 28.02 |
| … | … | … |
| AAPL | 2009-05-01 | 135.81 |
| AAPL | 2009-06-01 | 142.43 |
| AAPL | 2009-07-01 | 163.39 |
| AAPL | 2009-08-01 | 168.21 |
| AAPL | 2009-09-01 | 185.35 |
| AAPL | 2009-10-01 | 188.50 |
| AAPL | 2009-11-01 | 199.91 |
| AAPL | 2009-12-01 | 210.73 |
| AAPL | 2010-01-01 | 192.06 |
| AAPL | 2010-02-01 | 204.62 |
| AAPL | 2010-03-01 | 223.02 |
``` clojure
(-> stocks
(tc/group-by (fn [row]
{:symbol (:symbol row)
:year (tech.v3.datatype.datetime/long-temporal-field :years (:date row))}))
(tc/aggregate #(tech.v3.datatype.functional/mean (% :price)))
(tc/order-by [:symbol :year]))
```
\_unnamed \[51 3\]:
| :symbol | :year | summary |
|---------|------:|-------------:|
| AAPL | 2000 | 21.74833333 |
| AAPL | 2001 | 10.17583333 |
| AAPL | 2002 | 9.40833333 |
| AAPL | 2003 | 9.34750000 |
| AAPL | 2004 | 18.72333333 |
| AAPL | 2005 | 48.17166667 |
| AAPL | 2006 | 72.04333333 |
| AAPL | 2007 | 133.35333333 |
| AAPL | 2008 | 138.48083333 |
| AAPL | 2009 | 150.39333333 |
| … | … | … |
| MSFT | 2000 | 29.67333333 |
| MSFT | 2001 | 25.34750000 |
| MSFT | 2002 | 21.82666667 |
| MSFT | 2003 | 20.93416667 |
| MSFT | 2004 | 22.67416667 |
| MSFT | 2005 | 23.84583333 |
| MSFT | 2006 | 24.75833333 |
| MSFT | 2007 | 29.28416667 |
| MSFT | 2008 | 25.20833333 |
| MSFT | 2009 | 22.87250000 |
| MSFT | 2010 | 28.50666667 |
``` clojure
(-> stocks
(tc/group-by (juxt :symbol #(tech.v3.datatype.datetime/long-temporal-field :years (% :date))))
(tc/aggregate #(tech.v3.datatype.functional/mean (% :price)))
(tc/rename-columns {:$group-name-0 :symbol
:$group-name-1 :year}))
```
\_unnamed \[51 3\]:
| :symbol | :year | summary |
|---------|------:|-------------:|
| MSFT | 2000 | 29.67333333 |
| MSFT | 2001 | 25.34750000 |
| MSFT | 2002 | 21.82666667 |
| MSFT | 2003 | 20.93416667 |
| MSFT | 2004 | 22.67416667 |
| MSFT | 2005 | 23.84583333 |
| MSFT | 2006 | 24.75833333 |
| MSFT | 2007 | 29.28416667 |
| MSFT | 2008 | 25.20833333 |
| MSFT | 2009 | 22.87250000 |
| … | … | … |
| AAPL | 2000 | 21.74833333 |
| AAPL | 2001 | 10.17583333 |
| AAPL | 2002 | 9.40833333 |
| AAPL | 2003 | 9.34750000 |
| AAPL | 2004 | 18.72333333 |
| AAPL | 2005 | 48.17166667 |
| AAPL | 2006 | 72.04333333 |
| AAPL | 2007 | 133.35333333 |
| AAPL | 2008 | 138.48083333 |
| AAPL | 2009 | 150.39333333 |
| AAPL | 2010 | 206.56666667 |
### data.table
Below you can find comparizon between functionality of `data.table` and
Clojure dataset API. I leave it without comments, please refer original
document explaining details:
[Introduction to
`data.table`](https://rdatatable.gitlab.io/data.table/articles/datatable-intro.html)
R
``` r
library(data.table)
library(knitr)
flights <- fread("https://raw.githubusercontent.com/Rdatatable/data.table/master/vignettes/flights14.csv")
kable(head(flights))
```
| year | month | day | dep_delay | arr_delay | carrier | origin | dest | air_time | distance | hour |
|-----:|------:|----:|----------:|----------:|:--------|:-------|:-----|---------:|---------:|-----:|
| 2014 | 1 | 1 | 14 | 13 | AA | JFK | LAX | 359 | 2475 | 9 |
| 2014 | 1 | 1 | -3 | 13 | AA | JFK | LAX | 363 | 2475 | 11 |
| 2014 | 1 | 1 | 2 | 9 | AA | JFK | LAX | 351 | 2475 | 19 |
| 2014 | 1 | 1 | -8 | -26 | AA | LGA | PBI | 157 | 1035 | 7 |
| 2014 | 1 | 1 | 2 | 1 | AA | JFK | LAX | 350 | 2475 | 13 |
| 2014 | 1 | 1 | 4 | 0 | AA | EWR | LAX | 339 | 2454 | 18 |
------------------------------------------------------------------------
Clojure
``` clojure
(require '[tech.v3.datatype.functional :as dfn]
'[tech.v3.datatype.argops :as aops]
'[tech.v3.datatype :as dtype])
(defonce flights (tc/dataset "https://raw.githubusercontent.com/Rdatatable/data.table/master/vignettes/flights14.csv"))
```
``` clojure
(tc/head flights 6)
```
\[6 11\]:
| year | month | day | dep_delay | arr_delay | carrier | origin | dest | air_time | distance | hour |
|-----:|------:|----:|----------:|----------:|---------|--------|------|---------:|---------:|-----:|
| 2014 | 1 | 1 | 14 | 13 | AA | JFK | LAX | 359 | 2475 | 9 |
| 2014 | 1 | 1 | -3 | 13 | AA | JFK | LAX | 363 | 2475 | 11 |
| 2014 | 1 | 1 | 2 | 9 | AA | JFK | LAX | 351 | 2475 | 19 |
| 2014 | 1 | 1 | -8 | -26 | AA | LGA | PBI | 157 | 1035 | 7 |
| 2014 | 1 | 1 | 2 | 1 | AA | JFK | LAX | 350 | 2475 | 13 |
| 2014 | 1 | 1 | 4 | 0 | AA | EWR | LAX | 339 | 2454 | 18 |
#### Basics
##### Shape of loaded data
R
``` r
dim(flights)
```
[1] 253316 11
------------------------------------------------------------------------
Clojure
``` clojure
(tc/shape flights)
```
[253316 11]
##### What is `data.table`?
R
``` r
DT = data.table(
ID = c("b","b","b","a","a","c"),
a = 1:6,
b = 7:12,
c = 13:18
)
kable(DT)
```
| ID | a | b | c |
|:----|----:|----:|----:|
| b | 1 | 7 | 13 |
| b | 2 | 8 | 14 |
| b | 3 | 9 | 15 |
| a | 4 | 10 | 16 |
| a | 5 | 11 | 17 |
| c | 6 | 12 | 18 |
``` r
class(DT$ID)
```
[1] "character"
------------------------------------------------------------------------
Clojure
``` clojure
(def DT (tc/dataset {:ID ["b" "b" "b" "a" "a" "c"]
:a (range 1 7)
:b (range 7 13)
:c (range 13 19)}))
```
``` clojure
DT
```
\_unnamed \[6 4\]:
| :ID | :a | :b | :c |
|-----|----:|----:|----:|
| b | 1 | 7 | 13 |
| b | 2 | 8 | 14 |
| b | 3 | 9 | 15 |
| a | 4 | 10 | 16 |
| a | 5 | 11 | 17 |
| c | 6 | 12 | 18 |
``` clojure
(-> :ID DT meta :datatype)
```
:string
##### Get all the flights with “JFK” as the origin airport in the month of June.
R
``` r
ans <- flights[origin == "JFK" & month == 6L]
kable(head(ans))
```
| year | month | day | dep_delay | arr_delay | carrier | origin | dest | air_time | distance | hour |
|-----:|------:|----:|----------:|----------:|:--------|:-------|:-----|---------:|---------:|-----:|
| 2014 | 6 | 1 | -9 | -5 | AA | JFK | LAX | 324 | 2475 | 8 |
| 2014 | 6 | 1 | -10 | -13 | AA | JFK | LAX | 329 | 2475 | 12 |
| 2014 | 6 | 1 | 18 | -1 | AA | JFK | LAX | 326 | 2475 | 7 |
| 2014 | 6 | 1 | -6 | -16 | AA | JFK | LAX | 320 | 2475 | 10 |
| 2014 | 6 | 1 | -4 | -45 | AA | JFK | LAX | 326 | 2475 | 18 |
| 2014 | 6 | 1 | -6 | -23 | AA | JFK | LAX | 329 | 2475 | 14 |
------------------------------------------------------------------------
Clojure
``` clojure
(-> flights
(tc/select-rows (fn [row] (and (= (get row "origin") "JFK")
(= (get row "month") 6))))
(tc/head 6))
```
\[6 11\]:
| year | month | day | dep_delay | arr_delay | carrier | origin | dest | air_time | distance | hour |
|-----:|------:|----:|----------:|----------:|---------|--------|------|---------:|---------:|-----:|
| 2014 | 6 | 1 | -9 | -5 | AA | JFK | LAX | 324 | 2475 | 8 |
| 2014 | 6 | 1 | -10 | -13 | AA | JFK | LAX | 329 | 2475 | 12 |
| 2014 | 6 | 1 | 18 | -1 | AA | JFK | LAX | 326 | 2475 | 7 |
| 2014 | 6 | 1 | -6 | -16 | AA | JFK | LAX | 320 | 2475 | 10 |
| 2014 | 6 | 1 | -4 | -45 | AA | JFK | LAX | 326 | 2475 | 18 |
| 2014 | 6 | 1 | -6 | -23 | AA | JFK | LAX | 329 | 2475 | 14 |
##### Get the first two rows from `flights`.
R
``` r
ans <- flights[1:2]
kable(ans)
```
| year | month | day | dep_delay | arr_delay | carrier | origin | dest | air_time | distance | hour |
|-----:|------:|----:|----------:|----------:|:--------|:-------|:-----|---------:|---------:|-----:|
| 2014 | 1 | 1 | 14 | 13 | AA | JFK | LAX | 359 | 2475 | 9 |
| 2014 | 1 | 1 | -3 | 13 | AA | JFK | LAX | 363 | 2475 | 11 |
------------------------------------------------------------------------
Clojure
``` clojure
(tc/select-rows flights (range 2))
```
\[2 11\]:
| year | month | day | dep_delay | arr_delay | carrier | origin | dest | air_time | distance | hour |
|-----:|------:|----:|----------:|----------:|---------|--------|------|---------:|---------:|-----:|
| 2014 | 1 | 1 | 14 | 13 | AA | JFK | LAX | 359 | 2475 | 9 |
| 2014 | 1 | 1 | -3 | 13 | AA | JFK | LAX | 363 | 2475 | 11 |
##### Sort `flights` first by column `origin` in ascending order, and then by `dest` in descending order
R
``` r
ans <- flights[order(origin, -dest)]
kable(head(ans))
```
| year | month | day | dep_delay | arr_delay | carrier | origin | dest | air_time | distance | hour |
|-----:|------:|----:|----------:|----------:|:--------|:-------|:-----|---------:|---------:|-----:|
| 2014 | 1 | 5 | 6 | 49 | EV | EWR | XNA | 195 | 1131 | 8 |
| 2014 | 1 | 6 | 7 | 13 | EV | EWR | XNA | 190 | 1131 | 8 |
| 2014 | 1 | 7 | -6 | -13 | EV | EWR | XNA | 179 | 1131 | 8 |
| 2014 | 1 | 8 | -7 | -12 | EV | EWR | XNA | 184 | 1131 | 8 |
| 2014 | 1 | 9 | 16 | 7 | EV | EWR | XNA | 181 | 1131 | 8 |
| 2014 | 1 | 13 | 66 | 66 | EV | EWR | XNA | 188 | 1131 | 9 |
------------------------------------------------------------------------
Clojure
``` clojure
(-> flights
(tc/order-by ["origin" "dest"] [:asc :desc])
(tc/head 6))
```
\[6 11\]:
| year | month | day | dep_delay | arr_delay | carrier | origin | dest | air_time | distance | hour |
|-----:|------:|----:|----------:|----------:|---------|--------|------|---------:|---------:|-----:|
| 2014 | 6 | 3 | -6 | -38 | EV | EWR | XNA | 154 | 1131 | 6 |
| 2014 | 1 | 20 | -9 | -17 | EV | EWR | XNA | 177 | 1131 | 8 |
| 2014 | 3 | 19 | -6 | 10 | EV | EWR | XNA | 201 | 1131 | 6 |
| 2014 | 2 | 3 | 231 | 268 | EV | EWR | XNA | 184 | 1131 | 12 |
| 2014 | 4 | 25 | -8 | -32 | EV | EWR | XNA | 159 | 1131 | 6 |
| 2014 | 2 | 19 | 21 | 10 | EV | EWR | XNA | 176 | 1131 | 8 |
##### Select `arr_delay` column, but return it as a vector
R
``` r
ans <- flights[, arr_delay]
head(ans)
```
[1] 13 13 9 -26 1 0
------------------------------------------------------------------------
Clojure
``` clojure
(take 6 (flights "arr_delay"))
```
(13 13 9 -26 1 0)
##### Select `arr_delay` column, but return as a data.table instead
R
``` r
ans <- flights[, list(arr_delay)]
kable(head(ans))
```
| arr_delay |
|----------:|
| 13 |
| 13 |
| 9 |
| -26 |
| 1 |
| 0 |
------------------------------------------------------------------------
Clojure
``` clojure
(-> flights
(tc/select-columns "arr_delay")
(tc/head 6))
```
\[6 1\]:
| arr_delay |
|----------:|
| 13 |
| 13 |
| 9 |
| -26 |
| 1 |
| 0 |
##### Select both `arr_delay` and `dep_delay` columns
R
``` r
ans <- flights[, .(arr_delay, dep_delay)]
kable(head(ans))
```
| arr_delay | dep_delay |
|----------:|----------:|
| 13 | 14 |
| 13 | -3 |
| 9 | 2 |
| -26 | -8 |
| 1 | 2 |
| 0 | 4 |
------------------------------------------------------------------------
Clojure
``` clojure
(-> flights
(tc/select-columns ["arr_delay" "dep_delay"])
(tc/head 6))
```
\[6 2\]:
| arr_delay | dep_delay |
|----------:|----------:|
| 13 | 14 |
| 13 | -3 |
| 9 | 2 |
| -26 | -8 |
| 1 | 2 |
| 0 | 4 |
##### Select both `arr_delay` and `dep_delay` columns and rename them to `delay_arr` and `delay_dep`
R
``` r
ans <- flights[, .(delay_arr = arr_delay, delay_dep = dep_delay)]
kable(head(ans))
```
| delay_arr | delay_dep |
|----------:|----------:|
| 13 | 14 |
| 13 | -3 |
| 9 | 2 |
| -26 | -8 |
| 1 | 2 |
| 0 | 4 |
------------------------------------------------------------------------
Clojure
``` clojure
(-> flights
(tc/select-columns {"arr_delay" "delay_arr"
"dep_delay" "delay_arr"})
(tc/head 6))
```
\[6 2\]:
| delay_arr | delay_arr |
|----------:|----------:|
| 13 | 14 |
| 13 | -3 |
| 9 | 2 |
| -26 | -8 |
| 1 | 2 |
| 0 | 4 |
##### How many trips have had total delay \< 0?
R
``` r
ans <- flights[, sum( (arr_delay + dep_delay) < 0 )]
ans
```
[1] 141814
------------------------------------------------------------------------
Clojure
``` clojure
(->> (dfn/+ (flights "arr_delay") (flights "dep_delay"))
(aops/argfilter #(< % 0.0))
(dtype/ecount))
```
141814
or pure Clojure functions (much, much slower)
``` clojure
(->> (map + (flights "arr_delay") (flights "dep_delay"))
(filter neg?)
(count))
```
141814
##### Calculate the average arrival and departure delay for all flights with “JFK” as the origin airport in the month of June
R
``` r
ans <- flights[origin == "JFK" & month == 6L,
.(m_arr = mean(arr_delay), m_dep = mean(dep_delay))]
kable(ans)
```
| m_arr | m_dep |
|---------:|---------:|
| 5.839349 | 9.807884 |
------------------------------------------------------------------------
Clojure
``` clojure
(-> flights
(tc/select-rows (fn [row] (and (= (get row "origin") "JFK")
(= (get row "month") 6))))
(tc/aggregate {:m_arr #(dfn/mean (% "arr_delay"))
:m_dep #(dfn/mean (% "dep_delay"))}))
```
\_unnamed \[1 2\]:
| :m_arr | :m_dep |
|-----------:|-----------:|
| 5.83934932 | 9.80788411 |
##### How many trips have been made in 2014 from “JFK” airport in the month of June?
R
``` r
ans <- flights[origin == "JFK" & month == 6L, length(dest)]
ans
```
[1] 8422
or
``` r
ans <- flights[origin == "JFK" & month == 6L, .N]
ans
```
[1] 8422
------------------------------------------------------------------------
Clojure
``` clojure
(-> flights
(tc/select-rows (fn [row] (and (= (get row "origin") "JFK")
(= (get row "month") 6))))
(tc/row-count))
```
8422
##### deselect columns using - or !
R
``` r
ans <- flights[, !c("arr_delay", "dep_delay")]
kable(head(ans))
```
| year | month | day | carrier | origin | dest | air_time | distance | hour |
|-----:|------:|----:|:--------|:-------|:-----|---------:|---------:|-----:|
| 2014 | 1 | 1 | AA | JFK | LAX | 359 | 2475 | 9 |
| 2014 | 1 | 1 | AA | JFK | LAX | 363 | 2475 | 11 |
| 2014 | 1 | 1 | AA | JFK | LAX | 351 | 2475 | 19 |
| 2014 | 1 | 1 | AA | LGA | PBI | 157 | 1035 | 7 |
| 2014 | 1 | 1 | AA | JFK | LAX | 350 | 2475 | 13 |
| 2014 | 1 | 1 | AA | EWR | LAX | 339 | 2454 | 18 |
or
``` r
ans <- flights[, -c("arr_delay", "dep_delay")]
kable(head(ans))
```
| year | month | day | carrier | origin | dest | air_time | distance | hour |
|-----:|------:|----:|:--------|:-------|:-----|---------:|---------:|-----:|
| 2014 | 1 | 1 | AA | JFK | LAX | 359 | 2475 | 9 |
| 2014 | 1 | 1 | AA | JFK | LAX | 363 | 2475 | 11 |
| 2014 | 1 | 1 | AA | JFK | LAX | 351 | 2475 | 19 |
| 2014 | 1 | 1 | AA | LGA | PBI | 157 | 1035 | 7 |
| 2014 | 1 | 1 | AA | JFK | LAX | 350 | 2475 | 13 |
| 2014 | 1 | 1 | AA | EWR | LAX | 339 | 2454 | 18 |
------------------------------------------------------------------------
Clojure
``` clojure
(-> flights
(tc/select-columns (complement #{"arr_delay" "dep_delay"}))
(tc/head 6))
```
\[6 9\]:
| year | month | day | carrier | origin | dest | air_time | distance | hour |
|-----:|------:|----:|---------|--------|------|---------:|---------:|-----:|
| 2014 | 1 | 1 | AA | JFK | LAX | 359 | 2475 | 9 |
| 2014 | 1 | 1 | AA | JFK | LAX | 363 | 2475 | 11 |
| 2014 | 1 | 1 | AA | JFK | LAX | 351 | 2475 | 19 |
| 2014 | 1 | 1 | AA | LGA | PBI | 157 | 1035 | 7 |
| 2014 | 1 | 1 | AA | JFK | LAX | 350 | 2475 | 13 |
| 2014 | 1 | 1 | AA | EWR | LAX | 339 | 2454 | 18 |
#### Aggregations
##### How can we get the number of trips corresponding to each origin airport?
R
``` r
ans <- flights[, .(.N), by = .(origin)]
kable(ans)
```
| origin | N |
|:-------|------:|
| JFK | 81483 |
| LGA | 84433 |
| EWR | 87400 |
------------------------------------------------------------------------
Clojure
``` clojure
(-> flights
(tc/group-by ["origin"])
(tc/aggregate {:N tc/row-count}))
```
\_unnamed \[3 2\]:
| origin | :N |
|--------|------:|
| JFK | 81483 |
| LGA | 84433 |
| EWR | 87400 |
##### How can we calculate the number of trips for each origin airport for carrier code “AA”?
R
``` r
ans <- flights[carrier == "AA", .N, by = origin]
kable(ans)
```
| origin | N |
|:-------|------:|
| JFK | 11923 |
| LGA | 11730 |
| EWR | 2649 |
------------------------------------------------------------------------
Clojure
``` clojure
(-> flights
(tc/select-rows #(= (get % "carrier") "AA"))
(tc/group-by ["origin"])
(tc/aggregate {:N tc/row-count}))
```
\_unnamed \[3 2\]:
| origin | :N |
|--------|------:|
| JFK | 11923 |
| LGA | 11730 |
| EWR | 2649 |
##### How can we get the total number of trips for each `origin`, `dest` pair for carrier code “AA”?
R
``` r
ans <- flights[carrier == "AA", .N, by = .(origin, dest)]
kable(head(ans))
```
| origin | dest | N |
|:-------|:-----|-----:|
| JFK | LAX | 3387 |
| LGA | PBI | 245 |
| EWR | LAX | 62 |
| JFK | MIA | 1876 |
| JFK | SEA | 298 |
| EWR | MIA | 848 |
------------------------------------------------------------------------
Clojure
``` clojure
(-> flights
(tc/select-rows #(= (get % "carrier") "AA"))
(tc/group-by ["origin" "dest"])
(tc/aggregate {:N tc/row-count})
(tc/head 6))
```
\_unnamed \[6 3\]:
| origin | dest | :N |
|--------|------|-----:|
| JFK | LAX | 3387 |
| LGA | PBI | 245 |
| EWR | LAX | 62 |
| JFK | MIA | 1876 |
| JFK | SEA | 298 |
| EWR | MIA | 848 |
##### How can we get the average arrival and departure delay for each `orig`,`dest` pair for each month for carrier code “AA”?
R
``` r
ans <- flights[carrier == "AA",
.(mean(arr_delay), mean(dep_delay)),
by = .(origin, dest, month)]
kable(head(ans,10))
```
| origin | dest | month | V1 | V2 |
|:-------|:-----|------:|----------:|-----------:|
| JFK | LAX | 1 | 6.590361 | 14.2289157 |
| LGA | PBI | 1 | -7.758621 | 0.3103448 |
| EWR | LAX | 1 | 1.366667 | 7.5000000 |
| JFK | MIA | 1 | 15.720670 | 18.7430168 |
| JFK | SEA | 1 | 14.357143 | 30.7500000 |
| EWR | MIA | 1 | 11.011236 | 12.1235955 |
| JFK | SFO | 1 | 19.252252 | 28.6396396 |
| JFK | BOS | 1 | 12.919643 | 15.2142857 |
| JFK | ORD | 1 | 31.586207 | 40.1724138 |
| JFK | IAH | 1 | 28.857143 | 14.2857143 |
------------------------------------------------------------------------
Clojure
``` clojure
(-> flights
(tc/select-rows #(= (get % "carrier") "AA"))
(tc/group-by ["origin" "dest" "month"])
(tc/aggregate [#(dfn/mean (% "arr_delay"))
#(dfn/mean (% "dep_delay"))])
(tc/head 10))
```
\_unnamed \[10 5\]:
| origin | dest | month | :summary-0 | :summary-1 |
|--------|------|------:|------------:|------------:|
| EWR | DFW | 2 | 10.53676471 | 11.34558824 |
| LGA | ORD | 2 | 5.28526646 | 8.78996865 |
| JFK | BOS | 2 | 11.18000000 | 11.76000000 |
| JFK | LAX | 2 | 22.75200000 | 15.08000000 |
| EWR | LAX | 2 | 10.33333333 | 4.11111111 |
| JFK | SJU | 2 | 10.35000000 | 10.81250000 |
| LGA | MIA | 2 | 9.62500000 | 10.70000000 |
| JFK | STT | 2 | 15.11538462 | 20.96153846 |
| LGA | DFW | 2 | 5.37133550 | 7.47882736 |
| EWR | MIA | 2 | 1.56410256 | 4.75641026 |
##### So how can we directly order by all the grouping variables?
R
``` r
ans <- flights[carrier == "AA",
.(mean(arr_delay), mean(dep_delay)),
keyby = .(origin, dest, month)]
kable(head(ans,10))
```
| origin | dest | month | V1 | V2 |
|:-------|:-----|------:|----------:|----------:|
| EWR | DFW | 1 | 6.427673 | 10.012579 |
| EWR | DFW | 2 | 10.536765 | 11.345588 |
| EWR | DFW | 3 | 12.865031 | 8.079755 |
| EWR | DFW | 4 | 17.792683 | 12.920732 |
| EWR | DFW | 5 | 18.487805 | 18.682927 |
| EWR | DFW | 6 | 37.005952 | 38.744048 |
| EWR | DFW | 7 | 20.250000 | 21.154762 |
| EWR | DFW | 8 | 16.936046 | 22.069767 |
| EWR | DFW | 9 | 5.865031 | 13.055215 |
| EWR | DFW | 10 | 18.813665 | 18.894410 |
------------------------------------------------------------------------
Clojure
``` clojure
(-> flights
(tc/select-rows #(= (get % "carrier") "AA"))
(tc/group-by ["origin" "dest" "month"])
(tc/aggregate [#(dfn/mean (% "arr_delay"))
#(dfn/mean (% "dep_delay"))])
(tc/order-by ["origin" "dest" "month"])
(tc/head 10))
```
\_unnamed \[10 5\]:
| origin | dest | month | :summary-0 | :summary-1 |
|--------|------|------:|------------:|------------:|
| EWR | DFW | 1 | 6.42767296 | 10.01257862 |
| EWR | DFW | 2 | 10.53676471 | 11.34558824 |
| EWR | DFW | 3 | 12.86503067 | 8.07975460 |
| EWR | DFW | 4 | 17.79268293 | 12.92073171 |
| EWR | DFW | 5 | 18.48780488 | 18.68292683 |
| EWR | DFW | 6 | 37.00595238 | 38.74404762 |
| EWR | DFW | 7 | 20.25000000 | 21.15476190 |
| EWR | DFW | 8 | 16.93604651 | 22.06976744 |
| EWR | DFW | 9 | 5.86503067 | 13.05521472 |
| EWR | DFW | 10 | 18.81366460 | 18.89440994 |
##### Can `by` accept expressions as well or does it just take columns?
R
``` r
ans <- flights[, .N, .(dep_delay>0, arr_delay>0)]
kable(ans)
```
| dep_delay | arr_delay | N |
|:----------|:----------|-------:|
| TRUE | TRUE | 72836 |
| FALSE | TRUE | 34583 |
| FALSE | FALSE | 119304 |
| TRUE | FALSE | 26593 |
------------------------------------------------------------------------
Clojure
``` clojure
(-> flights
(tc/group-by (fn [row]
{:dep_delay (pos? (get row "dep_delay"))
:arr_delay (pos? (get row "arr_delay"))}))
(tc/aggregate {:N tc/row-count}))
```
\_unnamed \[4 3\]:
| :dep_delay | :arr_delay | :N |
|------------|------------|-------:|
| true | true | 72836 |
| false | true | 34583 |
| false | false | 119304 |
| true | false | 26593 |
##### Do we have to compute `mean()` for each column individually?
R
``` r
kable(DT)
```
| ID | a | b | c |
|:----|----:|----:|----:|
| b | 1 | 7 | 13 |
| b | 2 | 8 | 14 |
| b | 3 | 9 | 15 |
| a | 4 | 10 | 16 |
| a | 5 | 11 | 17 |
| c | 6 | 12 | 18 |
``` r
DT[, print(.SD), by = ID]
```
a b c
1: 1 7 13
2: 2 8 14
3: 3 9 15
a b c
1: 4 10 16
2: 5 11 17
a b c
1: 6 12 18
Empty data.table (0 rows and 1 cols): ID
``` r
kable(DT[, lapply(.SD, mean), by = ID])
```
| ID | a | b | c |
|:----|----:|-----:|-----:|
| b | 2.0 | 8.0 | 14.0 |
| a | 4.5 | 10.5 | 16.5 |
| c | 6.0 | 12.0 | 18.0 |
------------------------------------------------------------------------
Clojure
``` clojure
DT
(tc/group-by DT :ID {:result-type :as-map})
```
\_unnamed \[6 4\]:
| :ID | :a | :b | :c |
|-----|----:|----:|----:|
| b | 1 | 7 | 13 |
| b | 2 | 8 | 14 |
| b | 3 | 9 | 15 |
| a | 4 | 10 | 16 |
| a | 5 | 11 | 17 |
| c | 6 | 12 | 18 |
{“b” Group: b \[3 4\]:
| :ID | :a | :b | :c |
|-----|----:|----:|----:|
| b | 1 | 7 | 13 |
| b | 2 | 8 | 14 |
| b | 3 | 9 | 15 |
, “a” Group: a \[2 4\]:
| :ID | :a | :b | :c |
|-----|----:|----:|----:|
| a | 4 | 10 | 16 |
| a | 5 | 11 | 17 |
, “c” Group: c \[1 4\]:
| :ID | :a | :b | :c |
|-----|----:|----:|----:|
| c | 6 | 12 | 18 |
}
``` clojure
(-> DT
(tc/group-by [:ID])
(tc/aggregate-columns (complement #{:ID}) dfn/mean))
```
\_unnamed \[3 4\]:
| :ID | :a | :b | :c |
|-----|----:|-----:|-----:|
| b | 2.0 | 8.0 | 14.0 |
| a | 4.5 | 10.5 | 16.5 |
| c | 6.0 | 12.0 | 18.0 |
##### How can we specify just the columns we would like to compute the `mean()` on?
R
``` r
kable(head(flights[carrier == "AA", ## Only on trips with carrier "AA"
lapply(.SD, mean), ## compute the mean
by = .(origin, dest, month), ## for every 'origin,dest,month'
.SDcols = c("arr_delay", "dep_delay")])) ## for just those specified in .SDcols
```
| origin | dest | month | arr_delay | dep_delay |
|:-------|:-----|------:|----------:|-----------:|
| JFK | LAX | 1 | 6.590361 | 14.2289157 |
| LGA | PBI | 1 | -7.758621 | 0.3103448 |
| EWR | LAX | 1 | 1.366667 | 7.5000000 |
| JFK | MIA | 1 | 15.720670 | 18.7430168 |
| JFK | SEA | 1 | 14.357143 | 30.7500000 |
| EWR | MIA | 1 | 11.011236 | 12.1235955 |
------------------------------------------------------------------------
Clojure
``` clojure
(-> flights
(tc/select-rows #(= (get % "carrier") "AA"))
(tc/group-by ["origin" "dest" "month"])
(tc/aggregate-columns ["arr_delay" "dep_delay"] dfn/mean)
(tc/head 6))
```
\_unnamed \[6 5\]:
| origin | dest | month | arr_delay | dep_delay |
|--------|------|------:|------------:|------------:|
| EWR | DFW | 2 | 10.53676471 | 11.34558824 |
| LGA | ORD | 2 | 5.28526646 | 8.78996865 |
| JFK | BOS | 2 | 11.18000000 | 11.76000000 |
| JFK | LAX | 2 | 22.75200000 | 15.08000000 |
| EWR | LAX | 2 | 10.33333333 | 4.11111111 |
| JFK | SJU | 2 | 10.35000000 | 10.81250000 |
##### How can we return the first two rows for each month?
R
``` r
ans <- flights[, head(.SD, 2), by = month]
kable(head(ans))
```
| month | year | day | dep_delay | arr_delay | carrier | origin | dest | air_time | distance | hour |
|------:|-----:|----:|----------:|----------:|:--------|:-------|:-----|---------:|---------:|-----:|
| 1 | 2014 | 1 | 14 | 13 | AA | JFK | LAX | 359 | 2475 | 9 |
| 1 | 2014 | 1 | -3 | 13 | AA | JFK | LAX | 363 | 2475 | 11 |
| 2 | 2014 | 1 | -1 | 1 | AA | JFK | LAX | 358 | 2475 | 8 |
| 2 | 2014 | 1 | -5 | 3 | AA | JFK | LAX | 358 | 2475 | 11 |
| 3 | 2014 | 1 | -11 | 36 | AA | JFK | LAX | 375 | 2475 | 8 |
| 3 | 2014 | 1 | -3 | 14 | AA | JFK | LAX | 368 | 2475 | 11 |
------------------------------------------------------------------------
Clojure
``` clojure
(-> flights
(tc/group-by ["month"])
(tc/head 2) ;; head applied on each group
(tc/ungroup)
(tc/head 6))
```
\_unnamed \[6 11\]:
| year | month | day | dep_delay | arr_delay | carrier | origin | dest | air_time | distance | hour |
|-----:|------:|----:|----------:|----------:|---------|--------|------|---------:|---------:|-----:|
| 2014 | 2 | 20 | 4 | -13 | DL | LGA | ATL | 112 | 762 | 20 |
| 2014 | 2 | 20 | 8 | 3 | DL | LGA | PBI | 141 | 1035 | 10 |
| 2014 | 3 | 1 | -11 | 36 | AA | JFK | LAX | 375 | 2475 | 8 |
| 2014 | 3 | 1 | -3 | 14 | AA | JFK | LAX | 368 | 2475 | 11 |
| 2014 | 4 | 1 | -8 | -23 | MQ | LGA | BNA | 113 | 764 | 18 |
| 2014 | 4 | 1 | -8 | -11 | MQ | LGA | RDU | 71 | 431 | 18 |
##### How can we concatenate columns a and b for each group in ID?
R
``` r
kable(DT[, .(val = c(a,b)), by = ID])
```
| ID | val |
|:----|----:|
| b | 1 |
| b | 2 |
| b | 3 |
| b | 7 |
| b | 8 |
| b | 9 |
| a | 4 |
| a | 5 |
| a | 10 |
| a | 11 |
| c | 6 |
| c | 12 |
------------------------------------------------------------------------
Clojure
``` clojure
(-> DT
(tc/pivot->longer [:a :b] {:value-column-name :val})
(tc/drop-columns [:$column :c]))
```
\_unnamed \[12 2\]:
| :ID | :val |
|-----|-----:|
| b | 1 |
| b | 2 |
| b | 3 |
| a | 4 |
| a | 5 |
| c | 6 |
| b | 7 |
| b | 8 |
| b | 9 |
| a | 10 |
| a | 11 |
| c | 12 |
##### What if we would like to have all the values of column `a` and `b` concatenated, but returned as a list column?
R
``` r
kable(DT[, .(val = list(c(a,b))), by = ID])
```
| ID | val |
|:----|:-----------------|
| b | 1, 2, 3, 7, 8, 9 |
| a | 4, 5, 10, 11 |
| c | 6, 12 |
------------------------------------------------------------------------
Clojure
``` clojure
(-> DT
(tc/pivot->longer [:a :b] {:value-column-name :val})
(tc/drop-columns [:$column :c])
(tc/fold-by :ID))
```
\_unnamed \[3 2\]:
| :ID | :val |
|-----|-----------------|
| b | \[1 2 3 7 8 9\] |
| a | \[4 5 10 11\] |
| c | \[6 12\] |
### API tour
Below snippets are taken from [A data.table and dplyr
tour](https://atrebas.github.io/post/2019-03-03-datatable-dplyr/)
written by Atrebas (permission granted).
I keep structure and subtitles but I skip `data.table` and `dplyr`
examples.
Example data
``` clojure
(def DS (tc/dataset {:V1 (take 9 (cycle [1 2]))
:V2 (range 1 10)
:V3 (take 9 (cycle [0.5 1.0 1.5]))
:V4 (take 9 (cycle ["A" "B" "C"]))}))
```
``` clojure
(tc/dataset? DS)
(class DS)
```
true
tech.v3.dataset.impl.dataset.Dataset
``` clojure
DS
```
\_unnamed \[9 4\]:
| :V1 | :V2 | :V3 | :V4 |
|----:|----:|----:|-----|
| 1 | 1 | 0.5 | A |
| 2 | 2 | 1.0 | B |
| 1 | 3 | 1.5 | C |
| 2 | 4 | 0.5 | A |
| 1 | 5 | 1.0 | B |
| 2 | 6 | 1.5 | C |
| 1 | 7 | 0.5 | A |
| 2 | 8 | 1.0 | B |
| 1 | 9 | 1.5 | C |
#### Basic Operations
##### Filter rows
Filter rows using indices
``` clojure
(tc/select-rows DS [2 3])
```
\_unnamed \[2 4\]:
| :V1 | :V2 | :V3 | :V4 |
|----:|----:|----:|-----|
| 1 | 3 | 1.5 | C |
| 2 | 4 | 0.5 | A |
------------------------------------------------------------------------
Discard rows using negative indices
In Clojure API we have separate function for that: `drop-rows`.
``` clojure
(tc/drop-rows DS (range 2 7))
```
\_unnamed \[4 4\]:
| :V1 | :V2 | :V3 | :V4 |
|----:|----:|----:|-----|
| 1 | 1 | 0.5 | A |
| 2 | 2 | 1.0 | B |
| 2 | 8 | 1.0 | B |
| 1 | 9 | 1.5 | C |
------------------------------------------------------------------------
Filter rows using a logical expression
``` clojure
(tc/select-rows DS (comp #(> % 5) :V2))
```
\_unnamed \[4 4\]:
| :V1 | :V2 | :V3 | :V4 |
|----:|----:|----:|-----|
| 2 | 6 | 1.5 | C |
| 1 | 7 | 0.5 | A |
| 2 | 8 | 1.0 | B |
| 1 | 9 | 1.5 | C |
``` clojure
(tc/select-rows DS (comp #{"A" "C"} :V4))
```
\_unnamed \[6 4\]:
| :V1 | :V2 | :V3 | :V4 |
|----:|----:|----:|-----|
| 1 | 1 | 0.5 | A |
| 1 | 3 | 1.5 | C |
| 2 | 4 | 0.5 | A |
| 2 | 6 | 1.5 | C |
| 1 | 7 | 0.5 | A |
| 1 | 9 | 1.5 | C |
------------------------------------------------------------------------
Filter rows using multiple conditions
``` clojure
(tc/select-rows DS #(and (= (:V1 %) 1)
(= (:V4 %) "A")))
```
\_unnamed \[2 4\]:
| :V1 | :V2 | :V3 | :V4 |
|----:|----:|----:|-----|
| 1 | 1 | 0.5 | A |
| 1 | 7 | 0.5 | A |
------------------------------------------------------------------------
Filter unique rows
``` clojure
(tc/unique-by DS)
```
\_unnamed \[9 4\]:
| :V1 | :V2 | :V3 | :V4 |
|----:|----:|----:|-----|
| 1 | 1 | 0.5 | A |
| 2 | 2 | 1.0 | B |
| 1 | 3 | 1.5 | C |
| 2 | 4 | 0.5 | A |
| 1 | 5 | 1.0 | B |
| 2 | 6 | 1.5 | C |
| 1 | 7 | 0.5 | A |
| 2 | 8 | 1.0 | B |
| 1 | 9 | 1.5 | C |
``` clojure
(tc/unique-by DS [:V1 :V4])
```
\_unnamed \[6 4\]:
| :V1 | :V2 | :V3 | :V4 |
|----:|----:|----:|-----|
| 1 | 1 | 0.5 | A |
| 2 | 2 | 1.0 | B |
| 1 | 3 | 1.5 | C |
| 2 | 4 | 0.5 | A |
| 1 | 5 | 1.0 | B |
| 2 | 6 | 1.5 | C |
------------------------------------------------------------------------
Discard rows with missing values
``` clojure
(tc/drop-missing DS)
```
\_unnamed \[9 4\]:
| :V1 | :V2 | :V3 | :V4 |
|----:|----:|----:|-----|
| 1 | 1 | 0.5 | A |
| 2 | 2 | 1.0 | B |
| 1 | 3 | 1.5 | C |
| 2 | 4 | 0.5 | A |
| 1 | 5 | 1.0 | B |
| 2 | 6 | 1.5 | C |
| 1 | 7 | 0.5 | A |
| 2 | 8 | 1.0 | B |
| 1 | 9 | 1.5 | C |
------------------------------------------------------------------------
Other filters
``` clojure
(tc/random DS 3) ;; 3 random rows
```
\_unnamed \[3 4\]:
| :V1 | :V2 | :V3 | :V4 |
|----:|----:|----:|-----|
| 1 | 3 | 1.5 | C |
| 1 | 1 | 0.5 | A |
| 1 | 7 | 0.5 | A |
``` clojure
(tc/random DS (/ (tc/row-count DS) 2)) ;; fraction of random rows
```
\_unnamed \[5 4\]:
| :V1 | :V2 | :V3 | :V4 |
|----:|----:|----:|-----|
| 2 | 8 | 1.0 | B |
| 1 | 5 | 1.0 | B |
| 1 | 7 | 0.5 | A |
| 1 | 1 | 0.5 | A |
| 2 | 4 | 0.5 | A |
``` clojure
(tc/by-rank DS :V1 zero?) ;; take top n entries
```
\_unnamed \[4 4\]:
| :V1 | :V2 | :V3 | :V4 |
|----:|----:|----:|-----|
| 2 | 2 | 1.0 | B |
| 2 | 4 | 0.5 | A |
| 2 | 6 | 1.5 | C |
| 2 | 8 | 1.0 | B |
------------------------------------------------------------------------
Convenience functions
``` clojure
(tc/select-rows DS (comp (partial re-matches #"^B") str :V4))
```
\_unnamed \[3 4\]:
| :V1 | :V2 | :V3 | :V4 |
|----:|----:|----:|-----|
| 2 | 2 | 1.0 | B |
| 1 | 5 | 1.0 | B |
| 2 | 8 | 1.0 | B |
``` clojure
(tc/select-rows DS (comp #(<= 3 % 5) :V2))
```
\_unnamed \[3 4\]:
| :V1 | :V2 | :V3 | :V4 |
|----:|----:|----:|-----|
| 1 | 3 | 1.5 | C |
| 2 | 4 | 0.5 | A |
| 1 | 5 | 1.0 | B |
``` clojure
(tc/select-rows DS (comp #(< 3 % 5) :V2))
```
\_unnamed \[1 4\]:
| :V1 | :V2 | :V3 | :V4 |
|----:|----:|----:|-----|
| 2 | 4 | 0.5 | A |
``` clojure
(tc/select-rows DS (comp #(<= 3 % 5) :V2))
```
\_unnamed \[3 4\]:
| :V1 | :V2 | :V3 | :V4 |
|----:|----:|----:|-----|
| 1 | 3 | 1.5 | C |
| 2 | 4 | 0.5 | A |
| 1 | 5 | 1.0 | B |
Last example skipped.
##### Sort rows
Sort rows by column
``` clojure
(tc/order-by DS :V3)
```
\_unnamed \[9 4\]:
| :V1 | :V2 | :V3 | :V4 |
|----:|----:|----:|-----|
| 1 | 1 | 0.5 | A |
| 2 | 4 | 0.5 | A |
| 1 | 7 | 0.5 | A |
| 2 | 2 | 1.0 | B |
| 1 | 5 | 1.0 | B |
| 2 | 8 | 1.0 | B |
| 1 | 3 | 1.5 | C |
| 2 | 6 | 1.5 | C |
| 1 | 9 | 1.5 | C |
------------------------------------------------------------------------
Sort rows in decreasing order
``` clojure
(tc/order-by DS :V3 :desc)
```
\_unnamed \[9 4\]:
| :V1 | :V2 | :V3 | :V4 |
|----:|----:|----:|-----|
| 1 | 3 | 1.5 | C |
| 2 | 6 | 1.5 | C |
| 1 | 9 | 1.5 | C |
| 1 | 5 | 1.0 | B |
| 2 | 2 | 1.0 | B |
| 2 | 8 | 1.0 | B |
| 1 | 7 | 0.5 | A |
| 2 | 4 | 0.5 | A |
| 1 | 1 | 0.5 | A |
------------------------------------------------------------------------
Sort rows based on several columns
``` clojure
(tc/order-by DS [:V1 :V2] [:asc :desc])
```
\_unnamed \[9 4\]:
| :V1 | :V2 | :V3 | :V4 |
|----:|----:|----:|-----|
| 1 | 9 | 1.5 | C |
| 1 | 7 | 0.5 | A |
| 1 | 5 | 1.0 | B |
| 1 | 3 | 1.5 | C |
| 1 | 1 | 0.5 | A |
| 2 | 8 | 1.0 | B |
| 2 | 6 | 1.5 | C |
| 2 | 4 | 0.5 | A |
| 2 | 2 | 1.0 | B |
##### Select columns
Select one column using an index (not recommended)
``` clojure
(nth (tc/columns DS :as-seq) 2) ;; as column (iterable)
```
#tech.v3.dataset.column[9]
:V3
[0.5000, 1.000, 1.500, 0.5000, 1.000, 1.500, 0.5000, 1.000, 1.500]
``` clojure
(tc/dataset [(nth (tc/columns DS :as-seq) 2)])
```
\_unnamed \[9 1\]:
| :V3 |
|----:|
| 0.5 |
| 1.0 |
| 1.5 |
| 0.5 |
| 1.0 |
| 1.5 |
| 0.5 |
| 1.0 |
| 1.5 |
------------------------------------------------------------------------
Select one column using column name
``` clojure
(tc/select-columns DS :V2) ;; as dataset
```
\_unnamed \[9 1\]:
| :V2 |
|----:|
| 1 |
| 2 |
| 3 |
| 4 |
| 5 |
| 6 |
| 7 |
| 8 |
| 9 |
``` clojure
(tc/select-columns DS [:V2]) ;; as dataset
```
\_unnamed \[9 1\]:
| :V2 |
|----:|
| 1 |
| 2 |
| 3 |
| 4 |
| 5 |
| 6 |
| 7 |
| 8 |
| 9 |
``` clojure
(DS :V2) ;; as column (iterable)
```
#tech.v3.dataset.column[9]
:V2
[1, 2, 3, 4, 5, 6, 7, 8, 9]
------------------------------------------------------------------------
Select several columns
``` clojure
(tc/select-columns DS [:V2 :V3 :V4])
```
\_unnamed \[9 3\]:
| :V2 | :V3 | :V4 |
|----:|----:|-----|
| 1 | 0.5 | A |
| 2 | 1.0 | B |
| 3 | 1.5 | C |
| 4 | 0.5 | A |
| 5 | 1.0 | B |
| 6 | 1.5 | C |
| 7 | 0.5 | A |
| 8 | 1.0 | B |
| 9 | 1.5 | C |
------------------------------------------------------------------------
Exclude columns
``` clojure
(tc/select-columns DS (complement #{:V2 :V3 :V4}))
```
\_unnamed \[9 1\]:
| :V1 |
|----:|
| 1 |
| 2 |
| 1 |
| 2 |
| 1 |
| 2 |
| 1 |
| 2 |
| 1 |
``` clojure
(tc/drop-columns DS [:V2 :V3 :V4])
```
\_unnamed \[9 1\]:
| :V1 |
|----:|
| 1 |
| 2 |
| 1 |
| 2 |
| 1 |
| 2 |
| 1 |
| 2 |
| 1 |
------------------------------------------------------------------------
Other seletions
``` clojure
(->> (range 1 3)
(map (comp keyword (partial format "V%d")))
(tc/select-columns DS))
```
\_unnamed \[9 2\]:
| :V1 | :V2 |
|----:|----:|
| 1 | 1 |
| 2 | 2 |
| 1 | 3 |
| 2 | 4 |
| 1 | 5 |
| 2 | 6 |
| 1 | 7 |
| 2 | 8 |
| 1 | 9 |
``` clojure
(tc/reorder-columns DS :V4)
```
\_unnamed \[9 4\]:
| :V4 | :V1 | :V2 | :V3 |
|-----|----:|----:|----:|
| A | 1 | 1 | 0.5 |
| B | 2 | 2 | 1.0 |
| C | 1 | 3 | 1.5 |
| A | 2 | 4 | 0.5 |
| B | 1 | 5 | 1.0 |
| C | 2 | 6 | 1.5 |
| A | 1 | 7 | 0.5 |
| B | 2 | 8 | 1.0 |
| C | 1 | 9 | 1.5 |
``` clojure
(tc/select-columns DS #(clojure.string/starts-with? (name %) "V"))
```
\_unnamed \[9 4\]:
| :V1 | :V2 | :V3 | :V4 |
|----:|----:|----:|-----|
| 1 | 1 | 0.5 | A |
| 2 | 2 | 1.0 | B |
| 1 | 3 | 1.5 | C |
| 2 | 4 | 0.5 | A |
| 1 | 5 | 1.0 | B |
| 2 | 6 | 1.5 | C |
| 1 | 7 | 0.5 | A |
| 2 | 8 | 1.0 | B |
| 1 | 9 | 1.5 | C |
``` clojure
(tc/select-columns DS #(clojure.string/ends-with? (name %) "3"))
```
\_unnamed \[9 1\]:
| :V3 |
|----:|
| 0.5 |
| 1.0 |
| 1.5 |
| 0.5 |
| 1.0 |
| 1.5 |
| 0.5 |
| 1.0 |
| 1.5 |
``` clojure
(tc/select-columns DS #"..2") ;; regex converts to string using `str` function
```
\_unnamed \[9 1\]:
| :V2 |
|----:|
| 1 |
| 2 |
| 3 |
| 4 |
| 5 |
| 6 |
| 7 |
| 8 |
| 9 |
``` clojure
(tc/select-columns DS #{:V1 "X"})
```
\_unnamed \[9 1\]:
| :V1 |
|----:|
| 1 |
| 2 |
| 1 |
| 2 |
| 1 |
| 2 |
| 1 |
| 2 |
| 1 |
``` clojure
(tc/select-columns DS #(not (clojure.string/starts-with? (name %) "V2")))
```
\_unnamed \[9 3\]:
| :V1 | :V3 | :V4 |
|----:|----:|-----|
| 1 | 0.5 | A |
| 2 | 1.0 | B |
| 1 | 1.5 | C |
| 2 | 0.5 | A |
| 1 | 1.0 | B |
| 2 | 1.5 | C |
| 1 | 0.5 | A |
| 2 | 1.0 | B |
| 1 | 1.5 | C |
##### Summarise data
Summarise one column
``` clojure
(reduce + (DS :V1)) ;; using pure Clojure, as value
```
13
``` clojure
(tc/aggregate-columns DS :V1 dfn/sum) ;; as dataset
```
\_unnamed \[1 1\]:
| :V1 |
|-----:|
| 13.0 |
``` clojure
(tc/aggregate DS {:sumV1 #(dfn/sum (% :V1))})
```
\_unnamed \[1 1\]:
| :sumV1 |
|-------:|
| 13.0 |
------------------------------------------------------------------------
Summarize several columns
``` clojure
(tc/aggregate DS [#(dfn/sum (% :V1))
#(dfn/standard-deviation (% :V3))])
```
\_unnamed \[1 2\]:
| :summary-0 | :summary-1 |
|-----------:|-----------:|
| 13.0 | 0.4330127 |
``` clojure
(tc/aggregate-columns DS [:V1 :V3] [dfn/sum
dfn/standard-deviation])
```
\_unnamed \[1 2\]:
| :V1 | :V3 |
|-----:|----------:|
| 13.0 | 0.4330127 |
------------------------------------------------------------------------
Summarise several columns and assign column names
``` clojure
(tc/aggregate DS {:sumv1 #(dfn/sum (% :V1))
:sdv3 #(dfn/standard-deviation (% :V3))})
```
\_unnamed \[1 2\]:
| :sumv1 | :sdv3 |
|-------:|----------:|
| 13.0 | 0.4330127 |
------------------------------------------------------------------------
Summarise a subset of rows
``` clojure
(-> DS
(tc/select-rows (range 4))
(tc/aggregate-columns :V1 dfn/sum))
```
\_unnamed \[1 1\]:
| :V1 |
|----:|
| 6.0 |
##### Additional helpers
``` clojure
(-> DS
(tc/first)
(tc/select-columns :V3)) ;; select first row from `:V3` column
```
\_unnamed \[1 1\]:
| :V3 |
|----:|
| 0.5 |
``` clojure
(-> DS
(tc/last)
(tc/select-columns :V3)) ;; select last row from `:V3` column
```
\_unnamed \[1 1\]:
| :V3 |
|----:|
| 1.5 |
``` clojure
(-> DS
(tc/select-rows 4)
(tc/select-columns :V3)) ;; select forth row from `:V3` column
```
\_unnamed \[1 1\]:
| :V3 |
|----:|
| 1.0 |
``` clojure
(-> DS
(tc/select :V3 4)) ;; select forth row from `:V3` column
```
\_unnamed \[1 1\]:
| :V3 |
|----:|
| 1.0 |
``` clojure
(-> DS
(tc/unique-by :V4)
(tc/aggregate tc/row-count)) ;; number of unique rows in `:V4` column, as dataset
```
\_unnamed \[1 1\]:
| summary |
|--------:|
| 3 |
``` clojure
(-> DS
(tc/unique-by :V4)
(tc/row-count)) ;; number of unique rows in `:V4` column, as value
```
3
``` clojure
(-> DS
(tc/unique-by)
(tc/row-count)) ;; number of unique rows in dataset, as value
```
9
##### Add/update/delete columns
Modify a column
``` clojure
(tc/map-columns DS :V1 [:V1] #(dfn/pow % 2))
```
\_unnamed \[9 4\]:
| :V1 | :V2 | :V3 | :V4 |
|----:|----:|----:|-----|
| 1.0 | 1 | 0.5 | A |
| 4.0 | 2 | 1.0 | B |
| 1.0 | 3 | 1.5 | C |
| 4.0 | 4 | 0.5 | A |
| 1.0 | 5 | 1.0 | B |
| 4.0 | 6 | 1.5 | C |
| 1.0 | 7 | 0.5 | A |
| 4.0 | 8 | 1.0 | B |
| 1.0 | 9 | 1.5 | C |
``` clojure
(def DS (tc/add-column DS :V1 (dfn/pow (DS :V1) 2)))
```
``` clojure
DS
```
\_unnamed \[9 4\]:
| :V1 | :V2 | :V3 | :V4 |
|----:|----:|----:|-----|
| 1.0 | 1 | 0.5 | A |
| 4.0 | 2 | 1.0 | B |
| 1.0 | 3 | 1.5 | C |
| 4.0 | 4 | 0.5 | A |
| 1.0 | 5 | 1.0 | B |
| 4.0 | 6 | 1.5 | C |
| 1.0 | 7 | 0.5 | A |
| 4.0 | 8 | 1.0 | B |
| 1.0 | 9 | 1.5 | C |
------------------------------------------------------------------------
Add one column
``` clojure
(tc/map-columns DS :v5 [:V1] dfn/log)
```
\_unnamed \[9 5\]:
| :V1 | :V2 | :V3 | :V4 | :v5 |
|----:|----:|----:|-----|-----------:|
| 1.0 | 1 | 0.5 | A | 0.00000000 |
| 4.0 | 2 | 1.0 | B | 1.38629436 |
| 1.0 | 3 | 1.5 | C | 0.00000000 |
| 4.0 | 4 | 0.5 | A | 1.38629436 |
| 1.0 | 5 | 1.0 | B | 0.00000000 |
| 4.0 | 6 | 1.5 | C | 1.38629436 |
| 1.0 | 7 | 0.5 | A | 0.00000000 |
| 4.0 | 8 | 1.0 | B | 1.38629436 |
| 1.0 | 9 | 1.5 | C | 0.00000000 |
``` clojure
(def DS (tc/add-column DS :v5 (dfn/log (DS :V1))))
```
``` clojure
DS
```
\_unnamed \[9 5\]:
| :V1 | :V2 | :V3 | :V4 | :v5 |
|----:|----:|----:|-----|-----------:|
| 1.0 | 1 | 0.5 | A | 0.00000000 |
| 4.0 | 2 | 1.0 | B | 1.38629436 |
| 1.0 | 3 | 1.5 | C | 0.00000000 |
| 4.0 | 4 | 0.5 | A | 1.38629436 |
| 1.0 | 5 | 1.0 | B | 0.00000000 |
| 4.0 | 6 | 1.5 | C | 1.38629436 |
| 1.0 | 7 | 0.5 | A | 0.00000000 |
| 4.0 | 8 | 1.0 | B | 1.38629436 |
| 1.0 | 9 | 1.5 | C | 0.00000000 |
------------------------------------------------------------------------
Add several columns
``` clojure
(def DS (tc/add-columns DS {:v6 (dfn/sqrt (DS :V1))
:v7 "X"}))
```
``` clojure
DS
```
\_unnamed \[9 7\]:
| :V1 | :V2 | :V3 | :V4 | :v5 | :v6 | :v7 |
|----:|----:|----:|-----|-----------:|----:|-----|
| 1.0 | 1 | 0.5 | A | 0.00000000 | 1.0 | X |
| 4.0 | 2 | 1.0 | B | 1.38629436 | 2.0 | X |
| 1.0 | 3 | 1.5 | C | 0.00000000 | 1.0 | X |
| 4.0 | 4 | 0.5 | A | 1.38629436 | 2.0 | X |
| 1.0 | 5 | 1.0 | B | 0.00000000 | 1.0 | X |
| 4.0 | 6 | 1.5 | C | 1.38629436 | 2.0 | X |
| 1.0 | 7 | 0.5 | A | 0.00000000 | 1.0 | X |
| 4.0 | 8 | 1.0 | B | 1.38629436 | 2.0 | X |
| 1.0 | 9 | 1.5 | C | 0.00000000 | 1.0 | X |
------------------------------------------------------------------------
Create one column and remove the others
``` clojure
(tc/dataset {:v8 (dfn/+ (DS :V3) 1)})
```
\_unnamed \[9 1\]:
| :v8 |
|----:|
| 1.5 |
| 2.0 |
| 2.5 |
| 1.5 |
| 2.0 |
| 2.5 |
| 1.5 |
| 2.0 |
| 2.5 |
------------------------------------------------------------------------
Remove one column
``` clojure
(def DS (tc/drop-columns DS :v5))
```
``` clojure
DS
```
\_unnamed \[9 6\]:
| :V1 | :V2 | :V3 | :V4 | :v6 | :v7 |
|----:|----:|----:|-----|----:|-----|
| 1.0 | 1 | 0.5 | A | 1.0 | X |
| 4.0 | 2 | 1.0 | B | 2.0 | X |
| 1.0 | 3 | 1.5 | C | 1.0 | X |
| 4.0 | 4 | 0.5 | A | 2.0 | X |
| 1.0 | 5 | 1.0 | B | 1.0 | X |
| 4.0 | 6 | 1.5 | C | 2.0 | X |
| 1.0 | 7 | 0.5 | A | 1.0 | X |
| 4.0 | 8 | 1.0 | B | 2.0 | X |
| 1.0 | 9 | 1.5 | C | 1.0 | X |
------------------------------------------------------------------------
Remove several columns
``` clojure
(def DS (tc/drop-columns DS [:v6 :v7]))
```
``` clojure
DS
```
\_unnamed \[9 4\]:
| :V1 | :V2 | :V3 | :V4 |
|----:|----:|----:|-----|
| 1.0 | 1 | 0.5 | A |
| 4.0 | 2 | 1.0 | B |
| 1.0 | 3 | 1.5 | C |
| 4.0 | 4 | 0.5 | A |
| 1.0 | 5 | 1.0 | B |
| 4.0 | 6 | 1.5 | C |
| 1.0 | 7 | 0.5 | A |
| 4.0 | 8 | 1.0 | B |
| 1.0 | 9 | 1.5 | C |
------------------------------------------------------------------------
Remove columns using a vector of colnames
We use set here.
``` clojure
(def DS (tc/select-columns DS (complement #{:V3})))
```
``` clojure
DS
```
\_unnamed \[9 3\]:
| :V1 | :V2 | :V4 |
|----:|----:|-----|
| 1.0 | 1 | A |
| 4.0 | 2 | B |
| 1.0 | 3 | C |
| 4.0 | 4 | A |
| 1.0 | 5 | B |
| 4.0 | 6 | C |
| 1.0 | 7 | A |
| 4.0 | 8 | B |
| 1.0 | 9 | C |
------------------------------------------------------------------------
Replace values for rows matching a condition
``` clojure
(def DS (tc/map-columns DS :V2 [:V2] #(if (< % 4.0) 0.0 %)))
```
``` clojure
DS
```
\_unnamed \[9 3\]:
| :V1 | :V2 | :V4 |
|----:|----:|-----|
| 1.0 | 0.0 | A |
| 4.0 | 0.0 | B |
| 1.0 | 0.0 | C |
| 4.0 | 4.0 | A |
| 1.0 | 5.0 | B |
| 4.0 | 6.0 | C |
| 1.0 | 7.0 | A |
| 4.0 | 8.0 | B |
| 1.0 | 9.0 | C |
##### by
By group
``` clojure
(-> DS
(tc/group-by [:V4])
(tc/aggregate {:sumV2 #(dfn/sum (% :V2))}))
```
\_unnamed \[3 2\]:
| :V4 | :sumV2 |
|-----|-------:|
| A | 11.0 |
| B | 13.0 |
| C | 15.0 |
------------------------------------------------------------------------
By several groups
``` clojure
(-> DS
(tc/group-by [:V4 :V1])
(tc/aggregate {:sumV2 #(dfn/sum (% :V2))}))
```
\_unnamed \[6 3\]:
| :V4 | :V1 | :sumV2 |
|-----|----:|-------:|
| A | 1.0 | 7.0 |
| B | 4.0 | 8.0 |
| C | 1.0 | 9.0 |
| A | 4.0 | 4.0 |
| B | 1.0 | 5.0 |
| C | 4.0 | 6.0 |
------------------------------------------------------------------------
Calling function in by
``` clojure
(-> DS
(tc/group-by (fn [row]
(clojure.string/lower-case (:V4 row))))
(tc/aggregate {:sumV1 #(dfn/sum (% :V1))}))
```
\_unnamed \[3 2\]:
| :\$group-name | :sumV1 |
|---------------|-------:|
| a | 6.0 |
| b | 9.0 |
| c | 6.0 |
------------------------------------------------------------------------
Assigning column name in by
``` clojure
(-> DS
(tc/group-by (fn [row]
{:abc (clojure.string/lower-case (:V4 row))}))
(tc/aggregate {:sumV1 #(dfn/sum (% :V1))}))
```
\_unnamed \[3 2\]:
| :abc | :sumV1 |
|------|-------:|
| a | 6.0 |
| b | 9.0 |
| c | 6.0 |
``` clojure
(-> DS
(tc/group-by (fn [row]
(clojure.string/lower-case (:V4 row))))
(tc/aggregate {:sumV1 #(dfn/sum (% :V1))} {:add-group-as-column :abc}))
```
\_unnamed \[3 2\]:
| :\$group-name | :sumV1 |
|---------------|-------:|
| a | 6.0 |
| b | 9.0 |
| c | 6.0 |
------------------------------------------------------------------------
Using a condition in by
``` clojure
(-> DS
(tc/group-by #(= (:V4 %) "A"))
(tc/aggregate #(dfn/sum (% :V1))))
```
\_unnamed \[2 2\]:
| :\$group-name | summary |
|---------------|--------:|
| true | 6.0 |
| false | 15.0 |
------------------------------------------------------------------------
By on a subset of rows
``` clojure
(-> DS
(tc/select-rows (range 5))
(tc/group-by :V4)
(tc/aggregate {:sumV1 #(dfn/sum (% :V1))}))
```
\_unnamed \[3 2\]:
| :\$group-name | :sumV1 |
|---------------|-------:|
| A | 5.0 |
| B | 5.0 |
| C | 1.0 |
------------------------------------------------------------------------
Count number of observations for each group
``` clojure
(-> DS
(tc/group-by :V4)
(tc/aggregate tc/row-count))
```
\_unnamed \[3 2\]:
| :\$group-name | summary |
|---------------|--------:|
| A | 3 |
| B | 3 |
| C | 3 |
------------------------------------------------------------------------
Add a column with number of observations for each group
``` clojure
(-> DS
(tc/group-by [:V1])
(tc/add-column :n tc/row-count)
(tc/ungroup))
```
\_unnamed \[9 4\]:
| :V1 | :V2 | :V4 | :n |
|----:|----:|-----|----:|
| 1.0 | 0.0 | A | 5 |
| 1.0 | 0.0 | C | 5 |
| 1.0 | 5.0 | B | 5 |
| 1.0 | 7.0 | A | 5 |
| 1.0 | 9.0 | C | 5 |
| 4.0 | 0.0 | B | 4 |
| 4.0 | 4.0 | A | 4 |
| 4.0 | 6.0 | C | 4 |
| 4.0 | 8.0 | B | 4 |
------------------------------------------------------------------------
Retrieve the first/last/nth observation for each group
``` clojure
(-> DS
(tc/group-by [:V4])
(tc/aggregate-columns :V2 first))
```
\_unnamed \[3 2\]:
| :V4 | :V2 |
|-----|----:|
| A | 0.0 |
| B | 0.0 |
| C | 0.0 |
``` clojure
(-> DS
(tc/group-by [:V4])
(tc/aggregate-columns :V2 last))
```
\_unnamed \[3 2\]:
| :V4 | :V2 |
|-----|----:|
| A | 7.0 |
| B | 8.0 |
| C | 9.0 |
``` clojure
(-> DS
(tc/group-by [:V4])
(tc/aggregate-columns :V2 #(nth % 1)))
```
\_unnamed \[3 2\]:
| :V4 | :V2 |
|-----|----:|
| A | 4.0 |
| B | 5.0 |
| C | 6.0 |
#### Going further
##### Advanced columns manipulation
Summarise all the columns
``` clojure
;; custom max function which works on every type
(tc/aggregate-columns DS :all (fn [col] (first (sort #(compare %2 %1) col))))
```
\_unnamed \[1 3\]:
| :V1 | :V2 | :V4 |
|----:|----:|-----|
| 4.0 | 9.0 | C |
------------------------------------------------------------------------
Summarise several columns
``` clojure
(tc/aggregate-columns DS [:V1 :V2] dfn/mean)
```
\_unnamed \[1 2\]:
| :V1 | :V2 |
|-----------:|-----------:|
| 2.33333333 | 4.33333333 |
------------------------------------------------------------------------
Summarise several columns by group
``` clojure
(-> DS
(tc/group-by [:V4])
(tc/aggregate-columns [:V1 :V2] dfn/mean))
```
\_unnamed \[3 3\]:
| :V4 | :V1 | :V2 |
|-----|----:|-----------:|
| A | 2.0 | 3.66666667 |
| B | 3.0 | 4.33333333 |
| C | 2.0 | 5.00000000 |
------------------------------------------------------------------------
Summarise with more than one function by group
``` clojure
(-> DS
(tc/group-by [:V4])
(tc/aggregate-columns [:V1 :V2] (fn [col]
{:sum (dfn/sum col)
:mean (dfn/mean col)})))
```
\_unnamed \[3 5\]:
| :V4 | :V1-sum | :V1-mean | :V2-sum | :V2-mean |
|-----|--------:|---------:|--------:|-----------:|
| A | 6.0 | 2.0 | 11.0 | 3.66666667 |
| B | 9.0 | 3.0 | 13.0 | 4.33333333 |
| C | 6.0 | 2.0 | 15.0 | 5.00000000 |
Summarise using a condition
``` clojure
(-> DS
(tc/select-columns :type/numerical)
(tc/aggregate-columns :all dfn/mean))
```
\_unnamed \[1 2\]:
| :V1 | :V2 |
|-----------:|-----------:|
| 2.33333333 | 4.33333333 |
------------------------------------------------------------------------
Modify all the columns
``` clojure
(tc/update-columns DS :all reverse)
```
\_unnamed \[9 3\]:
| :V1 | :V2 | :V4 |
|----:|----:|-----|
| 1.0 | 9.0 | C |
| 4.0 | 8.0 | B |
| 1.0 | 7.0 | A |
| 4.0 | 6.0 | C |
| 1.0 | 5.0 | B |
| 4.0 | 4.0 | A |
| 1.0 | 0.0 | C |
| 4.0 | 0.0 | B |
| 1.0 | 0.0 | A |
------------------------------------------------------------------------
Modify several columns (dropping the others)
``` clojure
(-> DS
(tc/select-columns [:V1 :V2])
(tc/update-columns :all dfn/sqrt))
```
\_unnamed \[9 2\]:
| :V1 | :V2 |
|----:|-----------:|
| 1.0 | 0.00000000 |
| 2.0 | 0.00000000 |
| 1.0 | 0.00000000 |
| 2.0 | 2.00000000 |
| 1.0 | 2.23606798 |
| 2.0 | 2.44948974 |
| 1.0 | 2.64575131 |
| 2.0 | 2.82842712 |
| 1.0 | 3.00000000 |
``` clojure
(-> DS
(tc/select-columns (complement #{:V4}))
(tc/update-columns :all dfn/exp))
```
\_unnamed \[9 2\]:
| :V1 | :V2 |
|------------:|--------------:|
| 2.71828183 | 1.00000000 |
| 54.59815003 | 1.00000000 |
| 2.71828183 | 1.00000000 |
| 54.59815003 | 54.59815003 |
| 2.71828183 | 148.41315910 |
| 54.59815003 | 403.42879349 |
| 2.71828183 | 1096.63315843 |
| 54.59815003 | 2980.95798704 |
| 2.71828183 | 8103.08392758 |
------------------------------------------------------------------------
Modify several columns (keeping the others)
``` clojure
(def DS (tc/update-columns DS [:V1 :V2] dfn/sqrt))
```
``` clojure
DS
```
\_unnamed \[9 3\]:
| :V1 | :V2 | :V4 |
|----:|-----------:|-----|
| 1.0 | 0.00000000 | A |
| 2.0 | 0.00000000 | B |
| 1.0 | 0.00000000 | C |
| 2.0 | 2.00000000 | A |
| 1.0 | 2.23606798 | B |
| 2.0 | 2.44948974 | C |
| 1.0 | 2.64575131 | A |
| 2.0 | 2.82842712 | B |
| 1.0 | 3.00000000 | C |
``` clojure
(def DS (tc/update-columns DS (complement #{:V4}) #(dfn/pow % 2)))
```
``` clojure
DS
```
\_unnamed \[9 3\]:
| :V1 | :V2 | :V4 |
|----:|----:|-----|
| 1.0 | 0.0 | A |
| 4.0 | 0.0 | B |
| 1.0 | 0.0 | C |
| 4.0 | 4.0 | A |
| 1.0 | 5.0 | B |
| 4.0 | 6.0 | C |
| 1.0 | 7.0 | A |
| 4.0 | 8.0 | B |
| 1.0 | 9.0 | C |
------------------------------------------------------------------------
Modify columns using a condition (dropping the others)
``` clojure
(-> DS
(tc/select-columns :type/numerical)
(tc/update-columns :all #(dfn/- % 1)))
```
\_unnamed \[9 2\]:
| :V1 | :V2 |
|----:|-----:|
| 0.0 | -1.0 |
| 3.0 | -1.0 |
| 0.0 | -1.0 |
| 3.0 | 3.0 |
| 0.0 | 4.0 |
| 3.0 | 5.0 |
| 0.0 | 6.0 |
| 3.0 | 7.0 |
| 0.0 | 8.0 |
------------------------------------------------------------------------
Modify columns using a condition (keeping the others)
``` clojure
(def DS (tc/convert-types DS :type/numerical :int32))
```
``` clojure
DS
```
\_unnamed \[9 3\]:
| :V1 | :V2 | :V4 |
|----:|----:|-----|
| 1 | 0 | A |
| 4 | 0 | B |
| 1 | 0 | C |
| 4 | 4 | A |
| 1 | 5 | B |
| 4 | 5 | C |
| 1 | 7 | A |
| 4 | 8 | B |
| 1 | 9 | C |
------------------------------------------------------------------------
Use a complex expression
``` clojure
(-> DS
(tc/group-by [:V4])
(tc/head 2)
(tc/add-column :V2 "X")
(tc/ungroup))
```
\_unnamed \[6 3\]:
| :V1 | :V2 | :V4 |
|----:|-----|-----|
| 1 | X | A |
| 4 | X | A |
| 4 | X | B |
| 1 | X | B |
| 1 | X | C |
| 4 | X | C |
------------------------------------------------------------------------
Use multiple expressions
``` clojure
(tc/dataset (let [x (dfn/+ (DS :V1) (dfn/sum (DS :V2)))]
(println (seq (DS :V1)))
(println (tc/info (tc/select-columns DS :V1)))
{:A (range 1 (inc (tc/row-count DS)))
:B x}))
```
(1 4 1 4 1 4 1 4 1) \_unnamed: descriptive-stats \[1 11\]:
| :col-name | :datatype | :n-valid | :n-missing | :min | :mean | :max | :standard-deviation | :skew | :first | :last |
|-----------|-----------|---------:|-----------:|-----:|-----------:|-----:|--------------------:|-----------:|-------:|------:|
| :V1 | :int32 | 9 | 0 | 1.0 | 2.33333333 | 4.0 | 1.58113883 | 0.27105237 | 1 | 1 |
\_unnamed \[9 2\]:
| :A | :B |
|----:|-----:|
| 1 | 39.0 |
| 2 | 42.0 |
| 3 | 39.0 |
| 4 | 42.0 |
| 5 | 39.0 |
| 6 | 42.0 |
| 7 | 39.0 |
| 8 | 42.0 |
| 9 | 39.0 |
##### Chain expressions
Expression chaining using \>
``` clojure
(-> DS
(tc/group-by [:V4])
(tc/aggregate {:V1sum #(dfn/sum (% :V1))})
(tc/select-rows #(>= (:V1sum %) 5)))
```
\_unnamed \[3 2\]:
| :V4 | :V1sum |
|-----|-------:|
| A | 6.0 |
| B | 9.0 |
| C | 6.0 |
``` clojure
(-> DS
(tc/group-by [:V4])
(tc/aggregate {:V1sum #(dfn/sum (% :V1))})
(tc/order-by :V1sum :desc))
```
\_unnamed \[3 2\]:
| :V4 | :V1sum |
|-----|-------:|
| B | 9.0 |
| A | 6.0 |
| C | 6.0 |
##### Indexing and Keys
Set the key/index (order)
``` clojure
(def DS (tc/order-by DS :V4))
```
``` clojure
DS
```
\_unnamed \[9 3\]:
| :V1 | :V2 | :V4 |
|----:|----:|-----|
| 1 | 0 | A |
| 4 | 4 | A |
| 1 | 7 | A |
| 4 | 0 | B |
| 1 | 5 | B |
| 4 | 8 | B |
| 1 | 0 | C |
| 4 | 5 | C |
| 1 | 9 | C |
Select the matching rows
``` clojure
(tc/select-rows DS #(= (:V4 %) "A"))
```
\_unnamed \[3 3\]:
| :V1 | :V2 | :V4 |
|----:|----:|-----|
| 1 | 0 | A |
| 4 | 4 | A |
| 1 | 7 | A |
``` clojure
(tc/select-rows DS (comp #{"A" "C"} :V4))
```
\_unnamed \[6 3\]:
| :V1 | :V2 | :V4 |
|----:|----:|-----|
| 1 | 0 | A |
| 4 | 4 | A |
| 1 | 7 | A |
| 1 | 0 | C |
| 4 | 5 | C |
| 1 | 9 | C |
------------------------------------------------------------------------
Select the first matching row
``` clojure
(-> DS
(tc/select-rows #(= (:V4 %) "B"))
(tc/first))
```
\_unnamed \[1 3\]:
| :V1 | :V2 | :V4 |
|----:|----:|-----|
| 4 | 0 | B |
``` clojure
(-> DS
(tc/unique-by :V4)
(tc/select-rows (comp #{"B" "C"} :V4)))
```
\_unnamed \[2 3\]:
| :V1 | :V2 | :V4 |
|----:|----:|-----|
| 4 | 0 | B |
| 1 | 0 | C |
------------------------------------------------------------------------
Select the last matching row
``` clojure
(-> DS
(tc/select-rows #(= (:V4 %) "A"))
(tc/last))
```
\_unnamed \[1 3\]:
| :V1 | :V2 | :V4 |
|----:|----:|-----|
| 1 | 7 | A |
------------------------------------------------------------------------
Nomatch argument
``` clojure
(tc/select-rows DS (comp #{"A" "D"} :V4))
```
\_unnamed \[3 3\]:
| :V1 | :V2 | :V4 |
|----:|----:|-----|
| 1 | 0 | A |
| 4 | 4 | A |
| 1 | 7 | A |
------------------------------------------------------------------------
Apply a function on the matching rows
``` clojure
(-> DS
(tc/select-rows (comp #{"A" "C"} :V4))
(tc/aggregate-columns :V1 (fn [col]
{:sum (dfn/sum col)})))
```
\_unnamed \[1 1\]:
| :V1-sum |
|--------:|
| 12.0 |
------------------------------------------------------------------------
Modify values for matching rows
``` clojure
(def DS (-> DS
(tc/map-columns :V1 [:V1 :V4] #(if (= %2 "A") 0 %1))
(tc/order-by :V4)))
```
``` clojure
DS
```
\_unnamed \[9 3\]:
| :V1 | :V2 | :V4 |
|----:|----:|-----|
| 0 | 0 | A |
| 0 | 4 | A |
| 0 | 7 | A |
| 4 | 0 | B |
| 1 | 5 | B |
| 4 | 8 | B |
| 1 | 0 | C |
| 4 | 5 | C |
| 1 | 9 | C |
------------------------------------------------------------------------
Use keys in by
``` clojure
(-> DS
(tc/select-rows (comp (complement #{"B"}) :V4))
(tc/group-by [:V4])
(tc/aggregate-columns :V1 dfn/sum))
```
\_unnamed \[2 2\]:
| :V4 | :V1 |
|-----|----:|
| A | 0.0 |
| C | 6.0 |
------------------------------------------------------------------------
Set keys/indices for multiple columns (ordered)
``` clojure
(tc/order-by DS [:V4 :V1])
```
\_unnamed \[9 3\]:
| :V1 | :V2 | :V4 |
|----:|----:|-----|
| 0 | 0 | A |
| 0 | 4 | A |
| 0 | 7 | A |
| 1 | 5 | B |
| 4 | 0 | B |
| 4 | 8 | B |
| 1 | 0 | C |
| 1 | 9 | C |
| 4 | 5 | C |
------------------------------------------------------------------------
Subset using multiple keys/indices
``` clojure
(-> DS
(tc/select-rows #(and (= (:V1 %) 1)
(= (:V4 %) "C"))))
```
\_unnamed \[2 3\]:
| :V1 | :V2 | :V4 |
|----:|----:|-----|
| 1 | 0 | C |
| 1 | 9 | C |
``` clojure
(-> DS
(tc/select-rows #(and (= (:V1 %) 1)
(#{"B" "C"} (:V4 %)))))
```
\_unnamed \[3 3\]:
| :V1 | :V2 | :V4 |
|----:|----:|-----|
| 1 | 5 | B |
| 1 | 0 | C |
| 1 | 9 | C |
``` clojure
(-> DS
(tc/select-rows #(and (= (:V1 %) 1)
(#{"B" "C"} (:V4 %))) {:result-type :as-indexes}))
```
(4 6 8)
##### set\*() modifications
Replace values
There is no mutating operations `tech.ml.dataset` or easy way to set
value.
``` clojure
(def DS (tc/update-columns DS :V2 #(map-indexed (fn [idx v]
(if (zero? idx) 3 v)) %)))
```
``` clojure
DS
```
\_unnamed \[9 3\]:
| :V1 | :V2 | :V4 |
|----:|----:|-----|
| 0 | 3 | A |
| 0 | 4 | A |
| 0 | 7 | A |
| 4 | 0 | B |
| 1 | 5 | B |
| 4 | 8 | B |
| 1 | 0 | C |
| 4 | 5 | C |
| 1 | 9 | C |
------------------------------------------------------------------------
Reorder rows
``` clojure
(def DS (tc/order-by DS [:V4 :V1] [:asc :desc]))
```
``` clojure
DS
```
\_unnamed \[9 3\]:
| :V1 | :V2 | :V4 |
|----:|----:|-----|
| 0 | 3 | A |
| 0 | 4 | A |
| 0 | 7 | A |
| 4 | 0 | B |
| 4 | 8 | B |
| 1 | 5 | B |
| 4 | 5 | C |
| 1 | 0 | C |
| 1 | 9 | C |
------------------------------------------------------------------------
Modify colnames
``` clojure
(def DS (tc/rename-columns DS {:V2 "v2"}))
```
``` clojure
DS
```
\_unnamed \[9 3\]:
| :V1 | v2 | :V4 |
|----:|----:|-----|
| 0 | 3 | A |
| 0 | 4 | A |
| 0 | 7 | A |
| 4 | 0 | B |
| 4 | 8 | B |
| 1 | 5 | B |
| 4 | 5 | C |
| 1 | 0 | C |
| 1 | 9 | C |
``` clojure
(def DS (tc/rename-columns DS {"v2" :V2})) ;; revert back
```
------------------------------------------------------------------------
Reorder columns
``` clojure
(def DS (tc/reorder-columns DS :V4 :V1 :V2))
```
``` clojure
DS
```
\_unnamed \[9 3\]:
| :V4 | :V1 | :V2 |
|-----|----:|----:|
| A | 0 | 3 |
| A | 0 | 4 |
| A | 0 | 7 |
| B | 4 | 0 |
| B | 4 | 8 |
| B | 1 | 5 |
| C | 4 | 5 |
| C | 1 | 0 |
| C | 1 | 9 |
##### Advanced use of by
Select first/last/… row by group
``` clojure
(-> DS
(tc/group-by :V4)
(tc/first)
(tc/ungroup))
```
\_unnamed \[3 3\]:
| :V4 | :V1 | :V2 |
|-----|----:|----:|
| A | 0 | 3 |
| B | 4 | 0 |
| C | 4 | 5 |
``` clojure
(-> DS
(tc/group-by :V4)
(tc/select-rows [0 2])
(tc/ungroup))
```
\_unnamed \[6 3\]:
| :V4 | :V1 | :V2 |
|-----|----:|----:|
| A | 0 | 3 |
| A | 0 | 7 |
| B | 4 | 0 |
| B | 1 | 5 |
| C | 4 | 5 |
| C | 1 | 9 |
``` clojure
(-> DS
(tc/group-by :V4)
(tc/tail 2)
(tc/ungroup))
```
\_unnamed \[6 3\]:
| :V4 | :V1 | :V2 |
|-----|----:|----:|
| A | 0 | 4 |
| A | 0 | 7 |
| B | 4 | 8 |
| B | 1 | 5 |
| C | 1 | 0 |
| C | 1 | 9 |
------------------------------------------------------------------------
Select rows using a nested query
``` clojure
(-> DS
(tc/group-by :V4)
(tc/order-by :V2)
(tc/first)
(tc/ungroup))
```
\_unnamed \[3 3\]:
| :V4 | :V1 | :V2 |
|-----|----:|----:|
| A | 0 | 3 |
| B | 4 | 0 |
| C | 1 | 0 |
Add a group counter column
``` clojure
(-> DS
(tc/group-by [:V4 :V1])
(tc/ungroup {:add-group-id-as-column :Grp}))
```
\_unnamed \[9 4\]:
| :Grp | :V4 | :V1 | :V2 |
|-----:|-----|----:|----:|
| 0 | A | 0 | 3 |
| 0 | A | 0 | 4 |
| 0 | A | 0 | 7 |
| 1 | B | 4 | 0 |
| 1 | B | 4 | 8 |
| 2 | B | 1 | 5 |
| 3 | C | 4 | 5 |
| 4 | C | 1 | 0 |
| 4 | C | 1 | 9 |
------------------------------------------------------------------------
Get row number of first (and last) observation by group
``` clojure
(-> DS
(tc/add-column :row-id (range))
(tc/select-columns [:V4 :row-id])
(tc/group-by :V4)
(tc/ungroup))
```
\_unnamed \[9 2\]:
| :V4 | :row-id |
|-----|--------:|
| A | 0 |
| A | 1 |
| A | 2 |
| B | 3 |
| B | 4 |
| B | 5 |
| C | 6 |
| C | 7 |
| C | 8 |
``` clojure
(-> DS
(tc/add-column :row-id (range))
(tc/select-columns [:V4 :row-id])
(tc/group-by :V4)
(tc/first)
(tc/ungroup))
```
\_unnamed \[3 2\]:
| :V4 | :row-id |
|-----|--------:|
| A | 0 |
| B | 3 |
| C | 6 |
``` clojure
(-> DS
(tc/add-column :row-id (range))
(tc/select-columns [:V4 :row-id])
(tc/group-by :V4)
(tc/select-rows [0 2])
(tc/ungroup))
```
\_unnamed \[6 2\]:
| :V4 | :row-id |
|-----|--------:|
| A | 0 |
| A | 2 |
| B | 3 |
| B | 5 |
| C | 6 |
| C | 8 |
------------------------------------------------------------------------
Handle list-columns by group
``` clojure
(-> DS
(tc/select-columns [:V1 :V4])
(tc/fold-by :V4))
```
\_unnamed \[3 2\]:
| :V4 | :V1 |
|-----|-----------|
| A | \[0 0 0\] |
| B | \[4 4 1\] |
| C | \[4 1 1\] |
``` clojure
(-> DS
(tc/group-by :V4)
(tc/unmark-group))
```
\_unnamed \[3 3\]:
| :name | :group-id | :data |
|-------|----------:|-------------------|
| A | 0 | Group: A \[3 3\]: |
| B | 1 | Group: B \[3 3\]: |
| C | 2 | Group: C \[3 3\]: |
------------------------------------------------------------------------
Grouping sets (multiple by at once)
Not available.
#### Miscellaneous
##### Read / Write data
Write data to a csv file
``` clojure
(tc/write! DS "DF.csv")
```
10
------------------------------------------------------------------------
Write data to a tab-delimited file
``` clojure
(tc/write! DS "DF.txt" {:separator \tab})
```
10
or
``` clojure
(tc/write! DS "DF.tsv")
```
10
------------------------------------------------------------------------
Read a csv / tab-delimited file
``` clojure
(tc/dataset "DF.csv" {:key-fn keyword})
```
DF.csv \[9 3\]:
| :V4 | :V1 | :V2 |
|-----|----:|----:|
| A | 0 | 3 |
| A | 0 | 4 |
| A | 0 | 7 |
| B | 4 | 0 |
| B | 4 | 8 |
| B | 1 | 5 |
| C | 4 | 5 |
| C | 1 | 0 |
| C | 1 | 9 |
``` clojure
(tc/dataset "DF.txt" {:key-fn keyword})
```
DF.txt \[9 1\]:
| :V4 V1 V2 |
|-----------|
| A 0 3 |
| A 0 4 |
| A 0 7 |
| B 4 0 |
| B 4 8 |
| B 1 5 |
| C 4 5 |
| C 1 0 |
| C 1 9 |
``` clojure
(tc/dataset "DF.tsv" {:key-fn keyword})
```
DF.tsv \[9 3\]:
| :V4 | :V1 | :V2 |
|-----|----:|----:|
| A | 0 | 3 |
| A | 0 | 4 |
| A | 0 | 7 |
| B | 4 | 0 |
| B | 4 | 8 |
| B | 1 | 5 |
| C | 4 | 5 |
| C | 1 | 0 |
| C | 1 | 9 |
------------------------------------------------------------------------
Read a csv file selecting / droping columns
``` clojure
(tc/dataset "DF.csv" {:key-fn keyword
:column-whitelist ["V1" "V4"]})
```
DF.csv \[9 2\]:
| :V4 | :V1 |
|-----|----:|
| A | 0 |
| A | 0 |
| A | 0 |
| B | 4 |
| B | 4 |
| B | 1 |
| C | 4 |
| C | 1 |
| C | 1 |
``` clojure
(tc/dataset "DF.csv" {:key-fn keyword
:column-blacklist ["V4"]})
```
DF.csv \[9 2\]:
| :V1 | :V2 |
|----:|----:|
| 0 | 3 |
| 0 | 4 |
| 0 | 7 |
| 4 | 0 |
| 4 | 8 |
| 1 | 5 |
| 4 | 5 |
| 1 | 0 |
| 1 | 9 |
------------------------------------------------------------------------
Read and rbind several files
``` clojure
(apply tc/concat (map tc/dataset ["DF.csv" "DF.csv"]))
```
DF.csv \[18 3\]:
| V4 | V1 | V2 |
|-----|----:|----:|
| A | 0 | 3 |
| A | 0 | 4 |
| A | 0 | 7 |
| B | 4 | 0 |
| B | 4 | 8 |
| B | 1 | 5 |
| C | 4 | 5 |
| C | 1 | 0 |
| C | 1 | 9 |
| A | 0 | 3 |
| A | 0 | 4 |
| A | 0 | 7 |
| B | 4 | 0 |
| B | 4 | 8 |
| B | 1 | 5 |
| C | 4 | 5 |
| C | 1 | 0 |
| C | 1 | 9 |
##### Reshape data
Melt data (from wide to long)
``` clojure
(def mDS (tc/pivot->longer DS [:V1 :V2] {:target-columns :variable
:value-column-name :value}))
```
``` clojure
mDS
```
\_unnamed \[18 3\]:
| :V4 | :variable | :value |
|-----|-----------|-------:|
| A | :V1 | 0 |
| A | :V1 | 0 |
| A | :V1 | 0 |
| B | :V1 | 4 |
| B | :V1 | 4 |
| B | :V1 | 1 |
| C | :V1 | 4 |
| C | :V1 | 1 |
| C | :V1 | 1 |
| A | :V2 | 3 |
| A | :V2 | 4 |
| A | :V2 | 7 |
| B | :V2 | 0 |
| B | :V2 | 8 |
| B | :V2 | 5 |
| C | :V2 | 5 |
| C | :V2 | 0 |
| C | :V2 | 9 |
------------------------------------------------------------------------
Cast data (from long to wide)
``` clojure
(-> mDS
(tc/pivot->wider :variable :value {:fold-fn vec})
(tc/update-columns ["V1" "V2"] (partial map count)))
```
\_unnamed \[3 3\]:
| :V4 | :V1 | :V2 |
|-----|-----------|-----------|
| A | \[0 0 0\] | \[3 4 7\] |
| B | \[4 4 1\] | \[0 8 5\] |
| C | \[4 1 1\] | \[5 0 9\] |
``` clojure
(-> mDS
(tc/pivot->wider :variable :value {:fold-fn vec})
(tc/update-columns ["V1" "V2"] (partial map dfn/sum)))
```
\_unnamed \[3 3\]:
| :V4 | :V1 | :V2 |
|-----|-----------|-----------|
| A | \[0 0 0\] | \[3 4 7\] |
| B | \[4 4 1\] | \[0 8 5\] |
| C | \[4 1 1\] | \[5 0 9\] |
``` clojure
(-> mDS
(tc/map-columns :value #(str (> % 5))) ;; coerce to strings
(tc/pivot->wider :value :variable {:fold-fn vec})
(tc/update-columns ["true" "false"] (partial map #(if (sequential? %) (count %) 1))))
```
\_unnamed \[3 3\]:
| :V4 | false | true |
|-----|------:|-----:|
| A | 5 | 1 |
| B | 5 | 1 |
| C | 5 | 1 |
------------------------------------------------------------------------
Split
``` clojure
(tc/group-by DS :V4 {:result-type :as-map})
```
{“A” Group: A \[3 3\]:
| :V4 | :V1 | :V2 |
|-----|----:|----:|
| A | 0 | 3 |
| A | 0 | 4 |
| A | 0 | 7 |
, “B” Group: B \[3 3\]:
| :V4 | :V1 | :V2 |
|-----|----:|----:|
| B | 4 | 0 |
| B | 4 | 8 |
| B | 1 | 5 |
, “C” Group: C \[3 3\]:
| :V4 | :V1 | :V2 |
|-----|----:|----:|
| C | 4 | 5 |
| C | 1 | 0 |
| C | 1 | 9 |
}
------------------------------------------------------------------------
Split and transpose a vector/column
``` clojure
(-> {:a ["A:a" "B:b" "C:c"]}
(tc/dataset)
(tc/separate-column :a [:V1 :V2] ":"))
```
\_unnamed \[3 2\]:
| :V1 | :V2 |
|-----|-----|
| A | a |
| B | b |
| C | c |
##### Other
Skipped
#### Join/Bind data sets
``` clojure
(def x (tc/dataset {"Id" ["A" "B" "C" "C"]
"X1" [1 3 5 7]
"XY" ["x2" "x4" "x6" "x8"]}))
(def y (tc/dataset {"Id" ["A" "B" "B" "D"]
"Y1" [1 3 5 7]
"XY" ["y1" "y3" "y5" "y7"]}))
```
``` clojure
x y
```
\_unnamed \[4 3\]:
| Id | X1 | XY |
|-----|----:|-----|
| A | 1 | x2 |
| B | 3 | x4 |
| C | 5 | x6 |
| C | 7 | x8 |
\_unnamed \[4 3\]:
| Id | Y1 | XY |
|-----|----:|-----|
| A | 1 | y1 |
| B | 3 | y3 |
| B | 5 | y5 |
| D | 7 | y7 |
##### Join
Join matching rows from y to x
``` clojure
(tc/left-join x y "Id")
```
left-outer-join \[5 6\]:
| Id | X1 | XY | right.Id | Y1 | right.XY |
|-----|----:|-----|----------|----:|----------|
| A | 1 | x2 | A | 1 | y1 |
| B | 3 | x4 | B | 3 | y3 |
| B | 3 | x4 | B | 5 | y5 |
| C | 5 | x6 | | | |
| C | 7 | x8 | | | |
------------------------------------------------------------------------
Join matching rows from x to y
``` clojure
(tc/right-join x y "Id")
```
right-outer-join \[4 6\]:
| Id | X1 | XY | right.Id | Y1 | right.XY |
|-----|----:|-----|----------|----:|----------|
| A | 1 | x2 | A | 1 | y1 |
| B | 3 | x4 | B | 3 | y3 |
| B | 3 | x4 | B | 5 | y5 |
| | | | D | 7 | y7 |
------------------------------------------------------------------------
Join matching rows from both x and y
``` clojure
(tc/inner-join x y "Id")
```
inner-join \[3 5\]:
| Id | X1 | XY | Y1 | right.XY |
|-----|----:|-----|----:|----------|
| A | 1 | x2 | 1 | y1 |
| B | 3 | x4 | 3 | y3 |
| B | 3 | x4 | 5 | y5 |
------------------------------------------------------------------------
Join keeping all the rows
``` clojure
(tc/full-join x y "Id")
```
full-join \[6 6\]:
| Id | X1 | XY | right.Id | Y1 | right.XY |
|-----|----:|-----|----------|----:|----------|
| A | 1 | x2 | A | 1 | y1 |
| B | 3 | x4 | B | 3 | y3 |
| B | 3 | x4 | B | 5 | y5 |
| C | 5 | x6 | | | |
| C | 7 | x8 | | | |
| | | | D | 7 | y7 |
------------------------------------------------------------------------
Return rows from x matching y
``` clojure
(tc/semi-join x y "Id")
```
semi-join \[2 3\]:
| Id | X1 | XY |
|-----|----:|-----|
| A | 1 | x2 |
| B | 3 | x4 |
------------------------------------------------------------------------
Return rows from x not matching y
``` clojure
(tc/anti-join x y "Id")
```
anti-join \[2 3\]:
| Id | X1 | XY |
|-----|----:|-----|
| C | 5 | x6 |
| C | 7 | x8 |
##### More joins
Select columns while joining
``` clojure
(tc/right-join (tc/select-columns x ["Id" "X1"])
(tc/select-columns y ["Id" "XY"])
"Id")
```
right-outer-join \[4 4\]:
| Id | X1 | right.Id | XY |
|-----|----:|----------|-----|
| A | 1 | A | y1 |
| B | 3 | B | y3 |
| B | 3 | B | y5 |
| | | D | y7 |
``` clojure
(tc/right-join (tc/select-columns x ["Id" "XY"])
(tc/select-columns y ["Id" "XY"])
"Id")
```
right-outer-join \[4 4\]:
| Id | XY | right.Id | right.XY |
|-----|-----|----------|----------|
| A | x2 | A | y1 |
| B | x4 | B | y3 |
| B | x4 | B | y5 |
| | | D | y7 |
Aggregate columns while joining
``` clojure
(-> y
(tc/group-by ["Id"])
(tc/aggregate {"sumY1" #(dfn/sum (% "Y1"))})
(tc/right-join x "Id")
(tc/add-column "X1Y1" (fn [ds] (dfn/* (ds "sumY1")
(ds "X1"))))
(tc/select-columns ["right.Id" "X1Y1"]))
```
right-outer-join \[4 2\]:
| right.Id | X1Y1 |
|----------|-----:|
| A | 1.0 |
| B | 24.0 |
| C | |
| C | |
Update columns while joining
``` clojure
(-> x
(tc/select-columns ["Id" "X1"])
(tc/map-columns "SqX1" "X1" (fn [x] (* x x)))
(tc/right-join y "Id")
(tc/drop-columns ["X1" "Id"]))
```
right-outer-join \[4 4\]:
| SqX1 | right.Id | Y1 | XY |
|-----:|----------|----:|-----|
| 1 | A | 1 | y1 |
| 9 | B | 3 | y3 |
| 9 | B | 5 | y5 |
| | D | 7 | y7 |
------------------------------------------------------------------------
Adds a list column with rows from y matching x (nest-join)
``` clojure
(-> (tc/left-join x y "Id")
(tc/drop-columns ["right.Id"])
(tc/fold-by (tc/column-names x)))
```
\_unnamed \[4 5\]:
| Id | X1 | XY | Y1 | right.XY |
|-----|----:|-----|---------|---------------|
| A | 1 | x2 | \[1\] | \[“y1”\] |
| B | 3 | x4 | \[3 5\] | \[“y3” “y5”\] |
| C | 5 | x6 | \[\] | \[\] |
| C | 7 | x8 | \[\] | \[\] |
------------------------------------------------------------------------
Some joins are skipped
------------------------------------------------------------------------
Cross join
``` clojure
(def cjds (tc/dataset {:V1 [[2 1 1]]
:V2 [[3 2]]}))
```
``` clojure
cjds
```
\_unnamed \[1 2\]:
| :V1 | :V2 |
|-----------|---------|
| \[2 1 1\] | \[3 2\] |
``` clojure
(reduce #(tc/unroll %1 %2) cjds (tc/column-names cjds))
```
\_unnamed \[6 2\]:
| :V1 | :V2 |
|----:|----:|
| 2 | 3 |
| 2 | 2 |
| 1 | 3 |
| 1 | 2 |
| 1 | 3 |
| 1 | 2 |
``` clojure
(-> (reduce #(tc/unroll %1 %2) cjds (tc/column-names cjds))
(tc/unique-by))
```
\_unnamed \[4 2\]:
| :V1 | :V2 |
|----:|----:|
| 2 | 3 |
| 2 | 2 |
| 1 | 3 |
| 1 | 2 |
##### Bind
``` clojure
(def x (tc/dataset {:V1 [1 2 3]}))
(def y (tc/dataset {:V1 [4 5 6]}))
(def z (tc/dataset {:V1 [7 8 9]
:V2 [0 0 0]}))
```
``` clojure
x y z
```
\_unnamed \[3 1\]:
| :V1 |
|----:|
| 1 |
| 2 |
| 3 |
\_unnamed \[3 1\]:
| :V1 |
|----:|
| 4 |
| 5 |
| 6 |
\_unnamed \[3 2\]:
| :V1 | :V2 |
|----:|----:|
| 7 | 0 |
| 8 | 0 |
| 9 | 0 |
------------------------------------------------------------------------
Bind rows
``` clojure
(tc/bind x y)
```
\_unnamed \[6 1\]:
| :V1 |
|----:|
| 1 |
| 2 |
| 3 |
| 4 |
| 5 |
| 6 |
``` clojure
(tc/bind x z)
```
\_unnamed \[6 2\]:
| :V1 | :V2 |
|----:|----:|
| 1 | |
| 2 | |
| 3 | |
| 7 | 0 |
| 8 | 0 |
| 9 | 0 |
------------------------------------------------------------------------
Bind rows using a list
``` clojure
(->> [x y]
(map-indexed #(tc/add-column %2 :id (repeat %1)))
(apply tc/bind))
```
\_unnamed \[6 2\]:
| :V1 | :id |
|----:|----:|
| 1 | 0 |
| 2 | 0 |
| 3 | 0 |
| 4 | 1 |
| 5 | 1 |
| 6 | 1 |
------------------------------------------------------------------------
Bind columns
``` clojure
(tc/append x y)
```
\_unnamed \[3 2\]:
| :V1 | :V1 |
|----:|----:|
| 1 | 4 |
| 2 | 5 |
| 3 | 6 |
##### Set operations
``` clojure
(def x (tc/dataset {:V1 [1 2 2 3 3]}))
(def y (tc/dataset {:V1 [2 2 3 4 4]}))
```
``` clojure
x y
```
\_unnamed \[5 1\]:
| :V1 |
|----:|
| 1 |
| 2 |
| 2 |
| 3 |
| 3 |
\_unnamed \[5 1\]:
| :V1 |
|----:|
| 2 |
| 2 |
| 3 |
| 4 |
| 4 |
------------------------------------------------------------------------
Intersection
``` clojure
(tc/intersect x y)
```
intersection \[2 1\]:
| :V1 |
|----:|
| 2 |
| 3 |
------------------------------------------------------------------------
Difference
``` clojure
(tc/difference x y)
```
difference \[1 1\]:
| :V1 |
|----:|
| 1 |
------------------------------------------------------------------------
Union
``` clojure
(tc/union x y)
```
union \[4 1\]:
| :V1 |
|----:|
| 1 |
| 2 |
| 3 |
| 4 |
``` clojure
(tc/concat x y)
```
\_unnamed \[10 1\]:
| :V1 |
|----:|
| 1 |
| 2 |
| 2 |
| 3 |
| 3 |
| 2 |
| 2 |
| 3 |
| 4 |
| 4 |
------------------------------------------------------------------------
Equality not implemented