Column API

A column in tablecloth is a named sequence of typed data. This special type is defined in the tech.ml.dataset. It is roughly comparable to a R vector.

Column Creation

Empty column

(tcc/column)
#tech.v3.dataset.column<boolean>[0]
null
[]

Column from a vector or a sequence

(tcc/column [1 2 3 4 5])
#tech.v3.dataset.column<int64>[5]
null
[1, 2, 3, 4, 5]
(tcc/column `(1 2 3 4 5))
#tech.v3.dataset.column<int64>[5]
null
[1, 2, 3, 4, 5]

Ones & Zeros

You can also quickly create columns of ones or zeros:

(tcc/ones 10)
#tech.v3.dataset.column<int64>[10]
null
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
(tcc/zeros 10)
#tech.v3.dataset.column<int64>[10]
null
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

Column?

Finally, you can use the column? function to check if an item is a column:

(tcc/column? [1 2 3 4 5])
false
(tcc/column? (tcc/column))
true

Tablecloth’s datasets of course consists of columns:

(tcc/column? (-> (tc/dataset {:a [1 2 3 4 5]})
                 :a))
true

Types and Type detection

The default set of types for a column are defined in the underlying “tech ml” system. We can see the set here:

(tech.v3.datatype.casting/all-datatypes)
(:int32
 :int16
 :float32
 :float64
 :int64
 :uint64
 :string
 :uint16
 :int8
 :uint32
 :keyword
 :decimal
 :uuid
 :boolean
 :object
 :char
 :uint8)

Typeof & Typeof?

When you create a column, the underlying system will try to autodetect its type. We can see that here using the tcc/typeof function to check the type of a column:

(-> (tcc/column [1 2 3 4 5])
    (tcc/typeof))
:int64
(-> (tcc/column [:a :b :c :d :e])
    (tcc/typeof))
:keyword

Columns containing heterogenous data will receive type :object:

(-> (tcc/column [1 :b 3 :c 5])
    (tcc/typeof))
:object

You can also use the tcc/typeof? function to check the value of a function as an asssertion:

(-> (tcc/column [1 2 3 4 6])
    (tcc/typeof? :boolean))
false
(-> (tcc/column [1 2 3 4 6])
    (tcc/typeof? :int64))
true

Tablecloth has a concept of “concrete” and “general” types. A general type is the broad category of type and the concrete type is the actual type in memory. For example, a concrete type is a 64-bit integer :int64, which is also of the general type :integer. The typeof? function supports checking both.

(-> (tcc/column [1 2 3 4 6])
    (tcc/typeof? :int64))
true
(-> (tcc/column [1 2 3 4 6])
    (tcc/typeof? :integer))
true

Column Access & Manipulation

Column Access

The method for accessing a particular index position in a column is the same as for Clojure vectors:

(-> (tcc/column [1 2 3 4 5])
    (get 3))
4
(-> (tcc/column [1 2 3 4 5])
    (nth 3))
4

Slice

You can also slice a column

(-> (tcc/column (range 10))
    (tcc/slice 5))
#tech.v3.dataset.column<int64>[5]
null
[5, 6, 7, 8, 9]
(-> (tcc/column (range 10))
    (tcc/slice 1 4))
#tech.v3.dataset.column<int64>[4]
null
[1, 2, 3, 4]
(-> (tcc/column (range 10))
    (tcc/slice 0 9 2))
#tech.v3.dataset.column<int64>[5]
null
[0, 2, 4, 6, 8]

For clarity, the slice method supports the :end and :start keywords:

(-> (tcc/column (range 10))
    (tcc/slice :start :end 2))
#tech.v3.dataset.column<int64>[5]
null
[0, 2, 4, 6, 8]

If you need to create a discontinuous subset of the column, you can use the select function. This method accepts an array of index positions or an array of booleans. When using boolean select, a true value will select the value at the index positions containing true values:

Select

Select the values at index positions 1 and 9:

(-> (tcc/column (range 10))
    (tcc/select [1 9]))
#tech.v3.dataset.column<int64>[2]
null
[1, 9]

Select the values at index positions 0 and 2 using booelan select:

(-> (tcc/column (range 10))
    (tcc/select (tcc/column [true false true])))
#tech.v3.dataset.column<int64>[2]
null
[0, 2]

Sort

Use sort-column to sort a column: Default sort is in ascending order:

(-> (tcc/column [:c :z :a :f])
    (tcc/sort-column))
#tech.v3.dataset.column<keyword>[4]
null
[:a, :c, :f, :z]

You can provide the :desc and :asc keywords to change the default behavior:

(-> (tcc/column [:c :z :a :f])
    (tcc/sort-column :desc))
#tech.v3.dataset.column<keyword>[4]
null
[:z, :f, :c, :a]

You can also provide a comparator fn:

(-> (tcc/column [{:position 2
                  :text "and then stopped"}
                 {:position 1
                  :text "I ran fast"}])
    (tcc/sort-column (fn [a b] (< (:position a) (:position b)))))
#tech.v3.dataset.column&lt;persistent-map&gt;[2]
null
[{:position 1, :text "I ran fast"}, {:position 2, :text "and then stopped"}]

Column Operations

The Column API contains a large number of operations. These operations all take one or more columns as an argument, and they return either a scalar value or a new column, depending on the operations. These operations all take a column as the first argument so they are easy to use with the pipe -> macro, as with all functions in Tablecloth.

(def a (tcc/column [20 30 40 50]))
(def b (tcc/column (range 4)))
(tcc/- a b)
#tech.v3.dataset.column&lt;int64&gt;[4]
null
[20, 29, 38, 47]
(tcc/pow a 2)
#tech.v3.dataset.column&lt;float64&gt;[4]
null
[400.0, 900.0, 1600, 2500]
(tcc/* 10 (tcc/sin a))
#tech.v3.dataset.column&lt;float64&gt;[4]
null
[9.129, -9.880, 7.451, -2.624]
(tcc/< a 35)
#tech.v3.dataset.column&lt;boolean&gt;[4]
null
[true, true, false, false]

All these operations take a column as their first argument and return a column, so they can be chained easily.

(-> a
    (tcc/* b)
    (tcc/< 70))
#tech.v3.dataset.column&lt;boolean&gt;[4]
null
[true, true, false, false]
source: notebooks/column_api.clj