{
"cells": [
{
"metadata": {},
"cell_type": "markdown",
"source": [
"# 40 kotlin-dataframe puzzles\n",
"inspired by [100 pandas puzzles](https://github.com/ajcr/100-pandas-puzzles)"
]
},
{
"metadata": {},
"cell_type": "markdown",
"source": [
"## Importing kotlin-dataframe\n",
"### Getting started\n",
"Difficulty: easy\n",
"\n",
"**1.** Import kotlin-dataframe"
]
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2025-12-18T11:19:41.542631050Z",
"start_time": "2025-12-18T11:19:40.843063224Z"
},
"executionRelatedData": {
"compiledClasses": [
"Line_3_jupyter",
"Line_4_jupyter",
"Line_5_jupyter",
"Line_6_jupyter",
"Line_7_jupyter",
"Line_8_jupyter",
"Line_9_jupyter",
"Line_10_jupyter"
]
}
},
"cell_type": "code",
"source": [
"%useLatestDescriptors\n",
"%use dataframe"
],
"outputs": [],
"execution_count": 1
},
{
"metadata": {},
"cell_type": "markdown",
"source": [
"## DataFrame Basics\n",
"### A few of the fundamental routines for selecting, sorting, adding and aggregating data in DataFrames\n",
"Difficulty: easy\n",
"\n",
"Consider the following columns:\n",
"```[kotlin]\n",
"columnOf(\"cat\", \"cat\", \"snake\", \"dog\", \"dog\", \"cat\", \"snake\", \"cat\", \"dog\", \"dog\")\n",
"columnOf(2.5, 3.0, 0.5, Double.NaN, 5.0, 2.0, 4.5, Double.NaN, 7, 3)\n",
"columnOf(1, 3, 2, 3, 2, 3, 1, 1, 2, 1)\n",
"columnOf(\"yes\", \"yes\", \"no\", \"yes\", \"no\", \"no\", \"no\", \"yes\", \"no\", \"no\")\n",
"```\n",
"**2.** Create a DataFrame df from this columns."
]
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2025-12-18T11:19:42.256463391Z",
"start_time": "2025-12-18T11:19:41.547295069Z"
},
"executionRelatedData": {
"compiledClasses": [
"Line_11_jupyter",
"Line_12_jupyter",
"Line_13_jupyter"
]
}
},
"cell_type": "code",
"source": [
"val df = dataFrameOf(\n",
" \"animal\" to columnOf(\"cat\", \"cat\", \"snake\", \"dog\", \"dog\", \"cat\", \"snake\", \"cat\", \"dog\", \"dog\"),\n",
" \"age\" to columnOf(2.5, 3.0, 0.5, Double.NaN, 5.0, 2.0, 4.5, Double.NaN, 7.0, 3.0),\n",
" \"visits\" to columnOf(1, 3, 2, 3, 2, 3, 1, 1, 2, 1),\n",
" \"priority\" to columnOf(\"yes\", \"yes\", \"no\", \"yes\", \"no\", \"no\", \"no\", \"yes\", \"no\", \"no\"),\n",
")\n",
"df"
],
"outputs": [
{
"data": {
"text/html": [
" \n",
" \n",
"
\n",
" \n",
" \n",
" \n",
" | animal | age | visits | priority |
|---|
| cat | 2.500000 | 1 | yes |
| cat | 3.000000 | 3 | yes |
| snake | 0.500000 | 2 | no |
| dog | NaN | 3 | yes |
| dog | 5.000000 | 2 | no |
| cat | 2.000000 | 3 | no |
| snake | 4.500000 | 1 | no |
| cat | NaN | 1 | yes |
| dog | 7.000000 | 2 | no |
| dog | 3.000000 | 1 | no |
\n",
" \n",
" \n",
" "
],
"application/kotlindataframe+json": "{\"$version\":\"2.2.0\",\"metadata\":{\"columns\":[\"animal\",\"age\",\"visits\",\"priority\"],\"types\":[{\"kind\":\"ValueColumn\",\"type\":\"kotlin.String\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Int\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.String\"}],\"nrow\":10,\"ncol\":4,\"is_formatted\":false},\"kotlin_dataframe\":[{\"animal\":\"cat\",\"age\":2.5,\"visits\":1,\"priority\":\"yes\"},{\"animal\":\"cat\",\"age\":3.0,\"visits\":3,\"priority\":\"yes\"},{\"animal\":\"snake\",\"age\":0.5,\"visits\":2,\"priority\":\"no\"},{\"animal\":\"dog\",\"age\":NaN,\"visits\":3,\"priority\":\"yes\"},{\"animal\":\"dog\",\"age\":5.0,\"visits\":2,\"priority\":\"no\"},{\"animal\":\"cat\",\"age\":2.0,\"visits\":3,\"priority\":\"no\"},{\"animal\":\"snake\",\"age\":4.5,\"visits\":1,\"priority\":\"no\"},{\"animal\":\"cat\",\"age\":NaN,\"visits\":1,\"priority\":\"yes\"},{\"animal\":\"dog\",\"age\":7.0,\"visits\":2,\"priority\":\"no\"},{\"animal\":\"dog\",\"age\":3.0,\"visits\":1,\"priority\":\"no\"}]}"
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"execution_count": 2
},
{
"metadata": {},
"cell_type": "markdown",
"source": "**3.** Display a summary of the basic information about this DataFrame and its data."
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2025-12-18T11:19:42.668067585Z",
"start_time": "2025-12-18T11:19:42.493753232Z"
},
"executionRelatedData": {
"compiledClasses": [
"Line_15_jupyter"
]
}
},
"cell_type": "code",
"source": "df.schema()",
"outputs": [
{
"data": {
"text/plain": [
"animal: String\n",
"age: Double\n",
"visits: Int\n",
"priority: String"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"execution_count": 3
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2025-12-18T11:19:43.024905933Z",
"start_time": "2025-12-18T11:19:42.687874922Z"
},
"executionRelatedData": {
"compiledClasses": [
"Line_16_jupyter"
]
}
},
"cell_type": "code",
"source": "df.describe()",
"outputs": [
{
"data": {
"text/html": [
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" | name | type | count | unique | nulls | top | freq | mean | std | min | p25 | median | p75 | max |
|---|
| animal | String | 10 | 3 | 0 | cat | 4 | null | null | cat | cat | dog | dog | snake |
| age | Double | 10 | 8 | 0 | 3.000000 | 2 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| visits | Int | 10 | 3 | 0 | 1 | 4 | 1.900000 | 0.875595 | 1 | 1.000000 | 2.000000 | 3.000000 | 3 |
| priority | String | 10 | 2 | 0 | no | 6 | null | null | no | no | no | yes | yes |
\n",
" \n",
" \n",
" "
],
"application/kotlindataframe+json": "{\"$version\":\"2.2.0\",\"metadata\":{\"columns\":[\"name\",\"type\",\"count\",\"unique\",\"nulls\",\"top\",\"freq\",\"mean\",\"std\",\"min\",\"p25\",\"median\",\"p75\",\"max\"],\"types\":[{\"kind\":\"ValueColumn\",\"type\":\"kotlin.String\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.String\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Int\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Int\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Int\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Comparable<*>\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Int\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double?\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double?\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Comparable<*>\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Comparable<*>\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Comparable<*>\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Comparable<*>\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Comparable<*>\"}],\"nrow\":4,\"ncol\":14,\"is_formatted\":false},\"kotlin_dataframe\":[{\"name\":\"animal\",\"type\":\"String\",\"count\":10,\"unique\":3,\"nulls\":0,\"top\":\"cat\",\"freq\":4,\"mean\":null,\"std\":null,\"min\":\"cat\",\"p25\":\"cat\",\"median\":\"dog\",\"p75\":\"dog\",\"max\":\"snake\"},{\"name\":\"age\",\"type\":\"Double\",\"count\":10,\"unique\":8,\"nulls\":0,\"top\":\"3.0\",\"freq\":2,\"mean\":NaN,\"std\":NaN,\"min\":\"NaN\",\"p25\":\"NaN\",\"median\":\"NaN\",\"p75\":\"NaN\",\"max\":\"NaN\"},{\"name\":\"visits\",\"type\":\"Int\",\"count\":10,\"unique\":3,\"nulls\":0,\"top\":\"1\",\"freq\":4,\"mean\":1.9,\"std\":0.8755950357709131,\"min\":\"1\",\"p25\":\"1.0\",\"median\":\"2.0\",\"p75\":\"3.0\",\"max\":\"3\"},{\"name\":\"priority\",\"type\":\"String\",\"count\":10,\"unique\":2,\"nulls\":0,\"top\":\"no\",\"freq\":6,\"mean\":null,\"std\":null,\"min\":\"no\",\"p25\":\"no\",\"median\":\"no\",\"p75\":\"yes\",\"max\":\"yes\"}]}"
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"execution_count": 4
},
{
"metadata": {},
"cell_type": "markdown",
"source": "**4.** Return the first 3 rows of the DataFrame df."
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2025-12-18T11:19:43.282299347Z",
"start_time": "2025-12-18T11:19:43.094411261Z"
},
"executionRelatedData": {
"compiledClasses": [
"Line_18_jupyter"
]
}
},
"cell_type": "code",
"source": [
"df[0 ..< 3] // df[0..2]\n",
"\n",
"// or equivalently\n",
"\n",
"df.head(3)\n",
"\n",
"// or\n",
"\n",
"df.take(3)"
],
"outputs": [
{
"data": {
"text/html": [
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" | animal | age | visits | priority |
|---|
| cat | 2.500000 | 1 | yes |
| cat | 3.000000 | 3 | yes |
| snake | 0.500000 | 2 | no |
\n",
" \n",
" \n",
" "
],
"application/kotlindataframe+json": "{\"$version\":\"2.2.0\",\"metadata\":{\"columns\":[\"animal\",\"age\",\"visits\",\"priority\"],\"types\":[{\"kind\":\"ValueColumn\",\"type\":\"kotlin.String\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Int\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.String\"}],\"nrow\":3,\"ncol\":4,\"is_formatted\":false},\"kotlin_dataframe\":[{\"animal\":\"cat\",\"age\":2.5,\"visits\":1,\"priority\":\"yes\"},{\"animal\":\"cat\",\"age\":3.0,\"visits\":3,\"priority\":\"yes\"},{\"animal\":\"snake\",\"age\":0.5,\"visits\":2,\"priority\":\"no\"}]}"
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"execution_count": 5
},
{
"metadata": {},
"cell_type": "markdown",
"source": "**5.** Select \"animal\" and \"age\" columns from the DataFrame df."
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2025-12-18T11:19:43.547372659Z",
"start_time": "2025-12-18T11:19:43.339126731Z"
},
"executionRelatedData": {
"compiledClasses": [
"Line_20_jupyter"
]
}
},
"cell_type": "code",
"source": "df.select { animal and age }",
"outputs": [
{
"data": {
"text/html": [
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" | animal | age |
|---|
| cat | 2.500000 |
| cat | 3.000000 |
| snake | 0.500000 |
| dog | NaN |
| dog | 5.000000 |
| cat | 2.000000 |
| snake | 4.500000 |
| cat | NaN |
| dog | 7.000000 |
| dog | 3.000000 |
\n",
" \n",
" \n",
" "
],
"application/kotlindataframe+json": "{\"$version\":\"2.2.0\",\"metadata\":{\"columns\":[\"animal\",\"age\"],\"types\":[{\"kind\":\"ValueColumn\",\"type\":\"kotlin.String\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double\"}],\"nrow\":10,\"ncol\":2,\"is_formatted\":false},\"kotlin_dataframe\":[{\"animal\":\"cat\",\"age\":2.5},{\"animal\":\"cat\",\"age\":3.0},{\"animal\":\"snake\",\"age\":0.5},{\"animal\":\"dog\",\"age\":NaN},{\"animal\":\"dog\",\"age\":5.0},{\"animal\":\"cat\",\"age\":2.0},{\"animal\":\"snake\",\"age\":4.5},{\"animal\":\"cat\",\"age\":NaN},{\"animal\":\"dog\",\"age\":7.0},{\"animal\":\"dog\",\"age\":3.0}]}"
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"execution_count": 6
},
{
"metadata": {},
"cell_type": "markdown",
"source": "**6.** Select the data in rows [3, 4, 8] and in columns [\"animal\", \"age\"]."
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2025-12-18T11:19:43.840786386Z",
"start_time": "2025-12-18T11:19:43.614032321Z"
},
"executionRelatedData": {
"compiledClasses": [
"Line_22_jupyter"
]
}
},
"cell_type": "code",
"source": "df[3, 4, 8].select { animal and age }",
"outputs": [
{
"data": {
"text/html": [
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" | animal | age |
|---|
| dog | NaN |
| dog | 5.000000 |
| dog | 7.000000 |
\n",
" \n",
" \n",
" "
],
"application/kotlindataframe+json": "{\"$version\":\"2.2.0\",\"metadata\":{\"columns\":[\"animal\",\"age\"],\"types\":[{\"kind\":\"ValueColumn\",\"type\":\"kotlin.String\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double\"}],\"nrow\":3,\"ncol\":2,\"is_formatted\":false},\"kotlin_dataframe\":[{\"animal\":\"dog\",\"age\":NaN},{\"animal\":\"dog\",\"age\":5.0},{\"animal\":\"dog\",\"age\":7.0}]}"
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"execution_count": 7
},
{
"metadata": {},
"cell_type": "markdown",
"source": "**7.** Select only the rows where the number of visits is greater than 2."
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2025-12-18T11:19:44.130666181Z",
"start_time": "2025-12-18T11:19:43.890826650Z"
},
"executionRelatedData": {
"compiledClasses": [
"Line_24_jupyter"
]
}
},
"cell_type": "code",
"source": "df.filter { visits > 2 }",
"outputs": [
{
"data": {
"text/html": [
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" | animal | age | visits | priority |
|---|
| cat | 3.000000 | 3 | yes |
| dog | NaN | 3 | yes |
| cat | 2.000000 | 3 | no |
\n",
" \n",
" \n",
" "
],
"application/kotlindataframe+json": "{\"$version\":\"2.2.0\",\"metadata\":{\"columns\":[\"animal\",\"age\",\"visits\",\"priority\"],\"types\":[{\"kind\":\"ValueColumn\",\"type\":\"kotlin.String\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Int\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.String\"}],\"nrow\":3,\"ncol\":4,\"is_formatted\":false},\"kotlin_dataframe\":[{\"animal\":\"cat\",\"age\":3.0,\"visits\":3,\"priority\":\"yes\"},{\"animal\":\"dog\",\"age\":NaN,\"visits\":3,\"priority\":\"yes\"},{\"animal\":\"cat\",\"age\":2.0,\"visits\":3,\"priority\":\"no\"}]}"
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"execution_count": 8
},
{
"metadata": {},
"cell_type": "markdown",
"source": "**8.** Select the rows where the age is missing, i.e. it is NaN."
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2025-12-18T11:19:44.432826220Z",
"start_time": "2025-12-18T11:19:44.195451619Z"
},
"executionRelatedData": {
"compiledClasses": [
"Line_26_jupyter"
]
}
},
"cell_type": "code",
"source": "df.filter { age.isNaN() }",
"outputs": [
{
"data": {
"text/html": [
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" | animal | age | visits | priority |
|---|
| dog | NaN | 3 | yes |
| cat | NaN | 1 | yes |
\n",
" \n",
" \n",
" "
],
"application/kotlindataframe+json": "{\"$version\":\"2.2.0\",\"metadata\":{\"columns\":[\"animal\",\"age\",\"visits\",\"priority\"],\"types\":[{\"kind\":\"ValueColumn\",\"type\":\"kotlin.String\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Int\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.String\"}],\"nrow\":2,\"ncol\":4,\"is_formatted\":false},\"kotlin_dataframe\":[{\"animal\":\"dog\",\"age\":NaN,\"visits\":3,\"priority\":\"yes\"},{\"animal\":\"cat\",\"age\":NaN,\"visits\":1,\"priority\":\"yes\"}]}"
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"execution_count": 9
},
{
"metadata": {},
"cell_type": "markdown",
"source": "**9.** Select the rows where the animal is a cat and the age is less than 3."
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2025-12-18T11:19:44.731860002Z",
"start_time": "2025-12-18T11:19:44.516203139Z"
},
"executionRelatedData": {
"compiledClasses": [
"Line_28_jupyter"
]
}
},
"cell_type": "code",
"source": "df.filter { animal == \"cat\" && age < 3 }",
"outputs": [
{
"data": {
"text/html": [
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" | animal | age | visits | priority |
|---|
| cat | 2.500000 | 1 | yes |
| cat | 2.000000 | 3 | no |
\n",
" \n",
" \n",
" "
],
"application/kotlindataframe+json": "{\"$version\":\"2.2.0\",\"metadata\":{\"columns\":[\"animal\",\"age\",\"visits\",\"priority\"],\"types\":[{\"kind\":\"ValueColumn\",\"type\":\"kotlin.String\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Int\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.String\"}],\"nrow\":2,\"ncol\":4,\"is_formatted\":false},\"kotlin_dataframe\":[{\"animal\":\"cat\",\"age\":2.5,\"visits\":1,\"priority\":\"yes\"},{\"animal\":\"cat\",\"age\":2.0,\"visits\":3,\"priority\":\"no\"}]}"
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"execution_count": 10
},
{
"metadata": {},
"cell_type": "markdown",
"source": "**10.** Select the rows where age is between 2 and 4 (inclusive)."
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2025-12-18T11:19:45.067952922Z",
"start_time": "2025-12-18T11:19:44.828539799Z"
},
"executionRelatedData": {
"compiledClasses": [
"Line_30_jupyter"
]
}
},
"cell_type": "code",
"source": "df.filter { age in 2.0..4.0 }",
"outputs": [
{
"data": {
"text/html": [
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" | animal | age | visits | priority |
|---|
| cat | 2.500000 | 1 | yes |
| cat | 3.000000 | 3 | yes |
| cat | 2.000000 | 3 | no |
| dog | 3.000000 | 1 | no |
\n",
" \n",
" \n",
" "
],
"application/kotlindataframe+json": "{\"$version\":\"2.2.0\",\"metadata\":{\"columns\":[\"animal\",\"age\",\"visits\",\"priority\"],\"types\":[{\"kind\":\"ValueColumn\",\"type\":\"kotlin.String\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Int\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.String\"}],\"nrow\":4,\"ncol\":4,\"is_formatted\":false},\"kotlin_dataframe\":[{\"animal\":\"cat\",\"age\":2.5,\"visits\":1,\"priority\":\"yes\"},{\"animal\":\"cat\",\"age\":3.0,\"visits\":3,\"priority\":\"yes\"},{\"animal\":\"cat\",\"age\":2.0,\"visits\":3,\"priority\":\"no\"},{\"animal\":\"dog\",\"age\":3.0,\"visits\":1,\"priority\":\"no\"}]}"
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"execution_count": 11
},
{
"metadata": {},
"cell_type": "markdown",
"source": "**11.** Change the age in row 5 to 1.5"
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2025-12-18T11:19:45.321374555Z",
"start_time": "2025-12-18T11:19:45.119529698Z"
},
"executionRelatedData": {
"compiledClasses": [
"Line_32_jupyter"
]
}
},
"cell_type": "code",
"source": "df.update { age }.at(5).with { 1.5 }",
"outputs": [
{
"data": {
"text/html": [
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" | animal | age | visits | priority |
|---|
| cat | 2.500000 | 1 | yes |
| cat | 3.000000 | 3 | yes |
| snake | 0.500000 | 2 | no |
| dog | NaN | 3 | yes |
| dog | 5.000000 | 2 | no |
| cat | 1.500000 | 3 | no |
| snake | 4.500000 | 1 | no |
| cat | NaN | 1 | yes |
| dog | 7.000000 | 2 | no |
| dog | 3.000000 | 1 | no |
\n",
" \n",
" \n",
" "
],
"application/kotlindataframe+json": "{\"$version\":\"2.2.0\",\"metadata\":{\"columns\":[\"animal\",\"age\",\"visits\",\"priority\"],\"types\":[{\"kind\":\"ValueColumn\",\"type\":\"kotlin.String\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Int\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.String\"}],\"nrow\":10,\"ncol\":4,\"is_formatted\":false},\"kotlin_dataframe\":[{\"animal\":\"cat\",\"age\":2.5,\"visits\":1,\"priority\":\"yes\"},{\"animal\":\"cat\",\"age\":3.0,\"visits\":3,\"priority\":\"yes\"},{\"animal\":\"snake\",\"age\":0.5,\"visits\":2,\"priority\":\"no\"},{\"animal\":\"dog\",\"age\":NaN,\"visits\":3,\"priority\":\"yes\"},{\"animal\":\"dog\",\"age\":5.0,\"visits\":2,\"priority\":\"no\"},{\"animal\":\"cat\",\"age\":1.5,\"visits\":3,\"priority\":\"no\"},{\"animal\":\"snake\",\"age\":4.5,\"visits\":1,\"priority\":\"no\"},{\"animal\":\"cat\",\"age\":NaN,\"visits\":1,\"priority\":\"yes\"},{\"animal\":\"dog\",\"age\":7.0,\"visits\":2,\"priority\":\"no\"},{\"animal\":\"dog\",\"age\":3.0,\"visits\":1,\"priority\":\"no\"}]}"
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"execution_count": 12
},
{
"metadata": {},
"cell_type": "markdown",
"source": "**12.** Calculate the sum of all visits in df (i.e. the total number of visits)."
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2025-12-18T11:19:45.509286538Z",
"start_time": "2025-12-18T11:19:45.377472164Z"
},
"executionRelatedData": {
"compiledClasses": [
"Line_34_jupyter"
]
}
},
"cell_type": "code",
"source": "df.visits.sum()",
"outputs": [
{
"data": {
"text/plain": [
"19"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"execution_count": 13
},
{
"metadata": {},
"cell_type": "markdown",
"source": "**13.** Calculate the mean age for each different animal in df."
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2025-12-18T11:19:45.733964446Z",
"start_time": "2025-12-18T11:19:45.522355616Z"
},
"executionRelatedData": {
"compiledClasses": [
"Line_35_jupyter"
]
}
},
"cell_type": "code",
"source": "df.groupBy { animal }.mean { age }",
"outputs": [
{
"data": {
"text/html": [
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" | animal | age |
|---|
| cat | NaN |
| snake | 2.500000 |
| dog | NaN |
\n",
" \n",
" \n",
" "
],
"application/kotlindataframe+json": "{\"$version\":\"2.2.0\",\"metadata\":{\"columns\":[\"animal\",\"age\"],\"types\":[{\"kind\":\"ValueColumn\",\"type\":\"kotlin.String\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double\"}],\"nrow\":3,\"ncol\":2,\"is_formatted\":false},\"kotlin_dataframe\":[{\"animal\":\"cat\",\"age\":NaN},{\"animal\":\"snake\",\"age\":2.5},{\"animal\":\"dog\",\"age\":NaN}]}"
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"execution_count": 14
},
{
"metadata": {},
"cell_type": "markdown",
"source": "**14.** Append a new row to df with your choice of values for each column. Then delete that row to return the original DataFrame."
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2025-12-18T11:19:45.965591312Z",
"start_time": "2025-12-18T11:19:45.805912316Z"
},
"executionRelatedData": {
"compiledClasses": [
"Line_37_jupyter"
]
}
},
"cell_type": "code",
"source": [
"val modifiedDf = df.append(\"dog\", 5.5, 2, \"no\")\n",
"modifiedDf.dropLast()"
],
"outputs": [
{
"data": {
"text/html": [
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" | animal | age | visits | priority |
|---|
| cat | 2.500000 | 1 | yes |
| cat | 3.000000 | 3 | yes |
| snake | 0.500000 | 2 | no |
| dog | NaN | 3 | yes |
| dog | 5.000000 | 2 | no |
| cat | 2.000000 | 3 | no |
| snake | 4.500000 | 1 | no |
| cat | NaN | 1 | yes |
| dog | 7.000000 | 2 | no |
| dog | 3.000000 | 1 | no |
\n",
" \n",
" \n",
" "
],
"application/kotlindataframe+json": "{\"$version\":\"2.2.0\",\"metadata\":{\"columns\":[\"animal\",\"age\",\"visits\",\"priority\"],\"types\":[{\"kind\":\"ValueColumn\",\"type\":\"kotlin.String\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Int\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.String\"}],\"nrow\":10,\"ncol\":4,\"is_formatted\":false},\"kotlin_dataframe\":[{\"animal\":\"cat\",\"age\":2.5,\"visits\":1,\"priority\":\"yes\"},{\"animal\":\"cat\",\"age\":3.0,\"visits\":3,\"priority\":\"yes\"},{\"animal\":\"snake\",\"age\":0.5,\"visits\":2,\"priority\":\"no\"},{\"animal\":\"dog\",\"age\":NaN,\"visits\":3,\"priority\":\"yes\"},{\"animal\":\"dog\",\"age\":5.0,\"visits\":2,\"priority\":\"no\"},{\"animal\":\"cat\",\"age\":2.0,\"visits\":3,\"priority\":\"no\"},{\"animal\":\"snake\",\"age\":4.5,\"visits\":1,\"priority\":\"no\"},{\"animal\":\"cat\",\"age\":NaN,\"visits\":1,\"priority\":\"yes\"},{\"animal\":\"dog\",\"age\":7.0,\"visits\":2,\"priority\":\"no\"},{\"animal\":\"dog\",\"age\":3.0,\"visits\":1,\"priority\":\"no\"}]}"
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"execution_count": 15
},
{
"metadata": {},
"cell_type": "markdown",
"source": "**15.** Count the number of each type of animal in df."
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2025-12-18T11:19:46.367888372Z",
"start_time": "2025-12-18T11:19:46.058158419Z"
},
"executionRelatedData": {
"compiledClasses": [
"Line_39_jupyter"
]
}
},
"cell_type": "code",
"source": "df.groupBy { animal }.count()",
"outputs": [
{
"data": {
"text/html": [
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" | animal | count |
|---|
| cat | 4 |
| snake | 2 |
| dog | 4 |
\n",
" \n",
" \n",
" "
],
"application/kotlindataframe+json": "{\"$version\":\"2.2.0\",\"metadata\":{\"columns\":[\"animal\",\"count\"],\"types\":[{\"kind\":\"ValueColumn\",\"type\":\"kotlin.String\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Int\"}],\"nrow\":3,\"ncol\":2,\"is_formatted\":false},\"kotlin_dataframe\":[{\"animal\":\"cat\",\"count\":4},{\"animal\":\"snake\",\"count\":2},{\"animal\":\"dog\",\"count\":4}]}"
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"execution_count": 16
},
{
"metadata": {},
"cell_type": "markdown",
"source": "**16.** Sort df first by the values in the 'age' in descending order, then by the value in the 'visits' column in ascending order."
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2025-12-18T11:19:46.660623189Z",
"start_time": "2025-12-18T11:19:46.449941128Z"
},
"executionRelatedData": {
"compiledClasses": [
"Line_41_jupyter"
]
}
},
"cell_type": "code",
"source": "df.sortBy { age.desc() and visits }",
"outputs": [
{
"data": {
"text/html": [
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" | animal | age | visits | priority |
|---|
| cat | NaN | 1 | yes |
| dog | NaN | 3 | yes |
| dog | 7.000000 | 2 | no |
| dog | 5.000000 | 2 | no |
| snake | 4.500000 | 1 | no |
| dog | 3.000000 | 1 | no |
| cat | 3.000000 | 3 | yes |
| cat | 2.500000 | 1 | yes |
| cat | 2.000000 | 3 | no |
| snake | 0.500000 | 2 | no |
\n",
" \n",
" \n",
" "
],
"application/kotlindataframe+json": "{\"$version\":\"2.2.0\",\"metadata\":{\"columns\":[\"animal\",\"age\",\"visits\",\"priority\"],\"types\":[{\"kind\":\"ValueColumn\",\"type\":\"kotlin.String\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Int\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.String\"}],\"nrow\":10,\"ncol\":4,\"is_formatted\":false},\"kotlin_dataframe\":[{\"animal\":\"cat\",\"age\":NaN,\"visits\":1,\"priority\":\"yes\"},{\"animal\":\"dog\",\"age\":NaN,\"visits\":3,\"priority\":\"yes\"},{\"animal\":\"dog\",\"age\":7.0,\"visits\":2,\"priority\":\"no\"},{\"animal\":\"dog\",\"age\":5.0,\"visits\":2,\"priority\":\"no\"},{\"animal\":\"snake\",\"age\":4.5,\"visits\":1,\"priority\":\"no\"},{\"animal\":\"dog\",\"age\":3.0,\"visits\":1,\"priority\":\"no\"},{\"animal\":\"cat\",\"age\":3.0,\"visits\":3,\"priority\":\"yes\"},{\"animal\":\"cat\",\"age\":2.5,\"visits\":1,\"priority\":\"yes\"},{\"animal\":\"cat\",\"age\":2.0,\"visits\":3,\"priority\":\"no\"},{\"animal\":\"snake\",\"age\":0.5,\"visits\":2,\"priority\":\"no\"}]}"
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"execution_count": 17
},
{
"metadata": {},
"cell_type": "markdown",
"source": "**17.** The 'priority' column contains the values 'yes' and 'no'. Replace this column with a column of boolean values: 'yes' should be True and 'no' should be False."
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2025-12-18T11:19:46.937073806Z",
"start_time": "2025-12-18T11:19:46.722579483Z"
},
"executionRelatedData": {
"compiledClasses": [
"Line_43_jupyter"
]
}
},
"cell_type": "code",
"source": "df.convert { priority }.with { it == \"yes\" }",
"outputs": [
{
"data": {
"text/html": [
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" | animal | age | visits | priority |
|---|
| cat | 2.500000 | 1 | true |
| cat | 3.000000 | 3 | true |
| snake | 0.500000 | 2 | false |
| dog | NaN | 3 | true |
| dog | 5.000000 | 2 | false |
| cat | 2.000000 | 3 | false |
| snake | 4.500000 | 1 | false |
| cat | NaN | 1 | true |
| dog | 7.000000 | 2 | false |
| dog | 3.000000 | 1 | false |
\n",
" \n",
" \n",
" "
],
"application/kotlindataframe+json": "{\"$version\":\"2.2.0\",\"metadata\":{\"columns\":[\"animal\",\"age\",\"visits\",\"priority\"],\"types\":[{\"kind\":\"ValueColumn\",\"type\":\"kotlin.String\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Int\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Boolean\"}],\"nrow\":10,\"ncol\":4,\"is_formatted\":false},\"kotlin_dataframe\":[{\"animal\":\"cat\",\"age\":2.5,\"visits\":1,\"priority\":true},{\"animal\":\"cat\",\"age\":3.0,\"visits\":3,\"priority\":true},{\"animal\":\"snake\",\"age\":0.5,\"visits\":2,\"priority\":false},{\"animal\":\"dog\",\"age\":NaN,\"visits\":3,\"priority\":true},{\"animal\":\"dog\",\"age\":5.0,\"visits\":2,\"priority\":false},{\"animal\":\"cat\",\"age\":2.0,\"visits\":3,\"priority\":false},{\"animal\":\"snake\",\"age\":4.5,\"visits\":1,\"priority\":false},{\"animal\":\"cat\",\"age\":NaN,\"visits\":1,\"priority\":true},{\"animal\":\"dog\",\"age\":7.0,\"visits\":2,\"priority\":false},{\"animal\":\"dog\",\"age\":3.0,\"visits\":1,\"priority\":false}]}"
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"execution_count": 18
},
{
"metadata": {},
"cell_type": "markdown",
"source": "**18.** In the 'animal' column, change the 'dog' entries to 'corgi'."
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2025-12-18T11:19:47.183097405Z",
"start_time": "2025-12-18T11:19:46.992589282Z"
},
"executionRelatedData": {
"compiledClasses": [
"Line_45_jupyter"
]
}
},
"cell_type": "code",
"source": "df.update { animal }.where { it == \"dog\" }.with { \"corgi\" }",
"outputs": [
{
"data": {
"text/html": [
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" | animal | age | visits | priority |
|---|
| cat | 2.500000 | 1 | yes |
| cat | 3.000000 | 3 | yes |
| snake | 0.500000 | 2 | no |
| corgi | NaN | 3 | yes |
| corgi | 5.000000 | 2 | no |
| cat | 2.000000 | 3 | no |
| snake | 4.500000 | 1 | no |
| cat | NaN | 1 | yes |
| corgi | 7.000000 | 2 | no |
| corgi | 3.000000 | 1 | no |
\n",
" \n",
" \n",
" "
],
"application/kotlindataframe+json": "{\"$version\":\"2.2.0\",\"metadata\":{\"columns\":[\"animal\",\"age\",\"visits\",\"priority\"],\"types\":[{\"kind\":\"ValueColumn\",\"type\":\"kotlin.String\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Int\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.String\"}],\"nrow\":10,\"ncol\":4,\"is_formatted\":false},\"kotlin_dataframe\":[{\"animal\":\"cat\",\"age\":2.5,\"visits\":1,\"priority\":\"yes\"},{\"animal\":\"cat\",\"age\":3.0,\"visits\":3,\"priority\":\"yes\"},{\"animal\":\"snake\",\"age\":0.5,\"visits\":2,\"priority\":\"no\"},{\"animal\":\"corgi\",\"age\":NaN,\"visits\":3,\"priority\":\"yes\"},{\"animal\":\"corgi\",\"age\":5.0,\"visits\":2,\"priority\":\"no\"},{\"animal\":\"cat\",\"age\":2.0,\"visits\":3,\"priority\":\"no\"},{\"animal\":\"snake\",\"age\":4.5,\"visits\":1,\"priority\":\"no\"},{\"animal\":\"cat\",\"age\":NaN,\"visits\":1,\"priority\":\"yes\"},{\"animal\":\"corgi\",\"age\":7.0,\"visits\":2,\"priority\":\"no\"},{\"animal\":\"corgi\",\"age\":3.0,\"visits\":1,\"priority\":\"no\"}]}"
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"execution_count": 19
},
{
"metadata": {},
"cell_type": "markdown",
"source": [
"**19.** For each animal type and each number of visits, find the mean age.\n",
"\n",
"In other words, each row should be an animal, there should be a column for each of the number of visits and the values should be the mean ages."
]
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2025-12-18T11:19:47.538227593Z",
"start_time": "2025-12-18T11:19:47.243994037Z"
},
"executionRelatedData": {
"compiledClasses": [
"Line_47_jupyter"
]
}
},
"cell_type": "code",
"source": "df.pivot { visits }.groupBy { animal }.mean(skipNaN = true) { age }",
"outputs": [
{
"data": {
"text/html": [
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" | animal | visits | | |
|---|
| 1 | 3 | 2 |
|---|
| cat | 2.500000 | 2.500000 | null |
| snake | 4.500000 | null | 0.500000 |
| dog | 3.000000 | NaN | 6.000000 |
\n",
" \n",
" \n",
" "
],
"application/kotlindataframe+json": "{\"$version\":\"2.2.0\",\"metadata\":{\"columns\":[\"animal\",\"visits\"],\"types\":[{\"kind\":\"ValueColumn\",\"type\":\"kotlin.String\"},{\"kind\":\"ColumnGroup\"}],\"nrow\":3,\"ncol\":2,\"is_formatted\":false},\"kotlin_dataframe\":[{\"animal\":\"cat\",\"visits\":{\"data\":{\"1\":2.5,\"3\":2.5,\"2\":null},\"metadata\":{\"kind\":\"ColumnGroup\",\"columns\":[\"1\",\"3\",\"2\"],\"types\":[{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double?\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double?\"}]}}},{\"animal\":\"snake\",\"visits\":{\"data\":{\"1\":4.5,\"3\":null,\"2\":0.5},\"metadata\":{\"kind\":\"ColumnGroup\",\"columns\":[\"1\",\"3\",\"2\"],\"types\":[{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double?\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double?\"}]}}},{\"animal\":\"dog\",\"visits\":{\"data\":{\"1\":3.0,\"3\":NaN,\"2\":6.0},\"metadata\":{\"kind\":\"ColumnGroup\",\"columns\":[\"1\",\"3\",\"2\"],\"types\":[{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double?\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double?\"}]}}}]}"
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"execution_count": 20
},
{
"metadata": {},
"cell_type": "markdown",
"source": [
"## DataFrame: beyond the basics\n",
"### Slightly trickier: you may need to combine two or more methods to get the right answer\n",
"Difficulty: medium\n",
"\n",
"The previous section was tour through some basic but essential DataFrame operations.\n",
"Below are some ways that you might need to cut your data, but for which there is no single \"out-of-the-box\" method."
]
},
{
"metadata": {},
"cell_type": "markdown",
"source": [
"**20.** You have a DataFrame df with a column 'A' of integers. For example:\n",
"```kotlin\n",
"val df = dataFrameOf(\"A\" to columnOf(1, 2, 2, 3, 4, 5, 5, 5, 6, 7, 7))\n",
"```\n",
"How do you filter out rows which contain the same integer as the row immediately above?\n",
"\n",
"You should be left with a column containing the following values:\n",
"```\n",
"1, 2, 3, 4, 5, 6, 7\n",
"```"
]
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2025-12-18T11:19:47.812973387Z",
"start_time": "2025-12-18T11:19:47.592009662Z"
},
"executionRelatedData": {
"compiledClasses": [
"Line_49_jupyter",
"Line_50_jupyter",
"Line_51_jupyter"
]
}
},
"cell_type": "code",
"source": [
"val df = dataFrameOf(\"A\" to columnOf(1, 2, 2, 3, 4, 5, 5, 5, 6, 7, 7))\n",
"df"
],
"outputs": [
{
"data": {
"text/html": [
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" "
],
"application/kotlindataframe+json": "{\"$version\":\"2.2.0\",\"metadata\":{\"columns\":[\"A\"],\"types\":[{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Int\"}],\"nrow\":11,\"ncol\":1,\"is_formatted\":false},\"kotlin_dataframe\":[{\"A\":1},{\"A\":2},{\"A\":2},{\"A\":3},{\"A\":4},{\"A\":5},{\"A\":5},{\"A\":5},{\"A\":6},{\"A\":7},{\"A\":7}]}"
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"execution_count": 21
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2025-12-18T11:19:48.106029432Z",
"start_time": "2025-12-18T11:19:47.883015099Z"
},
"executionRelatedData": {
"compiledClasses": [
"Line_53_jupyter"
]
}
},
"cell_type": "code",
"source": "df.filter { prev()?.A != A }",
"outputs": [
{
"data": {
"text/html": [
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" "
],
"application/kotlindataframe+json": "{\"$version\":\"2.2.0\",\"metadata\":{\"columns\":[\"A\"],\"types\":[{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Int\"}],\"nrow\":7,\"ncol\":1,\"is_formatted\":false},\"kotlin_dataframe\":[{\"A\":1},{\"A\":2},{\"A\":3},{\"A\":4},{\"A\":5},{\"A\":6},{\"A\":7}]}"
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"execution_count": 22
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2025-12-18T11:19:48.354220694Z",
"start_time": "2025-12-18T11:19:48.153924502Z"
},
"executionRelatedData": {
"compiledClasses": [
"Line_55_jupyter"
]
}
},
"cell_type": "code",
"source": "df.filter { diffOrNull { A } != 0 }",
"outputs": [
{
"data": {
"text/html": [
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" "
],
"application/kotlindataframe+json": "{\"$version\":\"2.2.0\",\"metadata\":{\"columns\":[\"A\"],\"types\":[{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Int\"}],\"nrow\":7,\"ncol\":1,\"is_formatted\":false},\"kotlin_dataframe\":[{\"A\":1},{\"A\":2},{\"A\":3},{\"A\":4},{\"A\":5},{\"A\":6},{\"A\":7}]}"
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"execution_count": 23
},
{
"metadata": {},
"cell_type": "markdown",
"source": "We could use `distinct()` here but it won't work as desired if A is [1, 1, 2, 2, 1, 1] for example."
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2025-12-18T11:19:48.524516883Z",
"start_time": "2025-12-18T11:19:48.428439626Z"
},
"executionRelatedData": {
"compiledClasses": [
"Line_57_jupyter"
]
}
},
"cell_type": "code",
"source": "df.distinct()",
"outputs": [
{
"data": {
"text/html": [
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" "
],
"application/kotlindataframe+json": "{\"$version\":\"2.2.0\",\"metadata\":{\"columns\":[\"A\"],\"types\":[{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Int\"}],\"nrow\":7,\"ncol\":1,\"is_formatted\":false},\"kotlin_dataframe\":[{\"A\":1},{\"A\":2},{\"A\":3},{\"A\":4},{\"A\":5},{\"A\":6},{\"A\":7}]}"
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"execution_count": 24
},
{
"metadata": {},
"cell_type": "markdown",
"source": [
"**21.** Given a DataFrame of random numeric values:\n",
"```kotlin\n",
"val df = dataFrameOf(\"a\", \"b\", \"c\").randomDouble(5) // this is a 5x3 DataFrame of double values\n",
"```\n",
"\n",
"how do you subtract the row mean from each element in the row?"
]
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2025-12-18T11:19:48.976011050Z",
"start_time": "2025-12-18T11:19:48.636984362Z"
},
"executionRelatedData": {
"compiledClasses": [
"Line_58_jupyter",
"Line_59_jupyter",
"Line_60_jupyter"
]
}
},
"cell_type": "code",
"source": [
"val df = dataFrameOf(\"a\", \"b\", \"c\").randomDouble(5)\n",
"df"
],
"outputs": [
{
"data": {
"text/html": [
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" | a | b | c |
|---|
| 0.389647 | 0.552994 | 0.853043 |
| 0.788586 | 0.127921 | 0.833489 |
| 0.858807 | 0.673938 | 0.781276 |
| 0.883197 | 0.765814 | 0.929804 |
| 0.931211 | 0.402562 | 0.438368 |
\n",
" \n",
" \n",
" "
],
"application/kotlindataframe+json": "{\"$version\":\"2.2.0\",\"metadata\":{\"columns\":[\"a\",\"b\",\"c\"],\"types\":[{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double\"}],\"nrow\":5,\"ncol\":3,\"is_formatted\":false},\"kotlin_dataframe\":[{\"a\":0.38964725344768747,\"b\":0.552993764027322,\"c\":0.8530430376111371},{\"a\":0.7885858820136266,\"b\":0.12792147763728656,\"c\":0.8334889732578966},{\"a\":0.8588073780857777,\"b\":0.673938093076138,\"c\":0.7812759099616297},{\"a\":0.8831974604892261,\"b\":0.765814199023807,\"c\":0.9298044650001936},{\"a\":0.931211129866274,\"b\":0.40256161724758266,\"c\":0.4383683880175443}]}"
},
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
}
],
"execution_count": 25
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2025-12-18T11:19:49.490759745Z",
"start_time": "2025-12-18T11:19:49.100396054Z"
},
"executionRelatedData": {
"compiledClasses": [
"Line_63_jupyter"
]
}
},
"cell_type": "code",
"source": [
"df.update { colsOf() }\n",
" .with { it - rowMean() }"
],
"outputs": [
{
"data": {
"text/html": [
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" | a | b | c |
|---|
| -0.208914 | -0.045568 | 0.254482 |
| 0.205254 | -0.455411 | 0.250157 |
| 0.087467 | -0.097402 | 0.009935 |
| 0.023592 | -0.093791 | 0.070199 |
| 0.340497 | -0.188152 | -0.152345 |
\n",
" \n",
" \n",
" "
],
"application/kotlindataframe+json": "{\"$version\":\"2.2.0\",\"metadata\":{\"columns\":[\"a\",\"b\",\"c\"],\"types\":[{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double\"}],\"nrow\":5,\"ncol\":3,\"is_formatted\":false},\"kotlin_dataframe\":[{\"a\":-0.2089140982476947,\"b\":-0.04556758766806013,\"c\":0.25448168591575493},{\"a\":0.20525377104402331,\"b\":-0.4554106333323167,\"c\":0.25015686228829337},{\"a\":0.08746691771126269,\"b\":-0.09740236729837703,\"c\":0.009935449587114675},{\"a\":0.02359208565148385,\"b\":-0.09379117581393526,\"c\":0.0701990901624514},{\"a\":0.3404974181558069,\"b\":-0.1881520944628844,\"c\":-0.15234532369292275}]}"
},
"execution_count": 26,
"metadata": {},
"output_type": "execute_result"
}
],
"execution_count": 26
},
{
"metadata": {},
"cell_type": "markdown",
"source": [
"**22.** Suppose you have a DataFrame with 10 columns of real numbers, for example:\n",
"```kotlin\n",
"val names = ('a'..'j').map { it.toString() }\n",
"val df = dataFrameOf(names).randomDouble(5)\n",
"```\n",
"\n",
"Which column of numbers has the smallest sum? Return that column's label."
]
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2025-12-18T11:19:49.906927775Z",
"start_time": "2025-12-18T11:19:49.567367152Z"
},
"executionRelatedData": {
"compiledClasses": [
"Line_65_jupyter",
"Line_66_jupyter",
"Line_67_jupyter"
]
}
},
"cell_type": "code",
"source": [
"val names = ('a'..'j').map { it.toString() }\n",
"val df = dataFrameOf(names).randomDouble(5)\n",
"df"
],
"outputs": [
{
"data": {
"text/html": [
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" | a | b | c | d | e | f | g | h | i | j |
|---|
| 0.841911 | 0.282187 | 0.504844 | 0.316049 | 0.490350 | 0.593206 | 0.550719 | 0.814340 | 0.089081 | 0.113149 |
| 0.498208 | 0.847661 | 0.227987 | 0.518230 | 0.126070 | 0.616324 | 0.147116 | 0.262463 | 0.323552 | 0.737167 |
| 0.788916 | 0.162458 | 0.183311 | 0.926585 | 0.429727 | 0.740767 | 0.930618 | 0.971054 | 0.802308 | 0.230486 |
| 0.553347 | 0.482594 | 0.405461 | 0.369849 | 0.431532 | 0.177826 | 0.465287 | 0.262496 | 0.201674 | 0.430214 |
| 0.874932 | 0.162043 | 0.477525 | 0.884059 | 0.292586 | 0.854866 | 0.456486 | 0.589303 | 0.140426 | 0.911116 |
\n",
" \n",
" \n",
" "
],
"application/kotlindataframe+json": "{\"$version\":\"2.2.0\",\"metadata\":{\"columns\":[\"a\",\"b\",\"c\",\"d\",\"e\",\"f\",\"g\",\"h\",\"i\",\"j\"],\"types\":[{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double\"}],\"nrow\":5,\"ncol\":10,\"is_formatted\":false},\"kotlin_dataframe\":[{\"a\":0.8419114160124052,\"b\":0.282186920782716,\"c\":0.5048443177170018,\"d\":0.31604901017535125,\"e\":0.4903503697680932,\"f\":0.5932062822034694,\"g\":0.5507189032521874,\"h\":0.8143400642944842,\"i\":0.0890806635407565,\"j\":0.11314865188165824},{\"a\":0.49820844533063235,\"b\":0.8476612083057321,\"c\":0.22798742360150803,\"d\":0.5182296202669303,\"e\":0.12607036482076805,\"f\":0.6163241622452882,\"g\":0.1471161686192296,\"h\":0.26246326600812764,\"i\":0.32355220762629744,\"j\":0.7371668074893234},{\"a\":0.788915937670967,\"b\":0.16245843051774012,\"c\":0.18331082327972492,\"d\":0.9265848824100162,\"e\":0.4297273634305062,\"f\":0.7407665914938036,\"g\":0.9306177608421328,\"h\":0.9710544319125589,\"i\":0.802308385674185,\"j\":0.23048565019517087},{\"a\":0.553347476141766,\"b\":0.48259411048976075,\"c\":0.40546089768988547,\"d\":0.3698490765337128,\"e\":0.43153183270372275,\"f\":0.17782580998707898,\"g\":0.465287247289943,\"h\":0.2624963066894521,\"i\":0.20167375143620148,\"j\":0.43021369129837617},{\"a\":0.8749319981380302,\"b\":0.16204312466903636,\"c\":0.4775248119116894,\"d\":0.8840586021809772,\"e\":0.2925857979551013,\"f\":0.8548662347706445,\"g\":0.4564858905910003,\"h\":0.5893027524048042,\"i\":0.1404261330157901,\"j\":0.911116058678073}]}"
},
"execution_count": 27,
"metadata": {},
"output_type": "execute_result"
}
],
"execution_count": 27
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2025-12-18T11:19:50.235988942Z",
"start_time": "2025-12-18T11:19:49.987687771Z"
},
"executionRelatedData": {
"compiledClasses": [
"Line_69_jupyter"
]
}
},
"cell_type": "code",
"source": "df.sum().transpose().minBy(\"value\")[\"name\"]",
"outputs": [
{
"data": {
"text/plain": [
"i"
]
},
"execution_count": 28,
"metadata": {},
"output_type": "execute_result"
}
],
"execution_count": 28
},
{
"metadata": {},
"cell_type": "markdown",
"source": "**23.** How do you count how many unique rows a DataFrame has (i.e. ignore all rows that are duplicates)?"
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2025-12-18T11:19:50.387907271Z",
"start_time": "2025-12-18T11:19:50.238960093Z"
},
"executionRelatedData": {
"compiledClasses": [
"Line_70_jupyter",
"Line_71_jupyter",
"Line_72_jupyter"
]
}
},
"cell_type": "code",
"source": [
"val df = dataFrameOf(\"a\", \"b\", \"c\").randomInt(30, 0..2)\n",
"df.distinct().count()"
],
"outputs": [
{
"data": {
"text/plain": [
"19"
]
},
"execution_count": 29,
"metadata": {},
"output_type": "execute_result"
}
],
"execution_count": 29
},
{
"metadata": {},
"cell_type": "markdown",
"source": [
"**24.** In the cell below, you have a DataFrame `df` that consists of 10 columns of floating-point numbers. Exactly 5 entries in each row are NaN values.\n",
"\n",
"For each row of the DataFrame, find the *column* which contains the *third* NaN value.\n",
"\n",
"You should return a column of column labels: `e, c, d, h, d`"
]
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2025-12-18T11:19:50.641509850Z",
"start_time": "2025-12-18T11:19:50.391210050Z"
},
"executionRelatedData": {
"compiledClasses": [
"Line_73_jupyter",
"Line_74_jupyter",
"Line_75_jupyter"
]
}
},
"cell_type": "code",
"source": [
"val nan = Double.NaN\n",
"val names = ('a'..'j').map { it.toString() }\n",
"val data = listOf(\n",
" 0.04, nan, nan, 0.25, nan, 0.43, 0.71, 0.51, nan, nan,\n",
" nan, nan, nan, 0.04, 0.76, nan, nan, 0.67, 0.76, 0.16,\n",
" nan, nan, 0.5, nan, 0.31, 0.4, nan, nan, 0.24, 0.01,\n",
" 0.49, nan, nan, 0.62, 0.73, 0.26, 0.85, nan, nan, nan,\n",
" nan, nan, 0.41, nan, 0.05, nan, 0.61, nan, 0.48, 0.68,\n",
")\n",
"val df = dataFrameOf(names)(*data.toTypedArray())\n",
"df"
],
"outputs": [
{
"data": {
"text/html": [
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" | a | b | c | d | e | f | g | h | i | j |
|---|
| 0.040000 | NaN | NaN | 0.250000 | NaN | 0.430000 | 0.710000 | 0.510000 | NaN | NaN |
| NaN | NaN | NaN | 0.040000 | 0.760000 | NaN | NaN | 0.670000 | 0.760000 | 0.160000 |
| NaN | NaN | 0.500000 | NaN | 0.310000 | 0.400000 | NaN | NaN | 0.240000 | 0.010000 |
| 0.490000 | NaN | NaN | 0.620000 | 0.730000 | 0.260000 | 0.850000 | NaN | NaN | NaN |
| NaN | NaN | 0.410000 | NaN | 0.050000 | NaN | 0.610000 | NaN | 0.480000 | 0.680000 |
\n",
" \n",
" \n",
" "
],
"application/kotlindataframe+json": "{\"$version\":\"2.2.0\",\"metadata\":{\"columns\":[\"a\",\"b\",\"c\",\"d\",\"e\",\"f\",\"g\",\"h\",\"i\",\"j\"],\"types\":[{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double\"}],\"nrow\":5,\"ncol\":10,\"is_formatted\":false},\"kotlin_dataframe\":[{\"a\":0.04,\"b\":NaN,\"c\":NaN,\"d\":0.25,\"e\":NaN,\"f\":0.43,\"g\":0.71,\"h\":0.51,\"i\":NaN,\"j\":NaN},{\"a\":NaN,\"b\":NaN,\"c\":NaN,\"d\":0.04,\"e\":0.76,\"f\":NaN,\"g\":NaN,\"h\":0.67,\"i\":0.76,\"j\":0.16},{\"a\":NaN,\"b\":NaN,\"c\":0.5,\"d\":NaN,\"e\":0.31,\"f\":0.4,\"g\":NaN,\"h\":NaN,\"i\":0.24,\"j\":0.01},{\"a\":0.49,\"b\":NaN,\"c\":NaN,\"d\":0.62,\"e\":0.73,\"f\":0.26,\"g\":0.85,\"h\":NaN,\"i\":NaN,\"j\":NaN},{\"a\":NaN,\"b\":NaN,\"c\":0.41,\"d\":NaN,\"e\":0.05,\"f\":NaN,\"g\":0.61,\"h\":NaN,\"i\":0.48,\"j\":0.68}]}"
},
"execution_count": 30,
"metadata": {},
"output_type": "execute_result"
}
],
"execution_count": 30
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2025-12-18T11:19:51.054313756Z",
"start_time": "2025-12-18T11:19:50.738227798Z"
},
"executionRelatedData": {
"compiledClasses": [
"Line_77_jupyter"
]
}
},
"cell_type": "code",
"source": [
"df.mapToColumn(\"res\") { \n",
" namedValuesOf()\n",
" .filter { it.value.isNaN() }.drop(2)\n",
" .firstOrNull()?.name \n",
"}"
],
"outputs": [
{
"data": {
"text/html": [
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" "
],
"application/kotlindataframe+json": "{\"$version\":\"2.2.0\",\"metadata\":{\"columns\":[\"res\"],\"types\":[{\"kind\":\"ValueColumn\",\"type\":\"kotlin.String\"}],\"nrow\":5,\"ncol\":1,\"is_formatted\":false},\"kotlin_dataframe\":[{\"res\":\"e\"},{\"res\":\"c\"},{\"res\":\"d\"},{\"res\":\"h\"},{\"res\":\"d\"}]}"
},
"execution_count": 31,
"metadata": {},
"output_type": "execute_result"
}
],
"execution_count": 31
},
{
"metadata": {},
"cell_type": "markdown",
"source": [
"**25.** A DataFrame has a column of groups 'grps' and a column of integer values 'vals':\n",
"```kotlin\n",
"val df = dataFrameOf(\n",
" \"grps\" to columnOf(\"a\", \"a\", \"a\", \"b\", \"b\", \"c\", \"a\", \"a\", \"b\", \"c\", \"c\", \"c\", \"b\", \"b\", \"c\"),\n",
" \"vals\" to columnOf(12, 345, 3, 1, 45, 14, 4, 52, 54, 23, 235, 21, 57, 3, 87),\n",
")\n",
"```\n",
"\n",
"For each group, find the sum of the three greatest values. You should end up with the answer as follows:\n",
"```\n",
"grps\n",
"a 409\n",
"b 156\n",
"c 345\n",
"```"
]
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2025-12-18T11:19:51.467157852Z",
"start_time": "2025-12-18T11:19:51.142193506Z"
},
"executionRelatedData": {
"compiledClasses": [
"Line_79_jupyter",
"Line_80_jupyter",
"Line_81_jupyter"
]
}
},
"cell_type": "code",
"source": [
"val df = dataFrameOf(\n",
" \"grps\" to columnOf(\"a\", \"a\", \"a\", \"b\", \"b\", \"c\", \"a\", \"a\", \"b\", \"c\", \"c\", \"c\", \"b\", \"b\", \"c\"),\n",
" \"vals\" to columnOf(12, 345, 3, 1, 45, 14, 4, 52, 54, 23, 235, 21, 57, 3, 87),\n",
")\n",
"df"
],
"outputs": [
{
"data": {
"text/html": [
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" | grps | vals |
|---|
| a | 12 |
| a | 345 |
| a | 3 |
| b | 1 |
| b | 45 |
| c | 14 |
| a | 4 |
| a | 52 |
| b | 54 |
| c | 23 |
| c | 235 |
| c | 21 |
| b | 57 |
| b | 3 |
| c | 87 |
\n",
" \n",
" \n",
" "
],
"application/kotlindataframe+json": "{\"$version\":\"2.2.0\",\"metadata\":{\"columns\":[\"grps\",\"vals\"],\"types\":[{\"kind\":\"ValueColumn\",\"type\":\"kotlin.String\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Int\"}],\"nrow\":15,\"ncol\":2,\"is_formatted\":false},\"kotlin_dataframe\":[{\"grps\":\"a\",\"vals\":12},{\"grps\":\"a\",\"vals\":345},{\"grps\":\"a\",\"vals\":3},{\"grps\":\"b\",\"vals\":1},{\"grps\":\"b\",\"vals\":45},{\"grps\":\"c\",\"vals\":14},{\"grps\":\"a\",\"vals\":4},{\"grps\":\"a\",\"vals\":52},{\"grps\":\"b\",\"vals\":54},{\"grps\":\"c\",\"vals\":23},{\"grps\":\"c\",\"vals\":235},{\"grps\":\"c\",\"vals\":21},{\"grps\":\"b\",\"vals\":57},{\"grps\":\"b\",\"vals\":3},{\"grps\":\"c\",\"vals\":87}]}"
},
"execution_count": 32,
"metadata": {},
"output_type": "execute_result"
}
],
"execution_count": 32
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2025-12-18T11:19:51.874156449Z",
"start_time": "2025-12-18T11:19:51.536795229Z"
},
"executionRelatedData": {
"compiledClasses": [
"Line_83_jupyter"
]
}
},
"cell_type": "code",
"source": [
"df.groupBy { grps }.aggregate { \n",
" vals.sortDesc().take(3).sum() into \"res\"\n",
"}"
],
"outputs": [
{
"data": {
"text/html": [
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" "
],
"application/kotlindataframe+json": "{\"$version\":\"2.2.0\",\"metadata\":{\"columns\":[\"grps\",\"res\"],\"types\":[{\"kind\":\"ValueColumn\",\"type\":\"kotlin.String\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Int\"}],\"nrow\":3,\"ncol\":2,\"is_formatted\":false},\"kotlin_dataframe\":[{\"grps\":\"a\",\"res\":409},{\"grps\":\"b\",\"res\":156},{\"grps\":\"c\",\"res\":345}]}"
},
"execution_count": 33,
"metadata": {},
"output_type": "execute_result"
}
],
"execution_count": 33
},
{
"metadata": {},
"cell_type": "markdown",
"source": [
"**26.** The DataFrame `df` constructed below has two integer columns 'A' and 'B'. The values in 'A' are between 1 and 100 (inclusive).\n",
"\n",
"For each group of 10 consecutive integers in 'A' (i.e. `(0, 10]`, `(10, 20]`, ...), calculate the sum of the corresponding values in column 'B'.\n",
"\n",
"The answer is as follows:\n",
"\n",
"```\n",
"A\n",
"(0, 10] 635\n",
"(10, 20] 360\n",
"(20, 30] 315\n",
"(30, 40] 306\n",
"(40, 50] 750\n",
"(50, 60] 284\n",
"(60, 70] 424\n",
"(70, 80] 526\n",
"(80, 90] 835\n",
"(90, 100] 852\n",
"```"
]
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2025-12-18T11:19:52.289287480Z",
"start_time": "2025-12-18T11:19:51.941954661Z"
},
"executionRelatedData": {
"compiledClasses": [
"Line_85_jupyter",
"Line_86_jupyter",
"Line_87_jupyter"
]
}
},
"cell_type": "code",
"source": [
"import kotlin.random.Random\n",
"\n",
"val random = Random(42)\n",
"val list = List(200) { random.nextInt(1, 101) }\n",
"val df = dataFrameOf(\"A\", \"B\")(*list.toTypedArray())\n",
"df"
],
"outputs": [
{
"data": {
"text/html": [
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" | A | B |
|---|
| 34 | 41 |
| 42 | 3 |
| 42 | 33 |
| 22 | 41 |
| 70 | 88 |
| 53 | 68 |
| 80 | 4 |
| 59 | 59 |
| 45 | 1 |
| 27 | 14 |
| 70 | 8 |
| 11 | 52 |
| 51 | 60 |
| 46 | 43 |
| 17 | 17 |
| 17 | 42 |
| 56 | 29 |
| 58 | 49 |
| 48 | 7 |
| 73 | 52 |
\n",
" \n",
" \n",
" "
],
"application/kotlindataframe+json": "{\"$version\":\"2.2.0\",\"metadata\":{\"columns\":[\"A\",\"B\"],\"types\":[{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Int\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Int\"}],\"nrow\":100,\"ncol\":2,\"is_formatted\":false},\"kotlin_dataframe\":[{\"A\":34,\"B\":41},{\"A\":42,\"B\":3},{\"A\":42,\"B\":33},{\"A\":22,\"B\":41},{\"A\":70,\"B\":88},{\"A\":53,\"B\":68},{\"A\":80,\"B\":4},{\"A\":59,\"B\":59},{\"A\":45,\"B\":1},{\"A\":27,\"B\":14},{\"A\":70,\"B\":8},{\"A\":11,\"B\":52},{\"A\":51,\"B\":60},{\"A\":46,\"B\":43},{\"A\":17,\"B\":17},{\"A\":17,\"B\":42},{\"A\":56,\"B\":29},{\"A\":58,\"B\":49},{\"A\":48,\"B\":7},{\"A\":73,\"B\":52}]}"
},
"execution_count": 34,
"metadata": {},
"output_type": "execute_result"
}
],
"execution_count": 34
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2025-12-18T11:19:52.799510118Z",
"start_time": "2025-12-18T11:19:52.372865385Z"
},
"executionRelatedData": {
"compiledClasses": [
"Line_89_jupyter"
]
}
},
"cell_type": "code",
"source": [
"df.groupBy { A.map { (it - 1) / 10 } }.sum { B }\n",
" .sortBy { A }\n",
" .convert { A }.with { \"(${it * 10}, ${it * 10 + 10}]\" }"
],
"outputs": [
{
"data": {
"text/html": [
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" | A | B |
|---|
| (0, 10] | 353 |
| (10, 20] | 873 |
| (20, 30] | 321 |
| (30, 40] | 322 |
| (40, 50] | 432 |
| (50, 60] | 754 |
| (60, 70] | 405 |
| (70, 80] | 561 |
| (80, 90] | 657 |
| (90, 100] | 527 |
\n",
" \n",
" \n",
" "
],
"application/kotlindataframe+json": "{\"$version\":\"2.2.0\",\"metadata\":{\"columns\":[\"A\",\"B\"],\"types\":[{\"kind\":\"ValueColumn\",\"type\":\"kotlin.String\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Int\"}],\"nrow\":10,\"ncol\":2,\"is_formatted\":false},\"kotlin_dataframe\":[{\"A\":\"(0, 10]\",\"B\":353},{\"A\":\"(10, 20]\",\"B\":873},{\"A\":\"(20, 30]\",\"B\":321},{\"A\":\"(30, 40]\",\"B\":322},{\"A\":\"(40, 50]\",\"B\":432},{\"A\":\"(50, 60]\",\"B\":754},{\"A\":\"(60, 70]\",\"B\":405},{\"A\":\"(70, 80]\",\"B\":561},{\"A\":\"(80, 90]\",\"B\":657},{\"A\":\"(90, 100]\",\"B\":527}]}"
},
"execution_count": 35,
"metadata": {},
"output_type": "execute_result"
}
],
"execution_count": 35
},
{
"metadata": {},
"cell_type": "markdown",
"source": [
"## DataFrames: harder problems\n",
"\n",
"### These might require a bit of thinking outside the box...\n",
"\n",
"Difficulty: hard"
]
},
{
"metadata": {},
"cell_type": "markdown",
"source": [
"**27.** Consider a DataFrame `df` where there is an integer column 'X':\n",
"```kotlin\n",
"val df = dataFrameOf(\"X\" to columnOf(7, 2, 0, 3, 4, 2, 5, 0, 3, 4))\n",
"```\n",
"For each value, count the difference back to the previous zero (or the start of the column, whichever is closer). These values should therefore be\n",
"\n",
"```\n",
"[1, 2, 0, 1, 2, 3, 4, 0, 1, 2]\n",
"```\n",
"\n",
"Make this a new column 'Y'."
]
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2025-12-18T11:19:53.262677781Z",
"start_time": "2025-12-18T11:19:52.881011560Z"
},
"executionRelatedData": {
"compiledClasses": [
"Line_91_jupyter",
"Line_92_jupyter",
"Line_93_jupyter"
]
}
},
"cell_type": "code",
"source": [
"val df = dataFrameOf(\"X\" to columnOf(7, 2, 0, 3, 4, 2, 5, 0, 3, 4))\n",
"df"
],
"outputs": [
{
"data": {
"text/html": [
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" "
],
"application/kotlindataframe+json": "{\"$version\":\"2.2.0\",\"metadata\":{\"columns\":[\"X\"],\"types\":[{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Int\"}],\"nrow\":10,\"ncol\":1,\"is_formatted\":false},\"kotlin_dataframe\":[{\"X\":7},{\"X\":2},{\"X\":0},{\"X\":3},{\"X\":4},{\"X\":2},{\"X\":5},{\"X\":0},{\"X\":3},{\"X\":4}]}"
},
"execution_count": 36,
"metadata": {},
"output_type": "execute_result"
}
],
"execution_count": 36
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2025-12-18T11:19:53.552737696Z",
"start_time": "2025-12-18T11:19:53.347104241Z"
},
"executionRelatedData": {
"compiledClasses": [
"Line_95_jupyter"
]
}
},
"cell_type": "code",
"source": [
"df.mapToColumn(\"Y\") {\n",
" if (it.X == 0) 0 else (prev()?.newValue() ?: 0) + 1\n",
"}"
],
"outputs": [
{
"data": {
"text/html": [
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" "
],
"application/kotlindataframe+json": "{\"$version\":\"2.2.0\",\"metadata\":{\"columns\":[\"Y\"],\"types\":[{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Int\"}],\"nrow\":10,\"ncol\":1,\"is_formatted\":false},\"kotlin_dataframe\":[{\"Y\":1},{\"Y\":2},{\"Y\":0},{\"Y\":1},{\"Y\":2},{\"Y\":3},{\"Y\":4},{\"Y\":0},{\"Y\":1},{\"Y\":2}]}"
},
"execution_count": 37,
"metadata": {},
"output_type": "execute_result"
}
],
"execution_count": 37
},
{
"metadata": {},
"cell_type": "markdown",
"source": [
"**28.** Consider the DataFrame constructed below, which contains rows and columns of numerical data.\n",
"\n",
"Create a list of the column-row index locations of the three largest values in this DataFrame.\n",
"\n",
"In this case, the answer should be:\n",
"```\n",
"[(0, d), (2, c), (3, f)]\n",
"```"
]
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2025-12-18T11:19:54.071017006Z",
"start_time": "2025-12-18T11:19:53.641692715Z"
},
"executionRelatedData": {
"compiledClasses": [
"Line_97_jupyter",
"Line_98_jupyter",
"Line_99_jupyter"
]
}
},
"cell_type": "code",
"source": [
"val names = ('a'..'h').map { it.toString() } // val names = (0..7).map { it.toString() }\n",
"val random = Random(30)\n",
"val list = List(64) { random.nextInt(1, 101) }\n",
"val df = dataFrameOf(names)(*list.toTypedArray())\n",
"df"
],
"outputs": [
{
"data": {
"text/html": [
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" | a | b | c | d | e | f | g | h |
|---|
| 43 | 88 | 66 | 100 | 9 | 59 | 74 | 23 |
| 6 | 63 | 43 | 58 | 4 | 85 | 9 | 25 |
| 49 | 59 | 100 | 52 | 28 | 1 | 19 | 81 |
| 92 | 41 | 13 | 57 | 28 | 97 | 63 | 39 |
| 4 | 59 | 72 | 65 | 50 | 35 | 14 | 31 |
| 55 | 74 | 33 | 66 | 17 | 39 | 80 | 38 |
| 18 | 64 | 91 | 39 | 80 | 55 | 65 | 2 |
| 19 | 76 | 75 | 18 | 32 | 97 | 1 | 32 |
\n",
" \n",
" \n",
" "
],
"application/kotlindataframe+json": "{\"$version\":\"2.2.0\",\"metadata\":{\"columns\":[\"a\",\"b\",\"c\",\"d\",\"e\",\"f\",\"g\",\"h\"],\"types\":[{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Int\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Int\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Int\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Int\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Int\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Int\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Int\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Int\"}],\"nrow\":8,\"ncol\":8,\"is_formatted\":false},\"kotlin_dataframe\":[{\"a\":43,\"b\":88,\"c\":66,\"d\":100,\"e\":9,\"f\":59,\"g\":74,\"h\":23},{\"a\":6,\"b\":63,\"c\":43,\"d\":58,\"e\":4,\"f\":85,\"g\":9,\"h\":25},{\"a\":49,\"b\":59,\"c\":100,\"d\":52,\"e\":28,\"f\":1,\"g\":19,\"h\":81},{\"a\":92,\"b\":41,\"c\":13,\"d\":57,\"e\":28,\"f\":97,\"g\":63,\"h\":39},{\"a\":4,\"b\":59,\"c\":72,\"d\":65,\"e\":50,\"f\":35,\"g\":14,\"h\":31},{\"a\":55,\"b\":74,\"c\":33,\"d\":66,\"e\":17,\"f\":39,\"g\":80,\"h\":38},{\"a\":18,\"b\":64,\"c\":91,\"d\":39,\"e\":80,\"f\":55,\"g\":65,\"h\":2},{\"a\":19,\"b\":76,\"c\":75,\"d\":18,\"e\":32,\"f\":97,\"g\":1,\"h\":32}]}"
},
"execution_count": 38,
"metadata": {},
"output_type": "execute_result"
}
],
"execution_count": 38
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2025-12-18T11:19:54.475257403Z",
"start_time": "2025-12-18T11:19:54.139804923Z"
},
"executionRelatedData": {
"compiledClasses": [
"Line_101_jupyter"
]
}
},
"cell_type": "code",
"source": [
"df.add(\"index\") { index() }\n",
" .gather { dropLast() }.into(\"name\", \"vals\")\n",
" .sortByDesc(\"vals\").take(3)[\"index\", \"name\"]"
],
"outputs": [
{
"data": {
"text/html": [
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" "
],
"application/kotlindataframe+json": "{\"$version\":\"2.2.0\",\"metadata\":{\"columns\":[\"index\",\"name\"],\"types\":[{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Int\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.String\"}],\"nrow\":3,\"ncol\":2,\"is_formatted\":false},\"kotlin_dataframe\":[{\"index\":0,\"name\":\"d\"},{\"index\":2,\"name\":\"c\"},{\"index\":3,\"name\":\"f\"}]}"
},
"execution_count": 39,
"metadata": {},
"output_type": "execute_result"
}
],
"execution_count": 39
},
{
"metadata": {},
"cell_type": "markdown",
"source": [
"**29.** You are given the DataFrame below with a column of group IDs, 'grps', and a column of corresponding integer values 'vals'.\n",
"\n",
"```kotlin\n",
"val random = Random(31)\n",
"val lab = listOf(\"A\", \"B\")\n",
"\n",
"val vals by columnOf(List(15) { random.nextInt(-30, 30) })\n",
"val grps by columnOf(List(15) { lab[random.nextInt(0, 2)] })\n",
"\n",
"val df = dataFrameOf(vals, grps)\n",
"```\n",
"\n",
"Create a new column 'patched_values' which contains the same values as the 'vals' any negative values in 'vals' with the group mean:\n",
"\n",
"```\n",
"vals grps patched_vals\n",
" -17 B 21.0\n",
" -7 B 21.0\n",
" 28 B 28.0\n",
" 16 B 16.0\n",
" -21 B 21.0\n",
" 19 B 19.0\n",
" -2 B 21.0\n",
" -19 B 21.0\n",
" 16 A 16.0\n",
" 9 A 9.0\n",
" -14 A 16.0\n",
" -19 A 16.0\n",
" -22 A 16.0\n",
" -1 A 16.0\n",
" 23 A 23.0\n",
"```"
]
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2025-12-18T11:19:55.074975184Z",
"start_time": "2025-12-18T11:19:54.581439706Z"
},
"executionRelatedData": {
"compiledClasses": [
"Line_103_jupyter",
"Line_104_jupyter",
"Line_105_jupyter"
]
}
},
"cell_type": "code",
"source": [
"val random = Random(31)\n",
"val lab = listOf(\"A\", \"B\")\n",
"\n",
"val vals by columnOf(*Array(15) { random.nextInt(-30, 30) })\n",
"val grps by columnOf(*Array(15) { lab[random.nextInt(0, 2)] })\n",
"\n",
"val df = dataFrameOf(vals, grps)\n",
"df"
],
"outputs": [
{
"data": {
"text/html": [
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" | vals | grps |
|---|
| -17 | B |
| -7 | B |
| 16 | A |
| 28 | B |
| 9 | A |
| 16 | B |
| -21 | B |
| -14 | A |
| -19 | A |
| -22 | A |
| 19 | B |
| -2 | B |
| -1 | A |
| -19 | B |
| 23 | A |
\n",
" \n",
" \n",
" "
],
"application/kotlindataframe+json": "{\"$version\":\"2.2.0\",\"metadata\":{\"columns\":[\"vals\",\"grps\"],\"types\":[{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Int\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.String\"}],\"nrow\":15,\"ncol\":2,\"is_formatted\":false},\"kotlin_dataframe\":[{\"vals\":-17,\"grps\":\"B\"},{\"vals\":-7,\"grps\":\"B\"},{\"vals\":16,\"grps\":\"A\"},{\"vals\":28,\"grps\":\"B\"},{\"vals\":9,\"grps\":\"A\"},{\"vals\":16,\"grps\":\"B\"},{\"vals\":-21,\"grps\":\"B\"},{\"vals\":-14,\"grps\":\"A\"},{\"vals\":-19,\"grps\":\"A\"},{\"vals\":-22,\"grps\":\"A\"},{\"vals\":19,\"grps\":\"B\"},{\"vals\":-2,\"grps\":\"B\"},{\"vals\":-1,\"grps\":\"A\"},{\"vals\":-19,\"grps\":\"B\"},{\"vals\":23,\"grps\":\"A\"}]}"
},
"execution_count": 40,
"metadata": {},
"output_type": "execute_result"
}
],
"execution_count": 40
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2025-12-18T11:19:55.629440336Z",
"start_time": "2025-12-18T11:19:55.172126724Z"
},
"executionRelatedData": {
"compiledClasses": [
"Line_107_jupyter",
"Line_108_jupyter",
"Line_109_jupyter"
]
}
},
"cell_type": "code",
"source": [
"val means = df.filter { vals >= 0 }\n",
" .groupBy { grps }.mean()\n",
" .pivot { grps }.values { vals }\n",
"\n",
"df.add(\"patched_values\") {\n",
" if (vals < 0) means[grps] else vals.toDouble()\n",
"}"
],
"outputs": [
{
"data": {
"text/html": [
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" | vals | grps | patched_values |
|---|
| -17 | B | 21.000000 |
| -7 | B | 21.000000 |
| 16 | A | 16.000000 |
| 28 | B | 28.000000 |
| 9 | A | 9.000000 |
| 16 | B | 16.000000 |
| -21 | B | 21.000000 |
| -14 | A | 16.000000 |
| -19 | A | 16.000000 |
| -22 | A | 16.000000 |
| 19 | B | 19.000000 |
| -2 | B | 21.000000 |
| -1 | A | 16.000000 |
| -19 | B | 21.000000 |
| 23 | A | 23.000000 |
\n",
" \n",
" \n",
" "
],
"application/kotlindataframe+json": "{\"$version\":\"2.2.0\",\"metadata\":{\"columns\":[\"vals\",\"grps\",\"patched_values\"],\"types\":[{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Int\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.String\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Any\"}],\"nrow\":15,\"ncol\":3,\"is_formatted\":false},\"kotlin_dataframe\":[{\"vals\":-17,\"grps\":\"B\",\"patched_values\":\"21.0\"},{\"vals\":-7,\"grps\":\"B\",\"patched_values\":\"21.0\"},{\"vals\":16,\"grps\":\"A\",\"patched_values\":\"16.0\"},{\"vals\":28,\"grps\":\"B\",\"patched_values\":\"28.0\"},{\"vals\":9,\"grps\":\"A\",\"patched_values\":\"9.0\"},{\"vals\":16,\"grps\":\"B\",\"patched_values\":\"16.0\"},{\"vals\":-21,\"grps\":\"B\",\"patched_values\":\"21.0\"},{\"vals\":-14,\"grps\":\"A\",\"patched_values\":\"16.0\"},{\"vals\":-19,\"grps\":\"A\",\"patched_values\":\"16.0\"},{\"vals\":-22,\"grps\":\"A\",\"patched_values\":\"16.0\"},{\"vals\":19,\"grps\":\"B\",\"patched_values\":\"19.0\"},{\"vals\":-2,\"grps\":\"B\",\"patched_values\":\"21.0\"},{\"vals\":-1,\"grps\":\"A\",\"patched_values\":\"16.0\"},{\"vals\":-19,\"grps\":\"B\",\"patched_values\":\"21.0\"},{\"vals\":23,\"grps\":\"A\",\"patched_values\":\"23.0\"}]}"
},
"execution_count": 41,
"metadata": {},
"output_type": "execute_result"
}
],
"execution_count": 41
},
{
"metadata": {},
"cell_type": "markdown",
"source": [
"**30.** Implement a rolling mean over groups with window size 3, which ignores NaN value. For example, consider the following DataFrame:\n",
"```kotlin\n",
"val df = dataFrameOf(\n",
" \"groups\" to columnOf(\"a\", \"a\", \"b\", \"b\", \"a\", \"b\", \"b\", \"b\", \"a\", \"b\", \"a\", \"b\"),\n",
" \"value\" to columnOf(1.0, 2.0, 3.0, Double.NaN, 2.0, 3.0, Double.NaN, 1.0, 7.0, 3.0, Double.NaN, 8.0),\n",
")\n",
"df\n",
"\n",
"group value\n",
"a 1.0\n",
"a 2.0\n",
"b 3.0\n",
"b NaN\n",
"a 2.0\n",
"b 3.0\n",
"b NaN\n",
"b 1.0\n",
"a 7.0\n",
"b 3.0\n",
"a NaN\n",
"b 8.0\n",
"```\n",
"The goal is:\n",
"```\n",
"1.000000\n",
"1.500000\n",
"3.000000\n",
"3.000000\n",
"1.666667\n",
"3.000000\n",
"3.000000\n",
"2.000000\n",
"3.666667\n",
"2.000000\n",
"4.500000\n",
"4.000000\n",
"```\n",
"E.g., the first window of size three for group 'b' has values 3.0, NaN and 3.0 and occurs at row index 5.\n",
"Instead of being NaN, the value in the new column at this row index should be 3.0 (just the two non-NaN values are used to compute the mean (3+3)/2)"
]
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2025-12-18T11:19:56.047538523Z",
"start_time": "2025-12-18T11:19:55.719618410Z"
},
"executionRelatedData": {
"compiledClasses": [
"Line_111_jupyter",
"Line_112_jupyter",
"Line_113_jupyter"
]
}
},
"cell_type": "code",
"source": [
"val df = dataFrameOf(\n",
" \"groups\" to columnOf(\"a\", \"a\", \"b\", \"b\", \"a\", \"b\", \"b\", \"b\", \"a\", \"b\", \"a\", \"b\"),\n",
" \"value\" to columnOf(1.0, 2.0, 3.0, Double.NaN, 2.0, 3.0, Double.NaN, 1.0, 7.0, 3.0, Double.NaN, 8.0),\n",
")\n",
"df"
],
"outputs": [
{
"data": {
"text/html": [
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" | groups | value |
|---|
| a | 1.000000 |
| a | 2.000000 |
| b | 3.000000 |
| b | NaN |
| a | 2.000000 |
| b | 3.000000 |
| b | NaN |
| b | 1.000000 |
| a | 7.000000 |
| b | 3.000000 |
| a | NaN |
| b | 8.000000 |
\n",
" \n",
" \n",
" "
],
"application/kotlindataframe+json": "{\"$version\":\"2.2.0\",\"metadata\":{\"columns\":[\"groups\",\"value\"],\"types\":[{\"kind\":\"ValueColumn\",\"type\":\"kotlin.String\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double\"}],\"nrow\":12,\"ncol\":2,\"is_formatted\":false},\"kotlin_dataframe\":[{\"groups\":\"a\",\"value\":1.0},{\"groups\":\"a\",\"value\":2.0},{\"groups\":\"b\",\"value\":3.0},{\"groups\":\"b\",\"value\":NaN},{\"groups\":\"a\",\"value\":2.0},{\"groups\":\"b\",\"value\":3.0},{\"groups\":\"b\",\"value\":NaN},{\"groups\":\"b\",\"value\":1.0},{\"groups\":\"a\",\"value\":7.0},{\"groups\":\"b\",\"value\":3.0},{\"groups\":\"a\",\"value\":NaN},{\"groups\":\"b\",\"value\":8.0}]}"
},
"execution_count": 42,
"metadata": {},
"output_type": "execute_result"
}
],
"execution_count": 42
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2025-12-18T11:19:56.601931010Z",
"start_time": "2025-12-18T11:19:56.148703247Z"
},
"executionRelatedData": {
"compiledClasses": [
"Line_115_jupyter"
]
}
},
"cell_type": "code",
"source": [
"df.add(\"id\") { index() }\n",
" .groupBy { groups }.add(\"res\") {\n",
" relative(-2..0).value.filter { !it.isNaN() }.mean()\n",
" }.concat()\n",
" .sortBy(\"id\")\n",
" .remove(\"id\")"
],
"outputs": [
{
"data": {
"text/html": [
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" | groups | value | res |
|---|
| a | 1.000000 | 1.000000 |
| a | 2.000000 | 1.500000 |
| b | 3.000000 | 3.000000 |
| b | NaN | 3.000000 |
| a | 2.000000 | 1.666667 |
| b | 3.000000 | 3.000000 |
| b | NaN | 3.000000 |
| b | 1.000000 | 2.000000 |
| a | 7.000000 | 3.666667 |
| b | 3.000000 | 2.000000 |
| a | NaN | 4.500000 |
| b | 8.000000 | 4.000000 |
\n",
" \n",
" \n",
" "
],
"application/kotlindataframe+json": "{\"$version\":\"2.2.0\",\"metadata\":{\"columns\":[\"groups\",\"value\",\"res\"],\"types\":[{\"kind\":\"ValueColumn\",\"type\":\"kotlin.String\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double\"}],\"nrow\":12,\"ncol\":3,\"is_formatted\":false},\"kotlin_dataframe\":[{\"groups\":\"a\",\"value\":1.0,\"res\":1.0},{\"groups\":\"a\",\"value\":2.0,\"res\":1.5},{\"groups\":\"b\",\"value\":3.0,\"res\":3.0},{\"groups\":\"b\",\"value\":NaN,\"res\":3.0},{\"groups\":\"a\",\"value\":2.0,\"res\":1.6666666666666667},{\"groups\":\"b\",\"value\":3.0,\"res\":3.0},{\"groups\":\"b\",\"value\":NaN,\"res\":3.0},{\"groups\":\"b\",\"value\":1.0,\"res\":2.0},{\"groups\":\"a\",\"value\":7.0,\"res\":3.6666666666666665},{\"groups\":\"b\",\"value\":3.0,\"res\":2.0},{\"groups\":\"a\",\"value\":NaN,\"res\":4.5},{\"groups\":\"b\",\"value\":8.0,\"res\":4.0}]}"
},
"execution_count": 43,
"metadata": {},
"output_type": "execute_result"
}
],
"execution_count": 43
},
{
"metadata": {},
"cell_type": "markdown",
"source": [
"## Date\n",
"Difficulty: easy/medium"
]
},
{
"metadata": {},
"cell_type": "markdown",
"source": "**31.** Create a `LocalDate` column that contains each day of 2015 and a column of random numbers."
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2025-12-18T11:19:56.754450953Z",
"start_time": "2025-12-18T11:19:56.683121372Z"
},
"executionRelatedData": {
"compiledClasses": [
"Line_117_jupyter"
]
}
},
"cell_type": "code",
"source": "import kotlinx.datetime.*",
"outputs": [],
"execution_count": 44
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2025-12-18T11:19:57.126047058Z",
"start_time": "2025-12-18T11:19:56.791536294Z"
},
"executionRelatedData": {
"compiledClasses": [
"Line_118_jupyter"
]
}
},
"cell_type": "code",
"source": [
"class DateRangeIterator(first: LocalDate, last: LocalDate, val step: Int) : Iterator {\n",
" private val finalElement: LocalDate = last\n",
" private var hasNext: Boolean = if (step > 0) first <= last else first >= last\n",
" private var next: LocalDate = if (hasNext) first else finalElement\n",
"\n",
" override fun hasNext(): Boolean = hasNext\n",
"\n",
" override fun next(): LocalDate {\n",
" val value = next\n",
" if (value == finalElement) {\n",
" if (!hasNext) throw kotlin.NoSuchElementException()\n",
" hasNext = false\n",
" } else {\n",
" next = next.plus(step, DateTimeUnit.DayBased(1))\n",
" }\n",
" return value\n",
" }\n",
"}\n",
"\n",
"operator fun ClosedRange.iterator() = DateRangeIterator(this.start, this.endInclusive, 1)\n",
"\n",
"fun ClosedRange.toList(): List {\n",
" return when (val size = this.start.daysUntil(this.endInclusive)) {\n",
" 0 -> emptyList()\n",
" 1 -> listOf(iterator().next())\n",
" else -> {\n",
" val dest = ArrayList(size)\n",
" for (item in this) {\n",
" dest.add(item)\n",
" }\n",
" dest\n",
" }\n",
" }\n",
"}"
],
"outputs": [],
"execution_count": 45
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2025-12-18T11:19:57.565043672Z",
"start_time": "2025-12-18T11:19:57.140265113Z"
},
"executionRelatedData": {
"compiledClasses": [
"Line_119_jupyter",
"Line_120_jupyter",
"Line_121_jupyter"
]
}
},
"cell_type": "code",
"source": [
"val start = LocalDate(2015, 1, 1)\n",
"val end = LocalDate(2016, 1, 1)\n",
"\n",
"val days = (start..end).toList()\n",
"\n",
"val df = dataFrameOf(\n",
" \"dti\" to days.toColumn(),\n",
" \"s\" to List(days.size) { Random.nextDouble() }.toColumn()\n",
")\n",
"df.head()"
],
"outputs": [
{
"data": {
"text/html": [
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" | dti | s |
|---|
| 2015-01-01 | 0.799701 |
| 2015-01-02 | 0.542949 |
| 2015-01-03 | 0.813556 |
| 2015-01-04 | 0.898062 |
| 2015-01-05 | 0.539191 |
\n",
" \n",
" \n",
" "
],
"application/kotlindataframe+json": "{\"$version\":\"2.2.0\",\"metadata\":{\"columns\":[\"dti\",\"s\"],\"types\":[{\"kind\":\"ValueColumn\",\"type\":\"kotlinx.datetime.LocalDate\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double\"}],\"nrow\":5,\"ncol\":2,\"is_formatted\":false},\"kotlin_dataframe\":[{\"dti\":\"2015-01-01\",\"s\":0.7997012481451563},{\"dti\":\"2015-01-02\",\"s\":0.542948925540349},{\"dti\":\"2015-01-03\",\"s\":0.8135564640710455},{\"dti\":\"2015-01-04\",\"s\":0.8980617499274379},{\"dti\":\"2015-01-05\",\"s\":0.5391906092303334}]}"
},
"execution_count": 46,
"metadata": {},
"output_type": "execute_result"
}
],
"execution_count": 46
},
{
"metadata": {},
"cell_type": "markdown",
"source": "**32.** Find the sum of the values in `s` for every Wednesday."
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2025-12-18T11:19:57.887949175Z",
"start_time": "2025-12-18T11:19:57.627409636Z"
},
"executionRelatedData": {
"compiledClasses": [
"Line_123_jupyter"
]
}
},
"cell_type": "code",
"source": "df.filter { dti.dayOfWeek == DayOfWeek.WEDNESDAY }.sum { s }",
"outputs": [
{
"data": {
"text/plain": [
"25.543176247723252"
]
},
"execution_count": 47,
"metadata": {},
"output_type": "execute_result"
}
],
"execution_count": 47
},
{
"metadata": {},
"cell_type": "markdown",
"source": "**33.** For each calendar month in `s`, find the mean of values."
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2025-12-18T11:19:58.114859437Z",
"start_time": "2025-12-18T11:19:57.914142752Z"
},
"executionRelatedData": {
"compiledClasses": [
"Line_124_jupyter"
]
}
},
"cell_type": "code",
"source": "df.groupBy { dti.map { it.month } named \"month\" }.mean()",
"outputs": [
{
"data": {
"text/html": [
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" | month | s |
|---|
| JANUARY | 0.537513 |
| FEBRUARY | 0.524215 |
| MARCH | 0.459084 |
| APRIL | 0.539626 |
| MAY | 0.425648 |
| JUNE | 0.489793 |
| JULY | 0.460490 |
| AUGUST | 0.433911 |
| SEPTEMBER | 0.475016 |
| OCTOBER | 0.556229 |
| NOVEMBER | 0.396961 |
| DECEMBER | 0.521459 |
\n",
" \n",
" \n",
" "
],
"application/kotlindataframe+json": "{\"$version\":\"2.2.0\",\"metadata\":{\"columns\":[\"month\",\"s\"],\"types\":[{\"kind\":\"ValueColumn\",\"type\":\"kotlinx.datetime.Month\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double\"}],\"nrow\":12,\"ncol\":2,\"is_formatted\":false},\"kotlin_dataframe\":[{\"month\":\"JANUARY\",\"s\":0.5375128507100408},{\"month\":\"FEBRUARY\",\"s\":0.5242150349941221},{\"month\":\"MARCH\",\"s\":0.4590837094316621},{\"month\":\"APRIL\",\"s\":0.5396262610274204},{\"month\":\"MAY\",\"s\":0.42564847901454406},{\"month\":\"JUNE\",\"s\":0.48979324892116555},{\"month\":\"JULY\",\"s\":0.4604903807995807},{\"month\":\"AUGUST\",\"s\":0.4339111213149513},{\"month\":\"SEPTEMBER\",\"s\":0.4750157287398121},{\"month\":\"OCTOBER\",\"s\":0.5562291918749319},{\"month\":\"NOVEMBER\",\"s\":0.3969606206218127},{\"month\":\"DECEMBER\",\"s\":0.5214586359388804}]}"
},
"execution_count": 48,
"metadata": {},
"output_type": "execute_result"
}
],
"execution_count": 48
},
{
"metadata": {},
"cell_type": "markdown",
"source": "**34.** For each group of four consecutive calendar months in `s`, find the date on which the highest value occurred."
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2025-12-18T11:19:58.536299106Z",
"start_time": "2025-12-18T11:19:58.230471525Z"
},
"executionRelatedData": {
"compiledClasses": [
"Line_126_jupyter"
]
}
},
"cell_type": "code",
"source": [
"df.add(\"month4\") {\n",
" when (dti.monthNumber) {\n",
" in 1..4 -> 1\n",
" in 5..8 -> 2\n",
" else -> 3\n",
" }\n",
"}.groupBy(\"month4\").aggregate { maxBy { s } into \"max\" }"
],
"outputs": [
{
"data": {
"text/html": [
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" | month4 | max | | |
|---|
| dti | s | month4 |
|---|
| 1 | 2015-03-22 | 0.998280 | 1 |
| 2 | 2015-05-14 | 0.991455 | 2 |
| 3 | 2015-12-16 | 0.999891 | 3 |
\n",
" \n",
" \n",
" "
],
"application/kotlindataframe+json": "{\"$version\":\"2.2.0\",\"metadata\":{\"columns\":[\"month4\",\"max\"],\"types\":[{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Int\"},{\"kind\":\"ColumnGroup\"}],\"nrow\":3,\"ncol\":2,\"is_formatted\":false},\"kotlin_dataframe\":[{\"month4\":1,\"max\":{\"data\":{\"dti\":\"2015-03-22\",\"s\":0.9982800125376523,\"month4\":1},\"metadata\":{\"kind\":\"ColumnGroup\",\"columns\":[\"dti\",\"s\",\"month4\"],\"types\":[{\"kind\":\"ValueColumn\",\"type\":\"kotlinx.datetime.LocalDate\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Int\"}]}}},{\"month4\":2,\"max\":{\"data\":{\"dti\":\"2015-05-14\",\"s\":0.9914549696180743,\"month4\":2},\"metadata\":{\"kind\":\"ColumnGroup\",\"columns\":[\"dti\",\"s\",\"month4\"],\"types\":[{\"kind\":\"ValueColumn\",\"type\":\"kotlinx.datetime.LocalDate\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Int\"}]}}},{\"month4\":3,\"max\":{\"data\":{\"dti\":\"2015-12-16\",\"s\":0.9998914835120076,\"month4\":3},\"metadata\":{\"kind\":\"ColumnGroup\",\"columns\":[\"dti\",\"s\",\"month4\"],\"types\":[{\"kind\":\"ValueColumn\",\"type\":\"kotlinx.datetime.LocalDate\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Int\"}]}}}]}"
},
"execution_count": 49,
"metadata": {},
"output_type": "execute_result"
}
],
"execution_count": 49
},
{
"metadata": {},
"cell_type": "markdown",
"source": "**35.** Create a column consisting of the third Thursday in each month for the years 2015 and 2016."
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2025-12-18T11:19:58.720573737Z",
"start_time": "2025-12-18T11:19:58.641394272Z"
},
"executionRelatedData": {
"compiledClasses": [
"Line_128_jupyter"
]
}
},
"cell_type": "code",
"source": [
"import java.time.temporal.WeekFields\n",
"import java.util.*"
],
"outputs": [],
"execution_count": 50
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2025-12-18T11:19:59.016056372Z",
"start_time": "2025-12-18T11:19:58.722834006Z"
},
"executionRelatedData": {
"compiledClasses": [
"Line_129_jupyter"
]
}
},
"cell_type": "code",
"source": [
"val start = LocalDate(2015, 1, 1)\n",
"val end = LocalDate(2016, 12, 31)\n",
"\n",
"(start..end).toList().toColumn(\"thirdThursday\").filter {\n",
" it.toJavaLocalDate()[WeekFields.of(Locale.ENGLISH).weekOfMonth()] == 3\n",
" && it.dayOfWeek == DayOfWeek.THURSDAY\n",
"}"
],
"outputs": [
{
"data": {
"text/html": [
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" | thirdThursday |
|---|
| 2015-01-15 |
| 2015-02-19 |
| 2015-03-19 |
| 2015-04-16 |
| 2015-05-14 |
| 2015-06-18 |
| 2015-07-16 |
| 2015-08-13 |
| 2015-09-17 |
| 2015-10-15 |
| 2015-11-19 |
| 2015-12-17 |
| 2016-01-14 |
| 2016-02-18 |
| 2016-03-17 |
| 2016-04-14 |
| 2016-05-19 |
| 2016-06-16 |
| 2016-07-14 |
| 2016-08-18 |
\n",
" \n",
" \n",
" "
],
"application/kotlindataframe+json": "{\"$version\":\"2.2.0\",\"metadata\":{\"columns\":[\"thirdThursday\"],\"types\":[{\"kind\":\"ValueColumn\",\"type\":\"kotlinx.datetime.LocalDate\"}],\"nrow\":24,\"ncol\":1,\"is_formatted\":false},\"kotlin_dataframe\":[{\"thirdThursday\":\"2015-01-15\"},{\"thirdThursday\":\"2015-02-19\"},{\"thirdThursday\":\"2015-03-19\"},{\"thirdThursday\":\"2015-04-16\"},{\"thirdThursday\":\"2015-05-14\"},{\"thirdThursday\":\"2015-06-18\"},{\"thirdThursday\":\"2015-07-16\"},{\"thirdThursday\":\"2015-08-13\"},{\"thirdThursday\":\"2015-09-17\"},{\"thirdThursday\":\"2015-10-15\"},{\"thirdThursday\":\"2015-11-19\"},{\"thirdThursday\":\"2015-12-17\"},{\"thirdThursday\":\"2016-01-14\"},{\"thirdThursday\":\"2016-02-18\"},{\"thirdThursday\":\"2016-03-17\"},{\"thirdThursday\":\"2016-04-14\"},{\"thirdThursday\":\"2016-05-19\"},{\"thirdThursday\":\"2016-06-16\"},{\"thirdThursday\":\"2016-07-14\"},{\"thirdThursday\":\"2016-08-18\"}]}"
},
"execution_count": 51,
"metadata": {},
"output_type": "execute_result"
}
],
"execution_count": 51
},
{
"metadata": {},
"cell_type": "markdown",
"source": [
"## Cleaning Data\n",
"### Making a dataframe easier to work with\n",
"Difficulty: *easy/medium*\n",
"\n",
"It happens all the time: someone gives you data containing malformed strings, lists and missing data. How do you tidy it up so you can get on with the analysis?\n",
"\n",
"Take this monstrosity of a dataframe to use in the following puzzles:\n",
"```kotlin\n",
"var df = dataFrameOf(\n",
" \"From_To\" to columnOf(\"LoNDon_paris\", \"MAdrid_miLAN\", \"londON_StockhOlm\", \"Budapest_PaRis\", \"Brussels_londOn\"),\n",
" \"FlightNumber\" to columnOf(10045.0, Double.NaN, 10065.0, Double.NaN, 10085.0),\n",
" \"RecentDelays\" to columnOf(listOf(23, 47), listOf(), listOf(24, 43, 87), listOf(13), listOf(67, 32)),\n",
" \"Airline\" to columnOf(\"KLM(!)\", \"{Air France} (12)\", \"(British Airways. )\", \"12. Air France\", \"'Swiss Air'\"),\n",
")\n",
"```\n",
"\n",
"It looks like this:\n",
"```\n",
"From_To FlightNumber RecentDelays Airline\n",
"LoNDon_paris 10045.000000 [23, 47] KLM(!)\n",
"MAdrid_miLAN NaN [] {Air France} (12)\n",
"londON_StockhOlm 10065.000000 [24, 43, 87] (British Airways. )\n",
"Budapest_PaRis NaN [13] 12. Air France\n",
"Brussels_londOn 10085.000000 [67, 32] 'Swiss Air'\n",
"```"
]
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2025-12-18T11:19:59.743092247Z",
"start_time": "2025-12-18T11:19:59.127837817Z"
},
"executionRelatedData": {
"compiledClasses": [
"Line_131_jupyter",
"Line_132_jupyter",
"Line_133_jupyter"
]
}
},
"cell_type": "code",
"source": [
"var df = dataFrameOf(\n",
" \"From_To\" to columnOf(\"LoNDon_paris\", \"MAdrid_miLAN\", \"londON_StockhOlm\", \"Budapest_PaRis\", \"Brussels_londOn\"),\n",
" \"FlightNumber\" to columnOf(10045.0, Double.NaN, 10065.0, Double.NaN, 10085.0),\n",
" \"RecentDelays\" to columnOf(listOf(23, 47), listOf(), listOf(24, 43, 87), listOf(13), listOf(67, 32)),\n",
" \"Airline\" to columnOf(\"KLM(!)\", \"{Air France} (12)\", \"(British Airways. )\", \"12. Air France\", \"'Swiss Air'\"),\n",
")\n",
"df"
],
"outputs": [
{
"data": {
"text/html": [
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" | From_To | FlightNumber | RecentDelays | Airline |
|---|
| LoNDon_paris | 10045.000000 | [23, 47] | KLM(!) |
| MAdrid_miLAN | NaN | [ ] | {Air France} (12) |
| londON_StockhOlm | 10065.000000 | [24, 43, 87] | (British Airways. ) |
| Budapest_PaRis | NaN | [13] | 12. Air France |
| Brussels_londOn | 10085.000000 | [67, 32] | 'Swiss Air' |
\n",
" \n",
" \n",
" "
],
"application/kotlindataframe+json": "{\"$version\":\"2.2.0\",\"metadata\":{\"columns\":[\"From_To\",\"FlightNumber\",\"RecentDelays\",\"Airline\"],\"types\":[{\"kind\":\"ValueColumn\",\"type\":\"kotlin.String\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.collections.List\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.String\"}],\"nrow\":5,\"ncol\":4,\"is_formatted\":false},\"kotlin_dataframe\":[{\"From_To\":\"LoNDon_paris\",\"FlightNumber\":10045.0,\"RecentDelays\":[23,47],\"Airline\":\"KLM(!)\"},{\"From_To\":\"MAdrid_miLAN\",\"FlightNumber\":NaN,\"RecentDelays\":[],\"Airline\":\"{Air France} (12)\"},{\"From_To\":\"londON_StockhOlm\",\"FlightNumber\":10065.0,\"RecentDelays\":[24,43,87],\"Airline\":\"(British Airways. )\"},{\"From_To\":\"Budapest_PaRis\",\"FlightNumber\":NaN,\"RecentDelays\":[13],\"Airline\":\"12. Air France\"},{\"From_To\":\"Brussels_londOn\",\"FlightNumber\":10085.0,\"RecentDelays\":[67,32],\"Airline\":\"'Swiss Air'\"}]}"
},
"execution_count": 52,
"metadata": {},
"output_type": "execute_result"
}
],
"execution_count": 52
},
{
"metadata": {},
"cell_type": "markdown",
"source": [
"**36.** Some values in the `FlightNumber` column are missing (they are NaN).\n",
"These numbers are meant to increase by 10 with each row, so 10,055 and 10,075 need to be put in the right place.\n",
"Modify `df` to fill in these missing numbers and make the column an integer column (instead of a float column)."
]
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2025-12-18T11:20:00.173798942Z",
"start_time": "2025-12-18T11:19:59.836065273Z"
},
"executionRelatedData": {
"compiledClasses": [
"Line_135_jupyter"
]
}
},
"cell_type": "code",
"source": [
"df = df.fillNaNs { FlightNumber }\n",
" .with { prev()!!.FlightNumber + (next()!!.FlightNumber - prev()!!.FlightNumber) / 2 }\n",
" .convert { FlightNumber }.toInt()\n",
"df"
],
"outputs": [
{
"data": {
"text/html": [
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" | From_To | FlightNumber | RecentDelays | Airline |
|---|
| LoNDon_paris | 10045 | [23, 47] | KLM(!) |
| MAdrid_miLAN | 10055 | [ ] | {Air France} (12) |
| londON_StockhOlm | 10065 | [24, 43, 87] | (British Airways. ) |
| Budapest_PaRis | 10075 | [13] | 12. Air France |
| Brussels_londOn | 10085 | [67, 32] | 'Swiss Air' |
\n",
" \n",
" \n",
" "
],
"application/kotlindataframe+json": "{\"$version\":\"2.2.0\",\"metadata\":{\"columns\":[\"From_To\",\"FlightNumber\",\"RecentDelays\",\"Airline\"],\"types\":[{\"kind\":\"ValueColumn\",\"type\":\"kotlin.String\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Int\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.collections.List\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.String\"}],\"nrow\":5,\"ncol\":4,\"is_formatted\":false},\"kotlin_dataframe\":[{\"From_To\":\"LoNDon_paris\",\"FlightNumber\":10045,\"RecentDelays\":[23,47],\"Airline\":\"KLM(!)\"},{\"From_To\":\"MAdrid_miLAN\",\"FlightNumber\":10055,\"RecentDelays\":[],\"Airline\":\"{Air France} (12)\"},{\"From_To\":\"londON_StockhOlm\",\"FlightNumber\":10065,\"RecentDelays\":[24,43,87],\"Airline\":\"(British Airways. )\"},{\"From_To\":\"Budapest_PaRis\",\"FlightNumber\":10075,\"RecentDelays\":[13],\"Airline\":\"12. Air France\"},{\"From_To\":\"Brussels_londOn\",\"FlightNumber\":10085,\"RecentDelays\":[67,32],\"Airline\":\"'Swiss Air'\"}]}"
},
"execution_count": 53,
"metadata": {},
"output_type": "execute_result"
}
],
"execution_count": 53
},
{
"metadata": {},
"cell_type": "markdown",
"source": [
"**37.** The **From_To** column can better be two separate columns!\n",
"\n",
"Split each string by the underscore delimiter **_**.\n",
"Assign the correct names 'From' and 'To' to these columns."
]
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2025-12-18T11:20:00.606941273Z",
"start_time": "2025-12-18T11:20:00.245568490Z"
},
"executionRelatedData": {
"compiledClasses": [
"Line_137_jupyter",
"Line_138_jupyter",
"Line_139_jupyter"
]
}
},
"cell_type": "code",
"source": [
"var df2 = df.split { From_To }.by(\"_\").into(\"From\", \"To\")\n",
"df2"
],
"outputs": [
{
"data": {
"text/html": [
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" | From | To | FlightNumber | RecentDelays | Airline |
|---|
| LoNDon | paris | 10045 | [23, 47] | KLM(!) |
| MAdrid | miLAN | 10055 | [ ] | {Air France} (12) |
| londON | StockhOlm | 10065 | [24, 43, 87] | (British Airways. ) |
| Budapest | PaRis | 10075 | [13] | 12. Air France |
| Brussels | londOn | 10085 | [67, 32] | 'Swiss Air' |
\n",
" \n",
" \n",
" "
],
"application/kotlindataframe+json": "{\"$version\":\"2.2.0\",\"metadata\":{\"columns\":[\"From\",\"To\",\"FlightNumber\",\"RecentDelays\",\"Airline\"],\"types\":[{\"kind\":\"ValueColumn\",\"type\":\"kotlin.String\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.String\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Int\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.collections.List\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.String\"}],\"nrow\":5,\"ncol\":5,\"is_formatted\":false},\"kotlin_dataframe\":[{\"From\":\"LoNDon\",\"To\":\"paris\",\"FlightNumber\":10045,\"RecentDelays\":[23,47],\"Airline\":\"KLM(!)\"},{\"From\":\"MAdrid\",\"To\":\"miLAN\",\"FlightNumber\":10055,\"RecentDelays\":[],\"Airline\":\"{Air France} (12)\"},{\"From\":\"londON\",\"To\":\"StockhOlm\",\"FlightNumber\":10065,\"RecentDelays\":[24,43,87],\"Airline\":\"(British Airways. )\"},{\"From\":\"Budapest\",\"To\":\"PaRis\",\"FlightNumber\":10075,\"RecentDelays\":[13],\"Airline\":\"12. Air France\"},{\"From\":\"Brussels\",\"To\":\"londOn\",\"FlightNumber\":10085,\"RecentDelays\":[67,32],\"Airline\":\"'Swiss Air'\"}]}"
},
"execution_count": 54,
"metadata": {},
"output_type": "execute_result"
}
],
"execution_count": 54
},
{
"metadata": {},
"cell_type": "markdown",
"source": [
"**38.** Notice how the capitalization of the city names is all mixed up in this temporary DataFrame 'temp'.\n",
"Standardize the strings so that only the first letter is uppercase (e.g. \"londON\" should become \"London\".)"
]
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2025-12-18T11:20:00.993293322Z",
"start_time": "2025-12-18T11:20:00.698864044Z"
},
"executionRelatedData": {
"compiledClasses": [
"Line_141_jupyter"
]
}
},
"cell_type": "code",
"source": [
"df2 = df2.update { From and To }.with { it.lowercase().replaceFirstChar { it.uppercase() } }\n",
"df2"
],
"outputs": [
{
"data": {
"text/html": [
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" | From | To | FlightNumber | RecentDelays | Airline |
|---|
| London | Paris | 10045 | [23, 47] | KLM(!) |
| Madrid | Milan | 10055 | [ ] | {Air France} (12) |
| London | Stockholm | 10065 | [24, 43, 87] | (British Airways. ) |
| Budapest | Paris | 10075 | [13] | 12. Air France |
| Brussels | London | 10085 | [67, 32] | 'Swiss Air' |
\n",
" \n",
" \n",
" "
],
"application/kotlindataframe+json": "{\"$version\":\"2.2.0\",\"metadata\":{\"columns\":[\"From\",\"To\",\"FlightNumber\",\"RecentDelays\",\"Airline\"],\"types\":[{\"kind\":\"ValueColumn\",\"type\":\"kotlin.String\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.String\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Int\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.collections.List\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.String\"}],\"nrow\":5,\"ncol\":5,\"is_formatted\":false},\"kotlin_dataframe\":[{\"From\":\"London\",\"To\":\"Paris\",\"FlightNumber\":10045,\"RecentDelays\":[23,47],\"Airline\":\"KLM(!)\"},{\"From\":\"Madrid\",\"To\":\"Milan\",\"FlightNumber\":10055,\"RecentDelays\":[],\"Airline\":\"{Air France} (12)\"},{\"From\":\"London\",\"To\":\"Stockholm\",\"FlightNumber\":10065,\"RecentDelays\":[24,43,87],\"Airline\":\"(British Airways. )\"},{\"From\":\"Budapest\",\"To\":\"Paris\",\"FlightNumber\":10075,\"RecentDelays\":[13],\"Airline\":\"12. Air France\"},{\"From\":\"Brussels\",\"To\":\"London\",\"FlightNumber\":10085,\"RecentDelays\":[67,32],\"Airline\":\"'Swiss Air'\"}]}"
},
"execution_count": 55,
"metadata": {},
"output_type": "execute_result"
}
],
"execution_count": 55
},
{
"metadata": {},
"cell_type": "markdown",
"source": [
"**39.** In the **Airline** column, you can see some extra punctuation and symbols have appeared around the airline names.\n",
"Pull out just the airline name. E.g. `'(British Airways. )'` should become `'British Airways'`."
]
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2025-12-18T11:20:01.328403705Z",
"start_time": "2025-12-18T11:20:01.074557822Z"
},
"executionRelatedData": {
"compiledClasses": [
"Line_143_jupyter"
]
}
},
"cell_type": "code",
"source": [
"df2 = df2.update { Airline }.with {\n",
" \"([a-zA-Z\\\\s]+)\".toRegex().find(it)?.value ?: \"\"\n",
"}\n",
"df2"
],
"outputs": [
{
"data": {
"text/html": [
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" | From | To | FlightNumber | RecentDelays | Airline |
|---|
| London | Paris | 10045 | [23, 47] | KLM |
| Madrid | Milan | 10055 | [ ] | Air France |
| London | Stockholm | 10065 | [24, 43, 87] | British Airways |
| Budapest | Paris | 10075 | [13] | Air France |
| Brussels | London | 10085 | [67, 32] | Swiss Air |
\n",
" \n",
" \n",
" "
],
"application/kotlindataframe+json": "{\"$version\":\"2.2.0\",\"metadata\":{\"columns\":[\"From\",\"To\",\"FlightNumber\",\"RecentDelays\",\"Airline\"],\"types\":[{\"kind\":\"ValueColumn\",\"type\":\"kotlin.String\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.String\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Int\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.collections.List\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.String\"}],\"nrow\":5,\"ncol\":5,\"is_formatted\":false},\"kotlin_dataframe\":[{\"From\":\"London\",\"To\":\"Paris\",\"FlightNumber\":10045,\"RecentDelays\":[23,47],\"Airline\":\"KLM\"},{\"From\":\"Madrid\",\"To\":\"Milan\",\"FlightNumber\":10055,\"RecentDelays\":[],\"Airline\":\"Air France\"},{\"From\":\"London\",\"To\":\"Stockholm\",\"FlightNumber\":10065,\"RecentDelays\":[24,43,87],\"Airline\":\"British Airways\"},{\"From\":\"Budapest\",\"To\":\"Paris\",\"FlightNumber\":10075,\"RecentDelays\":[13],\"Airline\":\" Air France\"},{\"From\":\"Brussels\",\"To\":\"London\",\"FlightNumber\":10085,\"RecentDelays\":[67,32],\"Airline\":\"Swiss Air\"}]}"
},
"execution_count": 56,
"metadata": {},
"output_type": "execute_result"
}
],
"execution_count": 56
},
{
"metadata": {},
"cell_type": "markdown",
"source": [
"**40.** In the **RecentDelays** column, the values have been entered into the DataFrame as a list.\n",
"We would like each first value to be in its own column, each second value in its own column, and so on.\n",
"If a certain value is missing, the value should be `null`.\n",
"\n",
"Expand the column of lists into columns named 'delays_' and replace the unwanted `RecentDelays` column in `df` with 'delays'."
]
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2025-12-18T11:20:01.740439057Z",
"start_time": "2025-12-18T11:20:01.448232137Z"
},
"executionRelatedData": {
"compiledClasses": [
"Line_145_jupyter",
"Line_146_jupyter",
"Line_147_jupyter"
]
}
},
"cell_type": "code",
"source": [
"val cleanDf = df2.split { RecentDelays }.into { \"delay_$it\" }\n",
"cleanDf"
],
"outputs": [
{
"data": {
"text/html": [
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" | From | To | FlightNumber | delay_1 | delay_2 | delay_3 | Airline |
|---|
| London | Paris | 10045 | 23 | 47 | null | KLM |
| Madrid | Milan | 10055 | null | null | null | Air France |
| London | Stockholm | 10065 | 24 | 43 | 87 | British Airways |
| Budapest | Paris | 10075 | 13 | null | null | Air France |
| Brussels | London | 10085 | 67 | 32 | null | Swiss Air |
\n",
" \n",
" \n",
" "
],
"application/kotlindataframe+json": "{\"$version\":\"2.2.0\",\"metadata\":{\"columns\":[\"From\",\"To\",\"FlightNumber\",\"delay_1\",\"delay_2\",\"delay_3\",\"Airline\"],\"types\":[{\"kind\":\"ValueColumn\",\"type\":\"kotlin.String\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.String\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Int\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Int?\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Int?\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Int?\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.String\"}],\"nrow\":5,\"ncol\":7,\"is_formatted\":false},\"kotlin_dataframe\":[{\"From\":\"London\",\"To\":\"Paris\",\"FlightNumber\":10045,\"delay_1\":23,\"delay_2\":47,\"delay_3\":null,\"Airline\":\"KLM\"},{\"From\":\"Madrid\",\"To\":\"Milan\",\"FlightNumber\":10055,\"delay_1\":null,\"delay_2\":null,\"delay_3\":null,\"Airline\":\"Air France\"},{\"From\":\"London\",\"To\":\"Stockholm\",\"FlightNumber\":10065,\"delay_1\":24,\"delay_2\":43,\"delay_3\":87,\"Airline\":\"British Airways\"},{\"From\":\"Budapest\",\"To\":\"Paris\",\"FlightNumber\":10075,\"delay_1\":13,\"delay_2\":null,\"delay_3\":null,\"Airline\":\" Air France\"},{\"From\":\"Brussels\",\"To\":\"London\",\"FlightNumber\":10085,\"delay_1\":67,\"delay_2\":32,\"delay_3\":null,\"Airline\":\"Swiss Air\"}]}"
},
"execution_count": 57,
"metadata": {},
"output_type": "execute_result"
}
],
"execution_count": 57
},
{
"metadata": {},
"cell_type": "markdown",
"source": "Data looks much better now! Now, add a finishing `.renameToCamelCase` to get Kotlin-style identifiers.\n"
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2025-12-18T11:20:02.082661779Z",
"start_time": "2025-12-18T11:20:01.832916739Z"
}
},
"cell_type": "code",
"source": "cleanDf.renameToCamelCase()",
"outputs": [
{
"data": {
"text/html": [
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" | from | to | flightNumber | delay1 | delay2 | delay3 | airline |
|---|
| London | Paris | 10045 | 23 | 47 | null | KLM |
| Madrid | Milan | 10055 | null | null | null | Air France |
| London | Stockholm | 10065 | 24 | 43 | 87 | British Airways |
| Budapest | Paris | 10075 | 13 | null | null | Air France |
| Brussels | London | 10085 | 67 | 32 | null | Swiss Air |
\n",
" \n",
" \n",
" "
],
"application/kotlindataframe+json": "{\"$version\":\"2.2.0\",\"metadata\":{\"columns\":[\"from\",\"to\",\"flightNumber\",\"delay1\",\"delay2\",\"delay3\",\"airline\"],\"types\":[{\"kind\":\"ValueColumn\",\"type\":\"kotlin.String\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.String\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Int\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Int?\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Int?\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Int?\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.String\"}],\"nrow\":5,\"ncol\":7,\"is_formatted\":false},\"kotlin_dataframe\":[{\"from\":\"London\",\"to\":\"Paris\",\"flightNumber\":10045,\"delay1\":23,\"delay2\":47,\"delay3\":null,\"airline\":\"KLM\"},{\"from\":\"Madrid\",\"to\":\"Milan\",\"flightNumber\":10055,\"delay1\":null,\"delay2\":null,\"delay3\":null,\"airline\":\"Air France\"},{\"from\":\"London\",\"to\":\"Stockholm\",\"flightNumber\":10065,\"delay1\":24,\"delay2\":43,\"delay3\":87,\"airline\":\"British Airways\"},{\"from\":\"Budapest\",\"to\":\"Paris\",\"flightNumber\":10075,\"delay1\":13,\"delay2\":null,\"delay3\":null,\"airline\":\" Air France\"},{\"from\":\"Brussels\",\"to\":\"London\",\"flightNumber\":10085,\"delay1\":67,\"delay2\":32,\"delay3\":null,\"airline\":\"Swiss Air\"}]}"
},
"execution_count": 58,
"metadata": {},
"output_type": "execute_result"
}
],
"execution_count": 58
}
],
"metadata": {
"kernelspec": {
"display_name": "Kotlin",
"language": "kotlin",
"name": "kotlin"
},
"language_info": {
"codemirror_mode": "text/x-kotlin",
"file_extension": ".kt",
"mimetype": "text/x-kotlin",
"name": "kotlin",
"nbconvert_exporter": "",
"pygments_lexer": "kotlin",
"version": "1.8.0-dev-707"
},
"ktnbPluginMetadata": {
"projectLibraries": false
}
},
"nbformat": 4,
"nbformat_minor": 1
}