init research

This commit is contained in:
2026-02-08 11:20:43 -10:00
commit bdf064f54d
3041 changed files with 1592200 additions and 0 deletions
+179
View File
@@ -0,0 +1,179 @@
---
title: "Dataset (data frame) manipulation API for the tech.ml.dataset library"
output:
md_document:
variant: gfm
---
```{r setup, include=FALSE}
find_nrepl_port_up <- function() {
wd <- getwd()
while(wd != dirname(wd)) {
f <- paste0(wd,"/.nrepl-port")
if(file.exists(f)) return(paste0("@",f))
wd <- dirname(wd)
f <- NULL
}
}
port_file <- find_nrepl_port_up()
if(is.null(port_file)) stop("nREPL port not found")
library(knitr)
knitr_one_string <- knitr:::one_string
nrepl_cmd <- "rep"
opts_chunk$set(comment=NA, highlight=TRUE)
knit_engines$set(clojure = function(options) {
rep_params <- if(isTRUE(options$stdout_only)) {
"--print 'out,1,%{out}' --print 'value,1,' -p"
} else {
"-p"
}
code <- paste(rep_params, port_file, shQuote(knitr_one_string(options$code)))
out <- if (options$eval) {
if (options$message) message('running: ', nrepl_cmd, ' ', code)
tryCatch(
system2(nrepl_cmd, code, stdout = TRUE, stderr = TRUE, env = options$engine.env),
error = function(e) {
if (!options$error) stop(e)
paste('Error in running command', nrepl_cmd)
}
)
} else ''
if (!options$error && !is.null(attr(out, 'status'))) stop(knitr_one_string(out))
engine_output(options, options$code, out)})
```
[![](https://img.shields.io/clojars/v/scicloj/tablecloth)](https://clojars.org/scicloj/tablecloth)
[![](https://api.travis-ci.org/scicloj/tablecloth.svg?branch=master)](https://travis-ci.org/github/scicloj/tablecloth)
[![](https://img.shields.io/badge/zulip-discussion-yellowgreen)](https://clojurians.zulipchat.com/#narrow/stream/236259-tech.2Eml.2Edataset.2Edev/topic/api)
## Versions
### tech.ml.dataset 7.x (master branch)
[![](https://img.shields.io/clojars/v/scicloj/tablecloth)](https://clojars.org/scicloj/tablecloth)
### tech.ml.dataset 4.x (4.0 branch)
`[scicloj/tablecloth "4.04"]`
## Introduction
[tech.ml.dataset](https://github.com/techascent/tech.ml.dataset) is a great and fast library which brings columnar dataset to the Clojure. Chris Nuernberger has been working on this library for last year as a part of bigger `tech.ml` stack.
I've started to test the library and help to fix uncovered bugs. My main goal was to compare functionalities with the other standards from other platforms. I focused on R solutions: [dplyr](https://dplyr.tidyverse.org/), [tidyr](https://tidyr.tidyverse.org/) and [data.table](https://rdatatable.gitlab.io/data.table/).
During conversions of the examples I've come up how to reorganized existing `tech.ml.dataset` functions into simple to use API. The main goals were:
* Focus on dataset manipulation functionality, leaving other parts of `tech.ml` like pipelines, datatypes, readers, ML, etc.
* Single entry point for common operations - one function dispatching on given arguments.
* `group-by` results with special kind of dataset - a dataset containing subsets created after grouping as a column.
* Most operations recognize regular dataset and grouped dataset and process data accordingly.
* One function form to enable thread-first on dataset.
Important! This library is not the replacement of `tech.ml.dataset` nor a separate library. It should be considered as a addition on the top of `tech.ml.dataset`.
If you want to know more about `tech.ml.dataset` and `dtype-next` please refer their documentation:
* [tech.ml.dataset walkthrough](https://techascent.github.io/tech.ml.dataset/walkthrough.html)
* [dtype-next overview](https://cnuernber.github.io/dtype-next/overview.html)
* [dtype-next cheatsheet](https://cnuernber.github.io/dtype-next/cheatsheet.html)
Join the discussion on [Zulip](https://clojurians.zulipchat.com/#narrow/stream/236259-tech.2Eml.2Edataset.2Edev/topic/api)
## Documentation
Please refer [detailed documentation with examples](https://scicloj.github.io/tablecloth/index.html)
## Usage example
```{clojure results="hide"}
(require '[tablecloth.api :as tc])
```
```{clojure results="asis"}
(-> "https://raw.githubusercontent.com/techascent/tech.ml.dataset/master/test/data/stocks.csv"
(tc/dataset {:key-fn keyword})
(tc/group-by (fn [row]
{:symbol (:symbol row)
:year (tech.v3.datatype.datetime/long-temporal-field :years (:date row))}))
(tc/aggregate #(tech.v3.datatype.functional/mean (% :price)))
(tc/order-by [:symbol :year])
(tc/head 10))
```
## Contributing
`Tablecloth` is open for contribution. The best way to start is discussion on [Zulip](https://clojurians.zulipchat.com/#narrow/stream/236259-tech.2Eml.2Edataset.2Edev/topic/api).
### Development tools for documentation
Documentation is written in RMarkdown, that means that you need R to create html/md/pdf files.
Documentation contains around 600 code snippets which are run during build. There are two files:
* `README.Rmd`
* `docs/index.Rmd`
Prepare following software:
1. Install [R](https://www.r-project.org/)
2. Install [rep](https://github.com/eraserhd/rep), nRepl client
3. Install `pandoc`
4. Run nRepl
5. Run R and install R packages: `install.packages(c("rmarkdown","knitr"), dependencies=T)`
6. Load rmarkdown: `library(rmarkdown)`
7. Render readme: `render("README.Rmd","md_document")`
8. Render documentation: `render("docs/index.Rmd","all")`
### API file generation
`tablecloth.api` namespace is generated out of `api-template`, please run it before making documentation
```{clojure eval=FALSE}
(exporter/write-api! 'tablecloth.api.api-template
'tablecloth.api
"src/tablecloth/api.clj"
'[group-by drop concat rand-nth first last shuffle])
```
### Guideline
1. Before commiting changes please perform tests. I ususally do: `lein do clean, check, test` and build documentation as described above (which also tests whole library).
2. Keep API as simple as possible:
- first argument should be a dataset
- if parametrizations is complex, last argument should accept a map with not obligatory function arguments
- avoid variadic associative destructuring for function arguments
- usually function should working on grouped dataset as well, accept `parallel?` argument then (if applied).
3. Follow `potemkin` pattern and import functions to the API namespace using `tech.v3.datatype.export-symbols/export-symbols` function
4. Functions which are composed out of API function to cover specific case(s) should go to `tablecloth.utils` namespace.
5. Always update `README.Rmd`, `CHANGELOG.md`, `docs/index.Rmd`, tests and function docs are highly welcomed
6. Always discuss changes and PRs first
## TODO
* tests
* tutorials
## New experimental dev workflow
In this branch, we develop a new proposed dev workflow for Tablecloth:
- namespace-as-a-notebook documentation using [Kindly](https://scicloj.github.io/kindly) and [Clay](https://scicloj.github.io/clay)
- testing the documentation using [note-to-test](https://github.com/scicloj/note-to-test) - coming soon
### Relevant files
- [notebooks/draft.clj](notebooks/draft.clj) - the tutorial as a Kindly notebook (developed with Clay)
- [dev/conversion.clj](dev/conversion.clj) - the script used to generate the notebook from the original `Rmarkdown` tutorial (up to a few additional manual edits)
- [docs/draft.html](docs/draft.html) - the tutorial rendered using Clay and [Quarto](https://quarto.org/)
### Actions
- to render the notebook using Clay (assuming you have the Quarto CLI [installed](https://quarto.org/docs/get-started/)):
```clj
(require '[scicloj.clay.v2.api :as clay])
(clay/make! {:format [:quarto :html]
:source-path "notebooks/draft.clj"})
```
## Licence
Copyright (c) 2020 Scicloj
The MIT Licence
File diff suppressed because it is too large Load Diff
File diff suppressed because one or more lines are too long
File diff suppressed because it is too large Load Diff
Binary file not shown.