df-research/tech.ml.dataset/CHANGELOG.md

# Changelog

# 8.003
 * hamf fixes, charred fixes, and 30% less require times when run from repl.

# 8.002
 * small dtype-next and hamf upgrades.

# 8.001
 * Moved to new hamf,dtype using the hamf protocols.  See hamf's defprotocol namespace for reasons.  If you are
   extending dataset protocols via extend, extend-type or extend-protocol you need to use defprotocol's drop-in
   replacements as opposed to clojure's default implementations.

# 7.067
 * :disable-na-as-missing for fixed types too
 * new `maximum` reducer for reductions namespace

# 7.066
 * hamf bugfix

# 7.065
 * hamf persistent vector upgrade
 * promotional object parser fix for na-as-missing? option
 * Switch from object locking to reentrant locks in group-by-column-agg

# 7.064
 * Small perf optimizations around sparse columns and arrow.

# 7.063
 * Hamf upgrade - fast merge iterators, better sort operator
 * Bugfix in tech.v3.dataset.rolling - options map was ignored - now all options should be respected
 * Major arrow upgrade - support for sparse columns save/load in arrow.  See [test/tech/v3/libs/arrow_tests.clj]

# 7.062
 * hamf bugfix in apply-concat.

# 7.061
 * Upgrade to hamf to fix pmap with custom pool issue and initial cut at sparse columns.  There is no serialization
   yet as that requires significant changes to arrow to work for our intended use case.

# 7.060
 * Fixes [issue 458](https://github.com/techascent/tech.ml.dataset/issues/458) - replace missing with a value
   works correctly when a column contains all missing values.

# 7.059
 * dtype-next upgrade to fix clone-after-filter issue.

# 7.058
 * faster single column reduction when you have large columns and many missing -- avoids per-idx binary search of missing set.

# 7.057
 * Slightly faster arrow compressed writies.
 * column-cast no longer appends roaring bitmaps to metadata unless requested.

# 7.056
 * Arrow support for UUID and bigdecimal types.

# 7.055
 * Upgrade dtype-next to [version 10.136](https://github.com/cnuernber/dtype-next/blob/master/CHANGELOG.md#10136).

# 7.053
 * Column parsers are more rigorous in promoting their datatypes after clear op.

# 7.052
 * Fixing update-values - found untested (on mac) pathway that failed when run in cloud.

# 7.051
 * Much faster string table clone and much faster arrow write of string tables.

# 7.050
 * fix bug in stringtable clone.

# 7.049
 * Optimizations to string table clone, string table create and arrow serialization.

# 7.047
 * hamf bugfix for update-values.

# 7.046
 * dataset parsers return something that is not a dataset when the internal datasets have no columns.

# 7.045
 * Bulk add-constant! method used for adding missing values.

# 7.044
 * initial support for clearing dataset parsers - resets their row count but does not reset the schema.  Use tech.v3.dataset.protocols/ds-clear.

# 7.043
 * Legacy smile -- 2.6.0 -- support was removed.  Support for later smile versions has moved to the [scicloj system](https://github.com/scicloj/scicloj.ml.smile) and operations like PCA are best implemented at this time using neanderthal.

# 7.042
 * Upgrade hamf to get new api methods - lines and re-matches.

# 7.041
 * Slightly faster promotional object parser.

# 7.040
 * Fix for [issue 450](https://github.com/techascent/tech.ml.dataset/issues/450) - emapped columns could reduce as
   a different type than declared in the emap declaration.
 * Small perf improvements for unique-by.

# 7.039
 * Fix error in dtype-next/native-buffer/native-buffer->byte-array

# 7.038
 * Upgrade to hamf 2.020.
 * Fix for [issue 447](https://github.com/techascent/tech.ml.dataset/issues/447) - filter column by keyword.

# 7.037
 * Nippy loading is about 2x faster in the case of large string tables.
 * Arrow read pathways support :text-as-strings? to mirror :strings-as-text? on the write side so you can save out uncompressed data in the fastest-to-read format.

# 7.036
 * Major optimization (>9x!) loading of arrow files when large string tables/dictionaries are used.

# 7.035
 * Latest dtype-next (10.124) - contains upgrades to ham-fisted which allow pmap et al. to accept arbitrary executor services.
 * Fix for [issue 438](https://github.com/techascent/tech.ml.dataset/issues/438) - keyword dataset names in tribuo.
 * Fix for [issue 435](https://github.com/techascent/tech.ml.dataset/issues/435) - pd-merge's outer must accept empty datasets.
 * Fix for issues 432 and 371 - select-row-type operations don't remove `:print-index-range :all` metadata.


# 7.034
 * Reverted transit encoding of instant back to milliseconds since epoch as js api doesn't support microseconds since epoch.

# 7.033
 * [issue-434](https://github.com/techascent/tech.ml.dataset/issues/413) - bad transit encoding - packed instants are microseconds since epoch and have been for a while - not milliseconds since epoch.

# 7.031
 * [issue-413](https://github.com/techascent/tech.ml.dataset/issues/413) - reduce with packed columns.
 * [issue-414](https://github.com/techascent/tech.ml.dataset/issues/414) - categorical maps are now integers.
 * [issue-410](https://github.com/techascent/tech.ml.dataset/issues/410) - json parsing fails if parser-fn is provided.

# 7.030
 * [issue-408](https://github.com/techascent/tech.ml.dataset/issues/408) - xlsx files with numberic column names now load.
 * dtype-next upgrade fixing a few issues, most notably [issue-99](https://github.com/cnuernber/dtype-next/issues/99).

# 7.029
 * large parquet files now load - slowly as loading can't be parallelized - without holding onto more memory than they should.

# 7.028
 * [issue 400](https://github.com/techascent/tech.ml.dataset/issues/400) - CSV parser issue and upgrade.
 * [issue 401](https://github.com/techascent/tech.ml.dataset/issues/401) - parquet file failed to parse - missing columns.

# 7.027
 * Moved transit bindings from tmdjs into tech.v3.libs.clj-transit.

# 7.026
 * column sub-buffer failed to offset roaring bitmap missing indexes.

# 7.025
 * New option - `:disable-na-as-missing?` - to [disable treating NA as missing](https://github.com/techascent/tech.ml.dataset/pull/399).
 * Pathway to generically get a [tribuo trainer](https://github.com/techascent/tech.ml.dataset/pull/393).
 * Fix for tribuo changing [predicted column datatypes](https://github.com/techascent/tech.ml.dataset/pull/397).

# 7.024
 * Faster group-by-column-agg when a large (500+ entries) agg map is passed in.
 * Small optimizations to the categorical one-hot-encoding pathway.

# 7.023
 * [Issue 387](https://github.com/techascent/tech.ml.dataset/issues/387) - select now respects persistent vectors of booleans.

# 7.022
 * Issue with pd-merge where exception is thrown if all columns are used for join.
 * Allow system to load duplicate headers - [PR 386](https://github.com/techascent/tech.ml.dataset/pull/386) - thanks ezrand.
 * Bump fastexcel version and expose [StableID](https://github.com/techascent/tech.ml.dataset/pull/385) - thanks again ezrand.

# 7.021
 * hamf typed-nth operations (dnth, fnth, etc) that are efficient
   when input is the analogous primitive array.

# 7.020
 * hamf perf upgrades.
 * big perf upgrade for parsing sequences of maps.

 # 7.019
 * hamf perf upgrades.

# 7.018
 * hamf perf upgrades.

# 7.017
 * hamf perf upgrades.
 * Fix for [critical CVE-2021-40531](https://nvd.nist.gov/vuln/detail/CVE-2021-40531).

# 7.016
 * hamf perf upgrades.

# 7.015
 * hamf fix for compose-reducers.

# 7.013
 * Fixes join regression.  Join algorithm refactored to use hamf primitives.

# 7.012
 * hamf-2.0 upgrade.
 * Fix for [issue-377](https://github.com/techascent/tech.ml.dataset/issues/377).
 * Moved from broken Travis CI to Github CI thanks to @iperdomo.
 * Documentation fix thanks to @mars0i.

# 7.011
 * Moved to custom linkedhashmap implementation in hamf that has optimized union *and*
   has equiv semantics for the keys.  This is not a persistent map of any sort but is
   at least a step closer.  Potential fix for issue 372.

# 7.010
 * Fix for serious error in fastruct equiv pathway.
 * minimal, much faster pathways for column (tech.v3.dataset.impl.column/construct-column)
   and dataset (tech.v3.dataset.impl.dataset/construct-dataset) construction.  These pathways
   will not detect errors nor will they detect missing values so use them with care.


# 7.009
 * Small optimizations to bring back some performance for group-by-column-agg that was lost
   between 6 and 7.

# 7.007
 * row-at defaults to a copying operation so you can safely use this with datasets
   that may be zero-copied from other sources.
 * Fix for [issue 367](https://github.com/techascent/tech.ml.dataset/issues/367)

# 7.006
* dtype-next optimization for native-buffer->string

# 7.005
 * various dtype-next upgrades.

# 7.002
 * dtype-next upgrade.

# 7.001
 * Big dtype-next upgrade bringing pass-by-value and return-by-value and fixing some API issues
   in tech.v3.datatype.functional.

# 7.000-beta-55
 * dataset->csv-seq now uses charred's batch loading system and is parallelized.  See docs.

# 7.000-beta-53
 * Fixes [issue-363](https://github.com/techascent/tech.ml.dataset/issues/363) - ds/rows will now elide keys
   for missing values.  This imposes a small perf hit during read but allows things like the JSON serialization
   system to work better and is a bit more idiomatic.

# 7.000-beta-51
 * Huge dtype-next upgrade - we fixed a lot of argument order issues which unforunately means
   existing projects will have issues with latest version if they used the changed apis.  Please
   check out the dtype-next changelog.

# 7.000-beta-38
 * Writing long sequences of datasets into a single arrow file no longer causes
   stack overflow issues (clojure.core/concat is not used any more).

# 7.000-beta-37
 * Major fix to hamf hashtables fixing subtle issue resizing transient hashmaps.

# 7.000-beta-34
 * Major hamf refactoring in preparation for bringing system out of beta.

# 7.000-beta-33
 * Adding `elide-header?` option for printing:
```clojure
user> (vary-meta (ds/head ds) merge {:maximum-precision 3 :elide-header? true})
|    :a |
|------:|
| 0.197 |
| 0.463 |
| 0.765 |
| 0.546 |
| 0.076 |
```

# 7.000-beta-32
 * Arrow's binary datatype is supported so we can read images and such via arrow.
 * Minor ham-fisted update.
 * Minor charred write-json update.
 * [issue 352](https://github.com/techascent/tech.ml.dataset/issues/352) - maximum-precision is supported to control dataset-wide maximum precision
   when formatting doubles.

# 7.000-beta-31
 * large hamf upgrade for faster maps and faster map boolean operations.
 * charred upgrade for faster json parsing when using `key-fn keyword`.
 * [issue 349](https://github.com/techascent/tech.ml.dataset/issues/349) - list types for arrow.

# 7.000-beta-30
 * Higher performance mapseq parsing and dataset creation for more efficient creation of smaller datasets
   via transduce, mapseq-parser, ->dataset and the various csv parsing pathways.

# 7.000-beta-29
 * parquet supports streaming data into output streams.

# 7.000-beta-28
 * m-1 mac support upgraded - arrow lz4 compression, zstd compression and snappy
   support all tested.  dtype-next upgrade required for lz4, dependency upgrade required
   for zstd, snappy.

# 7.000-beta-27
 * New charred with faster json parsing.
 * Updated ham-fisted - maps and vectors derive from the base Clojure classes.

# 7.000-beta-24
 * NEW DOCS!!
 * Faster hamf map creation and mapv-type operations.


# 7.000-beta-23
 * Categorical map producing NAN regression.
 * Column inference regression.

# 7.000-beta-20
 * Really optimized row-mapcat - required some new primitives from ham-fisted.

# 7.000-beta-19
 * row-mapcat has a few simple optimizations.
 * Fixed [issue 346](https://github.com/techascent/tech.ml.dataset/issues/346) - print-all was broken in 7.X.

# 7.000-beta-17
 * dataset group-by operations must respect the initial order of keys in the grouping criteria.
 * group-by-column, group-by are heavily optimized and quite a bit faster for large datasets.

# 7.000-beta-16
 * Latest dtype-next - support for jdk-19 and fix for arggroup.

# 7.000-beta-14
 * Fix for [issue 342](https://github.com/techascent/tech.ml.dataset/issues/341) - join on date columns.

# 7.000-beta-12
 * Large hamf update.
 * fixed a set of issues - descriptive-stats col order, nippy after arrow,
   column correlation type.

# 7.000-beta-10
 * added normal set functions - union, intersection, and difference - to
   `tech.v3.dataset.set`.

# 7.000-beta-9
 * Lots of small dtype-next updates
 * slightly better resource tracking
 * first deps.edn-based release.
 * Smile, poi are no longer auto-included dependencies.  You have to include them manually.
 * pca removed from math, now only in tech.v3.dataset.neanderthal.

# 7.000-beta-6
 * Added a transduce-compatible rf pathway for sequence of maps - `ds/mapseq-rf`.

# 7.000-beta-5
 * Integration of ham-fisted deeply into dtype-next, tmd, and tablecloth

# 6.103
 * issues [329](https://github.com/techascent/tech.ml.dataset/issues/329) and [328](https://github.com/techascent/tech.ml.dataset/issues/328) - CVE related dependency upgrades.

# 6.102
 * [issue 325](https://github.com/techascent/tech.ml.dataset/issues/325) - Usage of tmd delays shutdown requiring `shutdown-agents`.

# 6.101
 * [issue 324](https://github.com/techascent/tech.ml.dataset/issues/324) - ILookup for columns
 * update to charred to fix `unread too far` issues.

# 6.100
 * [issue 323](https://github.com/techascent/tech.ml.dataset/issues/323) - Null schema type in arrow files.

# 6.099
 * [issue 322](https://github.com/techascent/tech.ml.dataset/issues/322) - Categorical maps must be integers.

# 6.098
 * Update to clojure 1.11 for development.
 * Upgrade to dtype-next for unary min,max and to get rid of 1.11 warnings.
 * Upgrade tech.io for updated nippy again to get rid of 1.11 warnings.

# 6.096
 * Latest tech.io - pulls in important charred csv fix.
 * [issue 320](https://github.com/techascent/tech.ml.dataset/issues/320) - specify encoding for files.
 * drop-rows, select-rows can take negative indexes.

# 6.095
 * Better parallelization for column-map.
 * Fix for [issue 307](https://github.com/techascent/tech.ml.dataset/issues/307) - Bugs in categorical mapping.


# 6.094
 * dtype-next upgrade.
 * Fix for round-tripping arrow files with compression.

# 6.093
 * Fixes for issue [259](https://github.com/techascent/tech.ml.dataset/issues/259/), which is same as new issue 311.  `:key-fn` should only be applied once per column and does not have to
   be idemptotent.

# 6.092
 * Fixes for issues [312](https://github.com/techascent/tech.ml.dataset/issues/312/), [315](https://github.com/techascent/tech.ml.dataset/issues/315/), [316](https://github.com/techascent/tech.ml.dataset/issues/316/)

# 6.091
 * Upgrade to latest charred - no user visible change expected.

# 6.090
 * replace-missing when provided both a direction and a default value will fill in missing items
   after the direction is applied with the missing value.
 * Added `:updown` and `:downup` options to replace previous behavior when desired.

# 6.089
 * CSV parsing now supports `:comment-char` that defaults to #.  Lines that begin with this character are ignored.
 * Fix for [issue 304](https://github.com/techascent/tech.ml.dataset/issues/304/) - n-initial-skip-rows not respected when parsing a csv file.
 * Experimental fix for [issue 305](https://github.com/techascent/tech.ml.dataset/issues/305/) - replace-missing with `:down` or `:up` should leave values missing when the initial replacement fails instead of trying the opposite direction.  This may leave datasets with some missing values.

# 6.087
 * Fix pd-merge `:outer` conditional - [issue 302](https://github.com/techascent/tech.ml.dataset/issues/302/)

# 6.086
 * Update dtype-next as [issue 57](https://github.com/cnuernber/dtype-next/issues/57) was blocking tmd deployment on some macs.
 * Fix for [issue 298](https://github.com/techascent/tech.ml.dataset/issues/298) - nippy'd columns fail thaw.

# 6.085
 * Update to dtype-next.
 * Support for decimal types from parquet - thanks to chrysophylax.

# 6.084
 * Various updates to the charred library.
 * column whitelist/blacklists can be column indexes again.

# 6.081
 * json/csv read/write is going through the [charred library](https://github.com/cnuernber/charred).

# 6.080
 * Update json parsing to dtype-next's json parser.

# 6.079
 * Optimized new csv processing system.  Slightly faster and uses less memory.

# 6.078
 * Switched to the new csv processing system in dtype-next for parsing csv's.  This eliminates
   a source of more or less unfixable issues regarding univocity and it should be nearly
   identical in performance while using less memory.
 * Additionally there is a new interface for csv -
   [tech.v3.dataset.io.csv](https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.io.csv.html)
   - processing that efficiently allows you to load a CSV into a sequence of datasets based
   on row-counts.
 * The univocity-based processing system will still be kept around as there may be files
   that load significantly faster or that load correctly with the univocity processing
   system.

# 6.077
 * Upgrade to dtype-next to make `(ds/filter-column ds col identity)` consistent w/r/t missing
   values across numeric and object datatypes.
 * `drop-missing` has a 2-arg variant that takes a dataset and column name.  This is a much
   faster pathway than `(ds/filter-column ds col identity)` for dropping missing values.


# 6.076
 * New print options and bug fix for [issue 266](https://github.com/techascent/tech.ml.dataset/issues/266) - printing
   first style of `first ... last` is the default as I think it is generally more useful than just first or last.
   Skipped a version due to bug in this system.

# 6.074
 * [issue 295](https://github.com/techascent/tech.ml.dataset/issues/295) - new-column exported from api had
   incorrect signature.
 * [issue 294](https://github.com/techascent/tech.ml.dataset/issues/294) - arrow files with lz4 dependent-block
   encoding fail for the jpoinz decoder.  The only sane resolution here is to use the C lz4 library decoding system
   while we work through these issues upstream.

# 6.072
 * Support for reading/writing csv, tsv, edn, json bzip2 and zip files.  Zip files
   are only read when there is a single zipentry in them.  bzip2 requires the user
   to require `tech.v3.dataset.bzip2` in order to work.  See namespace documentation.

# 6.071
 * Initial tribuo support - see docs for tech.v3.libs.tribuo.
 * Upgrade dtype-next - claypoole now comes by default.

# 6.069
 * The public java api docs are updated and it has a rowMap overload that supports
   options to pass to pmapDs.

# 6.068
 * `column-map` is no longer lazy when an explicit datatype is provided.  The result
   is now generated immediately in parallel.  Laziness can be achieved via the
   dtype-next emap api along with `assoc`.

# 6.067
 * Defaulting `:strings-as-text?` to false for the multiple dataset pathway as
   support for delta dictionaries was only recently solidified in the [Arrow SDK
   itself](https://issues.apache.org/jira/browse/ARROW-13467).

# 6.066
 * Major rework of arrow support to include support for all known arrow file formats
   and tested files in various formats across latest (7.0.0) pyarrow.
 * Fix for [issue 289](https://github.com/techascent/tech.ml.dataset/issues/289) - cross
   pd-merge produced incorrect result.

# 6.065
 * Fixing [issue 287](https://github.com/techascent/tech.ml.dataset/issues/287) - dataset corrupt after
   nippy serialization.  This had of course nothing to do with nippy but was caused by a bug in
   dataset->data pathway.
# 6.064
 * upgrade dtype-next to eliminate some superfluous logging and to enable
   window positioning on variable rolling windows.
 * Removed neanderthal as a required dependency.  IT is now lazily loaded upon call
   of PCA.

# 6.062
 * Upgrade dtype-next which removed an experimental fast list creation pathway.

## 6.061
 * Construct a dataset with sequences of java.util.HashMap now works.

## 6.060
 * Neanderthal is preferred but is not a required dependency.

## 6.059
 * Lazily load neanderthal specifically for PCA when necessary.

## 6.058
 * Upgrade datatype to provide faster map/vector constructors.
 * Upgrade rowvecs pathway so the copying option is considerably faster for dataset
   of columns or less.

## 6.057
 * Typed out faststruct's constructor.

## 6.056
 * JSON read/write upgraded to support gzip - ".json.gz".

## 6.055
 * Fixes [issue 284](https://github.com/techascent/tech.ml.dataset/issues/284) - unroll column fails on single column dataset.

## 6.054
 * Removed fastexcel as an automatic dependency due to [issue 283](https://github.com/techascent/tech.ml.dataset/issues/283).  The api documentation now
   indicates the known working fastexcel version.
 * Added Reductions namespace with example to Java API.

## 6.053
 * Java API, Documentation and Sample.
 * Non-backward-compatible Fixes to rolling API's `:comp-fn` optional argument - the
   parameters to the function are reversed so that things like `clojure.core/-` work.

## 6.052
 * `tech.v3.dataset.neaderthal/dataset->dense` supports float32 datatypes.
## 6.051
 * Replace-missing on packed datatypes now works - you have to pass in the unpacked
   value (which most users will anyway).

## 6.050
 * Arrow compression is now supported.  See documentation in the libs/arrow namespace.
 * Major dtype-next upgrade - now with a Java (tm) API :-).

## 6.049
 * Arrow has been rebuilt to be minimally dependent on the official arrow SDK
   and support JDK-17.

## 6.048
 * Small upgrade to dtype-next with a more flexible new-array-of-structs definition
 and [documentation](https://cnuernber.github.io/dtype-next/tech.v3.datatype.struct.html#var-new-array-of-structs).
 See [unit tests](https://github.com/techascent/tech.ml.dataset/blob/a7ad63c6082f6731b143ae47d2b8f71456888acb/test/tech/v3/dataset_test.clj#L1381)
 for how to convert an array of structs into a dataset.

## 6.047
 * Support for making datasets out of arrays of structs.

## 6.046
 * Disable automatic file-backed-text because mmap is broken on m-1 macs.  The fix for
   this is moving to JDK-17, btw, where it is a normal API call that works fine.  Don't
   expect this to come back.

## 6.044
 * col-parsers/make-fixed-parser allows you to make fixed type parsers for custom
   datatypes.


## 6.043
 * Text works correctly when mmap pathways are disabled - larray fails on m-1 mac.

## 6.042
 * Intermediate versions before this - Lots of micro optimizations to row-mapcat and
   some to micro-opts to group-by-column-agg.
 * Support for LocalTime datatype.  Parquet and Arrow support this conversion.  Arrow files
   will read localtime back in as the datatype `:time-microseconds`.  Users can use
   `:local-time` as a parser datatype and there is support for parsing some simple
   variations of local-time data.


## 6.036

 * [pmap-ds](https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-pmap-ds)
 * [row-mapcat](https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-row-mapcat)
 Row mapcat has an option to produce a sequence of datasets.  This flows naturally into
 group-by-column-agg.  Keep this in mind as it keeps the size of the working set in memory
 fairly low.

## 6.035
 * Main api namespaces are code-generated to ensure discoverability.  Namespaces affected are
   tech.v3.datatype, tech.v3.datatype.functional, tech.v3.datatype.datetime, tech.v3.dataset,
   tech.v3.dataset.metamorph.
 * clj-kondo bindings and mostly clean linting pass failing only on 2 places on my dev machine.
   Working with borkdude to deal with small number of current failings.

## 6.031
 * Upgrade to latest dtype-next - fix for ternary <,<=,>,>= in dfn namespace.
 * dtype-next's main api now includes efficient in-place reverse.
 * reverse-rows - reverse the order of the rows of the dataset.
 * select-missing - select only rows where one of the columns has a missing value.
 * The high performance aggregations in the reduce namespace  now support a specialized
   filter argument to filter out a row index very late in the process.

## 6.030
 * [issue 275](https://github.com/techascent/tech.ml.dataset/issues/275) - pokemon.csv failed
 to parse correctly due to quote issues.

## 6.029
 * [issue 267](https://github.com/techascent/tech.ml.dataset/issues/267) - converting between
   probability distributions back to labels quietly ignored NAN values leading to errors.
 * [issue 273](https://github.com/techascent/tech.ml.dataset/issues/273) - Approximate Bayesian Bootstrap
   implemented for replace-missing.
 * [induction](https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-induction) - create a new
   dataset via induction over each row of a previous dataset.

## 6.027
 * [issue 274](https://github.com/techascent/tech.ml.dataset/issues/274) - replace-missing drops metadata.
 * Latest dtype-next - specifically new functions in jvm-map namespace.

## 6.026
 * Minor upgrade - ds/concat returns nil of all arguments are nil or nothing is passed in.
 This matches the behavior of concat.

## 6.025
 * [locker](https://cnuernber.github.io/dtype-next/tech.v3.datatype.locker.html) - an efficient
   and threadsafe way to manipulate global variables.
 * [dtype-next issue 42] - elemwise cast on column failed with specific example.

## 6.024
 * nth is now correct on columns with negative indexes - see unit test.
 * nth is now correct on many more dtype types with negative indexes - see dtype-next unit tests.
 * [shift](https://cnuernber.github.io/dtype-next/tech.v3.datatype.functional.html#var-shift) functionality is included in dtype-next.


## 6.023
 * [issue 270](https://github.com/techascent/tech.ml.dataset/issues/270) - join with double columns was failing
   due to set-constant issue in dtype-next.

## 6.022
 * Moved ztellman's primitive-math library into datatype under specific namespace
   prefix due to dtype-next [issue 41](https://github.com/cnuernber/dtype-next/issues/41).

## 6.021
 * tech.v3.dataset.neanderthal/dense->dataset - creates a dataset with integer column names.

## 6.020
 * Fix for missing datetime values causing descriptive stats to fail.

## 6.019
 * [min-n-by-column](https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-min-n-by-column) - Find the mininum N rows by column - uses guava minmaxheap under the covers.
   Sorting the result of this is an efficient way to find have a sorted top-N-type operation.
 * Changed the default concatenation pathway to be copying by default.  This often times just
   works better and results in much faster processing pipelines.
 * Fixed a few issues with packed datatypes.
 * Changed extend-column-with-empty so that it copies data.  I am less sure about this
   change but it fixed an issue with packed datatypes at the cost that joins are often
   no longer in place.  So if you get OOM errors now doing certain joins this change
   is the culprit and we should back off and set it back to what it was.

## 6.016
 * Fixed dtype/writer? queries to accurately reflect actual situations.

## 6.015
 * row-map - map a function across the rows of the dataset (represented as maps).  The result
 should itself be a map and the dataset created from these maps will be merged back into
 the original ds.

## 6.014
 * tech.v3.dataset.reductions/group-by-column-agg can take a tuple of column names in addition
   to a single column name.  In the case of a tuple the grouping will be the vector of column
   values evaluated in object space (so missing will be nil).
 * [issue 260 - numeric, sorting of missing values](https://github.com/techascent/tech.ml.dataset/issues/260)
 * [issue 262 - positional renaming of columns](https://github.com/techascent/tech.ml.dataset/issues/262)
 * [issue 257 - implement pandas merge functionality](https://github.com/techascent/tech.ml.dataset/issues/257)

## 6.011
 * Upgrade tech.io so ls, metadata works for files similarly to how it works for aws.

## 6.010
 * More dtype-next datetime functions - `local-date->epoch-months`, `epoch-months->epoch-days`
   `epoch-days->epoch-months`, `epoch-months->local-date`.
 * Clean way to create new reducers for the reductions namespace - `tech.v3.dataset.reductions/index-reducer`.

## 6.009
 * latest dtype-next.
 * filter-by-column, sort-by-column use unpacked datatypes.
 * New parser argument, 'parse-type', that allows you to turn on string parsing when
   working with a sequence of maps.
 * `->dataset` is more robust to sequences of maps that may contain nil values.
 * `concat` can take datasets with different subsets of columns, just like concat of
   a sequence of maps does.


## 6.006
 * latest dtype-next.  perf fixes for continuous wavelet transform, linear-regression, some
   issue fixes.
 * [issue 257](https://github.com/techascent/tech.ml.dataset/issues/257) - implement pandas
   merge functionality.
 * [issue 255](https://github.com/techascent/tech.ml.dataset/issues/255) - Surprising behavior
   if dataset has no categorical columns (nil vs empty dataset).
 * Upgrade Apache Poi to version 5.0.0.

## 6.005
 * Fix in k-fold; it could fail for certain sizes of datasets.

## 6.004
 * major fix for odd? event? etc. in tech.v3.datatype.functional.
 * head,tail can accept numbers larger than row-count.
 * dtype-next tech.v3.datatype.functional namespace now has vectorized versions of
   sum, dot-product, magnitude-squared, and distance that it will use if the input
   is backed by a double array and if jdk.incubator.vector module is enabled.

## 6.003
 * New accessors - rows, row-at - both work in sequence-of-maps space.  -1 indexes for
   row-at return data indexed from the end so (row-at ds -1) returns the last dataset
   row.
 * When accessing columns via ifn interface - `(col idx)`, negative numbers index from
   the end so for instance -1 retrieves the last value in the column.
 * Large and potentially destabilizing optimization in some cases where argops/argfilter can
   return a range if the filtered region is contiguous and
   then new columns are sitting on sub-buffers of other columns as opposed to indexed-buffers.
   A sub-buffer doesn't pay the same indexing costs and is still capable of accessing the
   underlying data (such as a double array) whereas an indexed buffer cannot faithfully return
   the underlying data.  This can dramatically reduce indexing costs for certain operations
   and allows System/arraycopy and friends to be used for further operations.

## 6.002
 * [issue 254] - Unexpected behaviour of rolling with LazySeq and ChunkedCons.
 * [issue 252] - Nippy serialization of tech.v3.dataset.impl.column.Column.

## 6.001
 * Moved to 3 digit change qualifier.  Hopefully we get a new major version before we hit
   99 bugfixes but no guarantees.
 * [issue 250] - Columns of persistent vectors failed to save/restore from nippy.

## 6.00
 * [issue 247](https://github.com/techascent/tech.ml.dataset/issues/247) - certain pathways would load gzipped as binary.
 * [issue 248](https://github.com/techascent/tech.ml.dataset/issues/248) - Reflection in index code.
 * [issue 249](https://github.com/techascent/tech.ml.dataset/issues/249) - Failure for dataset->data for string columns with missing data.
 * update dtype-next for much more efficient cumulative summation type operations.

## 6.00-beta-15
 * Upgrade to dtype-next for fft-based convolutions.

## 6.00-beta-14
 * New rolling namespace for a high level pandas-style rolling api.
 * Lots of datetime improvements.
 * tech.v3.datatype improvements (new conv1d, diff, gradient functionality).

## 6.00-beta-13
 * Fix for parquet failing to load local files in windows.  Thanks hadoop, that is
   3 hours of my life that will never come back ;-).


## 6.00-beta-12

 * Parquet documentation to address logging slowdown.  If writing parquet files is
   unreasonably slow then please read the documentation on logging.  The java
   parquet implementations logs so much it slows things down 5x-10x.
 * Fix to nippy to be backward compatible.

## 6.00-beta-11
 * [issue 244](https://github.com/techascent/tech.ml.dataset/issues/242) - NPE with packed column.
 * [issue 243](https://github.com/techascent/tech.ml.dataset/issues/243) - xls, xlsx parsing documentation.
 * [issue 208](https://github.com/techascent/tech.ml.dataset/issues/208) - k-fold, train-test-split both now take random seed
   similar to shuffle.


## 6.00-beta-10
 * [issue 242](https://github.com/techascent/tech.ml.dataset/issues/242) - drop-columns
   reorders columns.

## 6.00-beta-9
* [issue 240](https://github.com/techascent/tech.ml.dataset/issues/240) - poor remove columns perf.
* Thorough fix to make `column-map` be a bit more predictable and forgiving.


## 6.00-beta-8
 * Fix for [issue 238](https://github.com/techascent/tech.ml.dataset/issues/238)

## 6.00-beta-7
 * Final upgrade for geni benchmark and moved a utility function to make
   it more generally accessible.

## 6.00-beta-5
 * Perf upgrades - ensuring geni benchmark speed stays constant.

## 6.00-beta-4
 * `[column-map-m](https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-column-map-m)

## 6.00-beta-2
 * Small bugfix so missing sets are correct set in `column-map`.

## 6.00-beta-1

 Data types and missing values are much more aggressively inferred - which is
 `O(n-rows)`) throught the api. There is a new API to disable the inference - Either
 pass something that is already a column or pass in a map with keys:
```clojure
  #:tech.v3.dataset{:data ... :missing ... :metadata ... :force-datatype? true}
```
  Put another way, the input to `#{assoc ds/update-column ds/add-column
  ds/add-or-update-column}`  is already a column
  (see `tech.v3.dataset.column/new-column`) or if `:tech.v3.dataset/force-datatype?` is
  true **and** `:tech.v3.dataset/data` is convertible to a reader then the data will not
  be scanned for datatype or missing values.  If the input data is a primitive-typed
  container then it will be scanned for missing values alone and anything else is
  passed through the object parsing system which is what is used for sequences of maps,
  maps of sequences and spreadsheets.

  In this way in general the system will do more work than before - more scans of the
  result of things like transducer pathways and persistent vectors but in return the
  dataset's column datatypes should match the user's expectations.  If too much time is
  being taken up via attempting to infer datatypes and missing sets then the user has
  the option to pass in explicitly constructed columns or column data representations
  both of which will disable the scanning.  Once the data is typed elementwise
  mathematical operations of the type in `:tech.v3.datatype.function` **will not**
  result in further scans the data.


Itemized Changes:

 * `assoc, ds/add-column, ds/update-column, ds/add-or-update-column` type operations all
    upgraded such that datatype and missing are inferred much more frequently.
 * `:tech.ml.dataset.parse/missing`, `:tech.ml.dataset.parse/parse-failure` -> `:tech.v3.dataset/missing`, `:tech.v3.dataset/parse-failure`.
 * `column-map` - Now scans results to infer datatype if not provided as opposed to assuming result is the
    widest of the input column types.  Also users can provide their own function that calculates missing sets as opposed to
	the default behavior being the union of the input columns' missing sets.


## 5.21
 * [Issue 233](https://github.com/techascent/tech.ml.dataset/issues/233) - Poi xlsx parser can now autodetect dates.  Note that fastexcel is the default
   xslx parser so in order to parse xlsx files using poi use `tech.v3.libs.poi/workbook->datasets`.
 * [PR 232](https://github.com/techascent/tech.ml.dataset/pull/232) - Option - `:disable-comment-skipping?` - to disable comment skipping in csv files.


## 5.20
 * Return an Iterable from csv->rows as opposed to a seq.  Iterator-seq has nontrivial overhead.
 * Fixes for issues [229](https://github.com/techascent/tech.ml.dataset/issues/229),
   [230](https://github.com/techascent/tech.ml.dataset/issues/230), and [231](https://github.com/techascent/tech.ml.dataset/issues/231).

## 5.19
 * Using builder model for parquet both for forward compatibility and so we can set an output stream
   as opposed to a file path.  This allows a graal native pathway to work wtih parquet.

## 5.18
 * Graal-native friendly mmap pathways (no requiring resolve, you have to explicity set the implementation in your main.clj file).
 * Parquet write pathway update to make more standard and more likely to work with future versions of parquet.  This means, however, that there will
   no longer be a direct correlation between number of datasets and number of record batches in a parquet file as the standard pathway takes care
   of writing out record batches when a memory constraint is triggered.  So if you save a dataset you may get a parquet file back that contains
   a sequence of datasets.  There are many parquet options, see the documentation for
   [ds-seq->parquet](https://techascent.github.io/tech.ml.dataset/tech.v3.libs.parquet.html#var-ds-seq-.3Eparquet).

## 5.17
 * [Issue 225](https://github.com/techascent/tech.ml.dataset/issues/224) - column/row selection should return empty datasets when no columns are selected.
 * nil headers now print fine - thanks to DavidVujic.

## 5.15
 * [Issue 224](https://github.com/techascent/tech.ml.dataset/issues/224) - dataset creating fails in map case when all vals are seqs.

## 5.14
 * Another set of smaller upgrades to csv parsing.
 * [Reservoir sampling](https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.reductions.html#var-reservoir-dataset) is supported for large aggregations.
 * tech.io (and thus nippy) is upgraded.

## 5.13
 * Various optimizations to csv parsing making it a bit (2x) faster.

## 5.12
 * All statistical/reduction summations now use  [Kahan's compensated summation](https://en.wikipedia.org/wiki/Kahan_summation_algorithm).  This makes summation
   much more accurate for very large streams of data.
 * [Issue 220](https://github.com/techascent/tech.ml.dataset/issues/220) - confusing behavior on dataset creation.  This may result in different
   behavior than was expected previously when using maps of columns as dataset constructors.

## 5.11
 * Many more algorithms exposed and documentation updated for the
   [apache-data-sketch](https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.reductions.apache-data-sketch.html)
   namespace.

## 5.10
 * apache data sketch set-cardinality algorithms hyper-log-log and theta.

## 5.07
 * [tech.v3.dataset.reductions](https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.reductions.html)
  namespace now includes direct aggregations
  including group-by aggregations and also t-dunnings' t-digest algorithm for
  probabilistic cdf and quantile estimation.

## 5.06
 * Bugs introduced by t-digest version 3.2.  Ignore this release.

## 5.05
 * Ragged per-row arrays are partially support for parquet.

## 5.04
 * `headers?` now also works when writing csv/tsv documents (thanks to behrica).

## 5.03
 * group-by now is done with a linkedhashmap thus the keys are ordered in terms
   of first found in the data.  This is useful for operations such as re-indexing
   a previously sorted dataset as it.

## 5.02
 * Fix for fastexcel files containing formulas.
 * Codox org.ow2.asm fix from dtype-next.

## 5.00
 * Large upgrade to [`dtype-next`](https://github.com/cnuernber/dtype-next/).
 * All public functions are dataset-first.  This breaks compatibility with sort-by
   filter, etc.
 * Namespace changes to be `tech.v3` across the board as opposed to `tech.v2` along
   with `tech.ml`.
 * All smile dataframe functionality is in tech.v3.libs.smile.data.

## 4.03
 * Fix for #136

## 4.02
 * Optimized conversion from a dataset to and from a neanderthal dense matrix is
   now supported -- see tech.ml.dataset.neanderthal.

## 4.01
 * Major cleanup of dependencies.  Logging works now, for better or definitely
   at times for worse.  To silence annoying, loud logging call:
   ```clojure
   (tech.ml.dataset.utils/set-slf4j-log-level :info)
   ```.
 * Added a large-dataset reduction namespace: `tech.ml.dataset.reductions`.  Current
   very beta but in general large reductions will reduce to java streams as these have
   parallelization possibilities that sequences do not have; for instance you can
   get a parallel stream out of a hash map.

## 4.00
 * Upgrade to smile 2.5.0

## 3.11
 * Major fix to tech.v2.datatype.mmap/mmap-file - resource types weren't being set in
   default.
 * `tech.ml.dataset/csv->dataset-seq` - Fixed input stream closing before sequence is
   completely consumed.

## 3.10
 * `tech.libs.arrow/write-dataset-seq-to-stream!` - Given a sequence of datasets, write
    an arrow stream with one record-batch for each dataset.
 * `tech.libs.arrow/stream->dataset-seq-copying` - Given an arrow stream, return a
    sequence of datasets, one for each arrow data record.
 * `tech.libs.arrow/stream->dataset-seq-inplace` - Given an arrow stream, return a
    sequence of datasets constructed in-place on memory mapped data.  Expects to be
	used with in a `tech.resource/stack-resource-context` but accepts options for
	`tech.v2.datatype.mmap/mmap-file`.
 * `tech.libs.arrow/visualize-arrow-stream` - memory-maps a file and returns the arrow
    structure in a way that prints nicely to the REPL.  Useful for exploring an arrow
	file and quickly seeing the low level structure.
 * `tech.ml.dataset/csv->dataset-seq` - Given a potentially large csv, parse it into
    a sequence of datasets.  These datasets are guaranteed to share a schema and so
	an efficient form of writing really large arrow files is to using this function
	along with `tech.libs.arrow/write-dataset-seq-to-stream!`.


## 3.08
#### Arrow Support
 * Proper arrow support.  In-place or accelerated copy pathway into the jvm.
 * 'tech.libs.arrow` exposes a few functions to dive through arrow files and
   product datasets.  Right now only stream file format is supported.
   Copying is supported via their blessed API.  In-place is supported by
   a more or less clean room implementation using memory mapped files.  There
   will be a blog post on this soon.
 * tech.datatype has a new namespace, tech.v2.datatype.mmap that supports memory
   mapping files and direct memory access for address spaces (and files) larger
   than the java nio 2GB limit for memory mapping and nio buffers.

## 3.07
 * Issue 122 - Datasets with columns of datasets did not serialize to nippy.

## 3.06
 * Issue 118 - sample, head, tail, all set :print-index-range so the entire ds prints.
 * Issue 119 - median in descriptive stats
 * Issue 117 - filter-column allows you to pass in a value for exact matches.
 * Updated README thanks to joinr.
 * Updated walkthrough.

## 3.05
 * Rebuilt with java8 so class files are java8 compatible.

## 3.04 - Bad release, built with java 11
 * Issue 116 - `tech.ml.dataset/fill-range-replace` - Given a numeric or date column,
   interpolate column such that differences between successive vaules are smaller
   than a given cutoff.  Use replace-missing functionality on all other columns
   to fill in values for generated rows.
 * Issue 115 - `tech.ml.dataset/replace-missing` Subset of replace-missing from
   tablecloth implemented.

## 3.03
 * Bugfix - Some string tables saved out with version 2.X would not load correctly.

## 3.02
 * Issue 98 - Reading csv/xlsx files sometimes produce numbers which breaks setting
   colnames to keywords.
 * Issue 113 - NPE when doing hasheq on empty dataset
 * Issue 114 - Columns now have full hasheq implementation - They are
   `IPersisentCollections`.

## 3.01
 * Datasets implement IPersistentMap.  This changes the meaning of `(seq dataset)`
   whereas it used to return columns it now returns sequences of map entries.
   It does mean, however, that you can destructure datasets in let statements to
   get the columns back and use clojure.core/[assoc,dissoc], contains? etc.
   Some of the core Clojure functions, such as select-keys, will change your dataset
   into a normal clojure persistent map so beware.


## 2.15
* fix nippy save/load for string tables.
* string tables now have arraylists for their int->str mapping.
* saving encoded-text columns is now possible with their encoding object.

## 2.14 - BAD RELEASE, NIPPY SAVE/LOAD BROKEN
 * There is a new parse type: `:encoded-text`.  When read, this will appear to be a
   string column however the user has a choice of encodings and utf-8 is the default.
   This is useful when you need a particular encoding for a column.  It is roughly
   twice as efficient be default as a normal string encoding (utf-8 vs. utf-16).

## 2.13
 * `nth`, `map` on packed datetime columns (or using them as functions)
   returns datetime objects as opposed to their packed values.  This means that if you
   ask a packed datetime column for an object reader you get back an unpacked value.

## 2.12
 * Better support of `nth`.  Columns cache the generic reader used for nth queries
   and all tech.v2.datatype readers support nth and count natively in base java
   interface implementations.
 * New namespace - `tech.ml.dataset.text.bag-of-words` that contains code to convert
  a dataset with a text field into a dataset with document ids and  and a
  document-id->token-idx dataset.
 * [Quick Reference](docs/quick-reference.md)

## 2.11
 * After several tries got docs up on cljdoc.  Need to have provided deps cleaned
   up a bit better.

## 2.09
 - Include logback-classic as a dependency as smile.math brings in slf4j and this causes
   an error if some implementation of slf4j isn't included thus breaking things like
   cljdoc.
 - Experimental options options for parsing text (:encoded-text) when dealing with
   large text fields.


## 2.08
 * Major datatype datetime upgrade - [Issue 31](https://github.com/techascent/tech.datatype/issues/31)


## 2.07
 * Bugfix - string tables that required integer storage were written out
   incorrectly to nippy files.


## 2.06
 * `left-join-asof` - Implementation of algorithms from pandas'
    [`merge_asof'](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.merge_asof.html).


## 2.05
 * Bugfix release - We now do not ever parse to float32 numbers by default.  This was
   silently causing data loss.  The cost of this is that files are somewhat larger and
   potentially we need to have an option to set the default sequence of datatypes
   attempted during data parsing.


## 2.04
 * Added `concat-copying`.  This is much faster when you want to concatenate many
   things at the cost of copying the data and thus potentially increasing the working
   set size in memory.


## 2.03
 * Saving to nippy is much faster because there is a new function to efficiently
   construct a string table from a reader of strings:
   `tech.ml.dataset.string-table/string-table-from-strings`.

## 2.02
 * **breaking change** - Remove date/uuid inference pathway for strings from
   mapseq/spreadsheet pathways.
 * nippy freeze/thaw is efficiently supported :-).


## 2.01
 * Issue-94 - Ragged csv data loads automatically now.
 * Issue-87 - Printing double numbers is much better.
 * Fixed saving tsv files - was writing out csv files.
 * Fixed writing packed datatypes - was writing integers.
 * Added parallelized loading of csv - helps a bit but only if parsing
   is really expensive, so only when lots of datetime types or something
   of that nature.

## 2.0
 * No changes from beta-59

## 2.0-beta-59
 * `tech.datatype` now supports persistent vectors made via `clojure.core.vector-of`.
   `vector-of` is a nice middle ground between raw persistent vectors and java arrays
   and may be a simple path for many users into typed storage and datasets.


## 2.0-beta-58
 - mistake release...nothing to see here....


## 2.0-beta-57
 * Issue-92 - ->dataset failed for map-style datasets
 * Issue-93 - use smile.io to load arrow/parquet files via `->dataset`
 * Issue-91 - apply doesn't work with columns or readers

## 2.0-beta-56
**breaking changes**
 * Upgraded smile to latest version (2.4.0).  This is a very new API so if
   you are relying transitively on smile via dataset this may have broke your
   systems.  Smile 1.4.X and smile 2.X are very different interfaces so this
   is important to get in before releasing a 2.0 version of dataset.

 There is now an efficient conversion to/from smile dataframes.

### New Functions
   * `->dataset` conversion a smile dataframe to a dataset.
   * `dataset->smile-dataframe` conversion a dataset to a smile dataframe.
     Columns that are reader based will be copied into java arrays.  To enable
	 predictable behavior a new function was added.
   * `ensure-array-backed` - ensure each column in the dataset has a zerocopy
   conversion to a java array enabled by `tech.v2.datatype/->array`.
   * `invert-string->number` - The pipeline function `string->number` stores a string
     table in the column metadata.  Using this metadata, invert the string->number
	 operation returning the column back to its original state.  This metadata is
	 :label-map which is a map from column-data to number.


## 2.0-beta-55
 * Issue-89 - column iterables operate in object space meaning missing
   values are nil as opposed to the datatype's missing value indicator.

## 2.0-beta-XX
 * Datatype readers now suppport typed java stream creation (typedStream method).

## 2.0-beta-54
 * Issue-88 - rename column fails on false name

## 2.0-beta-53
 * Issue-86 - joins on datasets with not typical names
 * Issue-85 - select-rows can take a scalar.

## 2.0-beta-52
 * `dtype/clone` works correctly for arrays.

## 2.0-beta-51
 * profiled group-by-column quite a bit.  Found/fixed several issues,
   about 10X faster if table is wide as compared to long.
 * Fixed printing in a few edge cases.
 * Issue-84 - `tech.ml.dataset.column/scan-data-for-missing` fix.

## 2.0-beta-50
 * Memory optimization related to roaring bitmap usage.
 * Issue-82 - Empty datasets no longer printed.

## 2.0-beta-49
 * Small fix to printing to make pandoc work better.

## 2.0-beta-48
 * much better printing.  Dataset now correctly print multiline column data and there
   are a set of options to control the printing.  See `dataset->str`.

## 2.0-beta-47
 * UUIDs are now supported as datatypes.  This includes parsing them from strings
   out of csv and xlsx files and as a fully supported object in mapseq pathways.

## 2.0-beta-46
 * Conversion of instant->zoned-date-time uses UTC zone instead of system zone.

## 2.0-beta-45
 * issue-76 - quartiles for datetime types in descriptive stats
 * issue-71 - Added shape function to main dataset api.  Returns shape in row major
   format.
 * issue-69 - Columns elide missing values during print operation.

## 2.0-beta-44
 * drop-rows on an empty set is a noop.
 * `tech.ml.dataset.column/stats` was wrong for columns with missing
   values.


## 2.0-beta-43
 * `tech.v2.datatype` was causing double-read on boolean readers.
 * issue-72 - added `max-num-columns` because csv and tsv files with more than 512
   columns were failing to parse.  New default is 8192.
 * issue-70 - The results of any join have two maps in their metadata -
   :left-column-names - map of original left-column-name->new-column-name.
   :right-column-names - map of original right-column-name->new-column-name.


## 2.0-beta-42
 * `n-initial-skip-rows` works with xlsx spreadsheets.
 * `assoc`, `dissoc` implemented in the main dataset namespace.

## 2.0-beta-41
 * issue-67 - Various `tech.v2.datatype.functional` functions are updated to be
 more permissive about their inputs and cast the result to the appropriate
 datatype.

## 2.0-beta-40
 * issue-65 - datetimes in mapseqs were partially broken.
 * `tech.v2.datatype.functional` will now change the datatype appropriately on a
    lot of unary math operations.  So for instance calling sin, cos, log, or log1p
	on an integer reader will now return a floating point reader.  These methods used
	to throw.
 * subtle bug in the ->reader method defined for object arrays meant that sometimes
   attempting math on object columns would fail.
 * `tech.ml.dataset/column-cast` - Changes the column datatype via a an optionally
   privided cast function.  This function is powerful - it will correctly convert
   packed types to their string representation, it will use the parsing system on
   string columns and it uses the same complex datatype argument as
   `tech.ml.dataset.column/parse-column`:
```clojure
user> (doc ds/column-cast)
-------------------------
tech.ml.dataset/column-cast
([dataset colname datatype])
  Cast a column to a new datatype.  This is never a lazy operation.  If the old
  and new datatypes match and no cast-fn is provided then dtype/clone is called
  on the column.

  colname may be a scalar or a tuple of [src-col dst-col].

  datatype may be a datatype enumeration or a tuple of
  [datatype cast-fn] where cast-fn may return either a new value,
  the :tech.ml.dataset.parse/missing, or :tech.ml.dataset.parse/parse-failure.
  Exceptions are propagated to the caller.  The new column has at least the
  existing missing set if no attempt returns :missing or :cast-failure.
  :cast-failure means the value gets added to metadata key :unparsed-data
  and the index gets added to :unparsed-indexes.


  If the existing datatype is string, then tech.ml.datatype.column/parse-column
  is called.

  Casts between numeric datatypes need no cast-fn but one may be provided.
  Casts to string need no cast-fn but one may be provided.
  Casts from string to anything will call tech.ml.dataset.column/parse-column.
user> (def stocks (ds/->dataset "test/data/stocks.csv" {:key-fn keyword}))

#'user/stocks
user> (ds/head stocks)
test/data/stocks.csv [5 3]:

| :symbol |      :date | :price |
|---------+------------+--------|
|    MSFT | 2000-01-01 |  39.81 |
|    MSFT | 2000-02-01 |  36.35 |
|    MSFT | 2000-03-01 |  43.22 |
|    MSFT | 2000-04-01 |  28.37 |
|    MSFT | 2000-05-01 |  25.45 |
user> (ds/head stocks)
test/data/stocks.csv [5 3]:

| :symbol |      :date | :price |
|---------+------------+--------|
|    MSFT | 2000-01-01 |  39.81 |
|    MSFT | 2000-02-01 |  36.35 |
|    MSFT | 2000-03-01 |  43.22 |
|    MSFT | 2000-04-01 |  28.37 |
|    MSFT | 2000-05-01 |  25.45 |
user> (take 5 (stocks :price))
(39.81 36.35 43.22 28.37 25.45)
user> (take 5 ((ds/column-cast stocks :price :string) :price))
("39.81" "36.35" "43.22" "28.37" "25.45")
user> (take 5 ((ds/column-cast stocks :price [:int32 #(Math/round (double %))]) :price))
(40 36 43 28 25)
user>
```


## 2.0-beta-29
 * renamed 'column-map' to 'column-name->column-map'.  This is a public interface change
   and we do apologize!
 * added 'column-map' which maps a function over one or more columns.  The result column
   has a missing set that is the union of the input columns' missing sets:
```clojure
user> (-> (ds/->dataset [{:a 1} {:b 2.0} {:a 2 :b 3.0}])
          (ds/column-map
           :summed
           (fn ^double [^double lhs ^double rhs]
             (+ lhs rhs))
           :a :b))
_unnamed [3 3]:

| :a |    :b | :summed |
|----+-------+---------|
|  1 |       |         |
|    | 2.000 |         |
|  2 | 3.000 |   5.000 |
user> (tech.ml.dataset.column/missing
       (*1 :summed))
#{0,1}
```

## 2.0-beta-38
 * [issue-64] - more tests revealed more problems with concat with different column
   types.
 * added `tech.v2.datatype/typed-reader-map` where the result datatype is derived
   from the input datatypes of the input readers.  The result of map-fn is
   unceremoniously coerced to this datatype -
```clojure
user> (-> (ds/->dataset [{:a 1.0} {:a 2.0}])
               (ds/update-column
                :a
                #(dtype/typed-reader-map (fn ^double [^double in]
                                           (if (< in 2.0) (- in) in))
                                         %)))
_unnamed [2 1]:

|     :a |
|--------|
| -1.000 |
|  2.000 |
```
 * Cleaned up the tech.datatype widen datatype code so it models a property type graph
   with clear unification rules (where the parent are equal else :object).

## 2.0-beta-37
 * [issue-64] - concat columns with different datatypes does a widening.  In addition,
   there are tested pathways to change the datatype of a column without changing the
   missing set.
 * `unroll-column` takes an optional argument `:indexes?` that will record the source
   index in the entry the unrolled data came from.

## 2.0-beta-36
 * generic column data lists now support `.addAll`

## 2.0-beta-35
 * `tech.datatype` - all readers are marked as sequential.
 * `unroll-column` - Given a column that may container either iterable or scalar data,
    unroll it so it only contains scalar data duplicating rows.
 * Issue 61 - Empty bitsets caused exceptions.
 * Issue 62 - IP addresses parsed as durations.

## 2.0-beta-34
 * Major speed (100x+) improvements to `tech.ml.dataset.column/unique` and especially
   `tech.ml.dataset.pipeline/string->number.

## 2.0-beta-33
 * `tech.v2.datatype` namespace has a new function - [make-reader](https://github.com/techascent/tech.datatype/blob/d735507fe6155e4e112e5640df4c211213f0deba/src/tech/v2/datatype.clj#L458) - that reifies
   a reader of the appropriate type.  This allows you to make new columns that have
   nontrivial translations and datatypes much easier than before.
 * `tech.v2.datatype` namespace has a new function - [->typed-reader](https://github.com/techascent/tech.datatype/blob/5b4745f728a2773ae542fac9613ffd1c482b9750/src/tech/v2/datatype.clj#L557) - that typecasts the incoming object into a reader of the appropriate datatype.
 This means that .read calls will be strongly typed and is useful for building up a set
 of typed variables before using `make-reader` above.
 * Some documentation on the implications of
   [columns, readers, and datatypes](docs/columns-readers-and-datatypes.md).

## 2.0-beta-32
 * Issue 52 - CSV columns with empty column names get named after their index.  Before they would cause
   an exception.
 * `tech.datatype` added a [method](https://github.com/techascent/tech.datatype/blob/bcffe8abe81a53022a5e5d24eae2577c58287bb7/src/tech/v2/datatype.clj#L519)
   to transform a reader into a  persistent-vector-like object that derives from
   `clojure.lang.APersistentVector` and thus gains benefit from the excellent equality
   and hash semantics of persistent vectors.

## 2.0-beta-31
 * Fixed #38 - set-missing/remove-rows can take infinite seqs - they are trimmed to
   dataset length.
 * Fixed #47 - Added [`columnwise-concat`](https://github.com/techascent/tech.ml.dataset/blob/bb3f3dbad78a04d81c08d6ae8f1507c6f4e26ed9/src/tech/ml/dataset.clj#L177)
   which is a far simpler version of dplyr's
   https://tidyr.tidyverse.org/reference/pivot_longer.html.  This is implemented
   efficiently in terms of indexed reader concatentation and as such should work
   on tables of any size.
 * Fixed #57 - BREAKING PUBLIC API CHANGES - We are getting more strict on the API - if
   a function is dataset-last (thus appropriate for `->>`) then any options must be
   passed before the dataset.  Same is true for the set of functions that are dataset
   first.  We will be more strict about this from now on.

## 2.0-beta-30
 * Parsing datetime types now works if the column starts with missing values.
 * An efficient formulation of java.util.map is introduced for when you have
   a bitmap of keys and a single value:
   `tech.v2.datatype.bitmap/bitmap-value->bitmap-map`.  This is used for
   replace-missing type operations.

## 2.0-beta-29
 * `brief` now does not return missing values.  Double or float NaN or INF values
   from a mapseq result in maps with fewer keys.
 * Set of columns used for default descriptive stats is reduced to original set as
   this fits on a small repl nicely.  Possible to override.  `brief` overrides this
   to provide defaults to get more information.
 * `unique-by` returns indexes in order.
 * Fixed #51 - mapseq parsing now follows proper number tower.

## 2.0-beta-28
 * Fixed #36 - use key-fn uniformly across all loaded datatypes
 * Fixed #45 - select can take a map.  This does a selection and
     a projection to new column names.
 * Fixed #41 - boolean columns failed to convert to doubles.
 * Fixed #44 - head,tail,shuffle,rand-nth,sample all implemented in format
     appropriate for `->>` operators.

## 2.0-beta-27
 * Update `tech.datatype` with upgraded and fewer dependencies.
   - asm 7.1 (was 7.0)
   - org.clojure/math.combinatorics 1.6 (was 1.2)
   - org.clojure/test.check 1.0.0

## 2.0-beta-25
 * Optimized filter.  Record of optimization is on
   [zulip](https://clojurians.zulipchat.com/#narrow/stream/151924-data-science/topic/tech.2Eml.2Edataset.20-.20filter).
   Synopsis is a speedup of like 10-20X depending on how much work you want to do :-).
   The base filter pathway has a speedup of around 2-4X.

## 2.0-beta-23
 * Updated description stats to provide list of distinct elements for categorical
   columns of length less than 21.
 * Updated mapseq system to provide nil values for missing data as opposed to the
   specific column datatype's missing value indicator.  This can be overridden
   by passing in `:missing-nil?` false as an option.
 * Added `brief` function to main namespace so you can get a nice brief description
   of your dataset when working from the REPL.  This prints out better than
   `descriptive-stats`.

## 2.0-beta-21
 * loading jsons files found issues with packing.
 * optimized conversion to/from maps.

## 2.0-beta-20
 * sort-by works with generic comparison fns.

## 2.0-beta-19
 * descriptive stats works with mixed column name types
 * argsort is now used for all sort functions
 * `->` versions of sort added so you can sort in -> pathways
 * instants and such can used for sorting

#### Added Functions
 - `column->dataset` - map a transform function over a column and return a new
   dataset from the result.  It is expected the transform function returns a map.
 - `drop-rows`, `select-rows`, `drop-columns` - more granular select calls.
 - `append-columns` - append a list of columns to a dataset.  Used with column->dataset.
 - `column-labeled-mapseq` - Create a sequence of maps with a :value and :label members.
   this flattens the dataset by producing Y maps per row instead of 1 map per row
   where the maps themselves are labeled with the value in their :value member.  This
   is useful to building vega charts.
 - `->distinct-by-column` - take the first row where a given key is present.  The arrow
   form of this indicats the dataset is the first argument.
 - `->sort-by`, `->sort-by-column` - Forms of these functions for using in `(->)`
    dataflows.
 - `interpolate-loess` - Produce a new column from a given pair of columns using loess
    interpolation to create the column.  The interpolator is saved as metadata on the
	new column.


## 2.0-beta-16
* Missing a datetime datatype for parse-str and add-to-container! means
  a compile time error.  Packed durations can now be read from mapseqs.

## 2.0-beta-15
* Descriptive stats now works with instants.

## 2.0-beta-14
* Descriptive stats now works with datetime types.

## 2.0-beta-12
* Support for parsing and working with durations.  Strings that look like times -
   "00:00:12" will be parsed into hh:mm:ss durations.  The value can have a negative
   sign in front.  This is in addition to the duration's native serialization string
   type.
* Added short test for tensors in datasets.  This means that the venerable print-table
  is no longer enough as it doesn't account for multiline strings and thus datatets
  with really complex things will not print correctly for a time.

## 2.0-beta-11
* Various fixes related to parsing and working with open data.
* `tech.ml.dataset.column/parse-column` - given a string column that failed to parse for
  some reason, you can force the system to attempt to parse it using, for instance,
  relaxed parsing semantics where failures simply record the failure in metadata.
* relaxed parsing in general is supported across all input types.

## 0.26
### Added
* rolling (rolling windows of computed functions - math operation)
* dataset/dssort-by
* dataset/ds-take-nth

## 0.22
### Added
* PCA