df-research/dataframe/docs/StardustDocs/topics/parse.md at main

ajet/df-research

Fork 0

Files

ajet bdf064f54d init research

2026-02-08 11:20:43 -10:00

7.2 KiB

Vendored

Raw Permalink Blame History

Returns a DataFrame in which the given String and Char columns are parsed into other types.

This is a special case of the operation.

This parsing operation is sometimes executed implicitly, for example, when reading from CSV or type converting from String/Char columns. You can recognize this by the locale or parserOptions arguments in these functions.

Related operations:

df.parse()

When no columns are specified, all String and Char columns are parsed, even those inside column groups and inside frame columns.

To parse only particular columns, use a column selector:

df.parse { age and weight }

Parsing Order

parse tries to parse every String/Char column into one of the supported types in the following order:

Int
Long
Instant (kotlin.time) (requires parseExperimentalInstant = true, enabled by default in DataFrame 1.0.0-Beta5)
Instant (kotlinx.datetime and java.time) (requires parseExperimentalInstant = false)
LocalDateTime (kotlinx.datetime and java.time)
LocalDate (kotlinx.datetime and java.time)
Duration (kotlin.time and java.time)
LocalTime (java.time)
URL (java.net)
Double (with optional locale settings)
Boolean
Uuid (kotlin.uuid.Uuid) (requires parseExperimentalUuid = true)
BigDecimal
JSON (arrays and objects) (requires the org.jetbrains.kotlinx:dataframe-json dependency)
Char
String

When .parse() is called on a single column and the input (String/Char) type is the same as the output type, (a.k.a., it cannot be parsed further) an IllegalStateException is thrown. To avoid this, use col.tryParse() instead.

Parser Options

DataFrame supports multiple parser options that can be used to customize the parsing behavior. These can be supplied to the parse function (or any other function that can implicitly parse Strings) as an argument.

For each option you don't supply (or supply null) DataFrame will take the value from the Global Parser Options.

Available parser options:

locale: Locale is used to parse doubles
- Global default locale is Locale.getDefault()
dateTimePattern: String is used to parse date and time
- Global default supports ISO (local) date-time
dateTimeFormatter: DateTimeFormatter is used to parse date and time
- Is derived from dateTimePattern and/or locale if null
nullStrings: List<String> is used to treat particular strings as null value
- Global default null strings are "null" and "NULL"
- When reading from CSV, we include even more defaults, like "", and "NA". See the KDocs there for the exact details
skipTypes: Set<KType> types that should be skipped during parsing
- Empty set by global default; parsing can result in any supported type
useFastDoubleParser: Boolean is used to enable or disable the new fast double parser
- Enabled by global default
parseExperimentalUuid: Boolean is used to enable or disable parsing to the experimental kotlin.uuid.Uuid class.
- Disabled by global default
parseExperimentalInstant: Boolean is used to enable or disable parsing to the kotlin.time.Instant class, available from Kotlin 2.1+. Will parse to kotlinx.datetime.Instant if false.
- Disabled by global default, enabled in DataFrame 1.0.0-Beta5.

df.parse(options = ParserOptions(locale = Locale.CHINA, dateTimeFormatter = DateTimeFormatter.ISO_WEEK_DATE))

Global Parser Options

As mentioned before, you can change the default global parser options that will be used by read, convert, and other parse operations. Whenever you don't explicitly provide parser options to a function call, DataFrame will use these global options instead.

For example, to change the locale to French and add a custom date-time pattern for all following DataFrame calls, do:

DataFrame.parser.locale = Locale.FRANCE
DataFrame.parser.addDateTimePattern("dd.MM.uuuu HH:mm:ss")

For locale, this means that the one being used by the parser is defined as:

↪ The locale given as function argument directly, or in parserOptions, if it is not null, else

↪ The locale set by DataFrame.parser.locale = ..., if it is not null, else

↪ Locale.getDefault(), which is the system's default locale that can be changed with Locale.setDefault().

Parsing Doubles

DataFrame has a new fast and powerful double parser enabled by default. It is based on the FastDoubleParser library for its high performance and configurability (in the future, we might expand this support to Float, BigDecimal, and BigInteger as well).

The parser is locale-aware; it will use the locale set by the (global) parser options to parse the doubles. It also has a fallback mechanism built in, meaning it can recognize characters from all other locales (and some from Wikipedia) and parse them correctly as long as they don't conflict with the current locale.

For example, if your locale uses ',' as decimal separator, it will not recognize ',' as thousands separator, but it will recognize ''', ' ', '٬', '_', ' ', etc. as such. The same holds for characters like "e", "inf", "×10^", "NaN", etc. (ignoring case).

This means you can safely parse "123'456 789,012.345×10^6" with a US locale but not "1.234,5".

Aside from this, DataFrame also explicitly recognizes "∞", "inf", "infinity", and "infty" as Double.POSITIVE_INFINITY (as well as their negative counterparts), "nan", "na", and "n/a" as Double.NaN, and all forms of whitespace are treated equally.

If FastDoubleParser fails to parse a String as Double, DataFrame will try to parse it using the standard NumberFormat.parse() function as a last resort.

If you experience any issues with the new parser, you can turn it off by setting useFastDoubleParser = false, which will use the old NumberFormat.parse() function instead.

Please report any issues you encounter.

7.2 KiB Vendored Raw Permalink Blame History Unescape Escape

Parsing Order

Parser Options

Global Parser Options

Parsing Doubles

7.2 KiB

Vendored

Raw Permalink Blame History