225 lines
6.9 KiB
Markdown
Vendored
225 lines
6.9 KiB
Markdown
Vendored
[//]: # (title: Write)
|
|
<!---IMPORT org.jetbrains.kotlinx.dataframe.samples.api.Write-->
|
|
|
|
[`DataFrame`](DataFrame.md) instances can be saved in the following formats: CSV, TSV, JSON, XLS(X) and Apache Arrow.
|
|
|
|
### Writing to CSV
|
|
|
|
You can write [`DataFrame`](DataFrame.md) in CSV format to file, to `String` or to `Appendable`
|
|
(i.e. to `Writer`).
|
|
|
|
Values of [`ColumnGroup`](DataColumn.md#columngroup), [`FrameColumn`](DataColumn.md#framecolumn), i.e. AnyRow, AnyFrame will be serialized as JSON objects.
|
|
|
|
<!---FUN writeCsv-->
|
|
|
|
```kotlin
|
|
df.writeCsv(file)
|
|
```
|
|
|
|
<!---END-->
|
|
|
|
<!---FUN writeCsvStr-->
|
|
|
|
```kotlin
|
|
val csvStr = df.toCsvStr(delimiter = ';', recordSeparator = System.lineSeparator())
|
|
```
|
|
|
|
<!---END-->
|
|
|
|
[`ColumnGroup`](DataColumn.md#columngroup) and [`FrameColumn`](DataColumn.md#framecolumn) values will be serialized as JSON strings.
|
|
|
|
### Writing to JSON
|
|
|
|
You can write your [`DataFrame`](DataFrame.md) in JSON format to file, to string or to `Appendable`
|
|
(i.e. to `Writer`).
|
|
|
|
<!---FUN writeJson-->
|
|
|
|
```kotlin
|
|
df.writeJson(file)
|
|
```
|
|
|
|
<!---END-->
|
|
|
|
<!---FUN writeJsonStr-->
|
|
|
|
```kotlin
|
|
val jsonStr = df.toJson(prettyPrint = true)
|
|
```
|
|
|
|
<!---END-->
|
|
|
|
### Write to Excel spreadsheet
|
|
|
|
Add dependency:
|
|
|
|
```kotlin
|
|
implementation("org.jetbrains.kotlinx:dataframe-excel:$dataframe_version")
|
|
```
|
|
|
|
You can write your [`DataFrame`](DataFrame.md) in XLS, XLSX format to a file, `OutputStream` or Workbook object.
|
|
|
|
<!---FUN writeXls-->
|
|
|
|
```kotlin
|
|
df.writeExcel(file)
|
|
```
|
|
|
|
<!---END-->
|
|
|
|
Values of [`ColumnGroup`](DataColumn.md#columngroup), [`FrameColumn`](DataColumn.md#framecolumn), i.e. AnyRow, AnyFrame will be serialized as JSON objects.
|
|
|
|
If you work directly with Apache POI, you can use created Workbook and Sheets in your code:
|
|
|
|
<!---FUN writeXlsAppendAndPostProcessing-->
|
|
|
|
```kotlin
|
|
/**
|
|
* Do something with generated sheets. Here we set bold style for headers and italic style for first data column
|
|
*/
|
|
fun setStyles(sheet: Sheet) {
|
|
val headerFont = sheet.workbook.createFont()
|
|
headerFont.bold = true
|
|
val headerStyle = sheet.workbook.createCellStyle()
|
|
headerStyle.setFont(headerFont)
|
|
|
|
val indexFont = sheet.workbook.createFont()
|
|
indexFont.italic = true
|
|
val indexStyle = sheet.workbook.createCellStyle()
|
|
indexStyle.setFont(indexFont)
|
|
|
|
sheet.forEachIndexed { index, row ->
|
|
if (index == 0) {
|
|
for (cell in row) {
|
|
cell.cellStyle = headerStyle
|
|
}
|
|
} else {
|
|
row.first().cellStyle = indexStyle
|
|
}
|
|
}
|
|
}
|
|
|
|
// Create a workbook (or use existing)
|
|
val wb = WorkbookFactory.create(true)
|
|
|
|
// Create different sheets from different dataframes in the workbook
|
|
val allPersonsSheet = df.writeExcel(wb, sheetName = "allPersons")
|
|
val happyPersonsSheet =
|
|
df.filter { person -> person.isHappy }.remove("isHappy").writeExcel(wb, sheetName = "happyPersons")
|
|
val unhappyPersonsSheet =
|
|
df.filter { person -> !person.isHappy }.remove("isHappy").writeExcel(wb, sheetName = "unhappyPersons")
|
|
|
|
// Do anything you want by POI
|
|
listOf(happyPersonsSheet, unhappyPersonsSheet).forEach { setStyles(it) }
|
|
|
|
// Save the result
|
|
file.outputStream().use { wb.write(it) }
|
|
wb.close()
|
|
```
|
|
|
|
<!---END-->
|
|
|
|
Add new sheets without using Apache POI directly by using a parameter to keep using the same file if it already exists
|
|
|
|
<!---FUN writeXlsWithMultipleSheets-->
|
|
|
|
```kotlin
|
|
// Create a new Excel workbook with a single sheet called "allPersons", replacing the file if it already exists -> Current sheets: allPersons
|
|
df.writeExcel(file, sheetName = "allPersons")
|
|
// Add a new sheet to the previous file without replacing it, by setting keepFile = true -> Current sheets: allPersons, happyPersons
|
|
df.filter { person -> person.isHappy }.remove("isHappy")
|
|
.writeExcel(file, sheetName = "happyPersons", keepFile = true)
|
|
// Add a new sheet to the previous file without replacing it, by setting keepFile = true -> Current sheets: allPersons, happyPersons, unhappyPersons
|
|
df.filter { person -> !person.isHappy }.remove("isHappy")
|
|
.writeExcel(file, sheetName = "unhappyPersons", keepFile = true)
|
|
```
|
|
|
|
<!---END-->
|
|
|
|
### Writing to Apache Arrow formats
|
|
|
|
Add dependency:
|
|
|
|
```kotlin
|
|
implementation("org.jetbrains.kotlinx:dataframe-arrow:$dataframe_version")
|
|
```
|
|
|
|
> Make sure to follow an
|
|
> [Apache Arrow Java compatibility](https://arrow.apache.org/docs/java/install.html#java-compatibility)
|
|
> guide when using Java 9+
|
|
> {style = "warning"}
|
|
|
|
[`DataFrame`](DataFrame.md) supports writing [Arrow interprocess streaming format](https://arrow.apache.org/docs/java/ipc.html#writing-and-reading-streaming-format)
|
|
and [Arrow random access format](https://arrow.apache.org/docs/java/ipc.html#writing-and-reading-random-access-files)
|
|
to raw WritableByteChannel, OutputStream, File or ByteArray.
|
|
|
|
Data may be saved "as is" (like exporting to new Excel file) or converted to match some target [Schema](https://arrow.apache.org/docs/java/reference/org/apache/arrow/vector/types/pojo/Schema.html)
|
|
if you have it (like inserting into existing SQL table).
|
|
|
|
The first approach is quite easy:
|
|
<!---FUN writeArrowFile-->
|
|
|
|
```kotlin
|
|
df.writeArrowIPC(file)
|
|
// or
|
|
df.writeArrowFeather(file)
|
|
```
|
|
|
|
<!---END-->
|
|
(writing to file, opened stream or channel),
|
|
<!---FUN writeArrowByteArray-->
|
|
|
|
```kotlin
|
|
val ipcByteArray: ByteArray = df.saveArrowIPCToByteArray()
|
|
// or
|
|
val featherByteArray: ByteArray = df.saveArrowFeatherToByteArray()
|
|
```
|
|
|
|
<!---END-->
|
|
(creating byte array). Nested frames and columns with mixed or unsupported types will be saved as String.
|
|
|
|
The second approach is a bit more tricky. You have to specify schema itself and casting behavior mode as `ArrowWriter` parameters.
|
|
Behavior `Mode` has four independent switchers: `restrictWidening`, `restrictNarrowing`, `strictType`, `strictNullable`.
|
|
You can use `Mode.STRICT` (this is default), `Mode.LOYAL` or any combination you want.
|
|
The `ArrowWriter` object should be closed after using because Arrow uses random access buffers not managed by Java GC.
|
|
Finally, you can specify a callback to be invoked if some data is lost or can not be saved according to your schema.
|
|
|
|
Here is full example:
|
|
<!---FUN writeArrowPerSchema-->
|
|
|
|
```kotlin
|
|
// Get schema from anywhere you want. It can be deserialized from JSON, generated from another dataset
|
|
// (including the DataFrame.columns().toArrowSchema() method), created manually, and so on.
|
|
val schema = Schema.fromJSON(schemaJson)
|
|
|
|
df.arrowWriter(
|
|
|
|
// Specify your schema
|
|
targetSchema = schema,
|
|
|
|
// Specify desired behavior mode
|
|
mode = ArrowWriter.Mode(
|
|
restrictWidening = true,
|
|
restrictNarrowing = true,
|
|
strictType = true,
|
|
strictNullable = false,
|
|
),
|
|
|
|
// Specify mismatch subscriber
|
|
mismatchSubscriber = writeMismatchMessage,
|
|
|
|
).use { writer: ArrowWriter ->
|
|
|
|
// Save to any format and sink, like in the previous example
|
|
writer.writeArrowFeather(file)
|
|
}
|
|
```
|
|
|
|
<!---END-->
|
|
On executing you should get two warnings:
|
|
>Column "city" contains nulls but expected not nullable
|
|
|
|
and
|
|
|
|
> Column "isHappy" is not described in the target schema and was ignored
|