init research
This commit is contained in:
@@ -0,0 +1,159 @@
|
||||
# Parquet
|
||||
|
||||
<web-summary>
|
||||
Read Parquet files via Apache Arrow in Kotlin DataFrame — high‑performance columnar storage for analytics.
|
||||
</web-summary>
|
||||
|
||||
<card-summary>
|
||||
Use Kotlin DataFrame to read Parquet datasets using Apache Arrow for fast, typed, columnar I/O.
|
||||
</card-summary>
|
||||
|
||||
<link-summary>
|
||||
Kotlin DataFrame can read Parquet files through Apache Arrow’s Dataset API. Learn how and when to use it.
|
||||
</link-summary>
|
||||
|
||||
Kotlin DataFrame supports reading [Apache Parquet](https://parquet.apache.org/) files through the Apache Arrow integration.
|
||||
|
||||
Requires the [`dataframe-arrow` module](Modules.md#dataframe-arrow), which is included by default in the general [`dataframe`](Modules.md#dataframe-general) artifact and in and when using `%use dataframe` for Kotlin Notebook.
|
||||
|
||||
> We currently only support READING Parquet via Apache Arrow; writing Parquet is not supported in Kotlin DataFrame.
|
||||
> {style="note"}
|
||||
|
||||
> Apache Arrow is not supported on Android, so reading Parquet files on Android is not available.
|
||||
> {style="warning"}
|
||||
|
||||
> Structured (nested) Arrow types such as Struct are not supported yet in Kotlin DataFrame.
|
||||
> See the issue: [Add inner / Struct type support in Arrow](https://github.com/Kotlin/dataframe/issues/536)
|
||||
> {style="warning"}
|
||||
|
||||
## Reading Parquet Files
|
||||
|
||||
Kotlin DataFrame provides four `readParquet()` methods that can read from different source types.
|
||||
All overloads accept optional `nullability` inference settings and `batchSize` for Arrow scanning.
|
||||
|
||||
```kotlin
|
||||
// 1) URLs
|
||||
public fun DataFrame.Companion.readParquet(
|
||||
vararg urls: URL,
|
||||
nullability: NullabilityOptions = NullabilityOptions.Infer,
|
||||
batchSize: Long = ARROW_PARQUET_DEFAULT_BATCH_SIZE,
|
||||
): AnyFrame
|
||||
|
||||
// 2) Strings (interpreted as file paths or URLs, e.g., "data/file.parquet", "file://", or "http(s)://")
|
||||
public fun DataFrame.Companion.readParquet(
|
||||
vararg strUrls: String,
|
||||
nullability: NullabilityOptions = NullabilityOptions.Infer,
|
||||
batchSize: Long = ARROW_PARQUET_DEFAULT_BATCH_SIZE,
|
||||
): AnyFrame
|
||||
|
||||
// 3) Paths
|
||||
public fun DataFrame.Companion.readParquet(
|
||||
vararg paths: Path,
|
||||
nullability: NullabilityOptions = NullabilityOptions.Infer,
|
||||
batchSize: Long = ARROW_PARQUET_DEFAULT_BATCH_SIZE,
|
||||
): AnyFrame
|
||||
|
||||
// 4) Files
|
||||
public fun DataFrame.Companion.readParquet(
|
||||
vararg files: File,
|
||||
nullability: NullabilityOptions = NullabilityOptions.Infer,
|
||||
batchSize: Long = ARROW_PARQUET_DEFAULT_BATCH_SIZE,
|
||||
): AnyFrame
|
||||
```
|
||||
|
||||
These overloads are defined in the `dataframe-arrow` module and internally use `FileFormat.PARQUET` from Apache Arrow’s
|
||||
Dataset API to scan the data and materialize it as a Kotlin `DataFrame`.
|
||||
|
||||
### Examples
|
||||
|
||||
```kotlin
|
||||
// Read from file paths (as strings)
|
||||
val df = DataFrame.readParquet("data/sales.parquet")
|
||||
```
|
||||
|
||||
<!---FUN readParquetFilePath-->
|
||||
|
||||
```kotlin
|
||||
// Read from Path objects
|
||||
val path = Paths.get("data/sales.parquet")
|
||||
val df = DataFrame.readParquet(path)
|
||||
```
|
||||
|
||||
<!---END-->
|
||||
|
||||
<!---FUN readParquetURL-->
|
||||
|
||||
```kotlin
|
||||
// Read from URLs
|
||||
val df = DataFrame.readParquet(url)
|
||||
```
|
||||
|
||||
<!---END-->
|
||||
|
||||
<!---FUN readParquetFile-->
|
||||
|
||||
```kotlin
|
||||
// Read from File objects
|
||||
val file = File("data/sales.parquet")
|
||||
val df = DataFrame.readParquet(file)
|
||||
```
|
||||
|
||||
<!---END-->
|
||||
|
||||
|
||||
<!---FUN readParquetFileWithParameters-->
|
||||
|
||||
```kotlin
|
||||
// Read from File objects
|
||||
val file = File("data/sales.parquet")
|
||||
|
||||
val df = DataFrame.readParquet(
|
||||
file,
|
||||
nullability = NullabilityOptions.Infer,
|
||||
batchSize = 64L * 1024
|
||||
)
|
||||
```
|
||||
|
||||
<!---END-->
|
||||
|
||||
|
||||
If you want to see a complete, realistic data‑engineering example using Spark and Parquet with Kotlin DataFrame,
|
||||
check out the [example project](https://github.com/Kotlin/dataframe/tree/master/examples/idea-examples/spark-parquet-dataframe).
|
||||
|
||||
### Multiple Files
|
||||
|
||||
It's possible to read multiple Parquet files:
|
||||
|
||||
<!---FUN readMultipleParquetFiles-->
|
||||
|
||||
```kotlin
|
||||
val file = File("data/sales.parquet")
|
||||
val file1 = File("data/sales1.parquet")
|
||||
val file2 = File("data/sales2.parquet")
|
||||
|
||||
val df = DataFrame.readParquet(file, file1, file2)
|
||||
```
|
||||
|
||||
<!---END-->
|
||||
|
||||
**Requirements:**
|
||||
|
||||
- All files must have compatible schemas
|
||||
- Files are vertically concatenated (union of rows)
|
||||
- Column types must match exactly
|
||||
- Missing columns in some files will result in null values
|
||||
|
||||
### Performance tips
|
||||
|
||||
- **Column selection**: Because the `readParquet` method reads all columns, use DataFrame operations like `select()` immediately after reading to reduce memory usage in later operations
|
||||
- **Predicate pushdown**: Currently not supported—filtering happens after data is loaded into memory
|
||||
- Use Arrow‑compatible JVMs as documented in
|
||||
[Apache Arrow Java compatibility](https://arrow.apache.org/docs/java/install.html#java-compatibility).
|
||||
- Adjust `batchSize` if you read huge files and need to tune throughput vs. memory.
|
||||
|
||||
### See also
|
||||
|
||||
- [](ApacheArrow.md) — reading/writing Arrow IPC formats.
|
||||
- [Parquet official site](https://parquet.apache.org/).
|
||||
- Example: [Spark + Parquet + Kotlin DataFrame](https://github.com/Kotlin/dataframe/tree/master/examples/idea-examples/spark-parquet-dataframe)
|
||||
- [](Data-Sources.md) — Overview of all supported formats
|
||||
Reference in New Issue
Block a user