Files
df-research/dataframe/docs/StardustDocs/topics/dataSources/Parquet.md
2026-02-08 11:20:43 -10:00

160 lines
5.0 KiB
Markdown
Vendored
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Parquet
<web-summary>
Read Parquet files via Apache Arrow in Kotlin DataFrame — highperformance columnar storage for analytics.
</web-summary>
<card-summary>
Use Kotlin DataFrame to read Parquet datasets using Apache Arrow for fast, typed, columnar I/O.
</card-summary>
<link-summary>
Kotlin DataFrame can read Parquet files through Apache Arrows Dataset API. Learn how and when to use it.
</link-summary>
Kotlin DataFrame supports reading [Apache Parquet](https://parquet.apache.org/) files through the Apache Arrow integration.
Requires the [`dataframe-arrow` module](Modules.md#dataframe-arrow), which is included by default in the general [`dataframe`](Modules.md#dataframe-general) artifact and in and when using `%use dataframe` for Kotlin Notebook.
> We currently only support READING Parquet via Apache Arrow; writing Parquet is not supported in Kotlin DataFrame.
> {style="note"}
> Apache Arrow is not supported on Android, so reading Parquet files on Android is not available.
> {style="warning"}
> Structured (nested) Arrow types such as Struct are not supported yet in Kotlin DataFrame.
> See the issue: [Add inner / Struct type support in Arrow](https://github.com/Kotlin/dataframe/issues/536)
> {style="warning"}
## Reading Parquet Files
Kotlin DataFrame provides four `readParquet()` methods that can read from different source types.
All overloads accept optional `nullability` inference settings and `batchSize` for Arrow scanning.
```kotlin
// 1) URLs
public fun DataFrame.Companion.readParquet(
vararg urls: URL,
nullability: NullabilityOptions = NullabilityOptions.Infer,
batchSize: Long = ARROW_PARQUET_DEFAULT_BATCH_SIZE,
): AnyFrame
// 2) Strings (interpreted as file paths or URLs, e.g., "data/file.parquet", "file://", or "http(s)://")
public fun DataFrame.Companion.readParquet(
vararg strUrls: String,
nullability: NullabilityOptions = NullabilityOptions.Infer,
batchSize: Long = ARROW_PARQUET_DEFAULT_BATCH_SIZE,
): AnyFrame
// 3) Paths
public fun DataFrame.Companion.readParquet(
vararg paths: Path,
nullability: NullabilityOptions = NullabilityOptions.Infer,
batchSize: Long = ARROW_PARQUET_DEFAULT_BATCH_SIZE,
): AnyFrame
// 4) Files
public fun DataFrame.Companion.readParquet(
vararg files: File,
nullability: NullabilityOptions = NullabilityOptions.Infer,
batchSize: Long = ARROW_PARQUET_DEFAULT_BATCH_SIZE,
): AnyFrame
```
These overloads are defined in the `dataframe-arrow` module and internally use `FileFormat.PARQUET` from Apache Arrows
Dataset API to scan the data and materialize it as a Kotlin `DataFrame`.
### Examples
```kotlin
// Read from file paths (as strings)
val df = DataFrame.readParquet("data/sales.parquet")
```
<!---FUN readParquetFilePath-->
```kotlin
// Read from Path objects
val path = Paths.get("data/sales.parquet")
val df = DataFrame.readParquet(path)
```
<!---END-->
<!---FUN readParquetURL-->
```kotlin
// Read from URLs
val df = DataFrame.readParquet(url)
```
<!---END-->
<!---FUN readParquetFile-->
```kotlin
// Read from File objects
val file = File("data/sales.parquet")
val df = DataFrame.readParquet(file)
```
<!---END-->
<!---FUN readParquetFileWithParameters-->
```kotlin
// Read from File objects
val file = File("data/sales.parquet")
val df = DataFrame.readParquet(
file,
nullability = NullabilityOptions.Infer,
batchSize = 64L * 1024
)
```
<!---END-->
If you want to see a complete, realistic dataengineering example using Spark and Parquet with Kotlin DataFrame,
check out the [example project](https://github.com/Kotlin/dataframe/tree/master/examples/idea-examples/spark-parquet-dataframe).
### Multiple Files
It's possible to read multiple Parquet files:
<!---FUN readMultipleParquetFiles-->
```kotlin
val file = File("data/sales.parquet")
val file1 = File("data/sales1.parquet")
val file2 = File("data/sales2.parquet")
val df = DataFrame.readParquet(file, file1, file2)
```
<!---END-->
**Requirements:**
- All files must have compatible schemas
- Files are vertically concatenated (union of rows)
- Column types must match exactly
- Missing columns in some files will result in null values
### Performance tips
- **Column selection**: Because the `readParquet` method reads all columns, use DataFrame operations like `select()` immediately after reading to reduce memory usage in later operations
- **Predicate pushdown**: Currently not supported—filtering happens after data is loaded into memory
- Use Arrowcompatible JVMs as documented in
[Apache Arrow Java compatibility](https://arrow.apache.org/docs/java/install.html#java-compatibility).
- Adjust `batchSize` if you read huge files and need to tune throughput vs. memory.
### See also
- [](ApacheArrow.md) — reading/writing Arrow IPC formats.
- [Parquet official site](https://parquet.apache.org/).
- Example: [Spark + Parquet + Kotlin DataFrame](https://github.com/Kotlin/dataframe/tree/master/examples/idea-examples/spark-parquet-dataframe)
- [](Data-Sources.md) — Overview of all supported formats