Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parquet large file write performance degredation #208

Open
arthurgailes opened this issue Aug 3, 2024 · 2 comments
Open

Parquet large file write performance degredation #208

arthurgailes opened this issue Aug 3, 2024 · 2 comments
Labels
reprex needs a minimal reproducible example

Comments

@arthurgailes
Copy link

arthurgailes commented Aug 3, 2024

Hello,

duckdb and arrow seem to write parquet files at roughly the same speed until the data gets to about 10+ GB, at which point duckdb is about an order of magnitude slower.

This is very large, but the issue also impacts partitioned writes and COPY x.parquet TO y.parquet workflows.

library(duckdb)
#> Loading required package: DBI
#> Warning: package 'DBI' was built under R version 4.3.3
library(arrow)
#> Warning: package 'arrow' was built under R version 4.3.3
#> 
#> Attaching package: 'arrow'
#> The following object is masked from 'package:utils':
#> 
#>     timestamp
library(dplyr)
#> Warning: package 'dplyr' was built under R version 4.3.2
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(microbenchmark)
library(glue)
#> Warning: package 'glue' was built under R version 4.3.3

# Generate a large DataFrame
set.seed(123)
n <- 1e9  # Number of rows
df <- data.frame(
  id = 1:n,
  value = rnorm(n),
  category1 = sample(LETTERS, n, replace = TRUE),
  category2 = sample(letters, n, replace = TRUE),
  description = replicate(n, paste(sample(c(letters, LETTERS), 10, replace = TRUE), collapse = ""))
)

df_size <- object.size(df)
df_size_mb <- df_size / (1024^3)
cat("DataFrame size: ", df_size_mb, "GB\n")
#> DataFrame size:  93.13226 GB

# Connect to DuckDB
con <- dbConnect(duckdb::duckdb())

# Copy DataFrame to DuckDB
dbWriteTable(con, "df", df, overwrite = TRUE)

# Create a temporary file
temp_file <- tempfile(fileext = ".parquet")

# Measure the time to write the DataFrame to Parquet using SQL
write_time <- microbenchmark(
  duck = dbExecute(con, glue("COPY (SELECT * FROM df) TO '{temp_file}' (FORMAT 'parquet')")),
  arrow = write_parquet(df, temp_file),
  times = 1
)

print(write_time)
#> Unit: seconds
#>   expr       min        lq      mean    median        uq       max neval
#>   duck 2096.6953 2096.6953 2096.6953 2096.6953 2096.6953 2096.6953     1
#>  arrow  298.3979  298.3979  298.3979  298.3979  298.3979  298.3979     1
cat("Parquet file written to: ", temp_file, "\n")
#> Parquet file written to:  Z:/RTemp\Rtmpo7Y8u1\file63602fc514b3.parquet

# Clean up
dbDisconnect(con, shutdown = TRUE)

Created on 2024-08-03 with reprex v2.1.1

@krlmlr
Copy link
Collaborator

krlmlr commented Aug 17, 2024

Thanks. What timings do you see when you run this directly from the duckdb CLI? I don't currently have access to a machine that large.

@krlmlr krlmlr added the reprex needs a minimal reproducible example label Aug 17, 2024
@arthurgailes
Copy link
Author

Sorry, I can't do this directly from CLI, but I posted the Python here; seems to be faster. duckdb/duckdb#13423 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
reprex needs a minimal reproducible example
Projects
None yet
Development

No branches or pull requests

2 participants