You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
duckdb and arrow seem to write parquet files at roughly the same speed until the data gets to about 10+ GB, at which point duckdb is about an order of magnitude slower.
This is very large, but the issue also impacts partitioned writes and COPY x.parquet TO y.parquet workflows.
library(duckdb)
#> Loading required package: DBI#> Warning: package 'DBI' was built under R version 4.3.3
library(arrow)
#> Warning: package 'arrow' was built under R version 4.3.3#> #> Attaching package: 'arrow'#> The following object is masked from 'package:utils':#> #> timestamp
library(dplyr)
#> Warning: package 'dplyr' was built under R version 4.3.2#> #> Attaching package: 'dplyr'#> The following objects are masked from 'package:stats':#> #> filter, lag#> The following objects are masked from 'package:base':#> #> intersect, setdiff, setequal, union
library(microbenchmark)
library(glue)
#> Warning: package 'glue' was built under R version 4.3.3# Generate a large DataFrame
set.seed(123)
n<-1e9# Number of rowsdf<-data.frame(
id=1:n,
value= rnorm(n),
category1= sample(LETTERS, n, replace=TRUE),
category2= sample(letters, n, replace=TRUE),
description= replicate(n, paste(sample(c(letters, LETTERS), 10, replace=TRUE), collapse=""))
)
df_size<- object.size(df)
df_size_mb<-df_size/ (1024^3)
cat("DataFrame size: ", df_size_mb, "GB\n")
#> DataFrame size: 93.13226 GB# Connect to DuckDBcon<- dbConnect(duckdb::duckdb())
# Copy DataFrame to DuckDB
dbWriteTable(con, "df", df, overwrite=TRUE)
# Create a temporary filetemp_file<- tempfile(fileext=".parquet")
# Measure the time to write the DataFrame to Parquet using SQLwrite_time<- microbenchmark(
duck= dbExecute(con, glue("COPY (SELECT * FROM df) TO '{temp_file}' (FORMAT 'parquet')")),
arrow= write_parquet(df, temp_file),
times=1
)
print(write_time)
#> Unit: seconds#> expr min lq mean median uq max neval#> duck 2096.6953 2096.6953 2096.6953 2096.6953 2096.6953 2096.6953 1#> arrow 298.3979 298.3979 298.3979 298.3979 298.3979 298.3979 1
cat("Parquet file written to: ", temp_file, "\n")
#> Parquet file written to: Z:/RTemp\Rtmpo7Y8u1\file63602fc514b3.parquet# Clean up
dbDisconnect(con, shutdown=TRUE)
Hello,
duckdb
and arrow seem to write parquet files at roughly the same speed until the data gets to about 10+ GB, at which point duckdb is about an order of magnitude slower.This is very large, but the issue also impacts partitioned writes and
COPY x.parquet TO y.parquet
workflows.Created on 2024-08-03 with reprex v2.1.1
The text was updated successfully, but these errors were encountered: