Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What is the canonical way to write parquet to disk using duckdb and dbplyr without collecting first? #207

Open
nicki-dese opened this issue Aug 1, 2024 · 3 comments
Labels
feature a feature request or enhancement help wanted ❤️ we'd love your help!

Comments

@nicki-dese
Copy link

nicki-dese commented Aug 1, 2024

Scenario is working with very large (many GB) csv files and wanting to save them to parquet files after processing without reading/collecting them entirely into memory.

I know how to do this with arrow:

library(arrow)

open_dataset("path/to/csv", format = "csv") |> 
  do_something() |>
  write_dataset("path/to/export.parquet", format = "parquet") 

I also know how to do this with duckdb and SQL:

COPY
    (SELECT * FROM 'path/to/csv.csv')
    TO  'path/to/export.parquet'
    (FORMAT 'parquet')

But I would like to figure out how to do this with duckdb and dbplyr. Context is that members of my team are unfamiliar with SQL, and are comfortable with dplyr syntax.

The closest I've got so far is:

library(arrow)
library(duckdb)

con <- dbConnect(duckdb())

tbl(con, "read_csv('path/to/csv.csv')") |>  # see note below
  do_something() |>
  to_arrow() |>
  write_dataset("path/to/export", format = "parquet") 

note: in my testing you have to pass tbl() the read function as well as the file path as a text string, even if the file path contains the csv or parquet extension in it. Is this where tbl_file() comes in? if so, how do you pass tbl_file parameters to the read_csv function, such as which columns are dates? (as mentioned in #159, it's unclear how to use duckdb-specific dbplyr functions from the duckdb documentation).

I'm a bit nervous about my workaround of using to_arrow() because one of the reasons we're starting as a team to use duckdb in preference to arrow is that arrow's auto-detection of schemas from csvs is not anywhere near as good as duckdb and it's very fussy and slow at parsing dates. I've also noticed how arrow interprets empty strings differs from duckdb (arrow leaves them as "", duckdb makes them NA). And I guess I'm cautious about introducing a step that might complicate matters and make it unclear what type casting has occurred by transferring between the two libraries.

I've also tried various versions of copy_to and db_copy_to with "temporary = F". I've managed to create an in-memory table called " 'path/to/export.parquet' (FORMAT 'parquet')" (!) but not actually save anything to disk.

If I should ask this question elsewhere, please let me know.

@rplsmn
Copy link

rplsmn commented Aug 6, 2024

Hello,

I am interested on the official answer to this question !
On my side, I've been doing the following : convert my dbplyr operations to sql using sql_render() then piping into a function that wraps COPY .. TO

E.g :

tbl(con, "read_csv('path/to/csv.csv')") |>
  do_something() |>
  duckdb_write_parquet()

Where duckdb_write_parquet() is simply something like :

duckdb_write_parquet <- function(.tbl_sql, path, con) {

  sql_tbl <- 
    .tbl_sql |>
    sql_render()

  sql_call <- glue::glue("COPY ({sql_tbl}) TO '{path}' (FORMAT 'parquet')")

  res <- dbExecute(con, sql_call)

  return(res)

}

Looking forward to seeing other ways of doing this !
Cheers

@nicki-dese
Copy link
Author

@rplsmn - thank you for that approach, clever :)

I'd thought of using show_query() and glue_sql:

query <- tbl(con, "read_csv('path/to/csv.csv')") |>
  do_something() |>
  show_query()
  
 copy_query <-  glue::glue_sql(("COPY ({query}) TO '{path}' (FORMAT 'parquet')"), .con = con)

 dbExecute(con, copy_query )

I like your way of functionifying it more.

@krlmlr
Copy link
Collaborator

krlmlr commented Aug 16, 2024

Thanks. There is duckplyr::df_to_parquet(), but nothing comparable in this package. We could certainly implement a tbl_to_parquet() -- the con can be retrieved via dbplyr::remote_con(), and we probably want to pass along options (hive partitioning, ...). Happy to review an implementation sketch here or in a PR.

@krlmlr krlmlr added feature a feature request or enhancement help wanted ❤️ we'd love your help! labels Aug 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature a feature request or enhancement help wanted ❤️ we'd love your help!
Projects
None yet
Development

No branches or pull requests

3 participants