Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add temporary parameter to duckdb_read_csv() #142

Closed
ThomasSoeiro opened this issue Apr 22, 2024 · 3 comments
Closed

add temporary parameter to duckdb_read_csv() #142

ThomasSoeiro opened this issue Apr 22, 2024 · 3 comments
Labels
feature a feature request or enhancement help wanted ❤️ we'd love your help!

Comments

@ThomasSoeiro
Copy link
Contributor

It is currently not possible to write a temporary table using duckdb_read_csv() directly.

To fix this, can we expose the temporary parameter in duckdb_read_csv():

duckdb-r/R/csv.R

Lines 32 to 33 in b13d405

duckdb_read_csv <- function(conn, name, files, header = TRUE, na.strings = "", nrow.check = 500,
delim = ",", quote = "\"", col.names = NULL, lower.case.names = FALSE, sep = delim, transaction = TRUE, ...) {

and pass it to the call to dbWriteTable()?

dbWriteTable(conn, tablename, headers[[1]][FALSE, , drop = FALSE])


For tables that fit in memory, we can use:

mytable <- read.csv2("mytable.csv")
dbWriteTable(con, "mytable", mytable, temporary = TRUE)

But we are stuck for tables that don't fit in memory.

@krlmlr
Copy link
Collaborator

krlmlr commented Apr 24, 2024

Thanks.

  • The dbWriteTable() you mentioned really should be a dbCreateTable(), this had confused me a bit
  • Can you confirm that running dbExecute("COPY ... FROM ...") does what you expect it to for temporary tables? What happens with name clashes (same table exists as permanent and as temporary table)?
  • For now, you could work around:
    • Import the CSV data into a temporary database that you connect to your main database with ATTACH
    • Use something like CREATE TEMPORARY TABLE ... AS read_csv_auto('...') (not tested)

@ThomasSoeiro
Copy link
Contributor Author

Prepare files and db for testing
library(DBI)

write.csv(warpbreaks, "warpbreaks.csv", row.names = FALSE)
write.csv(cars, "cars.csv", row.names = FALSE)

con <- dbConnect(duckdb::duckdb(), dbdir = "test.duckdb")
duckdb::duckdb_read_csv(con, "warpbreaks1", "warpbreaks.csv")
dbListTables(con)
# [1] "warpbreaks1"
dbDisconnect(con, shutdown = TRUE)
  • Can you confirm that running dbExecute("COPY ... FROM ...") does what you expect it to for temporary tables? (...)

Yes, it does:

con <- dbConnect(duckdb::duckdb(), dbdir = "test.duckdb", read_only = TRUE)
dbListTables(con)
# [1] "warpbreaks1"
dbExecute(con, "CREATE TEMPORARY TABLE warpbreaks2 (breaks NUMERIC, wool VARCHAR, tension VARCHAR);")
# [1] 0
dbExecute(con, "COPY warpbreaks2 FROM 'warpbreaks.csv' (HEADER);")
# [1] 54
dbListTables(con)
# [1] "warpbreaks1" "warpbreaks2"
dbDisconnect(con, shutdown = TRUE)

# check that the temporary table is gone
con <- dbConnect(duckdb::duckdb(), dbdir = "test.duckdb")
dbListTables(con)
# [1] "warpbreaks1"
dbDisconnect(con, shutdown = TRUE)
  • (...) What happens with name clashes (same table exists as permanent and as temporary table)?

Not sure what is expected here. Here is a test:

con <- dbConnect(duckdb::duckdb(), dbdir = "test.duckdb", read_only = TRUE)
dbListTables(con)
# [1] "warpbreaks1"
dbExecute(con, "CREATE TEMPORARY TABLE warpbreaks1 (speed NUMERIC, dist NUMERIC);")
# [1] 0
dbExecute(con, "COPY warpbreaks1 FROM 'cars.csv' (HEADER);")
# [1] 50
dbListTables(con)
# [1] "warpbreaks1" "warpbreaks1"
dbReadTable(con, "warpbreaks1") |> head()
#   speed dist
# 1     4    2
# 2     4   10
# 3     7    4
# 4     7   22
# 5     8   16
# 6     9   10
dbDisconnect(con, shutdown = TRUE)
  • For now, you could work around:

    • Import the CSV data into a temporary database that you connect to your main database with ATTACH
    • Use something like CREATE TEMPORARY TABLE ... AS read_csv_auto('...') (not tested)

It works, thanks! (documentation here)

con <- dbConnect(duckdb::duckdb(), dbdir = "test.duckdb", read_only = TRUE)
dbListTables(con)
# [1] "warpbreaks1"
dbExecute(con, "CREATE TEMPORARY TABLE warpbreaks3 AS FROM 'warpbreaks.csv';")
# [1] 54
dbListTables(con)
# [1] "warpbreaks1" "warpbreaks3"
dbDisconnect(con, shutdown = TRUE)

@krlmlr krlmlr added feature a feature request or enhancement and removed enhancement labels Aug 16, 2024
@krlmlr
Copy link
Collaborator

krlmlr commented Aug 17, 2024

Temporary tables take precedence indeed if they exist:

library(duckdb)
#> Loading required package: DBI
con <- dbConnect(duckdb())

writeLines("a\n1", "x.csv")

dbExecute(con, "CREATE TEMPORARY TABLE x (a INTEGER)")
#> [1] 0

dbExecute(con, "CREATE TABLE x (y VARCHAR)")
#> [1] 0

dbExecute(con, "COPY x FROM 'x.csv' (HEADER)")
#> [1] 1

dbReadTable(con, "x")
#>   a
#> 1 1

dbRemoveTable(con, "x", temporary = TRUE)

dbReadTable(con, "x")
#> [1] y
#> <0 rows> (or 0-length row.names)

Created on 2024-08-17 with reprex v2.1.0

We want to do the following here:

  • In duckdb_read_csv(), change dbWriteTable() to dbCreateTable()
  • Add an ellipsis after files to make sure arguments are named
  • Add a temporary = FALSE argument and modify the SQL as needed

My preferred workflow is to use dbplyr with tbl_function(), but I'm open to reviewing a PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature a feature request or enhancement help wanted ❤️ we'd love your help!
Projects
None yet
Development

No branches or pull requests

2 participants