-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
duckdbfs - duckplyr comparing notes #59
Comments
Thanks for reaching out! From quickly glancing over it, duckdbfs works today, via dbplyr, and it might take us a while to achieve feature parity here, in particular regarding spatial data frames. That said, at some point, duckplyr should be capable of doing everything that duckdbfs can do. Happy to review issues that document where this is not (yet) achieved. |
Thanks Kirill for the reply, and that sounds awesome! right -- I was kinda surprised that duckplyr was not using dbplyr. Just curious what the motivation for avoiding that route was? Is this idea that duckplyr will generally have more optimized behavior than what we get by just letting dbplyr translate dplyr to sql? Re places where feature-parity may not be there yet -- does or will |
Thanks. We have We translate dplyr to an intermediate representation dubbed "relational API", closer to Codd's relational algebra. No SQL involved. The aim is to achieve full dplyr compatibility, regarding data types, functions, and verbs. |
Cool! Am I correct in assuming that duckplyr would still work entirely outside of RAM? (e.g. the gbif example above is a few hundred GBs). Also, does that mean that users will also be able to run functions like looks like |
duckplyr uses duckdb under the hood and inherits all the goodness. We'll need to see what operations we can support without having to If you can share raw SQL syntax for reading S3, I might help to translate this into an equivalent |
sure thing, here's some quick examples. (in R not raw SQL but close enough to the DBI layer that the SQL is obvious, right 😄 ) This one is simple since this uses the default AWS endpoint, default region, and is public, so doesn't require auth keys/tokens. Normally we just have to set all of those using by executing a bunch of library(duckdb)
library(glue)
library(dplyr)
conn <- DBI::dbConnect(duckdb(), ":memory:")
DBI::dbExecute(conn, "INSTALL 'httpfs';")
DBI::dbExecute(conn, "LOAD 'httpfs';")
## note the explicit recursive glob, `**`. Arrow (or duckdbfs) do this implicitly
public_aws <- "s3://gbif-open-data-us-east-1/occurrence/2023-06-01/occurrence.parquet/**"
view_query <- glue::glue("CREATE VIEW 'gbif' ",
"AS SELECT * FROM parquet_scan('{public_aws}');")
DBI::dbSendQuery(conn, view_query)
df <- tbl(conn, tblname) A common test case: counting occurrences by lat/lon grid. (duckdb handles this case fine, even though data are large enough such that would be quite difficult to run this entirely in RAM on most machines). df |>
mutate(latitude = round(decimallatitude,2),
longitude = round(decimallongitude,2)) |>
count(longitude, latitude) |>
mutate(n = log(n)) Here is a second example that is still public data, but uses an alternative endpoint that must be set as an env var in vanilla duckdb: endpoint <- "data.ecoforecast.org"
DBI::dbExecute(conn, glue("SET s3_endpoint='{endpoint}';"))
DBI::dbExecute(conn, glue("SET s3_url_style='path';"))
tblname <- "scores"
parquet <- "s3://neon4cast-scores/parquet/aquatics/**"
view_query <-glue("CREATE VIEW '{tblname}' ",
"AS SELECT * FROM parquet_scan('{parquet}');")
DBI::dbSendQuery(conn, view_query)
tbl(conn, tblname) |
Thanks, this is helpful. I'll review when I next work on duckplyr. |
Howdy friends! Just saw this (from the Posit Conf schedule!), looks amazing (though still wrapping my head around scope etc).
I've been playing around with some possibly similar ideas in a very small wrapper package, duckdbfs, because I didn't know about the efforts here. If it makes sense, I'd be happy to merge features into here instead and archive duckdbfs. Alternatively I'd welcome your feedback on duckdbfs
My core goal with
duckdbfs
was to haveopen_dataset()
/write_dataset()
functions that operate like they do in arrow, (i.e. supporting local and S3 URIs), while also supporting arbitrary https urls. (yes I know we can do things likearrow::open_dataset() |> to_duckdb()
, but obviously that doesn't support https urls and adds overhead of using the arrow parser, which we found could be substantially slower than native duckdb httpfs mechanism).e.g. S3 access, with necessary config (as per #39):
https URIs work the same way of course.
duckdbfs
handles installing the httpfs extension when necessary. (Yes, it's tragic that httpfs extension still doesn't work on Windows owing to howduckdbfs
is building those binaries!).duckdbfs
seeks to make the spatial extension immediately visible to R users in the same way, e.g.Note we use
dplyr
/dbplyr
to do lazy spatial ops, and parse the result into R as ansf
object.The text was updated successfully, but these errors were encountered: