Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

duckdbfs - duckplyr comparing notes #59

Open
cboettig opened this issue Sep 19, 2023 · 7 comments
Open

duckdbfs - duckplyr comparing notes #59

cboettig opened this issue Sep 19, 2023 · 7 comments

Comments

@cboettig
Copy link

Howdy friends! Just saw this (from the Posit Conf schedule!), looks amazing (though still wrapping my head around scope etc).

I've been playing around with some possibly similar ideas in a very small wrapper package, duckdbfs, because I didn't know about the efforts here. If it makes sense, I'd be happy to merge features into here instead and archive duckdbfs. Alternatively I'd welcome your feedback on duckdbfs

My core goal with duckdbfs was to have open_dataset() / write_dataset() functions that operate like they do in arrow, (i.e. supporting local and S3 URIs), while also supporting arbitrary https urls. (yes I know we can do things like arrow::open_dataset() |> to_duckdb(), but obviously that doesn't support https urls and adds overhead of using the arrow parser, which we found could be substantially slower than native duckdb httpfs mechanism).

e.g. S3 access, with necessary config (as per #39):

parquet <- "s3://gbif-open-data-us-east-1/occurrence/2023-06-01/occurrence.parquet"
gbif <- duckdbfs::open_dataset(parquet, anonymous = TRUE, s3_region="us-east-1")

https URIs work the same way of course. duckdbfs handles installing the httpfs extension when necessary. (Yes, it's tragic that httpfs extension still doesn't work on Windows owing to how duckdbfs is building those binaries!). duckdbfs seeks to make the spatial extension immediately visible to R users in the same way, e.g.

library(dplyr)
spatial_ex <- paste0("https://raw.githubusercontent.com/cboettig/duckdbfs/",
                     "main/inst/extdata/spatial-test.csv") |>
  duckdbfs::open_dataset(format = "csv") 

spatial_ex |>
  mutate(geometry = ST_Point(longitude, latitude)) |>
  mutate(dist = ST_Distance(geometry, ST_Point(0,0))) |> 
  to_sf()

Note we use dplyr / dbplyr to do lazy spatial ops, and parse the result into R as an sf object.

@krlmlr
Copy link
Member

krlmlr commented Oct 15, 2023

Thanks for reaching out!

From quickly glancing over it, duckdbfs works today, via dbplyr, and it might take us a while to achieve feature parity here, in particular regarding spatial data frames. That said, at some point, duckplyr should be capable of doing everything that duckdbfs can do. Happy to review issues that document where this is not (yet) achieved.

@cboettig
Copy link
Author

Thanks Kirill for the reply, and that sounds awesome!

right -- I was kinda surprised that duckplyr was not using dbplyr. Just curious what the motivation for avoiding that route was? Is this idea that duckplyr will generally have more optimized behavior than what we get by just letting dbplyr translate dplyr to sql?

Re places where feature-parity may not be there yet -- does or will as_duckplyr_df take http or s3 URIs like in the above examples? From the docs it looks like it has to be a data.frame / tibble?

@krlmlr
Copy link
Member

krlmlr commented Oct 16, 2023

Thanks. We have duckplyr::duckplyr_df_from_file() for reading from files. We might even support S3 URIs, worth a try.

We translate dplyr to an intermediate representation dubbed "relational API", closer to Codd's relational algebra. No SQL involved. The aim is to achieve full dplyr compatibility, regarding data types, functions, and verbs.

@cboettig
Copy link
Author

Cool! Am I correct in assuming that duckplyr would still work entirely outside of RAM? (e.g. the gbif example above is a few hundred GBs). Also, does that mean that users will also be able to run functions like mutate() that call arbitrary third-party R packages for operations on columns without having to call collect() first and read everything into RAM?

looks like duckplyr_df_from_file() will need some additional support to parse arguments for duckdb's s3 config ? (arrow uses conventions to express some of these in URI format, e.g. with a endpoint_override query parameter on the end of a URI, which I copied over to duckdbfs.

@krlmlr
Copy link
Member

krlmlr commented Oct 16, 2023

duckplyr uses duckdb under the hood and inherits all the goodness. We'll need to see what operations we can support without having to collect(), but in any case this will happen transparently if needed, without user intervention.

If you can share raw SQL syntax for reading S3, I might help to translate this into an equivalent duckplyr_df_from_file() call, or enhance if needed.

@cboettig
Copy link
Author

sure thing, here's some quick examples. (in R not raw SQL but close enough to the DBI layer that the SQL is obvious, right 😄 ) This one is simple since this uses the default AWS endpoint, default region, and is public, so doesn't require auth keys/tokens. Normally we just have to set all of those using by executing a bunch of SET commands.

library(duckdb)
library(glue)
library(dplyr)

conn <- DBI::dbConnect(duckdb(), ":memory:")
DBI::dbExecute(conn, "INSTALL 'httpfs';")
DBI::dbExecute(conn, "LOAD 'httpfs';")

## note the explicit recursive glob, `**`.  Arrow (or duckdbfs) do this implicitly
public_aws <- "s3://gbif-open-data-us-east-1/occurrence/2023-06-01/occurrence.parquet/**"


view_query <- glue::glue("CREATE VIEW 'gbif' ",
                  "AS SELECT * FROM parquet_scan('{public_aws}');")
DBI::dbSendQuery(conn, view_query)
df <- tbl(conn, tblname)

A common test case: counting occurrences by lat/lon grid. (duckdb handles this case fine, even though data are large enough such that would be quite difficult to run this entirely in RAM on most machines).

df |> 
  mutate(latitude = round(decimallatitude,2),
          longitude = round(decimallongitude,2)) |> 
  count(longitude, latitude) |> 
  mutate(n = log(n))

Here is a second example that is still public data, but uses an alternative endpoint that must be set as an env var in vanilla duckdb:

endpoint <- "data.ecoforecast.org"
DBI::dbExecute(conn, glue("SET s3_endpoint='{endpoint}';"))
DBI::dbExecute(conn, glue("SET s3_url_style='path';"))


tblname <- "scores"
parquet <- "s3://neon4cast-scores/parquet/aquatics/**"
view_query <-glue("CREATE VIEW '{tblname}' ",
                  "AS SELECT * FROM parquet_scan('{parquet}');")
DBI::dbSendQuery(conn, view_query)
tbl(conn, tblname)

@krlmlr
Copy link
Member

krlmlr commented Oct 18, 2023

Thanks, this is helpful. I'll review when I next work on duckplyr.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants