duckdbfs - duckplyr comparing notes #59

cboettig · 2023-09-19T17:53:40Z

Howdy friends! Just saw this (from the Posit Conf schedule!), looks amazing (though still wrapping my head around scope etc).

I've been playing around with some possibly similar ideas in a very small wrapper package, duckdbfs, because I didn't know about the efforts here. If it makes sense, I'd be happy to merge features into here instead and archive duckdbfs. Alternatively I'd welcome your feedback on duckdbfs

My core goal with duckdbfs was to have open_dataset() / write_dataset() functions that operate like they do in arrow, (i.e. supporting local and S3 URIs), while also supporting arbitrary https urls. (yes I know we can do things like arrow::open_dataset() |> to_duckdb(), but obviously that doesn't support https urls and adds overhead of using the arrow parser, which we found could be substantially slower than native duckdb httpfs mechanism).

e.g. S3 access, with necessary config (as per #39):

parquet <- "s3://gbif-open-data-us-east-1/occurrence/2023-06-01/occurrence.parquet"
gbif <- duckdbfs::open_dataset(parquet, anonymous = TRUE, s3_region="us-east-1")

https URIs work the same way of course. duckdbfs handles installing the httpfs extension when necessary. (Yes, it's tragic that httpfs extension still doesn't work on Windows owing to how duckdbfs is building those binaries!). duckdbfs seeks to make the spatial extension immediately visible to R users in the same way, e.g.

library(dplyr)
spatial_ex <- paste0("https://raw.githubusercontent.com/cboettig/duckdbfs/",
                     "main/inst/extdata/spatial-test.csv") |>
  duckdbfs::open_dataset(format = "csv") 

spatial_ex |>
  mutate(geometry = ST_Point(longitude, latitude)) |>
  mutate(dist = ST_Distance(geometry, ST_Point(0,0))) |> 
  to_sf()

Note we use dplyr / dbplyr to do lazy spatial ops, and parse the result into R as an sf object.

The text was updated successfully, but these errors were encountered:

krlmlr · 2023-10-15T18:28:09Z

Thanks for reaching out!

From quickly glancing over it, duckdbfs works today, via dbplyr, and it might take us a while to achieve feature parity here, in particular regarding spatial data frames. That said, at some point, duckplyr should be capable of doing everything that duckdbfs can do. Happy to review issues that document where this is not (yet) achieved.

cboettig · 2023-10-16T22:33:28Z

Thanks Kirill for the reply, and that sounds awesome!

right -- I was kinda surprised that duckplyr was not using dbplyr. Just curious what the motivation for avoiding that route was? Is this idea that duckplyr will generally have more optimized behavior than what we get by just letting dbplyr translate dplyr to sql?

Re places where feature-parity may not be there yet -- does or will as_duckplyr_df take http or s3 URIs like in the above examples? From the docs it looks like it has to be a data.frame / tibble?

krlmlr · 2023-10-16T22:39:31Z

Thanks. We have duckplyr::duckplyr_df_from_file() for reading from files. We might even support S3 URIs, worth a try.

We translate dplyr to an intermediate representation dubbed "relational API", closer to Codd's relational algebra. No SQL involved. The aim is to achieve full dplyr compatibility, regarding data types, functions, and verbs.

cboettig · 2023-10-16T22:50:13Z

Cool! Am I correct in assuming that duckplyr would still work entirely outside of RAM? (e.g. the gbif example above is a few hundred GBs). Also, does that mean that users will also be able to run functions like mutate() that call arbitrary third-party R packages for operations on columns without having to call collect() first and read everything into RAM?

looks like duckplyr_df_from_file() will need some additional support to parse arguments for duckdb's s3 config ? (arrow uses conventions to express some of these in URI format, e.g. with a endpoint_override query parameter on the end of a URI, which I copied over to duckdbfs.

krlmlr · 2023-10-16T23:32:30Z

duckplyr uses duckdb under the hood and inherits all the goodness. We'll need to see what operations we can support without having to collect(), but in any case this will happen transparently if needed, without user intervention.

If you can share raw SQL syntax for reading S3, I might help to translate this into an equivalent duckplyr_df_from_file() call, or enhance if needed.

cboettig · 2023-10-17T00:39:59Z

sure thing, here's some quick examples. (in R not raw SQL but close enough to the DBI layer that the SQL is obvious, right 😄 ) This one is simple since this uses the default AWS endpoint, default region, and is public, so doesn't require auth keys/tokens. Normally we just have to set all of those using by executing a bunch of SET commands.

library(duckdb)
library(glue)
library(dplyr)

conn <- DBI::dbConnect(duckdb(), ":memory:")
DBI::dbExecute(conn, "INSTALL 'httpfs';")
DBI::dbExecute(conn, "LOAD 'httpfs';")

## note the explicit recursive glob, `**`.  Arrow (or duckdbfs) do this implicitly
public_aws <- "s3://gbif-open-data-us-east-1/occurrence/2023-06-01/occurrence.parquet/**"


view_query <- glue::glue("CREATE VIEW 'gbif' ",
                  "AS SELECT * FROM parquet_scan('{public_aws}');")
DBI::dbSendQuery(conn, view_query)
df <- tbl(conn, tblname)

A common test case: counting occurrences by lat/lon grid. (duckdb handles this case fine, even though data are large enough such that would be quite difficult to run this entirely in RAM on most machines).

df |> 
  mutate(latitude = round(decimallatitude,2),
          longitude = round(decimallongitude,2)) |> 
  count(longitude, latitude) |> 
  mutate(n = log(n))

Here is a second example that is still public data, but uses an alternative endpoint that must be set as an env var in vanilla duckdb:

endpoint <- "data.ecoforecast.org"
DBI::dbExecute(conn, glue("SET s3_endpoint='{endpoint}';"))
DBI::dbExecute(conn, glue("SET s3_url_style='path';"))


tblname <- "scores"
parquet <- "s3://neon4cast-scores/parquet/aquatics/**"
view_query <-glue("CREATE VIEW '{tblname}' ",
                  "AS SELECT * FROM parquet_scan('{parquet}');")
DBI::dbSendQuery(conn, view_query)
tbl(conn, tblname)

krlmlr · 2023-10-18T08:05:22Z

Thanks, this is helpful. I'll review when I next work on duckplyr.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

duckdbfs - duckplyr comparing notes #59

duckdbfs - duckplyr comparing notes #59

cboettig commented Sep 19, 2023

krlmlr commented Oct 15, 2023

cboettig commented Oct 16, 2023

krlmlr commented Oct 16, 2023

cboettig commented Oct 16, 2023

krlmlr commented Oct 16, 2023

cboettig commented Oct 17, 2023

krlmlr commented Oct 18, 2023

duckdbfs - duckplyr comparing notes #59

duckdbfs - duckplyr comparing notes #59

Comments

cboettig commented Sep 19, 2023

krlmlr commented Oct 15, 2023

cboettig commented Oct 16, 2023

krlmlr commented Oct 16, 2023

cboettig commented Oct 16, 2023

krlmlr commented Oct 16, 2023

cboettig commented Oct 17, 2023

krlmlr commented Oct 18, 2023