Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Interop with spatial extension #117

Open
krlmlr opened this issue Mar 22, 2024 · 7 comments
Open

Interop with spatial extension #117

krlmlr opened this issue Mar 22, 2024 · 7 comments

Comments

@krlmlr
Copy link
Collaborator

krlmlr commented Mar 22, 2024

From #55 (comment):

Perhaps not a complete summary, but:

  • Right now when a database result contains a column that comes from the DuckDB spatial extension, it shows up in Arrow output as a list(raw()) where each element contains an opaque internal binary representation. This often leads to confusion because the format can't be read anywhere except DuckDB. I think the same is true of a database result that does not go through Arrow (i.e., R/DBI) but I haven't checked.
  • For a database result accessed via R/DBI, I imagine that you might have enough information available to you at "convert to R" time to at least give it a class. That class could implement st::as_sfc() and give an error along the lines of "use st_as_ewkb() before collect(), which is the workaround.
  • For a database result accessed via Arrow, the solution is complex because DuckDB does not currently have a way for an extension type to customize its arrow output ( Representation of spatial types on export to ArrowArrayStream duckdb_spatial#153 ).

CC @paleolimbot @eitsupi @cboettig .

@dcaseykc
Copy link

Assuming my problems are related to this issue (and are not distinct):

The examples below show some attempts at reading/writing from a spatial enabled duckdb. I can get data in through sql but not dbWriteTable or sf::st_write. Nor have I found a way to translate what duckdb returns for the geom column into something R (sf or wk can understand.

library(duckdb)
#> Warning: package 'duckdb' was built under R version 4.3.3
#> Loading required package: DBI
library('sf')
#> Linking to GEOS 3.11.2, GDAL 3.7.2, PROJ 9.3.0; sf_use_s2() is TRUE
library('glue')

# Init DB
con <- dbConnect(duckdb::duckdb())
dbExecute(con, "install spatial; load spatial;")
#> [1] 0

# USE ST read
dbExecute(con, glue("create table spat as select *
          from ST_Read('{system.file('shape/nc.shp', package='sf')}')"))
#> [1] 99

# Try to get the stuff
a1 = dbGetQuery(con, 'select * from spat')
# I've not found a way to convert format of geom into something usable

a2 = st_read(con, 'spat') #breaks on the binary
#> Warning in st_read.DBIObject(con, "spat"): Could not find a simple features
#> geometry column. Will return a `data.frame`.


# Try to write it
# Gets errors
nc = sf::read_sf(system.file("shape/nc.shp", package="sf"))
dbWriteTable(con, name = 'nc', nc)
#> Note: method with signature 'DBIObject#sf' chosen for function 'dbDataType',
#>  target signature 'duckdb_connection#sf'.
#>  "duckdb_connection#ANY" would also be valid
#> Error: rapi_execute: Failed to run query
#> Error: Conversion Error: ParseException: Unknown type: '0106000020AB100000010000000103000000010000001B000000000000A0415E54C000000060FF1D4240000000209D6254C000000080E122424000000080F76354C0000000200523424000000020846854C0000000A09B2B4240000000C06D6F54C00000000026324240000000A0B06C54C000000040633C4240000000A0FA6C54C0000000C07942424000000040E16A54C0000000A0794B424000000060195654C0000000A053494240000000203E5654C000000060DA44424000000020C95454C000000040C0414240000000800D5454C000000080873D4240000000000A5154C000000060F637424000000060D25054C000000060D833424000000080674F54C0000000C090304240000000605A4F54C000000040C42E424000000060E95054C0000000E01B2D4240000000400E5554C000000040872E4240000000C0205754C000000060342D424000000080675754C000000000662B424000000020AA5654C0000000205D26424000000060845754C000000060AC23424000000040025A54C0000000A07C244240000000A0635A54C0000000A03622424000000020965B54C0000000405F21424000000020FC5C54C0000000C0AA1E4240000000A0415E54C000000060FF1D4240'
sf::st_write(nc, dsn = con, layer = 'pts')
#> Error: rapi_execute: Failed to run query
#> Error: Conversion Error: ParseException: Unknown type: '0106000020AB100000010000000103000000010000001B000000000000A0415E54C000000060FF1D4240000000209D6254C000000080E122424000000080F76354C0000000200523424000000020846854C0000000A09B2B4240000000C06D6F54C00000000026324240000000A0B06C54C000000040633C4240000000A0FA6C54C0000000C07942424000000040E16A54C0000000A0794B424000000060195654C0000000A053494240000000203E5654C000000060DA44424000000020C95454C000000040C0414240000000800D5454C000000080873D4240000000000A5154C000000060F637424000000060D25054C000000060D833424000000080674F54C0000000C090304240000000605A4F54C000000040C42E424000000060E95054C0000000E01B2D4240000000400E5554C000000040872E4240000000C0205754C000000060342D424000000080675754C000000000662B424000000020AA5654C0000000205D26424000000060845754C000000060AC23424000000040025A54C0000000A07C244240000000A0635A54C0000000A03622424000000020965B54C0000000405F21424000000020FC5C54C0000000C0AA1E4240000000A0415E54C000000060FF1D4240'
dbWriteTable(con, 'nc', nc, append = T, field.types = c('id' = 'INTEGER', 'geometry' = 'GEOMETRY'))
#> Error: rapi_execute: Failed to run query
#> Error: Conversion Error: ParseException: Unknown type: '0106000020AB100000010000000103000000010000001B000000000000A0415E54C000000060FF1D4240000000209D6254C000000080E122424000000080F76354C0000000200523424000000020846854C0000000A09B2B4240000000C06D6F54C00000000026324240000000A0B06C54C000000040633C4240000000A0FA6C54C0000000C07942424000000040E16A54C0000000A0794B424000000060195654C0000000A053494240000000203E5654C000000060DA44424000000020C95454C000000040C0414240000000800D5454C000000080873D4240000000000A5154C000000060F637424000000060D25054C000000060D833424000000080674F54C0000000C090304240000000605A4F54C000000040C42E424000000060E95054C0000000E01B2D4240000000400E5554C000000040872E4240000000C0205754C000000060342D424000000080675754C000000000662B424000000020AA5654C0000000205D26424000000060845754C000000060AC23424000000040025A54C0000000A07C244240000000A0635A54C0000000A03622424000000020965B54C0000000405F21424000000020FC5C54C0000000C0AA1E4240000000A0415E54C000000060FF1D4240'

Created on 2024-03-28 with reprex v2.1.0

@cboettig
Copy link

@dcaseykc You are really close, just missing two things:

First, you must remember that duckdb spatial has it's own native encoding of a geometry column that is not WKB. So it is up to you to manually call st_AsWKB to turn the geom column into WKB so other tools like sf can read it. Second, you should tell sf where to find the geometry column. Try this:

library(duckdb)
library(sf)
library(glue)
library(dplyr)
con <- dbConnect(duckdb::duckdb())
dbExecute(con, "install spatial; load spatial;")
dbExecute(con, glue("create table spat as select *
          from ST_Read('{system.file('shape/nc.shp', package='sf')}')"))
a1 = dbGetQuery(con, 'select * from spat')


q <- tbl(con, "spat") |> mutate(geometry = st_aswkb(geom)) |> dbplyr::sql_render()
a2 = st_read(con, query = q, geometry_column = "geometry") 

However, I find all that really verbose and hard to read. duckdbfs is a small wrapper around this syntax with helper functions. The above is the same as:

library(duckdbfs)
nc <- open_dataset("https://github.com/r-spatial/sf/raw/main/inst/shape/nc.shp")
nc |> to_sf()

HTH

@jacodevries
Copy link

@cboettig thanks for your response to the original post. It is most helpful. Can you please give an example of how open_dataset can be used to open a spatial dataset from duckdb in memory. open_dateset requires a "sources" parameter.

Based on your feedback above the following successfully returns an sf object from the in memory database.

duckDBMemory <- duckdb::dbConnect(duckdb())

query <- 
  tbl(duckDBMemory, "cadastre_tbl") |>
  mutate(geometry = st_aswkb(geom)) |>
  dbplyr::sql_render()

result <- 
  st_read(
    duckDBMemory,
    query = query,
    geometry_column = "geometry") 

My expectation is that the following should do the same:

result <- open_dataset(
  conn = duckDBMemory,
  tblname = "cadastre_tbl"
)

However, this returns Error in open_dataset(): ! argument "sources" is missing, with no default

Your advice will be appreciated.

@cboettig
Copy link

cboettig commented Sep 3, 2024

@jacodevries Not entirely sure I follow your question. Your example doesn't run -- it refers to a tbl(duckDBMemory, "cadastre_tbl"), but this table doesn't yet exist -- your code has provided no indication where this table is coming from, and yet you are creating an new in-memory connection on the line before, so it can't already be in memory on, say, a on-disk table. This spatial data has to come from somewhere, right? in duckdbfs that is the sources argument, as in the examples above (a shapefile, csv, geojson, parquet, etc etc).

If you already have a connection to a duckdb dataset that already has a the desired table materialized in it, you can just do tbl(duckDBMemory, "cadastre_tbl") to establish a connection to it, as normal. Note that duckbfs functions should be able to work with this table, e.g.

library(duckdb)
library(dplyr)
library(duckdbfs)
library(sf)

duckDBMemory <- duckdb::dbConnect(duckdb())

ex = open_dataset("https://github.com/r-spatial/sf/raw/main/inst/shape/nc.shp",
                                  tblname = "cadastre_tbl", conn = duckDBMemory)

# If the table already exists, just use `tbl` like always!
also_ex <- tbl(duckDBMemory, "cadastre_tbl")

# this works just fine with other duckdbfs functions, like to_sf() shown above:
also_ex <- tbl(duckDBMemory, "cadastre_tbl")

# Note that these are in fact identical
identical(ex, also_ex)

so we can use open_dataset() to read from a serialized file, or we can use tbl() as always to connect to an existing table or view already imported to duckdb. Does that make sense?

@jacodevries
Copy link

jacodevries commented Sep 4, 2024

@cboettig My apology - I realised that I left out a few critical details - let me rephrase my question:

  1. I have been struggling with the same issue, in that duckdb returns the geometries of spatial data in raw format.
    i.e. using also_ex <- tbl(duckDBMemory, "cadastre_tbl") returns the data, but the geometry is in raw format.

  2. Your post solved the problem. Using the following suggestion which you posted, the spatial dataset is returned in an sf dataframe.

library(duckdb)
library(sf)
library(glue)
library(dplyr)
con <- dbConnect(duckdb::duckdb())
dbExecute(con, "install spatial; load spatial;")
dbExecute(con, glue("create table spat as select *
          from ST_Read('{system.file('shape/nc.shp', package='sf')}')"))
a1 = dbGetQuery(con, 'select * from spat')


q <- tbl(con, "spat") |> mutate(geometry = st_aswkb(geom)) |> dbplyr::sql_render()
a2 = st_read(con, query = q, geometry_column = "geometry") 
  1. However, I noted your comment that the above is verbose, and that the following duckdbfs helper functions can achieved the same.
library(duckdbfs)
nc <- open_dataset("https://github.com/r-spatial/sf/raw/main/inst/shape/nc.shp")
nc |> to_sf()

I've therefore tried to figure out how to use your suggested wrapper function to read the duckdb data directly in order to make the code less verbose.

Therefore using your suggested code snippets:

The spatial database is created using:

library(duckdb)
library(sf)
library(glue)
library(dplyr)
con <- dbConnect(duckdb::duckdb())
dbExecute(con, "install spatial; load spatial;")
dbExecute(con, glue("create table spat as select *
          from ST_Read('{system.file('shape/nc.shp', package='sf')}')"))

Then using the following code snippet, the spatial data is returned, but in raw format:

a1 = dbGetQuery(con, 'select * from spat')

The following verbose snippet returns the data in sf format:

q <- tbl(con, "spat") |> mutate(geometry = st_aswkb(geom)) |> dbplyr::sql_render()
a2 = st_read(con, query = q, geometry_column = "geometry") 

My question: how can I use the duckdbfs helper function to make the request less verbose, but return the data in sf format:

library(duckdbfs)
nc <- open_dataset(conn = con, tblname = "spat")

However, this returns Error in open_dataset(): ! argument "sources" is missing, with no default

Is it possible to use open_dataset() to read from a duckdb database in memory - what should the value of the sources parameter be?

@cboettig
Copy link

cboettig commented Sep 5, 2024

@jacodevries yes, I think I understand what you are asking, but please use the function dplyr::tbl() when you have a connection and the name of an existing table. I do not understand why you are trying to use open_dataset() on that. Specifically, in your code above, just run:

library(dplyr); library(duckdbfs)
tbl(con, "spat") |> to_sf()

That is the same as the more verbose code you show:

q <- tbl(con, "spat") |> mutate(geometry = st_aswkb(geom)) |> dbplyr::sql_render()
a2 = st_read(con, query = q, geometry_column = "geometry") 

these two blocks of code are essentially identical.

@jacodevries
Copy link

@cboettig Thanks for your response to my query.

My attempt to use open_dataset is because of a misunderstanding on my part of when to use of the following code chunk:

library(duckdbfs)
nc <- open_dataset("https://github.com/r-spatial/sf/raw/main/inst/shape/nc.shp")
nc |> to_sf()

Thanks very much for your assistance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants