-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error: Invalid Input Error: Invalid unicode (byte sequence mismatch) detected in segment statistics update #12
Comments
Thanks for the bug report. This indeed seems to be an issue with how our internal engine reads strings from within the R string pool. Hopefully we can get a fix in soon! |
Thanks. Confirmed on macOS. I'm not sure it's an error though. You need valid UTF-8 strings, and DuckDB is more picky about it than R. library(duckdb)
#> Loading required package: DBI
con <- dbConnect(duckdb::duckdb(), dbdir = "my-db.duckdb")
my_df <- structure(list(no_municipio_esc = "Est\xe2ncia", no_municipio_prova = "Est\xe2ncia"), row.names = 16L, class = "data.frame")
dbWriteTable(con, "my_table", my_df)
#> Error: rapi_execute: Failed to run query
#> Error: Invalid Input Error: Invalid unicode (byte sequence mismatch) detected in segment statistics update
dbGetQuery(con, "SELECT 'Est\xe2ncia' AS x")
#> Error: Invalid unicode (byte sequence mismatch) detected in value construction
dbGetQuery(con, iconv("SELECT 'Est\xe2ncia' AS x", from = "latin1"))
#> x
#> 1 Estância Created on 2023-12-02 with reprex v2.0.2 |
hi @krlmlr thanks for taking the time to look at this! i feel like data like this is pretty common, i wonder if you'd be willing to weigh in on how R users might approach this issue? for example, is the strategy below a good starting point for R users trying to import non-UTF 8 strings into duckdb?
thanks a lot!! |
Thanks. Cleaning up the encoding should really happen when reading the CSV file. With readr, see https://readr.tidyverse.org/articles/locales.html . |
thank you!! |
I've found that data can have been 'corrupted' downstream, meaning even if it was imported as UTF-8 this error message can occur. In which case @ajdamico's approach worked, I used |
hi, i'm still hitting this error on both the CRAN version and also the
duckdb_0.8.1-9000
dev version..console output:
The text was updated successfully, but these errors were encountered: