Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error: Invalid Input Error: Invalid unicode (byte sequence mismatch) detected in segment statistics update #12

Open
ajdamico opened this issue Sep 14, 2023 · 6 comments · May be fixed by #16
Assignees

Comments

@ajdamico
Copy link

hi, i'm still hitting this error on both the CRAN version and also the duckdb_0.8.1-9000 dev version..

# using github
remotes::install_github("duckdb/duckdb-r")


# minimal reproducible example
library(duckdb)
con <- dbConnect( duckdb::duckdb() , dbdir = 'my-db.duckdb' )
my_df <- structure(list(no_municipio_esc = "Est\xe2ncia", no_municipio_prova = "Est\xe2ncia"), row.names = 16L, class = "data.frame")
dbWriteTable( con , 'my_table' , my_df )

console output:

> library(duckdb)
Loading required package: DBI
> con <- dbConnect( duckdb::duckdb() , dbdir = 'my-db.duckdb' )
> my_df <- structure(list(no_municipio_esc = "Est\xe2ncia", no_municipio_prova = "Est\xe2ncia"), row.names = 16L, class = "data.frame")
> dbWriteTable( con , 'my_table' , my_df )
Error: rapi_execute: Failed to run query
Error: Invalid Input Error: Invalid unicode (byte sequence mismatch) detected in segment statistics update
In addition: Warning message:
Database is garbage-collected, use dbDisconnect(con, shutdown=TRUE) or duckdb::duckdb_shutdown(drv) to avoid this. 
> 
> sessionInfo()
R version 4.3.1 (2023-06-16 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19045)

Matrix products: default


locale:
[1] LC_COLLATE=English_United States.utf8  LC_CTYPE=English_United States.utf8    LC_MONETARY=English_United States.utf8 LC_NUMERIC=C                           LC_TIME=English_United States.utf8    

time zone: America/New_York
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] duckdb_0.8.1-9000 DBI_1.1.3        

loaded via a namespace (and not attached):
[1] compiler_4.3.1 tools_4.3.1   
> 
@Tmonster
Copy link
Contributor

Thanks for the bug report. This indeed seems to be an issue with how our internal engine reads strings from within the R string pool. Hopefully we can get a fix in soon!

@krlmlr
Copy link
Collaborator

krlmlr commented Dec 2, 2023

Thanks. Confirmed on macOS. I'm not sure it's an error though. You need valid UTF-8 strings, and DuckDB is more picky about it than R.

library(duckdb)
#> Loading required package: DBI
con <- dbConnect(duckdb::duckdb(), dbdir = "my-db.duckdb")
my_df <- structure(list(no_municipio_esc = "Est\xe2ncia", no_municipio_prova = "Est\xe2ncia"), row.names = 16L, class = "data.frame")
dbWriteTable(con, "my_table", my_df)
#> Error: rapi_execute: Failed to run query
#> Error: Invalid Input Error: Invalid unicode (byte sequence mismatch) detected in segment statistics update

dbGetQuery(con, "SELECT 'Est\xe2ncia' AS x")
#> Error: Invalid unicode (byte sequence mismatch) detected in value construction
dbGetQuery(con, iconv("SELECT 'Est\xe2ncia' AS x", from = "latin1"))
#>          x
#> 1 Estância

Created on 2023-12-02 with reprex v2.0.2

@ajdamico
Copy link
Author

ajdamico commented Dec 2, 2023

hi @krlmlr thanks for taking the time to look at this! i feel like data like this is pretty common, i wonder if you'd be willing to weigh in on how R users might approach this issue? for example, is the strategy below a good starting point for R users trying to import non-UTF 8 strings into duckdb?

# download and import official brazilian microdata #

library(httr)
library(archive)
library(readr)
library(duckdb)

tf <- tempfile()

this_url <- "https://download.inep.gov.br/microdados/microdados_enem_2022.zip"

GET( this_url , write_disk( tf ) , progress() )

archive_extract( tf , dir = tempdir() )

enem_fns <- list.files( tempdir() , recursive = TRUE , full.names = TRUE )

enem_fn <- grep( "MICRODADOS_ENEM_([0-9][0-9][0-9][0-9])\\.csv$" , enem_fns , value = TRUE )

enem_tbl <- read_csv2( enem_fn )

enem_df <- data.frame( enem_tbl )

names( enem_df ) <- tolower( names( enem_df ) )


# fails to import as-is, but succeeds after conversion #

converted_enem_df <- enem_df

converted_enem_df[ ] <-
	lapply( 
		enem_df , 
		function( x ){
			if( class( x ) == 'character' ){
				iconv( x , from = "latin1" )
			} else x
		}
	)
	
con <- dbConnect( duckdb::duckdb() , dbdir = 'my-db.duckdb' )

# fails
dbWriteTable( con , 'enade' , enade_df )

# works
dbWriteTable( con , 'enade' , converted_enem_df )

thanks a lot!!

@krlmlr
Copy link
Collaborator

krlmlr commented Dec 2, 2023

Thanks. Cleaning up the encoding should really happen when reading the CSV file. With readr, see https://readr.tidyverse.org/articles/locales.html .

@ajdamico
Copy link
Author

ajdamico commented Dec 2, 2023

thank you!!

@eli-daniels
Copy link
Contributor

I've found that data can have been 'corrupted' downstream, meaning even if it was imported as UTF-8 this error message can occur. In which case @ajdamico's approach worked, I used dat <- dat |> mutate(across(where(is.character), stringi::stri_enc_tonative)).
I'm unable to share the data, but hope this helps someone in future

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants