Error: Invalid Input Error: Invalid unicode (byte sequence mismatch) detected in segment statistics update #12

ajdamico · 2023-09-14T20:46:03Z

hi, i'm still hitting this error on both the CRAN version and also the duckdb_0.8.1-9000 dev version..

# using github
remotes::install_github("duckdb/duckdb-r")


# minimal reproducible example
library(duckdb)
con <- dbConnect( duckdb::duckdb() , dbdir = 'my-db.duckdb' )
my_df <- structure(list(no_municipio_esc = "Est\xe2ncia", no_municipio_prova = "Est\xe2ncia"), row.names = 16L, class = "data.frame")
dbWriteTable( con , 'my_table' , my_df )

console output:

> library(duckdb)
Loading required package: DBI
> con <- dbConnect( duckdb::duckdb() , dbdir = 'my-db.duckdb' )
> my_df <- structure(list(no_municipio_esc = "Est\xe2ncia", no_municipio_prova = "Est\xe2ncia"), row.names = 16L, class = "data.frame")
> dbWriteTable( con , 'my_table' , my_df )
Error: rapi_execute: Failed to run query
Error: Invalid Input Error: Invalid unicode (byte sequence mismatch) detected in segment statistics update
In addition: Warning message:
Database is garbage-collected, use dbDisconnect(con, shutdown=TRUE) or duckdb::duckdb_shutdown(drv) to avoid this. 
> 
> sessionInfo()
R version 4.3.1 (2023-06-16 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19045)

Matrix products: default


locale:
[1] LC_COLLATE=English_United States.utf8  LC_CTYPE=English_United States.utf8    LC_MONETARY=English_United States.utf8 LC_NUMERIC=C                           LC_TIME=English_United States.utf8    

time zone: America/New_York
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] duckdb_0.8.1-9000 DBI_1.1.3        

loaded via a namespace (and not attached):
[1] compiler_4.3.1 tools_4.3.1   
>

The text was updated successfully, but these errors were encountered:

Tmonster · 2023-09-15T12:36:02Z

Thanks for the bug report. This indeed seems to be an issue with how our internal engine reads strings from within the R string pool. Hopefully we can get a fix in soon!

krlmlr · 2023-12-02T10:09:28Z

Thanks. Confirmed on macOS. I'm not sure it's an error though. You need valid UTF-8 strings, and DuckDB is more picky about it than R.

library(duckdb)
#> Loading required package: DBI
con <- dbConnect(duckdb::duckdb(), dbdir = "my-db.duckdb")
my_df <- structure(list(no_municipio_esc = "Est\xe2ncia", no_municipio_prova = "Est\xe2ncia"), row.names = 16L, class = "data.frame")
dbWriteTable(con, "my_table", my_df)
#> Error: rapi_execute: Failed to run query
#> Error: Invalid Input Error: Invalid unicode (byte sequence mismatch) detected in segment statistics update

dbGetQuery(con, "SELECT 'Est\xe2ncia' AS x")
#> Error: Invalid unicode (byte sequence mismatch) detected in value construction
dbGetQuery(con, iconv("SELECT 'Est\xe2ncia' AS x", from = "latin1"))
#>          x
#> 1 Estância

^{Created on 2023-12-02 with reprex v2.0.2}

ajdamico · 2023-12-02T20:04:19Z

hi @krlmlr thanks for taking the time to look at this! i feel like data like this is pretty common, i wonder if you'd be willing to weigh in on how R users might approach this issue? for example, is the strategy below a good starting point for R users trying to import non-UTF 8 strings into duckdb?

# download and import official brazilian microdata #

library(httr)
library(archive)
library(readr)
library(duckdb)

tf <- tempfile()

this_url <- "https://download.inep.gov.br/microdados/microdados_enem_2022.zip"

GET( this_url , write_disk( tf ) , progress() )

archive_extract( tf , dir = tempdir() )

enem_fns <- list.files( tempdir() , recursive = TRUE , full.names = TRUE )

enem_fn <- grep( "MICRODADOS_ENEM_([0-9][0-9][0-9][0-9])\\.csv$" , enem_fns , value = TRUE )

enem_tbl <- read_csv2( enem_fn )

enem_df <- data.frame( enem_tbl )

names( enem_df ) <- tolower( names( enem_df ) )


# fails to import as-is, but succeeds after conversion #

converted_enem_df <- enem_df

converted_enem_df[ ] <-
	lapply( 
		enem_df , 
		function( x ){
			if( class( x ) == 'character' ){
				iconv( x , from = "latin1" )
			} else x
		}
	)
	
con <- dbConnect( duckdb::duckdb() , dbdir = 'my-db.duckdb' )

# fails
dbWriteTable( con , 'enade' , enade_df )

# works
dbWriteTable( con , 'enade' , converted_enem_df )

thanks a lot!!

krlmlr · 2023-12-02T20:15:39Z

Thanks. Cleaning up the encoding should really happen when reading the CSV file. With readr, see https://readr.tidyverse.org/articles/locales.html .

ajdamico · 2023-12-02T20:16:48Z

thank you!!

eli-daniels · 2024-10-17T14:17:16Z

I've found that data can have been 'corrupted' downstream, meaning even if it was imported as UTF-8 this error message can occur. In which case @ajdamico's approach worked, I used dat <- dat |> mutate(across(where(is.character), stringi::stri_enc_tonative)).
I'm unable to share the data, but hope this helps someone in future

ajdamico mentioned this issue Sep 14, 2023

enem duckdb:yes ajdamico/asdfree#403

Closed

Tmonster assigned Tmonster and unassigned Tmonster Sep 15, 2023

krlmlr linked a pull request Mar 21, 2024 that will close this issue

feat: Throw exception when non-utf8 characters are in a data.frame #16

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error: Invalid Input Error: Invalid unicode (byte sequence mismatch) detected in segment statistics update #12

Error: Invalid Input Error: Invalid unicode (byte sequence mismatch) detected in segment statistics update #12

ajdamico commented Sep 14, 2023

Tmonster commented Sep 15, 2023

krlmlr commented Dec 2, 2023

ajdamico commented Dec 2, 2023

krlmlr commented Dec 2, 2023

ajdamico commented Dec 2, 2023

eli-daniels commented Oct 17, 2024

Error: Invalid Input Error: Invalid unicode (byte sequence mismatch) detected in segment statistics update #12

Error: Invalid Input Error: Invalid unicode (byte sequence mismatch) detected in segment statistics update #12

Comments

ajdamico commented Sep 14, 2023

Tmonster commented Sep 15, 2023

krlmlr commented Dec 2, 2023

ajdamico commented Dec 2, 2023

krlmlr commented Dec 2, 2023

ajdamico commented Dec 2, 2023

eli-daniels commented Oct 17, 2024