-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Add col.types
argument to duckdb_read_csv()
#445
Conversation
Thanks for the effort, good idea! This function is getting too complex now, is there a way to split this into more focused functions, one of which would be responsible for inferring the column types? Can you propose a refactoring that will later make it easy to add this functionality? The branch now also conflicts with the main branch, sorry about that. |
R/csv.R
Outdated
@@ -12,6 +14,9 @@ | |||
#' @param delim Which field separator should be used | |||
#' @param quote Which quote character is used for columns in the CSV file | |||
#' @param col.names Override the detected or generated column names | |||
#' @param col.types Character vector of column types in the same order as col.names, | |||
#' or a named character vector where names are column names and types pairs. | |||
#' Valid ypes are DuckDB data types, e.g. VARCHAR, DOUBLE, DATE, BIGINT, BOOLEAN, etc. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of saying etc, can we point to a documentation somewhere?
tests/testthat/test-read.R
Outdated
col.types = c( | ||
Sepal.Length = "DOUBLE", | ||
Sepal.Width = "DOUBLE", | ||
Petal.Length = "DOUBLE", | ||
Petal.Width = "DOUBLE", | ||
Species = "DOUBLE" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are no tests for date types, feels risky
Yep, I'll give it a shot and add some tests and better docs. Is there any way to avoid recompiling DuckDB when doing RCMD checks? |
I use |
I've refactored it a bit and added a DATE type test. Ready for review. If there is anything else I can do to make it better, let me know. RCMD Check gives : [ FAIL 0 | WARN 0 | SKIP 52 | PASS 4810 ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, looks like CI/CD is failing.
Thanks for the review @krlmlr |
Thanks! This is a good start, but ideally, I'd like to proceed as outlined in #118 (comment) . Merging for now, let's see. |
col.types
argument to duckdb_read_csv()
Agreed, but this adds the functionality asked across a few issues. Hopefully it helps someone. I'm still thinking about how to approach the method discussed in #118 |
This addresses 2 issues: #118 and #383
col.types can be given as a named character vector providing the names and types of the columns such as:
col.types = c(col0 = 'VARCHAR', col1 = 'COUBLE' , etc..)
or an unnamed character vector, then col names are taken from the read.csv output or the col.names argument. Column names given by col.types are preferred over col.names.The data types are provided as DuckDB data types, so VARCHAR, DOUBLE, BIGINT, etc...
As part of this I changed dbWriteTable to dbCreateTable. So could easily add the temporary parameter mentioned here #142, but don't want to get ahead of myself.
Also made a minor addition to the docs to mention that the csv files are appended to the table if the table already exists, I hope that is okay.
Happy to modify anything that may be amiss!
Closes #383.