Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add col.types argument to duckdb_read_csv() #445

Merged
merged 17 commits into from
Oct 28, 2024

Conversation

eli-daniels
Copy link
Contributor

@eli-daniels eli-daniels commented Sep 30, 2024

This addresses 2 issues: #118 and #383

col.types can be given as a named character vector providing the names and types of the columns such as: col.types = c(col0 = 'VARCHAR', col1 = 'COUBLE' , etc..) or an unnamed character vector, then col names are taken from the read.csv output or the col.names argument. Column names given by col.types are preferred over col.names.

The data types are provided as DuckDB data types, so VARCHAR, DOUBLE, BIGINT, etc...

As part of this I changed dbWriteTable to dbCreateTable. So could easily add the temporary parameter mentioned here #142, but don't want to get ahead of myself.

Also made a minor addition to the docs to mention that the csv files are appended to the table if the table already exists, I hope that is okay.

Happy to modify anything that may be amiss!

Closes #383.

@krlmlr
Copy link
Collaborator

krlmlr commented Oct 14, 2024

Thanks for the effort, good idea!

This function is getting too complex now, is there a way to split this into more focused functions, one of which would be responsible for inferring the column types? Can you propose a refactoring that will later make it easy to add this functionality?

The branch now also conflicts with the main branch, sorry about that.

R/csv.R Outdated
@@ -12,6 +14,9 @@
#' @param delim Which field separator should be used
#' @param quote Which quote character is used for columns in the CSV file
#' @param col.names Override the detected or generated column names
#' @param col.types Character vector of column types in the same order as col.names,
#' or a named character vector where names are column names and types pairs.
#' Valid ypes are DuckDB data types, e.g. VARCHAR, DOUBLE, DATE, BIGINT, BOOLEAN, etc.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of saying etc, can we point to a documentation somewhere?

Comment on lines 252 to 258
col.types = c(
Sepal.Length = "DOUBLE",
Sepal.Width = "DOUBLE",
Petal.Length = "DOUBLE",
Petal.Width = "DOUBLE",
Species = "DOUBLE"
)
Copy link

@xiaodaigh xiaodaigh Oct 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are no tests for date types, feels risky

@eli-daniels
Copy link
Contributor Author

Yep, I'll give it a shot and add some tests and better docs.

Is there any way to avoid recompiling DuckDB when doing RCMD checks?

@krlmlr
Copy link
Collaborator

krlmlr commented Oct 17, 2024

I use ccache to speed up recompilation: https://stackoverflow.com/a/45512708/946850, https://dirk.eddelbuettel.com/blog/2017/11/27/ .

@eli-daniels
Copy link
Contributor Author

I've refactored it a bit and added a DATE type test. Ready for review.

If there is anything else I can do to make it better, let me know.

RCMD Check gives : [ FAIL 0 | WARN 0 | SKIP 52 | PASS 4810 ]

Copy link
Collaborator

@krlmlr krlmlr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, looks like CI/CD is failing.

R/csv.R Show resolved Hide resolved
R/csv.R Outdated Show resolved Hide resolved
@eli-daniels
Copy link
Contributor Author

Thanks for the review @krlmlr

@krlmlr krlmlr enabled auto-merge October 28, 2024 16:53
@krlmlr
Copy link
Collaborator

krlmlr commented Oct 28, 2024

Thanks! This is a good start, but ideally, I'd like to proceed as outlined in #118 (comment) . Merging for now, let's see.

@krlmlr krlmlr changed the title add col.types to duckdb_read_csv feat: Add col.types argument to duckdb_read_csv() Oct 28, 2024
@krlmlr krlmlr added this pull request to the merge queue Oct 28, 2024
Merged via the queue into duckdb:main with commit 7c8f306 Oct 28, 2024
23 checks passed
@eli-daniels
Copy link
Contributor Author

Agreed, but this adds the functionality asked across a few issues. Hopefully it helps someone. I'm still thinking about how to approach the method discussed in #118

@eli-daniels eli-daniels deleted the resolve-conflict branch October 28, 2024 18:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

R: Add support for the type/dtypes in duckdb_read_csv?
3 participants