Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

data_read() preserves class for rds files #558

Merged
merged 6 commits into from
Oct 19, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion DESCRIPTION
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
Type: Package
Package: datawizard
Title: Easy Data Wrangling and Statistical Transformations
Version: 0.13.0.8
Version: 0.13.0.9
Authors@R: c(
person("Indrajeet", "Patil", , "patilindrajeet.science@gmail.com", role = "aut",
comment = c(ORCID = "0000-0003-1995-6531")),
Expand Down
3 changes: 3 additions & 0 deletions NEWS.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,9 @@ BUG FIXES
* `describe_distribution()` no longer errors if the sample was too sparse to compute
CIs. Instead, it warns the user and returns `NA` (#550).

* `data_read()` preserves variable types when importing files from `rds` or
`rdata` format (#558).

# datawizard 0.13.0

BREAKING CHANGES
Expand Down
41 changes: 21 additions & 20 deletions R/data_read.R
Original file line number Diff line number Diff line change
Expand Up @@ -15,15 +15,16 @@
#' for SAS data files.
#' @param encoding The character encoding used for the file. Usually not needed.
#' @param convert_factors If `TRUE` (default), numeric variables, where all
#' values have a value label, are assumed to be categorical and converted
#' into factors. If `FALSE`, no variable types are guessed and no conversion
#' of numeric variables into factors will be performed. See also section
#' 'Differences to other packages'. For `data_write()`, this argument only
#' applies to the text (e.g. `.txt` or `.csv`) or spreadsheet file formats (like
#' `.xlsx`). Converting to factors might be useful for these formats because
#' labelled numeric variables are then converted into factors and exported as
#' character columns - else, value labels would be lost and only numeric values
#' are written to the file.
#' values have a value label, are assumed to be categorical and converted into
#' factors. If `FALSE`, no variable types are guessed and no conversion of
#' numeric variables into factors will be performed. For `data_read()`, this
#' argument only applies to file types with *labelled data*, e.g. files from
#' SPSS, SAS or Stata. See also section 'Differences to other packages'. For
#' `data_write()`, this argument only applies to the text (e.g. `.txt` or
#' `.csv`) or spreadsheet file formats (like `.xlsx`). Converting to factors
#' might be useful for these formats because labelled numeric variables are then
#' converted into factors and exported as character columns - else, value labels
#' would be lost and only numeric values are written to the file.
#' @param verbose Toggle warnings and messages.
#' @param ... Arguments passed to the related `read_*()` or `write_*()` functions.
#'
Expand Down Expand Up @@ -65,12 +66,13 @@
#' @section Differences to other packages that read foreign data formats:
#' `data_read()` is most comparable to `rio::import()`. For data files from
#' SPSS, SAS or Stata, which support labelled data, variables are converted into
#' their most appropriate type. The major difference to `rio::import()` is that
#' `data_read()` automatically converts fully labelled numeric variables into
#' factors, where imported value labels will be set as factor levels. If a
#' numeric variable has _no_ value labels or less value labels than values, it
#' is not converted to factor. In this case, value labels are preserved as
#' `"labels"` attribute. Character vectors are preserved. Use
#' their most appropriate type. The major difference to `rio::import()` is for
#' data files from SPSS, SAS, or Stata, i.e. file types that support
#' *labelled data*. `data_read()` automatically converts fully labelled numeric
#' variables into factors, where imported value labels will be set as factor
#' levels. If a numeric variable has _no_ value labels or less value labels than
#' values, it is not converted to factor. In this case, value labels are
#' preserved as `"labels"` attribute. Character vectors are preserved. Use
#' `convert_factors = FALSE` to remove the automatic conversion of numeric
#' variables to factors.
#'
Expand Down Expand Up @@ -105,7 +107,7 @@ data_read <- function(path,
por = .read_spss(path, encoding, convert_factors, verbose, ...),
dta = .read_stata(path, encoding, convert_factors, verbose, ...),
sas7bdat = .read_sas(path, path_catalog, encoding, convert_factors, verbose, ...),
.read_unknown(path, file_type, convert_factors, verbose, ...)
.read_unknown(path, file_type, verbose, ...)
)

# tell user about empty columns
Expand Down Expand Up @@ -188,7 +190,7 @@ data_read <- function(path,
value_labels <- NULL
attr(i, "converted_to_factor") <- TRUE
} else {
# else, fall back to numeric
# else, fall back to numeric or factor
i <- as.numeric(i)
}

Expand Down Expand Up @@ -288,7 +290,7 @@ data_read <- function(path,
}


.read_unknown <- function(path, file_type, convert_factors, verbose, ...) {
.read_unknown <- function(path, file_type, verbose, ...) {
insight::check_if_installed("rio", reason = paste0("to read files of type '", file_type, "'"))
if (verbose) {
insight::format_alert("Reading data...")
Expand Down Expand Up @@ -317,6 +319,5 @@ data_read <- function(path,
}
out <- tmp
}

.post_process_imported_data(out, convert_factors, verbose)
out
}
32 changes: 17 additions & 15 deletions man/data_read.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

27 changes: 25 additions & 2 deletions tests/testthat/test-data_read.R
Original file line number Diff line number Diff line change
Expand Up @@ -141,12 +141,12 @@ test_that("data_read - RDS file, matrix, coercible", {
httr::stop_for_status(request)
writeBin(httr::content(request, type = "raw"), temp_file)

expect_message(expect_message(expect_message({
expect_message({
d <- data_read(
temp_file,
verbose = TRUE
)
})), regex = "0 out of 5")
})

expect_s3_class(d, "data.frame")
expect_identical(dim(d), c(2L, 5L))
Expand All @@ -155,6 +155,29 @@ test_that("data_read - RDS file, matrix, coercible", {



# RDS file, preserve class /types -----------------------------------

test_that("data_read - RDS file, preserve class", {
withr::with_tempfile("temp_file", fileext = ".rds", code = {
request <- httr::GET("https://raw.github.com/easystats/circus/main/data/hiv.rds")
httr::stop_for_status(request)
writeBin(httr::content(request, type = "raw"), temp_file)

d <- data_read(temp_file)
expect_s3_class(d, "data.frame")
expect_identical(
sapply(d, class),
c(
village = "integer", outcome = "integer", distance = "numeric",
amount = "numeric", incentive = "integer", age = "integer",
hiv2004 = "integer", agecat = "factor"
)
)
})
})



# RData -----------------------------------

test_that("data_read - no warning for RData", {
Expand Down
Loading