-
-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Suggestion of new function: describe_missing()
#561
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, I think it would be good to have describe_missing()
but the way it is implemented and documented looks very field-specific to me. I find the output of skimr::skim()
easier to understand with n_missing
and complete_rate
for instance. I'm also not familiar at all with aggregating stats on missing values across several variables (e.g. Ozone:Wind
) and the default output looks unexpected to me (I'd rather expect one row per variable).
R/describe_missing.R
Outdated
#' @description Provides a detailed description of missing values in a data frame. | ||
#' This function reports both absolute and percentage missing values of specified | ||
#' column lists or scales, following recommended guidelines. Some authors recommend | ||
#' reporting item-level missingness per scale, as well as a participant's maximum | ||
#' number of missing items by scale. For example, Parent (2013) writes: | ||
#' | ||
#' *I recommend that authors (a) state their tolerance level for missing data by scale | ||
#' or subscale (e.g., "We calculated means for all subscales on which participants gave | ||
#' at least 75% complete data") and then (b) report the individual missingness rates | ||
#' by scale per data point (i.e., the number of missing values out of all data points | ||
#' on that scale for all participants) and the maximum by participant (e.g., "For Attachment | ||
#' Anxiety, a total of 4 missing data points out of 100 were observed, with no participant | ||
#' missing more than a single data point").* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sounds a bit too much focused on survey data while this function can be interesting for all kinds of data. I'd rather keep the first or two first sentences here and move the rest in a specific section in 'Details' (but even there, this seems very field-specific).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved everything after "Some authors recommend" to @details
.
Also, I think the way I see it, is that a lot of packages and functions can report basic missing data features, like skimr::skim()
(that's the "easy" part). What is missing is a way to handle, as you highlight, survey data in that field-specific way. I thought it still fits with datawizard
even if offers additional field-specific features, although we can probably try to make it more general for other users. In the details section, I added a paragraph adding more context about scales as used in psychology:
#' In psychology, it is common to ask participants to answer questionnaires in
#' which people answer several questions about a specific topic. For example,
#' people could answer 10 different questions about how extroverted they are.
#' In turn, researchers calculate the average for those 10 questions (called
#' items). These questionnaires are called (e.g., Likert) "scales" (such as the
#' Rosenberg Self-Esteem Scale, also known as the RSES).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose one question we have to answer is: do we want to have describe_missing
only report basic missing info that is field-general a bit more like skim()
, OR we do we also want it to include the features specific to the survey format? (or said another way, should we remove or keep the survey feature)
R/describe_missing.R
Outdated
#' missing more than a single data point").* | ||
#' | ||
#' @param data The data frame to be analyzed. | ||
#' @param vars Variable (or lists of variables) to check for missing values (NAs). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We use select
, exclude
, etc. in all other dataframe functions, I think we should here as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here it works a little bit differently than select
elsewhere. vars
takes a list of list of strings (such as list(c("openness_1", "openness_2", "openness_3"), c("extroversion_1", "extroversion_2", "extroversion_3"))
) to take into account the nested structure of the items / columns. I can rename it to select
, but do you think it will create confusion or expectations that it should rely on and work with .select_nse
? Or should we include select
and exclude
in addition to vars
? I'm not sure how .select_nse
could accommodate the nested structure like I'm doing right now 🤔
R/describe_missing.R
Outdated
#' @keywords missing values NA guidelines | ||
#' @return A dataframe with the following columns: | ||
#' - `var`: Variables selected. | ||
#' - `items`: Number of items for selected variables. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think unique_values
instead of items
would be clearer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hum, so in this case "number of items" refers to the number of columns selected for each "scale" or combination of variables. Maybe I should use that instead, as I'm afraid unique_values
would suggest unique responses for a given column.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is indeed specific as in psychology we tend to think of variables as made of several "items". So items 1-10 create a variable such as a personality trait "extroversion". I'm not sure how to call it because "variable" might be confused with "scale" (i.e., a composite score). Maybe I could just rename that output column "columns", but I'm open to your suggestions if you have more. A more accurate name (for psychology) would be n_items
, so perhaps we can do n_columns
??
Co-authored-by: Etienne Bacher <52219252+etiennebacher@users.noreply.github.com>
Thanks for the feedback and comments! We can definitely rename the column names for more clarity e.g., to use
There is one row per variable / scale, but each variable / scale can be defined by multiple items / columns, and so the output has to be able to accommodate that (the current strategy is to use the But if I understand correctly, you would like that the default, instead of reporting for all columns as an aggregate (i.e., always exactly 1 row), would report one row per column, for all columns. Although for large datasets this would create a long output, that could work. |
Ok so I changed the default so that when no scale or variable are specified, all columns are reported on separate rows: However, this behaviour is overwritten if scales or variables are specified: library(datawizard)
# Use the entire data frame
set.seed(15)
fun <- function() {
c(sample(c(NA, 1:10), replace = TRUE), NA, NA, NA)
}
df <- data.frame(
ID = c("idz", NA),
openness_1 = fun(), openness_2 = fun(), openness_3 = fun(),
extroversion_1 = fun(), extroversion_2 = fun(), extroversion_3 = fun(),
agreeableness_1 = fun(), agreeableness_2 = fun(), agreeableness_3 = fun()
)
describe_missing(df)
#> variable n_columns n_missing cells missing_percent complete_percent
#> 1 ID 1 7 14 50.00 50.00
#> 2 openness_1 1 4 14 28.57 71.43
#> 3 openness_2 1 4 14 28.57 71.43
#> 4 openness_3 1 3 14 21.43 78.57
#> 5 extroversion_1 1 6 14 42.86 57.14
#> 6 extroversion_2 1 6 14 42.86 57.14
#> 7 extroversion_3 1 5 14 35.71 64.29
#> 8 agreeableness_1 1 3 14 21.43 78.57
#> 9 agreeableness_2 1 4 14 28.57 71.43
#> 10 agreeableness_3 1 3 14 21.43 78.57
#> 11 Total 10 45 140 32.14 67.86
#> missing_max missing_max_percent all_missing
#> 1 1 100 7
#> 2 1 100 4
#> 3 1 100 4
#> 4 1 100 3
#> 5 1 100 6
#> 6 1 100 6
#> 7 1 100 5
#> 8 1 100 3
#> 9 1 100 4
#> 10 1 100 3
#> 11 10 100 2
# If the questionnaire items start with the same name,
# one can list the scale names directly:
describe_missing(df, scales = c("ID", "openness", "extroversion", "agreeableness"))
#> variable n_columns n_missing cells missing_percent
#> 1 ID 1 7 14 50.00
#> 2 openness_1:openness_3 3 11 42 26.19
#> 3 extroversion_1:extroversion_3 3 17 42 40.48
#> 4 agreeableness_1:agreeableness_3 3 10 42 23.81
#> 5 Total 10 45 140 32.14
#> complete_percent missing_max missing_max_percent all_missing
#> 1 50.00 1 100 7
#> 2 73.81 3 100 3
#> 3 59.52 3 100 3
#> 4 76.19 3 100 3
#> 5 67.86 10 100 2
# Otherwise you can provide nested columns manually:
describe_missing(df,
select = list(
c("ID"),
c("openness_1", "openness_2", "openness_3"),
c("extroversion_1", "extroversion_2", "extroversion_3"),
c("agreeableness_1", "agreeableness_2", "agreeableness_3")
)
)
#> variable n_columns n_missing cells missing_percent
#> 1 ID 1 7 14 50.00
#> 2 openness_1:openness_3 3 11 42 26.19
#> 3 extroversion_1:extroversion_3 3 17 42 40.48
#> 4 agreeableness_1:agreeableness_3 3 10 42 23.81
#> 5 Total 10 45 140 32.14
#> complete_percent missing_max missing_max_percent all_missing
#> 1 50.00 1 100 7
#> 2 73.81 3 100 3
#> 3 59.52 3 100 3
#> 4 76.19 3 100 3
#> 5 67.86 10 100 2 Created on 2024-12-16 with reprex v2.1.1 |
I feel like most unresolved comments and questions regarding the documentation and the implementation are related to the scope of this function. I'd rather have a "generalist" function à la @easystats/core-team what do you think? Are you interested in having some of those field-specific features in this function? |
I tend to agree. This function should be more general purpose - and maybe a psych-centric wrapper can be housed in @rempsyc 's package (I also just now noticed your handle is the name of the package 😅) |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #561 +/- ##
==========================================
+ Coverage 91.14% 91.25% +0.11%
==========================================
Files 76 77 +1
Lines 6045 6144 +99
==========================================
+ Hits 5510 5607 +97
- Misses 535 537 +2 ☔ View full report in Codecov by Sentry. |
If I understand, the main outstanding issue is what to do with the |
Alright, in this case, I think I can introduce |
Alright, this is a much simplified version which now also support "by". So this is what I have so far: library(datawizard)
describe_missing(airquality, select = "Ozone:Temp")
#> variable n_missing missing_percent complete_percent
#> 1 Ozone 37 24.18 75.82
#> 2 Solar.R 7 4.58 95.42
#> 3 Wind 0 0.00 100.00
#> 4 Temp 0 0.00 100.00
#> 5 Total 44 7.19 92.81
describe_missing(airquality, exclude = "Ozone:Temp")
#> variable n_missing missing_percent complete_percent
#> 1 Month 0 0 100
#> 2 Day 0 0 100
#> 3 Total 0 0 100
# Testing the 'by' argument for survey scales
set.seed(15)
fun <- function() {
c(sample(c(NA, 1:10), replace = TRUE), NA, NA, NA)
}
df <- data.frame(
ID = c("idz", NA),
openness_1 = fun(), openness_2 = fun(), openness_3 = fun(),
extroversion_1 = fun(), extroversion_2 = fun(), extroversion_3 = fun(),
agreeableness_1 = fun(), agreeableness_2 = fun(), agreeableness_3 = fun()
)
df_long <- reshape_longer(
df,
select = -1,
names_sep = "_",
names_to = c("dimension", "item"))
describe_missing(df_long,
select = -c(1, 3),
by = "dimension")
#> variable n_missing missing_percent complete_percent
#> 1 agreeableness 10 23.81 76.19
#> 2 extroversion 17 40.48 59.52
#> 3 openness 11 26.19 73.81
#> 4 Total 38 15.08 84.92 Created on 2024-12-19 with reprex v2.1.1 Anything else you'd find desirable in the function? |
@@ -16,6 +16,10 @@ BREAKING CHANGES AND DEPRECATIONS | |||
- if `select` (previously `pattern`) is a named vector, then all elements | |||
must be named, e.g. `c(length = "Sepal.Length", "Sepal.Width")` errors. | |||
|
|||
NEW FUNCTIONS | |||
|
|||
* `describe_missing()`, to report on missing values in a data frame. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* `describe_missing()`, to report on missing values in a data frame. | |
* `describe_missing()`, to report on missing values in a data frame (#561). |
#' @title Describe Missing Values in Data According to Guidelines | ||
#' | ||
#' @description Provides a detailed description of missing values in a data frame. | ||
#' This function reports both absolute and percentage missing values of specified |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#' This function reports both absolute and percentage missing values of specified | |
#' This function reports both absolute number and percentage of missing values of specified |
#' variables and summary statistics will be computed for each group. Useful | ||
#' for survey data by first reshaping the data to the long format. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the last sentence is very specific and hard to understand. This usecase is part of the conversation so we have it in mind but it's a bit obscure for an external reader. It should be removed IMO.
#' @param sort Logical. Whether to sort the result from highest to lowest | ||
#' percentage of missing data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given this can be done with an extra data_arrange()
, I don't think it's necessary to add this argument.
#' - `n_missing`: Number of missing values. | ||
#' - `missing_percent`: Percentage of missing values. | ||
#' - `complete_percent`: Percentage of non-missing values. | ||
#' @param ... Arguments passed down to other functions. Currently not used. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we actually need ...
?
If we keep it, it should be positioned before @return
#' describe_missing( | ||
#' df_long, | ||
#' select = -c(1, 3), | ||
#' by = "dimension" | ||
#' ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This fails with an unclear message if there are more than one variable in by
.
The way this argument works is also not very clear to me. For instance, I'd find it more natural if the by
variables were used to return a list of dataframes, e.g.:
# group 1, subgroup 1
<output of describe_missing() for this particular (nested) group>
# group 1, subgroup 2
<output of describe_missing() for this particular (nested) group>
etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additionally, the current implementation means that select
has to be used to exclude the non-by
variables, otherwise the output is weird
> describe_missing(df_long, by = "dimension")
variable n_missing missing_percent complete_percent
1 agreeableness 21 50.00 50.00
2 agreeableness 0 0.00 100.00
3 agreeableness 10 23.81 76.19
4 extroversion 21 50.00 50.00
5 extroversion 0 0.00 100.00
6 extroversion 17 40.48 59.52
7 openness 21 50.00 50.00
8 openness 0 0.00 100.00
9 openness 11 26.19 73.81
10 Total 101 20.04 79.96
#' @return A dataframe with the following columns: | ||
#' - `variable`: Variables selected. | ||
#' - `n_missing`: Number of missing values. | ||
#' - `missing_percent`: Percentage of missing values. | ||
#' - `complete_percent`: Percentage of non-missing values. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should it also have the total number of obs for better comparison? Although this number would be repeated for all rows...
Fixes #454