Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[traits.build adding studies functions] Better options for checking data quality #137

Open
ehwenk opened this issue Nov 22, 2023 · 1 comment
Labels
enhancement New feature or request

Comments

@ehwenk
Copy link
Collaborator

ehwenk commented Nov 22, 2023

After updating adding_data_long in the traits.build-book repo I realised that there are a number of issues (e.g. issue 71 and standing requests from AusTraits users for a category of function I'll call dataset_check functions.

These are functions to look for suspicious patterns in data (outliers, duplicates within or across datasets) or to work out substitutions/taxonomic updates required. But they can't be part of dataset_test because there isn't the assumption that they can/will be "solved".

The code for some of the "tricks" are still in adding_data_long (generating a list of needed categorical trait value substitutions or taxonomic alignments). But I wonder if this should be a standalone "chapter" in traits.build-book that goes through each of these:

dataset_check_categorical_substitutions - outputs substitutions required (based on excluded data)
dataset_check_numeric_values - outputs out-of-range numeric values (i.e. those in excluded data)
dataset_check_taxonomic_updates - outputs list of original names not in taxon_list
dataset_check_not_pivoting - identifies the measurements that are preventing a dataset from pivoting
dataset_check_outlier_by_genus - for numeric values, looks for values that are x% higher/lower than other values within the genus; only runs is n for trait * genus > 10 (or some other #)
dataset_check_outlier_by_species - for numeric values, looks for values that are x% higher/lower than other values within the species; only runs is n for trait * species> 10 (or some other #)
dataset_check_duplicates_across_datasets - search for duplicates across datasets, tricky to figure out sig figures, and problem that for some traits, duplication likely (e.g. %N, where there is a narrow range)
dataset_check_duplicates_within_dataset - raise flag if all values for given taxon*trait are identical (i.e. growth form)

@yangsophieee or I have code for all of them except dataset_check_duplicates_across_datasets (old, in-need-of-rewrite code from a few years ago).

What is the best way to share these functions/code. Just in the book chapter/vignette? Or also as an additional R file somewhere?

ehwenk added a commit to traitecoevo/traits.build-book that referenced this issue Nov 23, 2023
* added new chapter toward traitecoevo/traits.build#137
* these functions might later become standalone functions in a file, but still should be included here , as users might want to change their exact formulations
@ehwenk
Copy link
Collaborator Author

ehwenk commented Nov 23, 2023

yangsophieee pushed a commit to traitecoevo/traits.build-book that referenced this issue Nov 24, 2023
- Added new chapter toward traitecoevo/traits.build#137
- These functions will also become traits.build functions, but still should be included here , as users might want to change their exact formulations
- Still need to add the `dataset_check_duplicates_across_datasets` function in the future
@ehwenk ehwenk changed the title Better options for checking data quality [traits.build adding studies functions] Better options for checking data quality Jul 31, 2024
@ehwenk ehwenk added the enhancement New feature or request label Jul 31, 2024
@ehwenk ehwenk added this to AusTraits Jul 31, 2024
@ehwenk ehwenk moved this to Backlog in AusTraits Jul 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Status: Backlog
Development

No branches or pull requests

1 participant