You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
After updating adding_data_long in the traits.build-book repo I realised that there are a number of issues (e.g. issue 71 and standing requests from AusTraits users for a category of function I'll call dataset_check functions.
These are functions to look for suspicious patterns in data (outliers, duplicates within or across datasets) or to work out substitutions/taxonomic updates required. But they can't be part of dataset_test because there isn't the assumption that they can/will be "solved".
The code for some of the "tricks" are still in adding_data_long (generating a list of needed categorical trait value substitutions or taxonomic alignments). But I wonder if this should be a standalone "chapter" in traits.build-book that goes through each of these:
dataset_check_categorical_substitutions - outputs substitutions required (based on excluded data) dataset_check_numeric_values - outputs out-of-range numeric values (i.e. those in excluded data) dataset_check_taxonomic_updates - outputs list of original names not in taxon_list dataset_check_not_pivoting - identifies the measurements that are preventing a dataset from pivoting dataset_check_outlier_by_genus - for numeric values, looks for values that are x% higher/lower than other values within the genus; only runs is n for trait * genus > 10 (or some other #) dataset_check_outlier_by_species - for numeric values, looks for values that are x% higher/lower than other values within the species; only runs is n for trait * species> 10 (or some other #) dataset_check_duplicates_across_datasets - search for duplicates across datasets, tricky to figure out sig figures, and problem that for some traits, duplication likely (e.g. %N, where there is a narrow range) dataset_check_duplicates_within_dataset - raise flag if all values for given taxon*trait are identical (i.e. growth form)
@yangsophieee or I have code for all of them except dataset_check_duplicates_across_datasets (old, in-need-of-rewrite code from a few years ago).
What is the best way to share these functions/code. Just in the book chapter/vignette? Or also as an additional R file somewhere?
The text was updated successfully, but these errors were encountered:
* added new chapter toward traitecoevo/traits.build#137
* these functions might later become standalone functions in a file, but still should be included here , as users might want to change their exact formulations
- Added new chapter toward traitecoevo/traits.build#137
- These functions will also become traits.build functions, but still should be included here , as users might want to change their exact formulations
- Still need to add the `dataset_check_duplicates_across_datasets` function in the future
ehwenk
changed the title
Better options for checking data quality
[traits.build adding studies functions] Better options for checking data quality
Jul 31, 2024
After updating
adding_data_long
in the traits.build-book repo I realised that there are a number of issues (e.g. issue 71 and standing requests from AusTraits users for a category of function I'll calldataset_check
functions.These are functions to look for suspicious patterns in data (outliers, duplicates within or across datasets) or to work out substitutions/taxonomic updates required. But they can't be part of
dataset_test
because there isn't the assumption that they can/will be "solved".The code for some of the "tricks" are still in
adding_data_long
(generating a list of needed categorical trait value substitutions or taxonomic alignments). But I wonder if this should be a standalone "chapter" in traits.build-book that goes through each of these:dataset_check_categorical_substitutions
- outputs substitutions required (based on excluded data)dataset_check_numeric_values
- outputs out-of-range numeric values (i.e. those in excluded data)dataset_check_taxonomic_updates
- outputs list of original names not in taxon_listdataset_check_not_pivoting
- identifies the measurements that are preventing a dataset from pivotingdataset_check_outlier_by_genus
- for numeric values, looks for values that are x% higher/lower than other values within the genus; only runs is n for trait * genus > 10 (or some other #)dataset_check_outlier_by_species
- for numeric values, looks for values that are x% higher/lower than other values within the species; only runs is n for trait * species> 10 (or some other #)dataset_check_duplicates_across_datasets
- search for duplicates across datasets, tricky to figure out sig figures, and problem that for some traits, duplication likely (e.g. %N, where there is a narrow range)dataset_check_duplicates_within_dataset
- raise flag if all values for given taxon*trait are identical (i.e. growth form)@yangsophieee or I have code for all of them except
dataset_check_duplicates_across_datasets
(old, in-need-of-rewrite code from a few years ago).What is the best way to share these functions/code. Just in the book chapter/vignette? Or also as an additional R file somewhere?
The text was updated successfully, but these errors were encountered: