[traits.build adding studies functions] Better options for checking data quality #137

ehwenk · 2023-11-22T21:39:58Z

After updating adding_data_long in the traits.build-book repo I realised that there are a number of issues (e.g. issue 71 and standing requests from AusTraits users for a category of function I'll call dataset_check functions.

These are functions to look for suspicious patterns in data (outliers, duplicates within or across datasets) or to work out substitutions/taxonomic updates required. But they can't be part of dataset_test because there isn't the assumption that they can/will be "solved".

The code for some of the "tricks" are still in adding_data_long (generating a list of needed categorical trait value substitutions or taxonomic alignments). But I wonder if this should be a standalone "chapter" in traits.build-book that goes through each of these:

dataset_check_categorical_substitutions - outputs substitutions required (based on excluded data)
dataset_check_numeric_values - outputs out-of-range numeric values (i.e. those in excluded data)
dataset_check_taxonomic_updates - outputs list of original names not in taxon_list
dataset_check_not_pivoting - identifies the measurements that are preventing a dataset from pivoting
dataset_check_outlier_by_genus - for numeric values, looks for values that are x% higher/lower than other values within the genus; only runs is n for trait * genus > 10 (or some other #)
dataset_check_outlier_by_species - for numeric values, looks for values that are x% higher/lower than other values within the species; only runs is n for trait * species> 10 (or some other #)
dataset_check_duplicates_across_datasets - search for duplicates across datasets, tricky to figure out sig figures, and problem that for some traits, duplication likely (e.g. %N, where there is a narrow range)
dataset_check_duplicates_within_dataset - raise flag if all values for given taxon*trait are identical (i.e. growth form)

@yangsophieee or I have code for all of them except dataset_check_duplicates_across_datasets (old, in-need-of-rewrite code from a few years ago).

What is the best way to share these functions/code. Just in the book chapter/vignette? Or also as an additional R file somewhere?

The text was updated successfully, but these errors were encountered:

* added new chapter toward traitecoevo/traits.build#137 * these functions might later become standalone functions in a file, but still should be included here , as users might want to change their exact formulations

ehwenk · 2023-11-23T21:56:47Z

See traitecoevo/traits.build-book@cbe6ceb

- Added new chapter toward traitecoevo/traits.build#137 - These functions will also become traits.build functions, but still should be included here , as users might want to change their exact formulations - Still need to add the `dataset_check_duplicates_across_datasets` function in the future

yangsophieee mentioned this issue Nov 23, 2023

Should we include tests of build dataset in dataset_test ? #71

Closed

ehwenk mentioned this issue Nov 24, 2023

add chapter, check_dataset_functions traitecoevo/traits.build-book#14

Merged

ehwenk changed the title ~~Better options for checking data quality~~ [traits.build adding studies functions] Better options for checking data quality Jul 31, 2024

ehwenk added the enhancement New feature or request label Jul 31, 2024

ehwenk added this to AusTraits Jul 31, 2024

ehwenk moved this to Backlog in AusTraits Jul 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[traits.build adding studies functions] Better options for checking data quality #137

[traits.build adding studies functions] Better options for checking data quality #137

ehwenk commented Nov 22, 2023

ehwenk commented Nov 23, 2023

[traits.build adding studies functions] Better options for checking data quality #137

[traits.build adding studies functions] Better options for checking data quality #137

Comments

ehwenk commented Nov 22, 2023

ehwenk commented Nov 23, 2023