Data Standardization and Validation

hayesall released this 06 Aug 18:28

· 21 commits to main since this release

b74bda1

Standardized Data Formatting

All datasets are now validated with the grammar defined in srlearn/linter

Datasets

Four more datasets are included in this release:

financial_nlp_small
nell_sports
boston_housing
icml

Other Changes

RELEASE_VERSION is now appended to the end of zipfiles. So instead of releasing toy_cancer.zip, this and future versions will have a version (e.g. toy_cancer_v0.0.4.zip) as part of the file name.
Add general usage instructions to main project README.md
Add a hash_datasets.sh script. This is not used at the moment, but can be used to get a hash value for all files in a dataset. This could be helpful for tracking whether two versions of a dataset are exactly the same, even when the zipped contents are different.
Add lint_datasets.sh script for testing dataset content
CI build: on pull requests and pushes to the main branch, the lint_datasets.sh script runs on all datasets under srlearn/

Assets 12