Data Standardization and Validation
Standardized Data Formatting
All datasets are now validated with the grammar defined in srlearn/linter
Datasets
Four more datasets are included in this release:
financial_nlp_small
nell_sports
boston_housing
icml
Other Changes
RELEASE_VERSION
is now appended to the end of zipfiles. So instead of releasingtoy_cancer.zip
, this and future versions will have a version (e.g.toy_cancer_v0.0.4.zip
) as part of the file name.- Add general usage instructions to main project
README.md
- Add a
hash_datasets.sh
script. This is not used at the moment, but can be used to get a hash value for all files in a dataset. This could be helpful for tracking whether two versions of a dataset are exactly the same, even when the zipped contents are different. - Add
lint_datasets.sh
script for testing dataset content - CI build: on pull requests and pushes to the main branch, the
lint_datasets.sh
script runs on all datasets undersrlearn/