Skip to content

Data Standardization and Validation

Compare
Choose a tag to compare
@hayesall hayesall released this 06 Aug 18:28
· 21 commits to main since this release
b74bda1

Standardized Data Formatting

All datasets are now validated with the grammar defined in srlearn/linter

Datasets

Four more datasets are included in this release:

  • financial_nlp_small
  • nell_sports
  • boston_housing
  • icml

Other Changes

  • RELEASE_VERSION is now appended to the end of zipfiles. So instead of releasing toy_cancer.zip, this and future versions will have a version (e.g. toy_cancer_v0.0.4.zip) as part of the file name.
  • Add general usage instructions to main project README.md
  • Add a hash_datasets.sh script. This is not used at the moment, but can be used to get a hash value for all files in a dataset. This could be helpful for tracking whether two versions of a dataset are exactly the same, even when the zipped contents are different.
  • Add lint_datasets.sh script for testing dataset content
  • CI build: on pull requests and pushes to the main branch, the lint_datasets.sh script runs on all datasets under srlearn/