Skip to content

CT-Data-Collaborative/pytest-ctdata-datapackage

Repository files navigation

pytest-ctdata-datapackage

See Build Status on Travis CI

Plugin for testing Tabular Data Packages with Tidy data resources.


This Pytest plugin was generated with Cookiecutter along with @hackebrot's Cookiecutter-pytest-plugin template.

Features

Requirements

  • Tidy-formatted data stored as a .csv
  • Datapackage.json

Installation

You can install "pytest-ctdata-datapackage" via pip from Github:

:code:`$ pip install -e git+https://github.com/CT-Data-Collaborative/pytest-ctdata-datapackage#egg=pytest-ctdata

-datapackage`

Usage

This plugin loads and structures a CTData CKAN dataset for value and structure testing. It is designed to be used alongside CTData Dataset Cookiecutter.

The plugin makes a few assumptions about the structure and organization of your data. It assumes that the root of your directory will contain a datapackage.json and the presence of only one resource file. This is more strict that the requirements imposed by the Tabular Data Package standards and stems from how we publish and display data.

Using the datapackage.json, the plugin will set up a number of fixtures that can then be used to run some basic tests against the final data set. Our cookiecutter plugin contains a testing file that includes a number of standard tests.

Most data published by CTData is associated with a limited set of geographies. Specifically:

  • Town/City
  • School District
  • County

When we publish data, we follow a number of conventions that impact data set testing.

1. All geographic entities are represented in the raw data file, even if no data is available. We consider the absence of data to be a meaningful data point itself and so we back fill our data files to communicate this. We usually indicate the absence of data by setting the Value field to be -9999.

2. All combinations of variables should be present. This follows, from #1, in that if we choose to present a given disaggregation that is not uniformally available, we will communicate this by setting the Value field to be -9999.

Provided fixtures include:

  • metadata - a dict representing the parsed datapackage.json file
  • geographies - a list of geographical entities present in data
  • domain - a boolean representing check that dataset domain is valid
  • years - a list of the years as specified in the metadata
  • dataset - a list of dicts representing the parsed tidy data file
  • spotchecks - a list of lookup keys and expected value
  • spotchec_results - a list of named tuples, each of which contain the test spec, the expected result and actual result
  • rowcount_reults - a named tuple with the expected row count and actual row count

Metadata Schema

Testing row counts and the success / failure of backfilling and subgroup calculations requires knowing the relationship between factors and the degree to which factors are nested or in parallel.

For example, let's imagine that there is a data set where observations include information about Sex and Race/Ethnicity. There are two common scenarios. These variables could either represent a hierarchy of disaggregation or represent parallel disaggregations.

Let's say that the Sex factor includes the following levels

  • Male
  • Female
  • All

And the Race/Ethnicity factor includes the following levels

  • White
  • Black
  • Hispanic
  • All

If these are nested each observation where sex is indicated to be 'Male' will have a corresponding Race/Ethnicity level that can be one of the three choices. This results in twelve possible combinations

  • Male/White
  • Female/White
  • All/White
  • Male/Black

and so on until - All/All

As an alternative, these factors could be parallel, in which case a given observation can either include information about sex OR information about Race/Ethnicity. The combinations can

  • Male/All
  • Female/All
  • All/All
  • All/White
  • All/Black
  • All/Hispanic

Sometimes the situation is more complex. Some factors can be hierarchical, while others can be parallel. This is often the case with education data. For example, data may be disaggregated by Sex and Race/Ethnicity with a separate disaggregation by grade.

Here is an example for how to specify a somewhat complex group of posssible combinations:

{
  "dimension_groups" :
    [
      {
          "Unit Type": ["Detached"],
          "Measure Type": ["Number", "Percent"],
          "Variable": ["Housing Units", "Margins of Error"]
      },
      {
          "Unit Type": ["Total"],
          "Measure Type": ["Number"],
          "Variable": ["Housing Units", "Margins of Error"]
      }
    ]
}

Rows that contain data on Detached Unit Type can be either Number or Percent Measure Types. However, Total Unit Type rows only contain Number Measure Type observations (Percents would all be 100%).

First, we include a specification of each factor and the available levels.

Second, we can include a list of the valid combinations.

For example one (Sex and Race/Ethnicity nested), we would specify as follows:

  • [Sex, Race/Ethnicity]

For the second example (Sex and Race/Ethnicity in parallel), we would specify as follows:

  • Sex
  • Race/Ethnicity

For the third, (Sex and Race/Ethnicity nested, Grade in parallel):

  • [Sex, Race/Ethnicity]
  • Grade

Roadmap

Fixtures to add:

  • subdomain - a boolean representing check that dataset subdomain is a valid value
  • domain_subdomain - a boolean representing check that domain/subdomain combination is a valid value
  • units - a list of expected measurement types
  • default - a dict of the expected default settings for CKAN

Contributing

Contributions are very welcome. Tests can be run with tox, please ensure the coverage at least stays the same before you submit a pull request.

License

Distributed under the terms of the MIT license, "pytest-ctdata_datatest" is free and open source software

Issues

If you encounter any problems, please file an issue along with a detailed description.

About

Pytest Plugin for testing Tidy Datapackages

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages