pytest-ctdata-datapackage

Plugin for testing Tabular Data Packages with Tidy data resources.

This Pytest plugin was generated with Cookiecutter along with @hackebrot's Cookiecutter-pytest-plugin template.

Features

Leverage datapackage.json and JSON Table Schema to setup a series of fixtures for easy testing of`Tidy data`_

Requirements

Tidy-formatted data stored as a .csv
Datapackage.json

Installation

You can install "pytest-ctdata-datapackage" via pip from Github:

:code:`$ pip install -e git+https://github.com/CT-Data-Collaborative/pytest-ctdata-datapackage#egg=pytest-ctdata

-datapackage`

Usage

This plugin loads and structures a CTData CKAN dataset for value and structure testing. It is designed to be used alongside CTData Dataset Cookiecutter.

The plugin makes a few assumptions about the structure and organization of your data. It assumes that the root of your directory will contain a datapackage.json and the presence of only one resource file. This is more strict that the requirements imposed by the Tabular Data Package standards and stems from how we publish and display data.

Using the datapackage.json, the plugin will set up a number of fixtures that can then be used to run some basic tests against the final data set. Our cookiecutter plugin contains a testing file that includes a number of standard tests.

Most data published by CTData is associated with a limited set of geographies. Specifically:

Town/City
School District
County

When we publish data, we follow a number of conventions that impact data set testing.

1. All geographic entities are represented in the raw data file, even if no data is available. We consider the absence of data to be a meaningful data point itself and so we back fill our data files to communicate this. We usually indicate the absence of data by setting the Value field to be -9999.

2. All combinations of variables should be present. This follows, from #1, in that if we choose to present a given disaggregation that is not uniformally available, we will communicate this by setting the Value field to be -9999.

Provided fixtures include:

metadata - a dict representing the parsed datapackage.json file
geographies - a list of geographical entities present in data
domain - a boolean representing check that dataset domain is valid
years - a list of the years as specified in the metadata
dataset - a list of dicts representing the parsed tidy data file
spotchecks - a list of lookup keys and expected value
spotchec_results - a list of named tuples, each of which contain the test spec, the expected result and actual result
rowcount_reults - a named tuple with the expected row count and actual row count

Metadata Schema

Testing row counts and the success / failure of backfilling and subgroup calculations requires knowing the relationship between factors and the degree to which factors are nested or in parallel.

For example, let's imagine that there is a data set where observations include information about Sex and Race/Ethnicity. There are two common scenarios. These variables could either represent a hierarchy of disaggregation or represent parallel disaggregations.

Let's say that the Sex factor includes the following levels

Male
Female
All

And the Race/Ethnicity factor includes the following levels

White
Black
Hispanic
All

If these are nested each observation where sex is indicated to be 'Male' will have a corresponding Race/Ethnicity level that can be one of the three choices. This results in twelve possible combinations

Male/White
Female/White
All/White
Male/Black

and so on until - All/All

As an alternative, these factors could be parallel, in which case a given observation can either include information about sex OR information about Race/Ethnicity. The combinations can

Male/All
Female/All
All/All
All/White
All/Black
All/Hispanic

Sometimes the situation is more complex. Some factors can be hierarchical, while others can be parallel. This is often the case with education data. For example, data may be disaggregated by Sex and Race/Ethnicity with a separate disaggregation by grade.

Here is an example for how to specify a somewhat complex group of posssible combinations:

{
  "dimension_groups" :
    [
      {
          "Unit Type": ["Detached"],
          "Measure Type": ["Number", "Percent"],
          "Variable": ["Housing Units", "Margins of Error"]
      },
      {
          "Unit Type": ["Total"],
          "Measure Type": ["Number"],
          "Variable": ["Housing Units", "Margins of Error"]
      }
    ]
}

Rows that contain data on Detached Unit Type can be either Number or Percent Measure Types. However, Total Unit Type rows only contain Number Measure Type observations (Percents would all be 100%).

First, we include a specification of each factor and the available levels.

Second, we can include a list of the valid combinations.

For example one (Sex and Race/Ethnicity nested), we would specify as follows:

[Sex, Race/Ethnicity]

For the second example (Sex and Race/Ethnicity in parallel), we would specify as follows:

Sex
Race/Ethnicity

For the third, (Sex and Race/Ethnicity nested, Grade in parallel):

[Sex, Race/Ethnicity]
Grade

Roadmap

Fixtures to add:

subdomain - a boolean representing check that dataset subdomain is a valid value
domain_subdomain - a boolean representing check that domain/subdomain combination is a valid value
units - a list of expected measurement types
default - a dict of the expected default settings for CKAN

Contributing

Contributions are very welcome. Tests can be run with tox, please ensure the coverage at least stays the same before you submit a pull request.

License

Distributed under the terms of the MIT license, "pytest-ctdata_datatest" is free and open source software

Issues

If you encounter any problems, please file an issue along with a detailed description.

Name		Name	Last commit message	Last commit date
Latest commit History 48 Commits
docs		docs
tests		tests
.gitignore		.gitignore
.travis.yml		.travis.yml
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.rst		README.rst
appveyor.yml		appveyor.yml
pytest_ctdata_datapackage.py		pytest_ctdata_datapackage.py
setup.py		setup.py
tox.ini		tox.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

pytest-ctdata-datapackage

Features

Requirements

Installation

Usage

Metadata Schema

Roadmap

Contributing

License

Issues

About

Releases

Packages

Languages

License

CT-Data-Collaborative/pytest-ctdata-datapackage

Folders and files

Latest commit

History

Repository files navigation

pytest-ctdata-datapackage

Features

Requirements

Installation

Usage

Metadata Schema

Roadmap

Contributing

License

Issues

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages