The Supreme Court Database, now in Python-friendly formats!

This repository contains Feathers and Parquet files derived from the most recent versions of the legacy and modern Supreme Court Database datasets. As discussed on the SCDB website, the SCDB is released annually in a variety of formats that differ from one another along several axes (time period, unit of analysis, database record granularity, and file format). This repository contains a minimally-altered version of each of these datasets.

Comparison to Official Datasets

I've made an active effort to ensure that, apart from datasets in the data/preprocessed directory, the feather and parquet files in this repository are faithful reproductions of those found in the official releases. They should differ from expectations only in that

Human-readable strings are used instead of numeric codes for variable values. These strings match the ones found in the SPSS release.
In string-valued and categorical columns, np.nan values are replaced by the description 'MISSING_VALUE'.
Variable data types are converted to accurate and more-or-less optimal (in terms of storage space) data types. This includes using the experimental pd.StringDtype from pandas. As a result of this and, mostly, general advantages of these file formats, the largest feather and parquet files we create here are 6.5 MB and 3.4 MB, respectively, roughly 1.7% and 6.5% the size of the largest .sav file from which we imported.

Available Files

data/raw contains the officially-released SPSS files from which I've derived datasets.
data/feather contains all of the generated feathers
data/parquet contains—yep you guessed it—the parquet files
data/preprocessed contains a more refined version of the case-centric, citation-level dataset. This is a combination of the legacy and modern datasets that also includes some mild error correction and imputation work. If you're curious for more details, all changes are documented in the repository's dvc.yaml file, the data_pipeline package and, with more prose, on my blog beginning with this post. If you're interested in getting involved, contributions are welcomed as are feature requests and issues!

Disclaimer

I'm not affiliated with the Supreme Court Database, and this project is not officially endorsed by members of the Supreme Court Database.

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
.dvc		.dvc
data		data
data_pipeline		data_pipeline
tests		tests
.dvcignore		.dvcignore
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
dvc.lock		dvc.lock
dvc.yaml		dvc.yaml
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

The Supreme Court Database, now in Python-friendly formats!

Comparison to Official Datasets

Available Files

Disclaimer

About

Languages

drmrd/scdb

Folders and files

Latest commit

History

Repository files navigation

The Supreme Court Database, now in Python-friendly formats!

Comparison to Official Datasets

Available Files

Disclaimer

About

Topics

Resources

Stars

Watchers

Forks

Languages