This repository contains Feathers and Parquet files derived from the most recent versions of the legacy and modern Supreme Court Database datasets. As discussed on the SCDB website, the SCDB is released annually in a variety of formats that differ from one another along several axes (time period, unit of analysis, database record granularity, and file format). This repository contains a minimally-altered version of each of these datasets.
I've made an active effort to ensure that, apart from datasets in the
data/preprocessed
directory, the feather and parquet
files in this repository are faithful reproductions of those found in the
official releases.
They should differ from expectations only in that
- Human-readable strings are used instead of numeric codes for variable values. These strings match the ones found in the SPSS release.
- In string-valued and categorical columns,
np.nan
values are replaced by the description'MISSING_VALUE'
. - Variable data types are converted to accurate and more-or-less optimal (in
terms of storage space) data types. This includes using the experimental
pd.StringDtype
from pandas. As a result of this and, mostly, general advantages of these file formats, the largest feather and parquet files we create here are 6.5 MB and 3.4 MB, respectively, roughly 1.7% and 6.5% the size of the largest.sav
file from which we imported.
data/raw
contains the officially-released SPSS files from which I've derived datasets.data/feather
contains all of the generated feathersdata/parquet
contains—yep you guessed it—the parquet filesdata/preprocessed
contains a more refined version of the case-centric, citation-level dataset. This is a combination of the legacy and modern datasets that also includes some mild error correction and imputation work. If you're curious for more details, all changes are documented in the repository'sdvc.yaml
file, thedata_pipeline
package and, with more prose, on my blog beginning with this post. If you're interested in getting involved, contributions are welcomed as are feature requests and issues!
I'm not affiliated with the Supreme Court Database, and this project is not officially endorsed by members of the Supreme Court Database.