Skip to content

Collect bill text and metadata (summary) as datasets for machine learning

Notifications You must be signed in to change notification settings

dreamproit/BillML

Repository files navigation

BillML

Introduction

This project creates datasets for machine learning from the US Congress bills. Currently supports creation of following datasets:

Build docker image with utility script

To build docker image required for project to work you can use manual docker command or following command:

export DOCKER_IMAGE_TAGS=billml:1.0.0; . ./build_docker_image.sh

Environment variables

This chapter contains a table with explanation of the environment variables that used in the project’s docker container.

Variable name

Default value

Description

BILLML_IMAGE_NAME

dreamproit/billml

The name of billml image that will be used in docker-compose.yaml.

BILLML_IMAGE_VERSION

1.0.0

The version of billml image that will be used in docker-compose.yaml.

BILLML_CONTAINER_NAME

billml

The name of the billml service container in docker-compose.

BILLML_PLATFORM_NAME

linux/x86_64

The name of the platform that will be used in billml docker compose service.

BILLML_RUN_FOR_EVER

False

Variable used in billml 's entrypoint.sh to keep container running indefinitely (very handy in development).

BILLML_LOCAL_VOLUME_PATH

/path/to/repo/root

Variable used dev docker-compose to mount local volume to the container root.

BILLML_LOG_OUTPUT_FOLDER

/usr/src/logs

Path to folder where project will save .log files and .gz archives with logs archives.

BILLML_LOCAL_FILESYSTEM_CONCURRENCY_LIMIT

4000

Amount of the concurrent tasks required for dataset creation.(In case of OSError lower this number)

BILLML_DATASETS_STORAGE_FILEPATH

/datasets_data

Path to folder where project will save datasets.

CONGRESS_DATA_FOLDER_FILEPATH

/bills_data/data

Path to folder where congress project saves bills data.

CONGRESS_CACHE_FOLDER_FILEPATH

/bills_data/cache

Path to folder where congress project saves cache data.

BILLML_BACKUP_BILLS_DATA_CLOUD_PROVIDER

s3

The name of cloud provider to store bills data.

BILLML_BACKUP_BILLS_DATA_BUCKET_NAME

your-bucket-name

The name of AWS s3 bucket.

BILLML_BACKUP_BILLS_DATA_ACCESS_KEY

your-aws-access-key

The AWS access key for your s3 bucket.

BILLML_BACKUP_BILLS_DATA_SECRET_KEY

your-aws-secret-key

The AWS secret key for your s3 bucket.

BILLML_BACKUP_DATASET_DATA_CLOUD_PROVIDER

s3

The name of cloud provider to store datasets data.

BILLML_BACKUP_DATASET_DATA_BUCKET_NAME

your-bucket-name

The name of AWS s3 bucket.

BILLML_BACKUP_DATASET_DATA_ACCESS_KEY

your-aws-access-key

The AWS access key for your s3 bucket.

BILLML_BACKUP_DATASET_DATA_SECRET_KEY

your-aws-secret-key

The AWS secret key for your s3 bucket.

BILLML_BACKUP_DATASET_HF_SECRET_KEY

your-hugging-face-token

The hugging face token required for upload datasets to hugging face utility script to work.

Project setup

Setup .env file

Use make command to create .env file

make env_setup

Bills data setup

Run download BILLSTATUS .xml files CLI command

docker exec billml_compose-billml bash -c "poetry run python /usr/src/app/congress/core/run.py govinfo --bulkdata=BILLSTATUS"
Note
You can add --congress parameter to download other congress(es) bills. The data will be stored in CONGRESS_DATA_FOLDER_FILEPATH folder.

Run convert fdsys_billstatus.xml files to data.json and data.xml files CLI command

docker exec billml_compose-billml bash -c "poetry run python /usr/src/app/congress/core/run.py bills --collections=BILLS"
Note
You can add --congress parameter to download other congress(es) bills.

Run download bills text-versions CLI command

docker exec billml_compose-billml bash -c "poetry run python /usr/src/app/congress/core/run.py govinfo --collections=BILLS --extract=xml,pdf"
Note
You can add --congress parameter to download other congress(es) bills as well as --extract parameter to extract other formats.

Run "check_local_files.py" utility script

docker exec billml_compose-billml bash -c "poetry run python /usr/src/app/congress/core/tasks/check_local_files.py"

This script provides information about local filesystem: - How many BILL_STATUSES available in the API - How many "fdsys_billstatus.xml" files present locally - How many "data.json" files(made out of "fdsys_billstatus.xml" files) present locally - How many bills available in the API - How many bills .xml files present locally

Also without skipping script steps the script will download all missing BILL_STATUS and bills data.(Very handy for updating local bills data)

How to create "bill_summary_us" dataset

To create "bill_summary_us" dataset from scratch you need to do following steps with help of congress project:

  • Download BILLSTATUS .xml files(aka fdsys_billstatus.xml files)

  • Convert fdsys_billstatus.xml files to data.json and data.xml files

  • Download bills text-versions

  • Run CLI commands to create "bill_summary_us" dataset

The steps above described in the "Project setup" chapter of this readme.

Run "bill_summary_us" dataset creation CLI command

Use following command to create "bill_summary_us" dataset. The dataset creation heavily rely on bills data presence. So if you setup project first time you have to make sure that you did previous steps correctly and bills data is present.

docker exec billml_compose-billml bash -c "poetry run python /usr/src/app/main.py --dataset_names='bill_summary_us'"

"bill_summary_us" CLI command parameters

Parameter name

Default value

Description

--dataset_names

None

The name dataset user wants to create.

--sections_limit

None

The number of sections that bill should have to be included in the dataset. We will include all bills with number of sections more or equal sections_limit.

--congresses_to_include

None

Number of congresses user want to include in the dataset. If no value provided all congresses available in the filesystem will be included.

--bill_types_to_include

['hconres', 'hjres', 'hr', 'hres', 's', 'sconres', 'sjres', 'sres',]

Bill types user want to include in the dataset(s) if no value provided all bill types will be included.

How to create "bill_text_us" dataset

To create "bill_text_us" dataset from scratch you need to do following steps with help of congress project:

  • Download bills text-versions

  • Run CLI commands to create "bill_text_us" dataset

The steps above described in the "Project setup" chapter of this readme.

Run "bill_text_us" dataset creation CLI command

Use following command to create "bill_text_us" dataset. The dataset creation heavily rely on bills data presence. So if you setup project first time you have to make sure that you did previous steps correctly and bills data is present.

docker exec billml_compose-billml bash -c "poetry run python /usr/src/app/main.py --dataset_names='bill_text_us'"

"bill_text_us" CLI command parameters

Parameter name

Default value

Description

--dataset_names

None

The name dataset user wants to create.

--sections_limit

None

The number of sections that bill should have to be included in the dataset. We will include all bills with number of sections more or equal sections_limit.

--congresses_to_include

None

Number of congresses user want to include in the dataset. If no value provided all congresses available in the filesystem will be included.

--bill_types_to_include

['hconres', 'hjres', 'hr', 'hres', 's', 'sconres', 'sjres', 'sres',]

Bill types user want to include in the dataset(s) if no value provided all bill types will be included.

Upload bill data to cloud storage

You can setup credentials to upload locally downloaded bills data to cloud storage (AWS s3 currently supported). To do that use following command:

docker exec billml_compose-billml bash -c ". utils/upload_bills_data.sh"

Download bill data from cloud storage

You can setup credentials to download(previously downloaded via project) bills data from cloud storage (AWS s3 currently supported). To do that use following command:

docker exec billml_compose-billml bash -c ". utils/download_bills_data.sh"
Note
You must setup ENV variables with your s3 bucket name and keys.

Upload datasets data to cloud storage

You can setup credentials to upload locally created datasets data to cloud storage (AWS s3 currently supported). To do that use following command:

docker exec billml_compose-billml bash -c ". utils/upload_datasets_data.sh"

Setup scheduled running

This chapter explains how to set up scheduled creating and uploading fresh datasets to hugging face repo. The following steps described as separate chapters in this readme.

Setup crontab schedulled steps to create datasets

After you setup ENV variables in .env file. You can use crontab -e command to edit crontab file and add backup script. Example of recommended schedule (At 23:11 on Sunday).

11 23 * * SUN docker exec billml_compose-billml bash -c "poetry run python /usr/src/app/congress/core/run.py govinfo --bulkdata=BILLSTATUS; \
poetry run python /usr/src/app/congress/core/run.py bills --collections=BILLS; \
poetry run python /usr/src/app/congress/core/run.py govinfo --collections=BILLS --extract=xml,pdf; \
poetry run python /usr/src/app/congress/core/tasks/check_local_files.py; \
poetry run python /usr/src/app/main.py --dataset_names='bill_summary_us'; \
poetry run python /usr/src/app/main.py --dataset_names='bill_text_us'; \
. utils/upload_bills_data.sh; \
. utils/upload_datasets_data.sh
"

After schedulled run you can use command to upload datasets to hugging face:

# find name of created 'bill_summary_us' dataset
ls /datasets_data/bill_summary_us
# use utility script to upload dataset, for example:
docker exec billml_compose-billml bash -c "poetry run python /usr/src/app/utils/upload_dataset_to_hf.py --source_dataset_filepath=/datasets_data/bill_summary_us/bill_summary_us_11-10-2023_13-02-35.jsonl --hf_dataset_name=dreamproit/bill_summary_us --hf_path_in_repo=bill_summary_us.jsonl"

Examples

Example .jsonl dataset files stored in repo samples folder. Each example dataset file have 10 items.

About

Collect bill text and metadata (summary) as datasets for machine learning

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published