- Introduction
- Build docker image with utility script
- Environment variables
- Project setup
- Bills data setup
- How to create "bill_summary_us" dataset
- How to create "bill_text_us" dataset
- Upload bill data to cloud storage
- Download bill data from cloud storage
- Upload datasets data to cloud storage
- Setup scheduled running
- Examples
This project creates datasets for machine learning from the US Congress bills. Currently supports creation of following datasets:
To build docker image required for project to work you can use manual docker command or following command:
export DOCKER_IMAGE_TAGS=billml:1.0.0; . ./build_docker_image.sh
This chapter contains a table with explanation of the environment variables that used in the project’s docker container.
Variable name |
Default value |
Description |
|
dreamproit/billml |
The name of |
|
1.0.0 |
The version of |
|
billml |
The name of the |
|
linux/x86_64 |
The name of the platform that will be used in |
|
False |
Variable used in |
|
/path/to/repo/root |
Variable used dev docker-compose to mount local volume to the container root. |
|
/usr/src/logs |
Path to folder where project will save .log files and .gz archives with logs archives. |
|
4000 |
Amount of the concurrent tasks required for dataset creation.(In case of |
|
/datasets_data |
Path to folder where project will save datasets. |
|
/bills_data/data |
Path to folder where congress project saves bills data. |
|
/bills_data/cache |
Path to folder where congress project saves cache data. |
|
s3 |
The name of cloud provider to store bills data. |
|
your-bucket-name |
The name of AWS s3 bucket. |
|
your-aws-access-key |
The AWS access key for your s3 bucket. |
|
your-aws-secret-key |
The AWS secret key for your s3 bucket. |
|
s3 |
The name of cloud provider to store datasets data. |
|
your-bucket-name |
The name of AWS s3 bucket. |
|
your-aws-access-key |
The AWS access key for your s3 bucket. |
|
your-aws-secret-key |
The AWS secret key for your s3 bucket. |
|
your-hugging-face-token |
The hugging face token required for upload datasets to hugging face utility script to work. |
docker exec billml_compose-billml bash -c "poetry run python /usr/src/app/congress/core/run.py govinfo --bulkdata=BILLSTATUS"
Note
|
You can add --congress parameter to download other congress(es) bills. The data will be stored in CONGRESS_DATA_FOLDER_FILEPATH folder.
|
docker exec billml_compose-billml bash -c "poetry run python /usr/src/app/congress/core/run.py bills --collections=BILLS"
Note
|
You can add --congress parameter to download other congress(es) bills.
|
docker exec billml_compose-billml bash -c "poetry run python /usr/src/app/congress/core/run.py govinfo --collections=BILLS --extract=xml,pdf"
Note
|
You can add --congress parameter to download other congress(es) bills as well as --extract parameter to extract other formats.
|
docker exec billml_compose-billml bash -c "poetry run python /usr/src/app/congress/core/tasks/check_local_files.py"
This script provides information about local filesystem: - How many BILL_STATUSES available in the API - How many "fdsys_billstatus.xml" files present locally - How many "data.json" files(made out of "fdsys_billstatus.xml" files) present locally - How many bills available in the API - How many bills .xml files present locally
Also without skipping script steps the script will download all missing BILL_STATUS and bills data.(Very handy for updating local bills data)
To create "bill_summary_us" dataset from scratch you need to do following steps with help of congress project:
-
Download BILLSTATUS .xml files(aka
fdsys_billstatus.xml
files) -
Convert
fdsys_billstatus.xml
files todata.json
anddata.xml
files -
Download bills
text-versions
-
Run CLI commands to create "bill_summary_us" dataset
The steps above described in the "Project setup" chapter of this readme.
Use following command to create "bill_summary_us" dataset. The dataset creation heavily rely on bills data presence. So if you setup project first time you have to make sure that you did previous steps correctly and bills data is present.
docker exec billml_compose-billml bash -c "poetry run python /usr/src/app/main.py --dataset_names='bill_summary_us'"
Parameter name |
Default value |
Description |
|
None |
The name dataset user wants to create. |
|
None |
The number of sections that bill should have to be included in the dataset. We will include all bills with number of sections more or equal sections_limit. |
|
None |
Number of congresses user want to include in the dataset. If no value provided all congresses available in the filesystem will be included. |
|
['hconres', 'hjres', 'hr', 'hres', 's', 'sconres', 'sjres', 'sres',] |
Bill types user want to include in the dataset(s) if no value provided all bill types will be included. |
To create "bill_text_us" dataset from scratch you need to do following steps with help of congress project:
-
Download bills
text-versions
-
Run CLI commands to create "bill_text_us" dataset
The steps above described in the "Project setup" chapter of this readme.
Use following command to create "bill_text_us" dataset. The dataset creation heavily rely on bills data presence. So if you setup project first time you have to make sure that you did previous steps correctly and bills data is present.
docker exec billml_compose-billml bash -c "poetry run python /usr/src/app/main.py --dataset_names='bill_text_us'"
Parameter name |
Default value |
Description |
|
None |
The name dataset user wants to create. |
|
None |
The number of sections that bill should have to be included in the dataset. We will include all bills with number of sections more or equal sections_limit. |
|
None |
Number of congresses user want to include in the dataset. If no value provided all congresses available in the filesystem will be included. |
|
['hconres', 'hjres', 'hr', 'hres', 's', 'sconres', 'sjres', 'sres',] |
Bill types user want to include in the dataset(s) if no value provided all bill types will be included. |
You can setup credentials to upload locally downloaded bills data to cloud storage (AWS s3 currently supported). To do that use following command:
docker exec billml_compose-billml bash -c ". utils/upload_bills_data.sh"
You can setup credentials to download(previously downloaded via project) bills data from cloud storage (AWS s3 currently supported). To do that use following command:
docker exec billml_compose-billml bash -c ". utils/download_bills_data.sh"
Note
|
You must setup ENV variables with your s3 bucket name and keys. |
You can setup credentials to upload locally created datasets data to cloud storage (AWS s3 currently supported). To do that use following command:
docker exec billml_compose-billml bash -c ". utils/upload_datasets_data.sh"
This chapter explains how to set up scheduled creating and uploading fresh datasets to hugging face repo. The following steps described as separate chapters in this readme.
Setup crontab schedulled steps to create datasets
After you setup ENV variables in .env file. You can use crontab -e command to edit crontab file and add backup script. Example of recommended schedule (At 23:11 on Sunday).
11 23 * * SUN docker exec billml_compose-billml bash -c "poetry run python /usr/src/app/congress/core/run.py govinfo --bulkdata=BILLSTATUS; \
poetry run python /usr/src/app/congress/core/run.py bills --collections=BILLS; \
poetry run python /usr/src/app/congress/core/run.py govinfo --collections=BILLS --extract=xml,pdf; \
poetry run python /usr/src/app/congress/core/tasks/check_local_files.py; \
poetry run python /usr/src/app/main.py --dataset_names='bill_summary_us'; \
poetry run python /usr/src/app/main.py --dataset_names='bill_text_us'; \
. utils/upload_bills_data.sh; \
. utils/upload_datasets_data.sh
"
After schedulled run you can use command to upload datasets to hugging face:
# find name of created 'bill_summary_us' dataset
ls /datasets_data/bill_summary_us
# use utility script to upload dataset, for example:
docker exec billml_compose-billml bash -c "poetry run python /usr/src/app/utils/upload_dataset_to_hf.py --source_dataset_filepath=/datasets_data/bill_summary_us/bill_summary_us_11-10-2023_13-02-35.jsonl --hf_dataset_name=dreamproit/bill_summary_us --hf_path_in_repo=bill_summary_us.jsonl"