BillML

Table of Contents

Introduction
Build docker image with utility script
Environment variables
Project setup
- Setup .env file
Bills data setup
How to create "bill_summary_us" dataset
- Run "bill_summary_us" dataset creation CLI command
How to create "bill_text_us" dataset
- Run "bill_text_us" dataset creation CLI command
Upload bill data to cloud storage
Download bill data from cloud storage
Upload datasets data to cloud storage
Setup scheduled running
Examples

Introduction

This project creates datasets for machine learning from the US Congress bills. Currently supports creation of following datasets:

Build docker image with utility script

To build docker image required for project to work you can use manual docker command or following command:

export DOCKER_IMAGE_TAGS=billml:1.0.0; . ./build_docker_image.sh

Environment variables

This chapter contains a table with explanation of the environment variables that used in the project’s docker container.

Variable name	Default value	Description
`BILLML_IMAGE_NAME`	dreamproit/billml	The name of `billml` image that will be used in `docker-compose.yaml`.
`BILLML_IMAGE_VERSION`	1.0.0	The version of `billml` image that will be used in `docker-compose.yaml`.
`BILLML_CONTAINER_NAME`	billml	The name of the `billml` service container in docker-compose.
`BILLML_PLATFORM_NAME`	linux/x86_64	The name of the platform that will be used in `billml` docker compose service.
`BILLML_RUN_FOR_EVER`	False	Variable used in `billml` 's `entrypoint.sh` to keep container running indefinitely (very handy in development).
`BILLML_LOCAL_VOLUME_PATH`	/path/to/repo/root	Variable used dev docker-compose to mount local volume to the container root.
`BILLML_LOG_OUTPUT_FOLDER`	/usr/src/logs	Path to folder where project will save .log files and .gz archives with logs archives.
`BILLML_LOCAL_FILESYSTEM_CONCURRENCY_LIMIT`	4000	Amount of the concurrent tasks required for dataset creation.(In case of `OSError` lower this number)
`BILLML_DATASETS_STORAGE_FILEPATH`	/datasets_data	Path to folder where project will save datasets.
`CONGRESS_DATA_FOLDER_FILEPATH`	/bills_data/data	Path to folder where congress project saves bills data.
`CONGRESS_CACHE_FOLDER_FILEPATH`	/bills_data/cache	Path to folder where congress project saves cache data.
`BILLML_BACKUP_BILLS_DATA_CLOUD_PROVIDER`	s3	The name of cloud provider to store bills data.
`BILLML_BACKUP_BILLS_DATA_BUCKET_NAME`	your-bucket-name	The name of AWS s3 bucket.
`BILLML_BACKUP_BILLS_DATA_ACCESS_KEY`	your-aws-access-key	The AWS access key for your s3 bucket.
`BILLML_BACKUP_BILLS_DATA_SECRET_KEY`	your-aws-secret-key	The AWS secret key for your s3 bucket.
`BILLML_BACKUP_DATASET_DATA_CLOUD_PROVIDER`	s3	The name of cloud provider to store datasets data.
`BILLML_BACKUP_DATASET_DATA_BUCKET_NAME`	your-bucket-name	The name of AWS s3 bucket.
`BILLML_BACKUP_DATASET_DATA_ACCESS_KEY`	your-aws-access-key	The AWS access key for your s3 bucket.
`BILLML_BACKUP_DATASET_DATA_SECRET_KEY`	your-aws-secret-key	The AWS secret key for your s3 bucket.
`BILLML_BACKUP_DATASET_HF_SECRET_KEY`	your-hugging-face-token	The hugging face token required for upload datasets to hugging face utility script to work.

Project setup

Setup `.env` file

Use make command to create .env file

make env_setup

Bills data setup

Run download BILLSTATUS .xml files CLI command

docker exec billml_compose-billml bash -c "poetry run python /usr/src/app/congress/core/run.py govinfo --bulkdata=BILLSTATUS"

Note	You can add `--congress` parameter to download other congress(es) bills. The data will be stored in `CONGRESS_DATA_FOLDER_FILEPATH` folder.

Run convert `fdsys_billstatus.xml` files to `data.json` and `data.xml` files CLI command

docker exec billml_compose-billml bash -c "poetry run python /usr/src/app/congress/core/run.py bills --collections=BILLS"

Note	You can add `--congress` parameter to download other congress(es) bills.

Run download bills `text-versions` CLI command

docker exec billml_compose-billml bash -c "poetry run python /usr/src/app/congress/core/run.py govinfo --collections=BILLS --extract=xml,pdf"

Note	You can add `--congress` parameter to download other congress(es) bills as well as `--extract` parameter to extract other formats.

Run "check_local_files.py" utility script

docker exec billml_compose-billml bash -c "poetry run python /usr/src/app/congress/core/tasks/check_local_files.py"

This script provides information about local filesystem: - How many BILL_STATUSES available in the API - How many "fdsys_billstatus.xml" files present locally - How many "data.json" files(made out of "fdsys_billstatus.xml" files) present locally - How many bills available in the API - How many bills .xml files present locally

Also without skipping script steps the script will download all missing BILL_STATUS and bills data.(Very handy for updating local bills data)

How to create "bill_summary_us" dataset

To create "bill_summary_us" dataset from scratch you need to do following steps with help of congress project:

Download BILLSTATUS .xml files(aka fdsys_billstatus.xml files)
Convert fdsys_billstatus.xml files to data.json and data.xml files
Download bills text-versions
Run CLI commands to create "bill_summary_us" dataset

The steps above described in the "Project setup" chapter of this readme.

Run "bill_summary_us" dataset creation CLI command

Use following command to create "bill_summary_us" dataset. The dataset creation heavily rely on bills data presence. So if you setup project first time you have to make sure that you did previous steps correctly and bills data is present.

docker exec billml_compose-billml bash -c "poetry run python /usr/src/app/main.py --dataset_names='bill_summary_us'"

"bill_summary_us" CLI command parameters

Parameter name	Default value	Description
`--dataset_names`	None	The name dataset user wants to create.
`--sections_limit`	None	The number of sections that bill should have to be included in the dataset. We will include all bills with number of sections more or equal sections_limit.
`--congresses_to_include`	None	Number of congresses user want to include in the dataset. If no value provided all congresses available in the filesystem will be included.
`--bill_types_to_include`	['hconres', 'hjres', 'hr', 'hres', 's', 'sconres', 'sjres', 'sres',]	Bill types user want to include in the dataset(s) if no value provided all bill types will be included.

How to create "bill_text_us" dataset

To create "bill_text_us" dataset from scratch you need to do following steps with help of congress project:

Download bills text-versions
Run CLI commands to create "bill_text_us" dataset

The steps above described in the "Project setup" chapter of this readme.

Run "bill_text_us" dataset creation CLI command

Use following command to create "bill_text_us" dataset. The dataset creation heavily rely on bills data presence. So if you setup project first time you have to make sure that you did previous steps correctly and bills data is present.

docker exec billml_compose-billml bash -c "poetry run python /usr/src/app/main.py --dataset_names='bill_text_us'"

"bill_text_us" CLI command parameters

Parameter name	Default value	Description
`--dataset_names`	None	The name dataset user wants to create.
`--sections_limit`	None	The number of sections that bill should have to be included in the dataset. We will include all bills with number of sections more or equal sections_limit.
`--congresses_to_include`	None	Number of congresses user want to include in the dataset. If no value provided all congresses available in the filesystem will be included.
`--bill_types_to_include`	['hconres', 'hjres', 'hr', 'hres', 's', 'sconres', 'sjres', 'sres',]	Bill types user want to include in the dataset(s) if no value provided all bill types will be included.

Upload bill data to cloud storage

You can setup credentials to upload locally downloaded bills data to cloud storage (AWS s3 currently supported). To do that use following command:

docker exec billml_compose-billml bash -c ". utils/upload_bills_data.sh"

Download bill data from cloud storage

You can setup credentials to download(previously downloaded via project) bills data from cloud storage (AWS s3 currently supported). To do that use following command:

docker exec billml_compose-billml bash -c ". utils/download_bills_data.sh"

Note	You must setup ENV variables with your s3 bucket name and keys.

Upload datasets data to cloud storage

You can setup credentials to upload locally created datasets data to cloud storage (AWS s3 currently supported). To do that use following command:

docker exec billml_compose-billml bash -c ". utils/upload_datasets_data.sh"

Setup scheduled running

This chapter explains how to set up scheduled creating and uploading fresh datasets to hugging face repo. The following steps described as separate chapters in this readme.

Setup crontab schedulled steps to create datasets

After you setup ENV variables in .env file. You can use crontab -e command to edit crontab file and add backup script. Example of recommended schedule (At 23:11 on Sunday).

11 23 * * SUN docker exec billml_compose-billml bash -c "poetry run python /usr/src/app/congress/core/run.py govinfo --bulkdata=BILLSTATUS; \
poetry run python /usr/src/app/congress/core/run.py bills --collections=BILLS; \
poetry run python /usr/src/app/congress/core/run.py govinfo --collections=BILLS --extract=xml,pdf; \
poetry run python /usr/src/app/congress/core/tasks/check_local_files.py; \
poetry run python /usr/src/app/main.py --dataset_names='bill_summary_us'; \
poetry run python /usr/src/app/main.py --dataset_names='bill_text_us'; \
. utils/upload_bills_data.sh; \
. utils/upload_datasets_data.sh
"

After schedulled run you can use command to upload datasets to hugging face:

# find name of created 'bill_summary_us' dataset
ls /datasets_data/bill_summary_us
# use utility script to upload dataset, for example:
docker exec billml_compose-billml bash -c "poetry run python /usr/src/app/utils/upload_dataset_to_hf.py --source_dataset_filepath=/datasets_data/bill_summary_us/bill_summary_us_11-10-2023_13-02-35.jsonl --hf_dataset_name=dreamproit/bill_summary_us --hf_path_in_repo=bill_summary_us.jsonl"

Examples

Example .jsonl dataset files stored in repo samples folder. Each example dataset file have 10 items.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
common		common
congress @ b03d188		congress @ b03d188
core		core
output		output
samples		samples
schemas		schemas
utils		utils
.env.example		.env.example
.gitignore		.gitignore
.gitmodules		.gitmodules
Dockerfile		Dockerfile
Makefile		Makefile
README.adoc		README.adoc
build_docker_image.sh		build_docker_image.sh
docker-compose.dev.yaml		docker-compose.dev.yaml
docker-compose.yaml		docker-compose.yaml
entrypoint.sh		entrypoint.sh
main.py		main.py
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BillML

Introduction

Build docker image with utility script

Environment variables

Project setup

Setup `.env` file

Bills data setup

Run download BILLSTATUS .xml files CLI command

Run convert `fdsys_billstatus.xml` files to `data.json` and `data.xml` files CLI command

Run download bills `text-versions` CLI command

Run "check_local_files.py" utility script

How to create "bill_summary_us" dataset

Run "bill_summary_us" dataset creation CLI command

"bill_summary_us" CLI command parameters

How to create "bill_text_us" dataset

Run "bill_text_us" dataset creation CLI command

"bill_text_us" CLI command parameters

Upload bill data to cloud storage

Download bill data from cloud storage

Upload datasets data to cloud storage

Setup scheduled running

Examples

About

Releases

Packages

Contributors 2

Languages

dreamproit/BillML

Folders and files

Latest commit

History

Repository files navigation

BillML

Introduction

Build docker image with utility script

Environment variables

Project setup

Setup .env file

Bills data setup

Run download BILLSTATUS .xml files CLI command

Run convert fdsys_billstatus.xml files to data.json and data.xml files CLI command

Run download bills text-versions CLI command

Run "check_local_files.py" utility script

How to create "bill_summary_us" dataset

Run "bill_summary_us" dataset creation CLI command

"bill_summary_us" CLI command parameters

How to create "bill_text_us" dataset

Run "bill_text_us" dataset creation CLI command

"bill_text_us" CLI command parameters

Upload bill data to cloud storage

Download bill data from cloud storage

Upload datasets data to cloud storage

Setup scheduled running

Examples

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Setup `.env` file

Run convert `fdsys_billstatus.xml` files to `data.json` and `data.xml` files CLI command

Run download bills `text-versions` CLI command

Packages