BillTitleIndex

Table of Contents

Overview
Tools
- Swagger
- Flower
- PgAdmin
Usage
- API Usage
Deployment with docker-compose
Development (Individual components)
- How to update recent BillStatus XML Files:
- Pipeline

Overview

The BillTitleIndex repository gathers metadata about the titles of the bills in the United States Congress. This information is collected, added to a PostgreSQL database, and then used to create a searchable index and API. This API is combined with an API that identifies similar bills (aih/billtitles-py), and this combined API will be used by the BillMap project (/unitedstates/Bill Map) which provides context and associated documents for similar bills (e.g. companion bills, related bills, etc.).

This repository provides a containerized environment that scrapes bill metadata from scratch; processes and stores data to PostgreSQL and Elasticsearch; and provides an API with FastAPI/Swagger/OpenAPI to search bill titles.

The docker-compose orchestrates the following containers: * PostgreSQL database * Elasticsearch index * Celery workers * RabbitMQ message broker * Flower server * FastAPI API

Tools

Swagger

Swagger is a tool to work with self-documented code. This is a great out of the box feature of FastAPI. It allows writing documentation directly in the code and shows it in the UI, testing API endpoints, describing request parameters and schemas for requests and responses.

Flower

Tool to control and monitor celery workers.

Accessible by this url http://localhost:5555

NOTE: Default Login / Password: admin/root. Can be configured by .env file.

PgAdmin

Tool to query Postgres Database, monitor DB state, etc.

Accessible by this url http://localhost:5050

NOTE: Default Login / Password: admin@localhost.com/root. Can be configured by .env file.

Staring from the scratch you won’t have added databases, so you can do it:

right click on Servers (top left corner)
Register → Server

Then you will see:

Usage

API Usage

Once the API is running, you can access it at port 8000 of the host (defined in the docker-compose).

The API routes and schemas are described with swagger.

Deployment with docker-compose

If you already have loaded bills BULK data - put loaded data into the project directory in the same folder where docker-compose.yml file is found.

How to download data is described here. After downloading one archive you will have to unpack it and you will see such structure of folders and files:

Where the top folder is called data, which contains folder with digits name - which means congress number. Inside each congress you see folders [hconres, hjres, hr and others].

If you want to provide more data (not for only one congress) you have to merge all downloaded data folders (r’data/\d*' - all root folders of each data unpacked archive) into one folder data/, so the folders structure should look like this:

Run this command to process bills information:

docker compose run api usc-run bills

Now you can run loading data into the database (PostgreSQL and replicate data to ElasticSearch) as described here.

To start all services run this command:

docker-compose up -d

This will start the docker containers, including both the scrapers for bill metadata and indexing of the titles and the FastAPI server.

NOTE: If you don’t have loaded congress bills data yet - you will need to trigger tasks through Swagger UI or running this command in terminal.

Docker compose usage

There are 2 docker-compose config files:

docker-compose.yml - is the main configuration file for docker-compose command, which should be used for deployment on the prod server
docker-compose.override.yml - the main purpose of this file is to allow modifications inside the main docker-compose.yml file without actual changes inside it. It allows to re-define services and theirs configuration for local development needs.

Be aware:

Docker compose automatically apply these files if it sees that .override file exists next to the main docker-compose file.
docker-compose.override.yml overrides path to .env file with .env.local. So for local development you can change any variable in .env.local file.
docker-compose.override.yml file should be removed when it’s deploying on the prod - so docker compose will use .env file which can be created by copying and modifying .env.local.

Development

For local development you should install such tools:

pre-commit

Allows to set git pre-commit hooks. Read about this great tool here

To install it run one of provided commands:

pip install pre-commit
# OR
brew install pre-commit

Before actual development started run this command in the root directory of the project. It will setup git hooks:

pre-commit install

Now you can test that it works:

pre-commit run --all-files

If you see this error it means that it modified files to follow standards (eg PEP8):

autoflake................................................................Failed
- hook id: autoflake
- files were modified by this hook
Reorder python imports...................................................Failed
- hook id: reorder-python-imports
- exit code: 1
- files were modified by this hook

And if you run pre-commit run --all-files command again you will see that all everything passed:

autoflake................................................................Passed
Reorder python imports...................................................Passed

NOTE: Usually you don’t need to run it manually. Git will trigger this tool automatically once you try to commit something.

Hooks and linters to run by these hooks:

flake8
some hooks from default set of pre-commit hooks:
- check-ast - to check that code can be compiled
- debug-statements - disallow debug statements in the code
docker-compose-check
Dockerfile linter hadolint
code autoformatting tools:

NOTE: pre-commit tool will install all these tools automatically at first run. You don’t need to install any of them.

Used containers and theirs purposes

Name	Purpose
api	runs FastAPI server
celery_worker	runs periodic tasks and background tasks to scrape data and process it
es01	ElasticSearch service
flower	Celery workers control, monitoring and scaling (with ability to auto-scale) tool
gp_admin	Postgres server tool to explore data, run queries and etc.

Docker environment variables:

To configure the environment of containers there should be .env file.

Name	Explanation	Default value
SECRET_KEY	Generate random hash to secure django application	changethis
APP_MODULE	Path to the main executable fastAPI module	billtitleindex.wsgi:app
FLOWER_BASIC_AUTH	Flower basic authentication. Should be provided in format username:password	admin:root
DATA_DIR	Mount directory inside the container where congress data is stored
LOCAL_DATA_DIR	Where is the bills data stored outside of the docker	../congress/
POSTGRES_HOST	Postgres server host name	postgres
POSTGRES_USER	Postgres user name	btiadmin
POSTGRES_PASSWORD	Postgres user password	btiadmin
POSTGRES_DB	DB name	billtitle
POSTGRES_PORT	Postgres port	5432
MESSAGE_BROKER_URI	URI of broker host. Should be in format: `amqp://{hostname}:{port}/`	amqp://rabbitmq:5672/
ELASTICSEARCH_URI	URI of elasticsearch host. Should be in format: `http://{hostname}:{port}` or `https://{hostname}:{port}`	http://elasticsearch:9200

Tasks

The pipeline includes the following tasks:

scraping-task-midnight-daily: responsible for scraping bills. This task runs these commands under the hood (one by one):

  usc-run govinfo --bulkdata=BILLSTATUS
  usc-run bills

pipeline-task-everyday-4am: runs the pipeline at 4:00 AM. Gets data collected by scrape task, parses files and inserts records into the PostgreSQL database. Once item inserted into DB it replicates data to ElasticSearch indexes.

After deploying with docker-compose, the tasks can be triggered manually through the API.

Scraping is triggered with a GET request to http://localhost:8000/api/scraping/

The pipeline can be manually triggered with a GET request to http://localhost:8000/api/pipeline/.

Or just by clicking a button in swagger UI:

run scrape process here
run pipeline here

TODO: describe in more detail what the scraper and pipeline tasks do.

Development (Individual components)

How to update recent BillStatus XML Files:

Bill data is primarily collected by using unitedstates/congress, which covers 2013 to the present.

This project collects data from the official congressional XML data on legislation, which covers the 113th Congress (2013) to the present.

Preferred way is to use docker containers

Docker image created from this repository already include environment, all needed libraries and has already installed unitedstates/congress.

The process using this tool has 2 parts. First, the XML data must be fetched from Govinfo. The script pulls the bill status XML and on subsequent runs only pulls new and changed files:

docker compose run api usc-run govinfo --bulkdata=BILLSTATUS

Then run the bills task to process any new and changed files:

docker compose run api usc-run bills

It’s also possible to run commands chaining them with &&:

docker compose run api usc-run govinfo --bulkdata=BILLSTATUS && usc-run bills

NOTE

To get the bulk data of bill status before 2013, we can use the ProPublica bulk downloads page.

Data is provided in both JSON and XML formats.

Bulk data from previous congresses can be downloaded by clicking the links below. Bulk data for congresses before and including the 112th was generated by the Sunlight Foundation. Data for congresses the 113th Congress and subsequent congresses was generated by ProPublica, using code from the [@UnitedStates GitHub organization](https://github.com/unitedstates).

For example, zip file here

Pipeline

Preferred way is to use docker container

To run pipeline just execute this command:

Run all containers:

docker compose up -d

Run pipeline command:

docker-compose run api python manage.py runpipeline

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
.github/workflows		.github/workflows
billtitleindex		billtitleindex
docs/img		docs/img
es-docker		es-docker
.flake8		.flake8
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.adoc		README.adoc
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BillTitleIndex

Overview

Tools

Swagger

Flower

PgAdmin

Usage

API Usage

Deployment with docker-compose

Docker compose usage

Development

pre-commit

Used containers and theirs purposes

Docker environment variables:

Tasks

Development (Individual components)

How to update recent BillStatus XML Files:

Preferred way is to use docker containers

Pipeline

Preferred way is to use docker container

About

Releases

Packages

Contributors 3

Languages

License

dreamproit/billtitleindex

Folders and files

Latest commit

History

Repository files navigation

BillTitleIndex

Overview

Tools

Swagger

Flower

PgAdmin

Usage

API Usage

Deployment with docker-compose

Docker compose usage

Development

pre-commit

Used containers and theirs purposes

Docker environment variables:

Tasks

Development (Individual components)

How to update recent BillStatus XML Files:

Preferred way is to use docker containers

Pipeline

Preferred way is to use docker container

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages