Skip to content

Latest commit

 

History

History
101 lines (60 loc) · 4.3 KB

SCRAPING.adoc

File metadata and controls

101 lines (60 loc) · 4.3 KB

Overview

There are a number of scrapers that get data for this project. The initial scraping, in some cases was done manually, and the data was then loaded into the database. There are nightly Celery tasks that update and maintain the data. These are described in UPDATES_CELERY.

The following scrapers are available:

  1. US Congress (for bills and metadata)

  2. Statements of Administration Policy

  3. CRS reports

  4. Relevant Committee Documents

  5. CBO Scores

Scrapers

US Congress (Bills)

./run govinfo --bulkdata=BILLSTATUS
./run bills

When running initially, I got an error because the bulk directories had not been made. To unzip the files manually in all directories:

find . -name "*.zip" | xargs -P 5 -I fileName sh -c 'unzip -o -d "$(dirname "fileName")/$(basename -s .zip "fileName")" "fileName"'

For more detail, see USCONGRESS_SCRAPER.

Statements of Administration Policy

Instructions for loading the database fixture for the Statements of Administration Policy are in the DATA BACKGROUND document, here: DATA BACKGROUND: Statement of Administration Policy.

To load the latest data (from the Biden Administration), run:

./manage.py biden_statements

CRS Reports

The scraper for CRS Reports, and its instructions, are described in CRS_REPORTS_SCRAPER.

Relevant Committee Documents

To load Relevant Committee Documents data use the following instructions:

  1. After installing the requirements under scrapers directory, run crec_scrape_urls.py file under scrapers directory.

  2. Go to the crec_scrapy folder and run “scrapy crawl crec” command. It will take about an hour to scrape all the data in crec_scrapy/data/crec_data.json file.

  3. Copy scraped data from crec_scrapy/data/crec_data.json to django base directory. First delete old data under django base directory or replace it.

  4. Run django command “./manage.py load_crec” command to populate the data to the database.

CBO Scores

The CBO task processes metadata from the bill status files that are downloaded by the unitedstates/congress scraper (above). It finds the field 'cboCostEstimates' and collects these for all bills. Then, it stores any new scores in the database.

Install unitedstates/congress repository

The text of bills can be scraped with the Python project here: https://github.com/unitedstates/congress. First, clone this repository as a child of the FlatGovDir directory above, and a sibling of congress. In this way, running the scraper will fill out the text data within the congress directory.

$ cd /path/to/FlatGovDir
$ git clone https://github.com/unitedstates/congress.git
(no credentials needed-- it is an open repo)

Install scraper dependencies

Install the congress Python virtual environment, install the requirements (pip install requirements.txt). The scrapers were built with Python 2.7 and have not been upgraded; updates may be needed for a production environment, but the @unitedstates/congress scraper is sufficient to gather the baseline data to test the utilities in this repository.

Note
Scraping the initial data can be very time-consuming (most of a day, depending on your internet download speeds). To get started, it is worth finding a source for bulk downloads of the text, if possible.

On MacOS (Catalina), installing the congress requirements involved a few adjustments:

  1. Install OpenSSL 1.02 with Homebrew. The latest OpenSSL (>1.1) causes problems with certain requirements; unfortunately, version 1.0.0 also failed. A script was set up by a Github user to install version 1.0.2.

brew uninstall openssl --ignore-dependencies; brew uninstall openssl --ignore-dependencies; brew uninstall libressl --ignore-dependencies; brew install https://raw.githubusercontent.com/Homebrew/homebrew-core/8b9d6d688f483a0f33fcfc93d433de501b9c3513/Formula/openssl.rb;

  1. Link the OpenSSL libraries

export LDFLAGS=-L/usr/local/opt/openssl/lib
export CPPFLAGS=-I/usr/local/opt/openssl/include
  1. Install pytz, pep517 and cryptography directly

pip install pytz
pip install pep517
pip install cryptography
  1. Install requirements

From the congress repository directory, pip install -r requirements.txt