###Workflow Overview
- Retrieve Budget Estimates PDF files from government budget website
- Convert PDF to text files by running Shell script convert.sh (PDF -> TXT)
- Extract figures in text files into CSV files by running python script parse.py (TXT -> CSV)
- Upload CSV file to Google Spreadsheet and cleanse the data
- Export the data as JSON by Google Spreadsheet JSON API
- Import JSON Data into Elastic Search Engine
Chinese PDFs used for processing
convert_batch.sh 021
will download the PDFs for head 021 (CEO Office) and parse them
Download Data (refer to https://code4hk.hackpad.com/CODE4HK-Budget-Hackathon-4Sgfyk51g5m) and extract to raw/csv/
cd docker-host-vm
vagrant up
cd ..
vagrant up --provider=docker