Complete pipeline to automate affiliation scraping and geolocation
The requirements to run this program are:
- Selenium and ChromeDriver
- ElasticSearch setup on localhost:9200
- Libpostal
- Other requirements can be installed using requirements.txt
- The links for pre-trained BERT Model and Cached Data Files
I understand that these are quite a lot of requirements and can complicate things for users looking for quick results. Thus, I have provided a quick start Bash Script to load all of this into a Colab Notebook automatically and immediately get you going!
- Open a new notebook on Colab> Clone this repo and cd to the root directory
!git clone https://github.com/isdapro/CORD19-Geolocation-Pipeline.git
cd CORD19-Geolocation-Pipeline/
- Install the requirements by executing the bash Script. This may take a while..
!bash ./load_stuff.sh
- Now, download the latest metadata.csv from Kaggle (or the version you want to produce results for) and place it in the data directory of our project. You should see other files such as scraped.csv here, please don't modify them
/content/CORD19-Geolocation-Pipeline/geolocation-pipeline/data
- Enter the main directory and execute the main python Script
cd geolocation-pipeline
!python main.py
Here onwards, you can follow the instructions that you see on the screen. The script will allow you to choose to scrape, geolocate, change your API keys, etc.
For each paper in CORD-19, you will be able to see the Institute where that paper originated from along with detailed GeoNames data regarding the location of that Institute (an interesting piece of information missing in Kaggle's CORD-19).
Check https://medium.com/swlh/covid-19-research-papers-geolocation-c2d090bf9e06 for more details