CORD19-Geolocation-Pipeline

Complete pipeline to automate affiliation scraping and geolocation

The requirements to run this program are:

Selenium and ChromeDriver
ElasticSearch setup on localhost:9200
Libpostal
Other requirements can be installed using requirements.txt
The links for pre-trained BERT Model and Cached Data Files

I understand that these are quite a lot of requirements and can complicate things for users looking for quick results. Thus, I have provided a quick start Bash Script to load all of this into a Colab Notebook automatically and immediately get you going!

Quick Start Instructions

Open a new notebook on Colab> Clone this repo and cd to the root directory

  !git clone https://github.com/isdapro/CORD19-Geolocation-Pipeline.git
  cd CORD19-Geolocation-Pipeline/

Install the requirements by executing the bash Script. This may take a while..

  !bash ./load_stuff.sh

Now, download the latest metadata.csv from Kaggle (or the version you want to produce results for) and place it in the data directory of our project. You should see other files such as scraped.csv here, please don't modify them

  /content/CORD19-Geolocation-Pipeline/geolocation-pipeline/data

Enter the main directory and execute the main python Script

  cd geolocation-pipeline
  !python main.py

Here onwards, you can follow the instructions that you see on the screen. The script will allow you to choose to scrape, geolocate, change your API keys, etc.

What gets produced

For each paper in CORD-19, you will be able to see the Institute where that paper originated from along with detailed GeoNames data regarding the location of that Institute (an interesting piece of information missing in Kaggle's CORD-19).

Check https://medium.com/swlh/covid-19-research-papers-geolocation-c2d090bf9e06 for more details

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CORD19-Geolocation-Pipeline

Quick Start Instructions

What gets produced

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
geolocation-pipeline		geolocation-pipeline
README.md		README.md
load_stuff.sh		load_stuff.sh
load_stuff0.sh		load_stuff0.sh
requirements.txt		requirements.txt

isdapro/CORD19-Geolocation-Pipeline

Folders and files

Latest commit

History

Repository files navigation

CORD19-Geolocation-Pipeline

Quick Start Instructions

What gets produced

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages