Skip to content

Crawls the ATGC sequence of the sars covid2 novel coronavirus from ncbi website.

License

Notifications You must be signed in to change notification settings

SiddharthaAnand/ncbi-sars-cov2-data-crawler

Repository files navigation

NCBI SARS COV-2 Data Crawler

A crawler which crawls the ATGC genome sequence of the novel corona-virus2 found all over the world. This data is being uploaded at the ncbi website and being updated everyday.

What it scrapes exactly?

It scrapes data from the ncbi website and crawls the ATGC genome sequence as well as other meta-data being uploaded on the website.

The ATGC sequence is stored in XXXXX.txt files in a directory given as input by the user. So, if there are 2000 accessions, then there would be 2000 .txt files.

How?

Using selenium and beautifulsoup.

We use selenium to simulate the user session on the browser. BeautifulSoup is used to parse the html content from the page source, and extract what we exactly need.

The internals

It has been divided in two broad steps.

Step 1

  • The image shown is the sars2 webpage of ncbi. You can see the nucleotide column, consists of Accession IDS. sars2 novel coronavirus webpage of ncbi
  • The screenshot of the table which contains the Accession IDS. sars2 novel coronavirus table of ncbi
  • The nucleotide details after you click on the accession link. sars2 novel coronavirus table of ncbi
  • The metadata on a certain nucleotide window. The relative url is stored from this window. sars2 novel coronavirus table of ncbi

Step 2

Read the relative url stored in the file called 'genome_to_url_mapper_dict' inside the directory given as arguments on the command line. It reads them one by one and goes to those web pages to scrape the data from there and stores in the files with <ACCESSION_NO>.txt format.

  • The ATGC genome sequence web page on ncbi. sars2 novel coronavirus table of ncbi
  • Focus on the relative url that was stored. In the code, it adds to the base url to create a new url everytime. The relative url changes for every accession. sars2 novel coronavirus table of ncbi
  • The ATGC sequence which is stored in the .txt files. sars2 novel coronavirus table of ncbi

Why use selenium?

In order to come up with this code quickly, I have used selenium. This could have been done in several other ways as well wherein the exact request could have been replicated in the python module and sent to the ncbi server.

Another way, is to create an exact Request being sent to the server, including proper handling of cookies and other headers. Just a simple GET does not return the data in html which we need. This ncbi web page is high on javascript, which executes once it opens up in the web browser.

Installation

Install chromedriver.

Clone the repository.

$ git clone https://github.com/SiddharthaAnand/ncbi-sars-cov2-data-crawler.git

Switch to the cloned directory.

$ cd ncbi-sars-cov2-data-crawler/

Set up a virtual environment. If you do not have one, you can install it. The following command uses a specific version of python (python3.5) create it.

$ virtualenv -p /usr/bin/python3.5 venv

Activate your virtualenv which was named venv.

$ source venv/bin/activate

Install requirements.

$ pip install -r requirements.txt

Understand command-line arguments with the help option.

$ python ncbi_sars2_crawler.py -h

Run the code.

$ python ncbi_sars2_crawler.py --chromepath <path/to/chromedriver> --filepath <directory/to/store/results> >> <directory/to/store/results/logs_YYYYMMDD>

Run the code with logging on the console.

$ python ncbi_sars2_crawler.py --chromepath <path/to/chromedriver> --filepath <directory/to/store/results>

If interested, you can contribute