gene-variant-extractor

This was my first experience coding in Python.

Date: Summer 2018

This repository contains python code used to parse and organize gene variants from supplemental data files associated with human genomics research articles.

Code is in src.

Example output with some files (not all that are usually outputted) containing extracted data and logs of files processed is in output.zip.

Installation

Install the following packages on your computer if you don't already have them:

pandas
xlrd
docx
docx2csv
XlsxWriter
antiword
pdftotext --> command line tool, although there may be alternative python library

pip install pandas
pip install xlrd
pip install docx
pip install docx2csv
pip install XlsxWriter
sudo apt-get install antiword unrtf poppler-utils libjpeg-dev

pdftotext: http://macappstore.org/pdftotext/ or pip install pdftotext (python package)

Getting Started

Place code and genelist.txt in directory that contains folder of supplemental data files.
Run suppdata_scraper.py. You will be prompted in the terminal to input the name of folder containing files and your name.
As script runs, the following should happen:

File progress will be logged via different .txt files. Check files_processed.txt for overall progress.
Terminal will print statements indicating filename with index currently being processed. Files are not processed in exact index order because of multiprocessing.
.txt files will be created in dataframes folder for every individual file that contains data.
Files that may contain amino acids or nucleotides are copied to manual folder.
.txt and .xlsx files will be created in the workspace folder for parsing purposes.

Once suppdata_scraper.py is done running, run dataframe.py. When prompted, you should input the name of the folder containing the dataframes. Once this is done running, you should have masterlist.txt contained in the output folder.

About Code

suppdata_scraper.py is the main scraper program used to parse files and extract gene variants.
dataframe.py is for combining all the dataframes containing extracted data from different files into a single masterlist file with all the extracted data. This should be used in instances where suppdata_scraper.py hits a roadblock and is not able to concatenate all the dataframes during its run.
big_manual.py is for screening and prioritizing large files that contain amino acids and/or nucleotides and need to be manually extracted.
manual.py is for screening files containing amino acids and/or nucleotides and counting the number of occurrences of amino acids and/or nucleotides. These files will need to be manually extracted.

How suppdata_scraper.py Works

Input: Directory of supplemental data files scraped from the web. These files can be any of the following type:

pdf
doc/docx
txt
xls/xlsx
csv/tsv

Output:

output folder with following:
- masterlist.txt --> Main output. All gene variants are stored with files they came from. Also masterlist.csv and masterlist.xlsx, which contain same info in different file type.
- Following .txt files that characterize data:
  - files_processed.txt: filenames and index in list
  - bad_files.txt: files that produce an error
  - good_files.txt: files that contain gene variants
  - manual.txt: files that contain nucleotides or amino acids
  - files_ignored.txt: Other file types such as media files that are not relevant
  - variant_counts.txt: Counts for total number of different gene variants for each file that contains data
  - process_time.txt: File size and time it takes for script to process each file
dataframes folder with dataframe files containing data extracted from all files
manual folder with files that need to be manually extracted and have been copied over
big_files_manual folder containing large files that need to be manually extracted and have been copied over

Some more details:

Scraper finds genes by comparing against genelist.txt and finds different variants using regular expressions.
For pdf, doc, and txt files, the scraper goes through line by line and pulls out gene and variant matches.
For xlsx and xls files, the scraper goes through every cell row by row and pulls out gene and variant matches.
For docx files, the scraper extracts any tables that it finds, converts them into xlsx files, and then follows the same procedure for a xlsx file.

This methodology, while perhaps not the most efficient, proved to be pretty accurate and ensured that associations between genes and variants on the same lines/rows in files were in most cases maintained during the data extraction.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
src		src
README.md		README.md
output.zip		output.zip

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

gene-variant-extractor

Installation

Getting Started

About Code

How suppdata_scraper.py Works

About

Releases

Packages

Languages

dcchang/gene-variant-extractor

Folders and files

Latest commit

History

Repository files navigation

gene-variant-extractor

Installation

Getting Started

About Code

How suppdata_scraper.py Works

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages