vcfLoader

A tool to load and annotate vcf files to ElasticSearch through Spark and Hail (https://github.com/hail-is/hail).

The annotations used and their corresponding versions are:

VEP (v88)
dbNSP (2.9.3)
CADD (1.3)
Clinvar (20180930)
gnomAD exome (2.0.1)
ExAC (0.3)

In order to upload and annotate data with vcfLoader, the steps to follow are:

Install Spark 2.1.0: Download the spark distribution and add the following lines to your .bashrc file (with the corresponding path where Spark is downloaded):

export SPARK_HOME=path/to/spark
export PATH=$PATH:path/to/spark/bin

Run source .bashrc in your command line to save the changes

Install Python 3.6
Install Hail.

sudo apt-get install g++ cmake
git clone --branch 0.2 https://github.com/broadinstitute/hail.git
cd hail
./gradlew -Dspark.version=2.1.0 shadowJar

Add the following lines to your .bashrc file:

export HAIL_HOME=/path/to/hail
export PYTHONPATH="$PYTHONPATH:$HAIL_HOME/python:$SPARK_HOME/python:`echo $SPARK_HOME/python/lib/py4j*-src.zip`"
export SPARK_CLASSPATH=$HAIL_HOME/build/libs/hail-all-spark.jar

For more detailed instructions, go to:

https://hail.is/docs/stable/getting_started.html#building-hail-from-source

At this point, run a pyspark shell by typing

pyspark --conf='spark.sql.files.openCostInBytes=53687091200' --conf='spark.sql.files.maxPartitionBytes=53687091200'

in your command line. The configuration parameters set are the ones needed by Hail. Once in the shell, type

import hail

to check whether the installation was done correctly. If no errors are reported, Hail has been correctly installed.

Install ElasticSearch 6.4.3. You can download the zipped distribution, unzip it and run

cd elasticsearch
./bin/elasticsearch

Check the address localhost:9200 in your browser to verify that is running.

Install VEP. It is one of the external annotation databases used by vcfLoader, and needs to be installed locally:

https://www.ensembl.org/info/docs/tools/vep/index.html

Run the following commands to get the specific version used by vcfLoader:

git clone -b release/88 https://github.com/Ensembl/ensembl-vep.git
cd ensembl-vep
perl INSTALL.plcluster643231open

When asked for the download of cache files, choose the id corresponding to homo_sapiens hg37.

Copy the configuration files config.json and vep.properties available in Gitea

https://172.16.10.100/gitea/platform/spark_config/

and adapt them to your computer paths.

The configuration file specifies the paths where the annotations can be found, and ElasticSearch index information.
In order to run the pipeline locally, download the necessary annotation tables and type the following command:

spark-submit --master local --conf='spark.sql.files.openCostInBytes=53687091200'
--conf='spark.sql.files.maxPartitionBytes=53687091200'
--jars <path_to_hail>/hail/build/libs/hail-all-spark.jar
--executor-memory 8G
--py-files python/dist/vcfLoader-0.1-py2.7.egg python/rdconnect/main.py
--chrom <chromosome> --step <pipeline_steps>

Where chromosome is the chromosome number (MT=23, X=24 and Y=25), and pipeline_steps the steps to run. The latter can be specified as a single comma-separated string with the names of the steps to run. The different options are:

createIndex,createIndexCNV,loadVCF,loadGermline,loadSomatic,loadCNV,loaddbNSFP,loadcadd,loadExomesGnomad,loadExAC,loadCGI.annotateCGI,annotateVEP,annotatedbNSFP,annotatecadd,

annotateclinvar,annotateExomesGnomad,annotateExAC,transform,toElastic,toElasticCNV

Name		Name	Last commit message	Last commit date
Latest commit History 1,779 Commits
project		project
python		python
resources		resources
scripts		scripts
src		src
.gitignore		.gitignore
README.md		README.md
brickhouse-0.7.1-SNAPSHOT.jar		brickhouse-0.7.1-SNAPSHOT.jar
build.sbt		build.sbt
elastic4s-core_2.10-1.5.15.jar		elastic4s-core_2.10-1.5.15.jar
elasticsearch-1.5.2.jar		elasticsearch-1.5.2.jar
elasticsearch-spark_2.10-2.1.0.jar		elasticsearch-spark_2.10-2.1.0.jar
lucene-core-4.10.4.jar		lucene-core-4.10.4.jar
rdconnect_shell		rdconnect_shell
rdconnect_submit		rdconnect_submit
umd.py		umd.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

vcfLoader

About

Releases

Packages

Languages

bag-cnag/vcfLoader

Folders and files

Latest commit

History

Repository files navigation

vcfLoader

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages