Protein domains are independent sections of protein sequences that can have function distint functions. One of the major ways in which proteins can evovle is through domain insertion/delection/duplication. In this project we will attempt to build an analysis pipeline that will take in 2 groups of species proteomes and find differences in domain compositions between the two groups. The whole pipeline will be packaged inside a docker container which can executed on any given data in any machine environment.
-
Proteome fasta file for each species. The number of fasta files will depend of which two groups of species the user decides to compare and how many species the user the user wants in each group. It is recommended that there should be atleast 10 species per group to obtain significant results. These can be collected from following databases.
-
The HMM file containing Pfam domains from Pfam database which contains registry of all the domains found in all the organisms. The HMM file must be processed using hmmpress program to create a HMM database. For more details on how to use the hmmpress tool please see the HMMER user manual.
-
A file with two columns
species
andspecies_label
. Thespecies
column contains fasta file names of individual species append by string "pfamscan". Thespecies_label
column contains labels (0 or 1) classifying the species in different groups.
- Akshay Yadav
- Sumegha Godara
- Yafang Guo
a). Construct a container with all the programs and dependencies required for the pipeline to run. The analysis pipeline is composed of 3 major steps viz. assigning domains to sequences in fasta, calculating domain matrices, and statistical analysis of domain matrices.
b). Implementation of the analysis pipeline on snakemake worfkflow engine.
c). Testing the reproducibility of the pipeline.
2.1 Launch an instance on Jetstream, using Ubuntu 18.04 Devel and Docker v1.22, with m1.xlarge (CPU: 24, Mem: 60 GB, Disk: 60 GB) ssh to the VM using
# get the username and IP address
ssh $USER@xxx.xxx.xxx.xxx
2.2 Download the data
wget https://de.cyverse.org/dl/d/D92472AE-62CA-4029-ABBE-66B2E23D06B1/test_data.tar.gz
unzip the data
tar -xzvf test_data.tar.gz
**2.3 *For reproducibility, go to Section 3.4 directly. *
- Install make, perl #v5.22.1, hmmer, pfamscan
FROM ubuntu:16.04 RUN apt-get update && \ apt-get install -y wget build-essential make perl hmmer && \ cd /root/ && \ wget "http://ftp.ebi.ac.uk/pub/databases/Pfam/Tools/OldPfamScan/PfamScan1.5/PfamScan.tar.gz"
- Install python3 and libs to run the scripts
RUN apt-get update \ && apt-get install -y python3-pip python3-dev #Version:Python 3.5.2 RUN pip3 install pandas #Version:0.24.2 RUN pip3 install rpy2 #Version:3.0.5 RUN pip3 install scipy #Version:1.3.0 RUN pip3 install sklearn #Version:0.21.2 RUN pip3 install matplotlib #Version:3.0.3
- Install snakemake as the workflow management system
RUN pip3 install snakemake #Version:5.5.4
- Add scripts for data analysis inside the container
ADD scripts /usr/local/bin # make the scripts executable RUN chmod +x /usr/local/bin/*
3.2.1. The analysis includes three steps.
- assigning pfam protein domains to species fasta
- calculating the domain matrices (content, duplication, abundance, versatility) from domain assignments
- analyzing domain matrices for filtering out significantly evolving domains
3.2.2. The workflow diagram generated by
snakemake --dag -np -s snakefile |dot -Tsvg > protein-domain-evolution-workflow.svg
3.2.3. The snakefile
-
1st step, run pfamscan.pl
rule pfamscan:
-
2nd step, run python scripts to get matrices
rule domain_content_matrix: rule domain_duplication_matrix: rule domain_abundance_matrix: rule domain_versatility_matrix:
-
3rd step, run python scripts to analyze the matrices
rule domain_content_matrix_analysis: rule domain_duplication_matrix_analysis: rule domain_abundance_matrix_analysis: rule domain_versatility_matrix_analysis:
3.3.1 git clone the repository
git clone https://github.com/cyber-carpentry/Group5-protein-domain-evolution-project.git
Once the dockerfile and snakefile are ready, build the docker imager from the git directory and not the Docker
directory. as:
docker build -t akshayayadav/protein-domain-evolution-project -f Docker/Dockerfile .
Use docker images
to check the built images
- Pull the docker image from https://hub.docker.com/r/akshayayadav/protein-domain-evolution-project .
docker pull akshayayadav/protein-domain-evolution-project
- Since the data directory is not built into the container, you need to bind mount a volume with the data directory into the container.
docker run -v <path to the data directory>:/data akshayayadav/protein-domain-evolution-project run_analysis.sh -c 10
The number "10" gives the number of cores passed to snakemake to run the analysis.