Skip to content

cyber-carpentry/Group5-protein-domain-evolution-project

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

80 Commits
 
 
 
 
 
 

Repository files navigation

Protein domain evolution analysis pipeline

Protein domains are independent sections of protein sequences that can have function distint functions. One of the major ways in which proteins can evovle is through domain insertion/delection/duplication. In this project we will attempt to build an analysis pipeline that will take in 2 groups of species proteomes and find differences in domain compositions between the two groups. The whole pipeline will be packaged inside a docker container which can executed on any given data in any machine environment.

Inputs

  • Proteome fasta file for each species. The number of fasta files will depend of which two groups of species the user decides to compare and how many species the user the user wants in each group. It is recommended that there should be atleast 10 species per group to obtain significant results. These can be collected from following databases.

  • The HMM file containing Pfam domains from Pfam database which contains registry of all the domains found in all the organisms. The HMM file must be processed using hmmpress program to create a HMM database. For more details on how to use the hmmpress tool please see the HMMER user manual.

  • A file with two columns species and species_label. The species column contains fasta file names of individual species append by string "pfamscan". The species_label column contains labels (0 or 1) classifying the species in different groups.

Group members

  • Akshay Yadav
  • Sumegha Godara
  • Yafang Guo

Instructions

1. Goals

a). Construct a container with all the programs and dependencies required for the pipeline to run. The analysis pipeline is composed of 3 major steps viz. assigning domains to sequences in fasta, calculating domain matrices, and statistical analysis of domain matrices.

b). Implementation of the analysis pipeline on snakemake worfkflow engine.

c). Testing the reproducibility of the pipeline.

2. How to start

2.1 Launch an instance on Jetstream, using Ubuntu 18.04 Devel and Docker v1.22, with m1.xlarge (CPU: 24, Mem: 60 GB, Disk: 60 GB) ssh to the VM using

# get the username and IP address
ssh $USER@xxx.xxx.xxx.xxx

2.2 Download the data

wget https://de.cyverse.org/dl/d/D92472AE-62CA-4029-ABBE-66B2E23D06B1/test_data.tar.gz

unzip the data

tar -xzvf test_data.tar.gz

**2.3 *For reproducibility, go to Section 3.4 directly. *

3. Build a Docker container

3.1 Starting from Dockerfile (explanation)

  • Install make, perl #v5.22.1, hmmer, pfamscan
    FROM ubuntu:16.04
    RUN apt-get update && \
        apt-get install -y wget build-essential make perl hmmer && \
        cd /root/ && \
        wget "http://ftp.ebi.ac.uk/pub/databases/Pfam/Tools/OldPfamScan/PfamScan1.5/PfamScan.tar.gz"
  • Install python3 and libs to run the scripts
    RUN apt-get update \
      && apt-get install -y python3-pip python3-dev  #Version:Python 3.5.2
    RUN pip3 install pandas #Version:0.24.2
    RUN pip3 install rpy2   #Version:3.0.5
    RUN pip3 install scipy  #Version:1.3.0
    RUN pip3 install sklearn  #Version:0.21.2
    RUN pip3 install matplotlib #Version:3.0.3
    
  • Install snakemake as the workflow management system
    RUN pip3 install snakemake #Version:5.5.4
    
  • Add scripts for data analysis inside the container
    ADD scripts /usr/local/bin
    # make the scripts executable
    RUN chmod +x /usr/local/bin/* 
    

3.2 Making the snakemake workflow file (explanation)

3.2.1. The analysis includes three steps.

  • assigning pfam protein domains to species fasta
  • calculating the domain matrices (content, duplication, abundance, versatility) from domain assignments
  • analyzing domain matrices for filtering out significantly evolving domains

3.2.2. The workflow diagram generated by

snakemake --dag -np -s snakefile |dot -Tsvg > protein-domain-evolution-workflow.svg

3.2.3. The snakefile

  • 1st step, run pfamscan.pl

    rule pfamscan:
    
  • 2nd step, run python scripts to get matrices

    rule domain_content_matrix:
    rule domain_duplication_matrix:
    rule domain_abundance_matrix:
    rule domain_versatility_matrix:
    
  • 3rd step, run python scripts to analyze the matrices

    rule domain_content_matrix_analysis:
    rule domain_duplication_matrix_analysis:
    rule domain_abundance_matrix_analysis:
    rule domain_versatility_matrix_analysis:
    

3.3 Build the container using Docker (hands on)

3.3.1 git clone the repository

git clone https://github.com/cyber-carpentry/Group5-protein-domain-evolution-project.git

Once the dockerfile and snakefile are ready, build the docker imager from the git directory and not the Docker directory. as:

docker build -t akshayayadav/protein-domain-evolution-project -f Docker/Dockerfile .

Use docker images to check the built images

3.4 Create and run a writeable container layer over the built image (hands on)

docker pull akshayayadav/protein-domain-evolution-project
  • Since the data directory is not built into the container, you need to bind mount a volume with the data directory into the container.
docker run -v <path to the data directory>:/data akshayayadav/protein-domain-evolution-project run_analysis.sh -c 10

The number "10" gives the number of cores passed to snakemake to run the analysis.

About

Cyber Carpentry 2019 Project

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 94.5%
  • Dockerfile 3.9%
  • Shell 1.6%