domain-clustering-pipeline

Introduction

This pipeline is designed to automate the prediction of protein domains given only the protein's sequence, based on the method described in (Granata et al., 2017). The pipeline is written in Snakemake framework (Koster and Rahmann, 2012) to allow easy execution both in compute clusters and in local machines and to ensure reproducibility of the generated data. We also strive to make installation as convenient as possible, by having most of our dependencies available on conda.

The pipeline consists of three main processes:

Generation of multiple sequence alignment using an iterative protein sequence search tool called hhblits
Prediction of direct coupling between every pair of protein residues using gplmDCA (Feinauer et al., 2014; & Ekeberg et al., 2013)
Generation of protein domains clusters using spectral clustering (Ponzoni, 2015)

This pipeline can be run end-to-end — i.e. from FASTA sequence input toa table of every residue position and their corresponding domain membership — or can be started in from intermediate files.

This repository also include RMarkdown files we use for our downstream analyses which aims to inspect the possible relationship between the number and identity of ExAC mutations (Lek et al., 2016) and pathogenic variants.

Dependencies

Python, Conda, MATLAB

Setup

Before using the pipeline, you will need to install snakemake on top of the dependencies listed above

conda install -c bioconda -c conda-forge snakemake

Also, hhblits requires a database to be downloaded for which they recommend UniClust30. Download this database into the db/ directory.

Running the pipeline on local machine

On the local machine you could simply run the pipeline by firstly populating the fasta_list.dat file with linebreak-delimieted name of your fasta files (without the .fa suffix). You would also need to modify the parameters in params.json.

Once those are done, you can run

snakemake -k --use-conda

Running the pipeline on a compute cluster

For parallelise implementation on a compute cluster, you would need to create a configuration file called cluster.json, for example

{
  "__default__":
    {
      "partition": "general",
      "mem": "1G",
      "time": "00:30:00"
    }
}

then, simply run

snakemake -k -j 100 --use-conda --cluster-config cluster.json --cluster "sbatch -p {cluster.partition} --mem={cluster.mem} -t {cluster.time} -c {threads}"

where -j 100 means that you only allow a maxiumum of 100 jobs to be run simultaneously on the cluster.

Results

The result along with all intermediate and log files can be found in the results/ directory, where the final result (every residue's domain membership in the protein) is found in results/clustering.

Analysis

Data

Files used to preprocess the clinvar, ExAC and PanCanAtlas variations can be found in downstream_analysis/clinvar_data, downstream_analysis/exac_data and downstream_analysis/pancanatlas_data. The Rmd files in each folder explains what the other files are used for. The link to these datasets can be found here:

Clinvar: ftp://ftp.ncbi.nih.gov/pub/clinvar/vcf_GRCh37/archive_2.0/2019/clinvar_20190715_papu.vcf.gz
ExAC: ftp://ftp.broadinstitute.org/pub/ExAC_release/release0.3.1/ExAC.r0.3.1.sites.vep.vcf.gz
PanCanAtlas: http://api.gdc.cancer.gov/data/1c8cfe5f-e52d-41ba-94da-f15ea1337efc

Preliminary Analysis

Analysis of the first 6,924 proteins can be found in downstream_analysis/Analysis/Analysis_exac.Rmd, which also explains what the other files in that directory is used for.

Bibliography

Feinauer, C., Skwark, M. J., Pagnani, A., & Aurell, E. (2014). Improving contact prediction along three dimensions. PLoS computational biology, 10(10), e1003847.

Granata, D., Ponzoni, L., Micheletti, C., & Carnevale, V. (2017). Patterns of coevolving amino acids unveil structural and dynamical domains. Proceedings of the National Academy of Sciences, 114(50), E10612-E10621.

Köster, J., & Rahmann, S. (2012). Snakemake—a scalable bioinformatics workflow engine. Bioinformatics, 28(19), 2520-2522.

Lek, M., Karczewski, K. J., Minikel, E. V., Samocha, K. E., Banks, E., Fennell, T., ... & Tukiainen, T. (2016). Analysis of protein-coding genetic variation in 60,706 humans. Nature, 536(7616), 285.

Ekeberg, M., Lövkvist, C., Lan, Y., Weigt, M., & Aurell, E. (2013). Improved contact prediction in proteins: using pseudolikelihoods to
infer Potts models. Physical Review E, 87(1), 012707.

Ponzoni, L., Polles, G., Carnevale, V., & Micheletti, C. (2015). SPECTRUS: A dimensionality reduction approach for identifying dynamical
domains in protein complexes from limited structural datasets. Structure, 23(8), 1516-1525.

Name		Name	Last commit message	Last commit date
Latest commit History 57 Commits
downstream_analysis		downstream_analysis
env		env
fasta_input		fasta_input
spectrus_slim		spectrus_slim
third-party-code		third-party-code
README.md		README.md
Snakefile		Snakefile
cluster.json		cluster.json
cluster_spectrus.sh		cluster_spectrus.sh
config.yaml		config.yaml
edgelist_to_simmtx.py		edgelist_to_simmtx.py
fasta_done.dat		fasta_done.dat
fasta_list.dat		fasta_list.dat
filter_by_gap.sh		filter_by_gap.sh
gplmDCA_asymmetric.m		gplmDCA_asymmetric.m
params.json		params.json
plot_graphics.R		plot_graphics.R
reformat_a3m_fa.py		reformat_a3m_fa.py
sample_list.dat		sample_list.dat

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

domain-clustering-pipeline

Introduction

Dependencies

Setup

Running the pipeline on local machine

Running the pipeline on a compute cluster

Results

Analysis

Data

Preliminary Analysis

Bibliography

About

Releases

Packages

Languages

kristoforusbryant/domain-clustering-pipeline

Folders and files

Latest commit

History

Repository files navigation

domain-clustering-pipeline

Introduction

Dependencies

Setup

Running the pipeline on local machine

Running the pipeline on a compute cluster

Results

Analysis

Data

Preliminary Analysis

Bibliography

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages