This pipeline is designed to automate the prediction of protein domains given only the protein's sequence, based on the method described in (Granata et al., 2017). The pipeline is written in Snakemake framework (Koster and Rahmann, 2012) to allow easy execution both in compute clusters and in local machines and to ensure reproducibility of the generated data. We also strive to make installation as convenient as possible, by having most of our dependencies available on conda.
The pipeline consists of three main processes:
- Generation of multiple sequence alignment using an iterative protein sequence search tool called hhblits
- Prediction of direct coupling between every pair of protein residues using gplmDCA (Feinauer et al., 2014; & Ekeberg et al., 2013)
- Generation of protein domains clusters using spectral clustering (Ponzoni, 2015)
This pipeline can be run end-to-end — i.e. from FASTA sequence input toa table of every residue position and their corresponding domain membership — or can be started in from intermediate files.
This repository also include RMarkdown files we use for our downstream analyses which aims to inspect the possible relationship between the number and identity of ExAC mutations (Lek et al., 2016) and pathogenic variants.
Python, Conda, MATLAB
Before using the pipeline, you will need to install snakemake on top of the dependencies listed above
conda install -c bioconda -c conda-forge snakemake
Also, hhblits requires a database to be downloaded for which they recommend UniClust30. Download this database into the db/
directory.
On the local machine you could simply run the pipeline by firstly populating the fasta_list.dat
file with linebreak-delimieted name of your fasta files (without the .fa suffix). You would also need to modify the parameters in params.json
.
Once those are done, you can run
snakemake -k --use-conda
For parallelise implementation on a compute cluster, you would need to create a configuration file called cluster.json
, for example
{
"__default__":
{
"partition": "general",
"mem": "1G",
"time": "00:30:00"
}
}
then, simply run
snakemake -k -j 100 --use-conda --cluster-config cluster.json --cluster "sbatch -p {cluster.partition} --mem={cluster.mem} -t {cluster.time} -c {threads}"
where -j 100
means that you only allow a maxiumum of 100 jobs to be run simultaneously on the cluster.
The result along with all intermediate and log files can be found in the results/
directory, where the final result (every residue's domain membership in the protein) is found in results/clustering
.
Files used to preprocess the clinvar, ExAC and PanCanAtlas variations can be found in downstream_analysis/clinvar_data
, downstream_analysis/exac_data
and downstream_analysis/pancanatlas_data
. The Rmd files in each folder explains what the other files are used for. The link to these datasets can be found here:
- Clinvar: ftp://ftp.ncbi.nih.gov/pub/clinvar/vcf_GRCh37/archive_2.0/2019/clinvar_20190715_papu.vcf.gz
- ExAC: ftp://ftp.broadinstitute.org/pub/ExAC_release/release0.3.1/ExAC.r0.3.1.sites.vep.vcf.gz
- PanCanAtlas: http://api.gdc.cancer.gov/data/1c8cfe5f-e52d-41ba-94da-f15ea1337efc
Analysis of the first 6,924 proteins can be found in downstream_analysis/Analysis/Analysis_exac.Rmd
, which also explains what the other files in that directory is used for.
Feinauer, C., Skwark, M. J., Pagnani, A., & Aurell, E. (2014). Improving contact prediction along three dimensions. PLoS computational biology, 10(10), e1003847.
Granata, D., Ponzoni, L., Micheletti, C., & Carnevale, V. (2017). Patterns of coevolving amino acids unveil structural and dynamical domains. Proceedings of the National Academy of Sciences, 114(50), E10612-E10621.
Köster, J., & Rahmann, S. (2012). Snakemake—a scalable bioinformatics workflow engine. Bioinformatics, 28(19), 2520-2522.
Lek, M., Karczewski, K. J., Minikel, E. V., Samocha, K. E., Banks, E., Fennell, T., ... & Tukiainen, T. (2016). Analysis of protein-coding genetic variation in 60,706 humans. Nature, 536(7616), 285.
Ekeberg, M., Lövkvist, C., Lan, Y., Weigt, M., & Aurell, E. (2013). Improved contact prediction in proteins: using pseudolikelihoods to
infer Potts models. Physical Review E, 87(1), 012707.
Ponzoni, L., Polles, G., Carnevale, V., & Micheletti, C. (2015). SPECTRUS: A dimensionality reduction approach for identifying dynamical
domains in protein complexes from limited structural datasets. Structure, 23(8), 1516-1525.