Materials for semester project in Bioinformatics Institute (spring 2021)
Students: Pyankov I., Shemyakina A.
Supervisor: Popov P. (Skoltech)
This work is inspired by the development process of Proteolysis-targeting chimeras (PROTACs) and related molecules that induce targeted protein degradation by the ubiquitin-proteasome system. They represent a new therapeutic modality and are the focus of great interest, however the progress is hindered by the low efficiency of protein crystallography that provides E3 ubiquitin ligases' 3D structures required in the initial steps of PROTAC development. The automated in silico modeling tool could assist in expanding the number of enzymes available for development of targeted protein degradation systems.
Develop script for performing Multi-template Homology Auto Modeling of a target protein.
- Develop the script.
- Test script performance with E3 ubiquitin ligase target.
You can find the example result of script performance (with target protein human E3 ubiquitin-protein ligase TRIM69, EC 2.3.2.27) in the results folder (zipped). Inside it, Q86WT6.fasta is the target sequence file, orig_templates folder contains template strucrutes, found and downloaded during homologues search, Modeling folder contains files generated during Rosetta modeling, including final models in the corresponding folder, score folder contains raw score file generated by Ornate and summary_result contains summarized score information - csv file with mean and sd scores and graph with score disrtibution across target protein length. This example contains only one final model and one score file respectively, but in general script can generate and score any number of models.
Homologues search - mafft-homologs L-INS-i [1]; modeling - rosetta_scripts Application [2], FastRelax Mover [3], energy landscape [4]; quality assessment - Ornate [5].
Compatibility is guaranteed for followed python packages versions:
numpy==1.20.3
pandas==1.2.3
lxml==4.5.0
requests==2.25.1
urllib.request==3.8
Bio==1.78
matplotlib==3.3.4
BLAST_DB: db_v5
Mafft version: v7.453
Rosetta version: rosetta.source.release-275 r275 2021.07+release.c48be26 Installation instructions at https://www.rosettacommons.org/demos/latest/tutorials/install_build/install_build
Ornate has no version, but requiers tensorflow==1.14.0 or sooner (thus, python 3.5-3.7) Installation instructions at https://team.inria.fr/nano-d/software/Ornate/
The script consists of three parts.
First part downloads SWISS-MODEL Repository and/or World Wide Protein Data Bank, performs homologues search in the databases and downloads pdb structures of potential templates.
Second part processes the templates: chooses the correct chain from pdb file, selects templates above the identity percent threshold, calculates target coverage by remaining templates. Then the script follows the RosettaCM tutorial and generates model(s).
Third part evaluates the resulting model(s) and generates the csv file with scores mean and sd and picture with score distribution across target length for each model.
- Allowed target protein sequence length is no more than 1000 amino acid residues.
- Only provide an absolute path for working directory
- If you have problems creating a database for mafft you can download db_for_mafft. Unpack it and provide path to it, when the script asks you
- Please take note that make_fragments.pl script initially installs many dependencies. The installation requires ~73+ Gigs of free disk space.
All program code is presented in the main.py file. Launch it without any flags. Files Search_and_download_homologues.py, Process_homologues_and_model.py and Proteins_score.py contain functions required for main script execution (part 1, 2 and 3 respectively). File Preparation.py contains functions for interaction with user.
Note! All scripts must be in one folder!
The script is started with the following command:
python3 main.py
In the beginning script requests in the command line the required information. The execution might be aborted due to the absence of software dependencies (Ornate and Rosetta), the absence of acceptable homologues in databases and insufficient coverage of target protein by homologues.
First part generates target fasta file in provided working directory and downloads templates in orig_templates folder. The results of the second part are contained in the Modeling folder. The results of the third part are contained in the score folder (Ornate output) and summary_result folder (summary of Ornate output).
If you have any questions, please contact haletidy@gmail.com and/or vanypyankov@gmail.com
Thank you for your attention!
[1] Katoh K. et al. MAFFT version 5: improvement in accuracy of multiple sequence alignment //Nucleic acids research. – 2005. – Т. 33. – №. 2. – С. 511-518. doi: 10.1093/nar/gki198
[2] Fleishman SJ, Leaver-Fay A, Corn JE, Strauch E-M, Khare SD, Koga N, Ashworth J, Murphy P, Richter F, Lemmon G, Meiler J, and Baker D. (2011). RosettaScripts: A Scripting Language Interface to the Rosetta Macromolecular Modeling Suite. PLoS ONE 6(6):e20161. doi: 10.1371/journal.pone.0020161.
[3] Khatib F, Cooper S, Tyka MD, Xu K, Makedon I, Popovic Z, Baker D, and Players F. (2011). Algorithm discovery by protein folding game players. Proc Natl Acad Sci USA 108(47):18949-53. doi: 10.1073/pnas.1115898108.
[4] Maguire JB, Haddox HK, Strickland D, Halabiya SF, Coventry B, Griffin JR, Pulavarti SVSRK, Cummins M, Thieker DF, Klavins E, Szyperski T, DiMaio F, Baker D, and Kuhlman B. (2020). Perturbing the energy landscape for improved packing during computational protein design.. Proteins "in press". doi: 10.1002/prot.26030.
[5] Pages G., Charmettant B., Grudinin S. Protein model quality assessment using 3D oriented convolutional neural networks. bioRxiv. – 2018. doi: 10.1093/bioinformatics/btz122