BCrystal: An Interpretable Sequence-Based Protein Crystallization Predictor
Protein crystallization allows to study molecular structure. Novel in silico, accurate, sequence-based protein crystallization predictors are highly sought.
This step will install all the dependencies required for running BCrystal. You do not need sudo permissions for this step.
-
Install Anaconda
- Download Anaconda (64 bit) installer python3.x for linux : https://www.anaconda.com/distribution/#download-section
- Run the installer :
bash Anaconda3-2019.03-Linux-x86_64.sh
and follow the instructions to install. - Install xgboost: conda install -c conda-forge xgboost
- Install shap: conda install -c conda-forge shap
- Install Bio: conda install -c anaconda biopython
-
R requirements
- Run R REPL by running the following:
R
- Install R libraries
- Interpol (do
install.packages('Interpol')
) - bio3d (do
install.packages('bio3d')
) - doParallel (do
install.packages('doParallel')
) - zoo (do
install.packages('zoo')
)
- Interpol (do
Quit R REPL:
quit()
- Run R REPL by running the following:
-
SCRATCH (version SCRATCH-1D release 1.2) (http://scratch.proteomics.ics.uci.edu, Downloads: http://download.igb.uci.edu/#sspro)
- Run
wget http://download.igb.uci.edu/SCRATCH-1D_1.2.tar.gz
- Run
tar -xvzf SCRATCH-1D_1.2.tar.gz
- Run
cd SCRATCH-1D_1.2
- Run
perl install.pl
- Run
cd ..
- Replace the blast in
SCRATCH-1D_1.2/pkg/blast-2.2.26
with a 64 bit version ofblast-2.2.26
if you are running on a 64 bit machine (ftp://ftp.ncbi.nlm.nih.gov/blast/executables/legacy.NOTSUPPORTED/2.2.26/
).
- Run
-
DISOPRED (version 3.16) (http://bioinfadmin.cs.ucl.ac.uk/downloads/DISOPRED/)
- Run
wget http://bioinfadmin.cs.ucl.ac.uk/downloads/DISOPRED/DISOPRED3.16.tar.gz
- Run
tar -xvzf DISOPRED3.16.tar.gz
- Run
cd DISOPRED/src/
- Run
make clean; make; make install
- In
run_disopred.pl
file within the DISOPRED folder putmy $NCBI_DIR = <path-to-SCRATCH-folder>/pkg/blast-2.2.26/bin
- In
run_disopred.pl
file also putmy $SEQ_DB = <path-to-SCRATCH-folder>/pkg/PROFILpro_1.2/data/uniref50/uniref50
.
- Run
DISOPRED and SCRATCH-1D_1.2 should be in the same directory as Data folder. Data folder has the training and the 3 test set proteins in fasta format as well as files corresponding to their true labels - crystallized (1) or not (0).
To run BCrystal on your own protein sequences you need the following three things:
- Protein Sequence File: Protein sequence/sequences of interest in fasta format (https://en.wikipedia.org/wiki/FASTA_format). We provide
Data/test.fasta
as sample test file. - SCRATCH: Software used to extract structural features from a given protein sequence file. Follow instructions in the previous section to install SCRATCH.
- DISOPRED: Software used to extract disorder features from a given protein sequence file. Follow instructions in the previosu section to install DISOPRED.
Rscript --vanilla features_PaRSnIP_v2.R <your-test>.fasta
python xgb.py features.csv <your-test>.fasta <output_folder>
In the <output_folder> you will find 2 outputs:
- prediction.csv - Containing the crystallization propensity
- bar_plot_i.png - where i=1 if a solo sequence is passed in fasta otherwise its the nth sequence in the test fasta file.
To run BCrystal for training xgboost model on our training proteins, you need to do the following:
Rscript --vanilla features_PaRSnIP_v2.R Data/Train/FULL_Train.fasta
python xgb_train.py
The training set is readily available for ease of use at: https://drive.google.com/file/d/1FRWIcs4xvK2O5OCqhg7u5g_4qm2zJn2d/view and can be used in combination with Step 2 to generate the BCrystal model.
Your output will be a file called train.model