This repository contains code for predicting a pairwise protein-protein interaction network from a set of protein sequences or two lists of sequences thought to interact.
The procedure is based on MSA creation and evaluation with AlphaFold2 adapted for pairwise interactions aka FoldDock. Note that the method is not evaluated for a random set of sequences, only for interactions with experimental support. For disordered proteins, the expected performance is low.
AlphaFold2 is available under the Apache License, Version 2.0 and so is SpeedPPI, which is a derivative thereof.
The AlphaFold2 parameters are made available under the terms of the CC BY 4.0 license and have not been modified.
You may not use these files except in compliance with the licenses.
For an example study, see Towards a structurally resolved human protein interaction network where 65'484 human protein interactions were evaluated.
The number of possible pairs grows quadratically with the number of input sequences.
As this number rapidly becomes too large to handle, we apply a number of techniques to speed up the
network generation. We also greatly reduce the memory footprint of the dependencies for the protein structure prediction and generated MSAs. Overall, the speedup is 40x (for a set of 1000 proteins, 499500 pairwise interactions) and the disk space is reduced 4000x. The benefits are larger the more proteins are being predicted.
We provide two options for running SpeedPPI:
- All-vs-all mode.
- Runs all proteins in a fasta file against each other.
- Some-vs-some mode.
- Runs all proteins in two different fasta files against each other.
The PPI network is built automatically and all structures above a score threshold that you set are
saved to disk.
Follow the instructions below, provide your sequences and relax while your network is being built.
If you like SpeedPPI, please star this repo and if you use it in your research please cite FoldDock and SpeedPPI
Install all packages into a conda environment (requires https://docs.conda.io/en/latest/miniconda.html)
conda env create -f speed_ppi.yml
wait
conda activate speed_ppi
pip install --upgrade "jax[cuda12_pip]" -f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html
pip install --upgrade dm-haiku
Note that the prediction part here runs on GPU. You will therefore have to install all appropriate CUDA drivers for your system. Otherwise the GPU will not be used.
The installation scripts do install these drivers, but this can fail in some cases.
To check if the gpu is accessed do:
conda activate speed_ppi
python3 ./src/test_gpu_avail.py
If the output is "gpu" - everything is fine.
The static SSE2 build is fetched.
mkdir hh-suite
cd hh-suite
wget https://github.com/soedinglab/hh-suite/releases/download/v3.3.0/hhsuite-3.3.0-SSE2-Linux.tar.gz
tar xvfz hhsuite-3.3.0-SSE2-Linux.tar.gz
cd ..
25 Gb download, 87 Gb extracted
wget http://wwwuser.gwdg.de/~compbiol/uniclust/2018_08/uniclust30_2018_08_hhsuite.tar.gz --no-check-certificate
tar -zxvf uniclust30_2018_08_hhsuite.tar.gz -C data/
mkdir data/params
wget https://storage.googleapis.com/alphafold/alphafold_params_2021-07-14.tar
mv alphafold_params_2021-07-14.tar data/params
tar -xf data/params/alphafold_params_2021-07-14.tar
mv params_model_1.npz data/params
Cleanup - remove unnecessary files
rm uniclust30_2018_08_hhsuite.tar.gz
rm data/params/alphafold_params_2021-07-14.tar
rm params_*.npz
rm hh-suite/hhsuite*.tar.gz
- Input paths to the following resources (separated by space)
- A fasta file with sequences for all proteins you want to analyse
- Path to HHblits (default: hh-suite/build/bin/hhblits)
- pDockQ threshold. This determines what structures are saved to disk. (0.5 is recommended)
- Output directory
Try the test case:
conda activate speed_ppi
bash create_ppi_all_vs_all.sh ./data/test/test.fasta ./hh-suite/bin/hhblits 0.5 ./data/test/all_vs_all/
- Input paths to the following resources (separated by space)
- A fasta file with sequences for the proteins in list 1
- A fasta file with sequences for the proteins in list 2
- Path to HHblits
- Output directory
- pDockQ threshold. This determines what structures are saved to disk. (0.5 is recommended)
Try the test case:
conda activate speed_ppi
bash create_ppi_some_vs_some.sh ./data/test/test1.fasta ./data/test/test2.fasta ./hh-suite/bin/hhblits 0.5 ./data/test/some_vs_some/
bash predict_single.sh ./data/test/4G4S_O.fasta ./data/test/4G4S_P.fasta ./hh-suite/bin/hhblits ./data/test/single/
If you have a computational cluster available, it will be much faster to run your predictions in parallel. This requires some knowledge of computational infrastructure, however. Steps 2 and 3 in create_ppi_[all_vs_all, some_vs_some].sh are written in individual scripts assuming a SLURM infrastructure in ./src/parallel/. You can copy these, modify the paths and variables and queue them at your cluster to make the predictions even more efficient!