Skip to content

Latest commit

 

History

History
161 lines (95 loc) · 5.02 KB

README.md

File metadata and controls

161 lines (95 loc) · 5.02 KB

ParaFold

ParaFold is a parallel version of AlphaFold, with splited CPU and GPU parts and a wrapper script for optional Amber relax.

ParaFold works by checking whether a file named features.pkl exists. If features.pkl does not exist, ParaFold distributes the first stage jobs to CPUs. These CPU jobs usually take a few minutes or hours to complete. Once the file features.pkl is generated by CPUs, the second stage of model inference on GPUs starts. ParaFold also supports to run entirely on GPUs, like AlphaFold, if the prediction job is submitted to GPUs but without the existence of features.pkl. ParaFold can run large-scale protein predictions on supercomputer, with shorter time and at a lower cost.

If you use ParaFold, please cite the paper: ParaFold: Paralleling AlphaFold for Large-Scale Predictions, Bozitao Zhong, Xiaoming Su, Minhua Wen, Sichen Zuo, Liang Hong, James Lin, International Conference on High Performance Computing in Asia-Pacific Region Workshops (2022) - link.

ParaFold supports AlphaFold 2.1.1

How to install

Setting up conda environment

ParaFold works with the same python environment as AlphaFold. So if you have a local python environment for AlphaFold, you can skip this section.

Step 1: Create a conda environment for AlphaFold

# suppose you have miniconda environment on your cluster, or you can install another miniconda or anaconda
module load miniconda3
conda create -n alphafold python=3.8
source activate alphafold

Step 2: Install cudatoolkit 10.1 and cudnn:

conda install cudatoolkit=10.1 cudnn

Why use cudatoolkit 10.1:

  • cudatoolkit supports TensorFlow 2.3.0, while sometimes TensorFlow can't find GPU when using cudatoolkit 10.2
  • cudnn version 7.6.5

  • For higher version of CUDA driver, you can install cudatoolkit 11.2 and TensorFlow 2.5.0 instead

Step 3: Install tensorflow 2.3.0 by pip

pip install tensorflow==2.3.0

Step 4: Install other packages with pip and conda

# Using conda
conda install -c conda-forge openmm=7.5.1 pdbfixer=1.7
conda install -c bioconda hmmer=3.3.2 hhsuite=3.3.0 kalign2=2.04
conda install pandas=1.3.4

# Using pip
pip install biopython==1.79 chex==0.0.7 dm-haiku==0.0.4 dm-tree==0.1.6 immutabledict==2.0.0 jax==0.2.14 ml-collections==0.1.0
pip install --upgrade jax jaxlib==0.1.69+cuda101 -f https://storage.googleapis.com/jax-releases/jax_releases.html

jax installation reference: https://github.com/google/jax

  • For CUDA 11.1, 11.2, or 11.3, use cuda111.
  • For CUDA 11.0, use cuda110.
  • For CUDA 10.2, use cuda102.
  • For CUDA 10.1, use cuda101.

Clone This Repo

git clone https://github.com/SJTU-HPC/ParaFold.git
alphafold_path="/path/to/alphafold/git/repo"

give the executive permission for sh files:

chmod +x run_alphafold.sh

Final Steps

Download chemical properties to the common folder

wget -q -P alphafold/alphafold/common/ https://git.scicore.unibas.ch/schwede/openstructure/-/raw/7102c63615b64735c4941278d92b554ec94415f8/modules/mol/alg/src/stereo_chemical_props.txt

Apply OpenMM patch

# This is you path to your alphafold folder
alphafold_path="/path/to/alphafold/git/repo"
cd ~/.conda/envs/alphafold/lib/python3.8/site-packages/
patch -p0 < $alphafold_path/docker/openmm.patch

Some detail information of modified files

4 files:

  • run_alphafold.py: modified version of original run_alphafold.py, with multiple additional functions like skipping featuring steps when exists feature.pkl in output folder
  • run_alphafold.sh: bash script to run run_alphafold.py
  • run_figure: some plots

How to run

First, you need CPUs to run get features:

./run_alphafold.sh -d data -o output -p monomer_ptm -i input/test.fasta -t 2021-07-27 -m model_1 -f

-f means only run the featurization step, result in a feature.pkl file, and skip the following steps.

8 CPUs is enough, according to my test, more CPUs won't help with speed

Featuring step will output the feature.pkl and MSA folder in your output folder: ./output/FASTA_NAME/

PS: Here we put input files in an input folder to organize files in a better way.

Second, you can run run_alphafold.sh using GPU:

./run_alphafold.sh -d data -o output -m model_1,model_2,model_3,model_4,model_5 -i input/test.fasta -t 2021-07-27

If you have successfully output feature.pkl, you can have a very fast featuring step

Finally, you can run run_figure.py to visualize your result: [This will be available soon]

python run_figure.py [SystemName]

This python file will create a figure folder in your output folder.

Notice: run_figure.py need a local conda environment with matplotlib, pymol and numpy.

Functions

You can using some flags to change prediction model for ParallelFold:

-x: Skip AMBER refinement

-b: Using benchmark mode - running JAX model for twice, and the second run can used for evaluate running time

-r: Change the number of cycles in recycling