AI-based protein structure prediction pipelines, such as AlphaFold2, have achieved near-experimental accuracy. These advanced pipelines mainly rely on Multiple Sequence Alignments (MSAs) and templates as inputs to learn the co-evolution information from the homologous sequences. Nonetheless, searching MSAs and templates from protein databases is time-consuming, usually taking dozens of minutes. Consequently, we attempt to explore the limits of fast protein structure prediction by using only primary sequences of proteins. HelixFold-Single is proposed to combine a large-scale protein language model with the superior geometric learning capability of AlphaFold2. Our proposed method, HelixFold-Single, first pre-trains a large-scale protein language model (PLM) with thousands of millions of primary sequences utilizing the self-supervised learning paradigm, which will be used as an alternative to MSAs and templates for learning the co-evolution information. Then, by combining the pre-trained PLM and the essential components of AlphaFold2, we obtain an end-to-end differentiable model to predict the 3D coordinates of atoms from only the primary sequence.
For those who want to try out our model without any installation, we also provide an online interface PaddleHelix HelixFold-Single Forecast through web service.
To reproduce the results reported in our paper, specific environment settings are required as below.
- python: 3.7
- cuda: 11.2
- cudnn: 8.10.1
- nccl: 2.12.12
Except those listed in the requirements.txt
, PaddlePaddle dev
package is required to run HelixFold.
Visit here to install PaddlePaddle dev
. Also, we provide a package here if your machine environment is Nvidia A100 with
cuda=11.2.
python -m pip install -r requirements.txt
wget https://baidu-nlp.bj.bcebos.com/PaddleHelix/HelixFold/paddlepaddle_gpu-0.0.0-cp37-cp37m-linux_x86_64.whl
python -m pip install paddlepaddle_gpu-0.0.0-cp37-cp37m-linux_x86_64.whl
Here we provide the trained model that can be used to reproduce the results of our paper.
wget https://baidu-nlp.bj.bcebos.com/PaddleHelix/HelixFold-Single/helixfold-single.pdparams
To run the inference, what you need is a fasta file and the pre-downloaded trained model:
python helixfold_single_inference.py \
--init_model=./helixfold-single.pdparams \
--fasta_file=data/7O9F_B.fasta \
--output_dir="./output"
init_model
: the trained model.fasta_file
: the fasta_file file which contains the protein sequence to be predicted.output_dir
: the path to the output.
The output is organized as:
./output
unrelaxed.pdb
where unrelaxed.pdb
is the predicted pdb file.
If you use the code or data in this repository, please cite:
@article{fang2022helixfold_single,
title={HelixFold-Single: MSA-free Protein Structure Prediction by Using Protein Language Model as an Alternative},
author={Fang, Xiaomin and Wang, Fan and Liu, Lihang and He, Jingzhou and Lin, Dayong and Xiang, Yingfei and Zhang, Xiaonan and Wu, Hua and Li, Hui and Song, Le},
journal={arXiv preprint arXiv:2207.13921},
year={2022}
}