This is the official implementation of the LatentDiff method proposed in the following paper.
Cong Fu*, Keqiang Yan*, Limei Wang, Wing Yee Au, Michael McThrow, Tao Komikado, Koji Maruhashi, Kanji Uchino, Xiaoning Qian, Shuiwang Ji "A Latent Diffusion Model for Protein Structure Generation", the Second Learning on Graphs Conference (LoG) 2023
We include key dependencies below. The versions we used are in the parentheses. Our detailed environmental setup is available in environment.yml.
- PyTorch (1.11.0)
- PyTorch Geometric (2.1.0)
- biopython (1.79)
- biotite (0.34.1)
- tmalign (20170708)
- We curate protein data from Protein Data Bank and AlphaFold DB.
- We put all the curated data in google drive for downloading.
Download the data from the google drive and unzip all the datasets in the data
folder.
The training process contains two stages: 1. train the protein autoencoder 2. train the diffusion model in the latent protein space.
First, we train the protein autoencoder:
cd scripts
bash train_autoencoder.sh
After training the protein autoencoder, a folder will be created containing the trained model. The name of the folder is <time stamp> + <suffix>
, where the suffix is defined in train_autoencoder.sh
.
Next, we need to generate training data in the latent protein space using the trained encoder:
cd data
Run all the cells in gen_data_for_diffusion.ipynb
Replace the <path of protein autoencoder checkpoint>
and <latent_data_name>
in the gen_data_for_diffusion.ipynb
. Please save the latent protein data in data
folder to avoid triggering path error when running the following steps.
Then, we can start training the latent diffusion model:
cd scripts
bash train_diffusion.sh
Note that latent_dataname
in train_diffusion.sh
is the same with <latent_data_name>
you just set in the previous step.
Diffusion model framework is adapted from EDM.
cd scripts
source gen_diffusion_analysis.sh
There are some variables you need to set in the gen_diffusion_analysis.sh
:
autoencoder_path
: name of the root folder containing the trained autoencoder model (automatically created when training the autoencoder, with name <time stamp> + <suffix>
)
latent_data_name
: this is the same as latent_dataname
in the train_diffusion.sh
diffusion_model_path
: name of the root folder containing the trained diffusion model
In order to run gen_diffusion_analysis.sh
, you also need to install OmegaFold. OmegeFold should be installed in another conda environment with the name omegafold
.
@inproceedings{fu2023latent,
title={A Latent Diffusion Model for Protein Structure Generation},
author={Fu, Cong and Yan, Keqiang and Wang, Limei and Au, Wing Yee and McThrow, Michael and Komikado, Tao and Maruhashi, Koji and Uchino, Kanji and Qian, Xiaoning and Ji, Shuiwang},
booktitle={The Second Learning on Graphs Conference},
year={2023}
}