Skip to content

Latest commit

 

History

History
67 lines (51 loc) · 3.23 KB

README.md

File metadata and controls

67 lines (51 loc) · 3.23 KB

ProtFill

ProtFill is an inpainting protein sequence and structure co-design model that works on antibodies as well as other proteins.

architecture

Our model uses custom GVPe message passing layers, which are a modification of GVP with edge updates.

gvpe

Installation

cd protfill
conda create --name protfill python=3.10
conda activate protfill
python -m pip install .
python -m pip install torch_geometric torch_scatter

Data

The datasets can be downloaded from proteinflow.

proteinflow download --tag 20230102_stable

proteinflow download --tag 20230626_sabdab --skip_splitting
rm -r data/proteinflow_20230626_sabdab/splits_dict/
cp -r data/splits_dict data/proteinflow_20230626_sabdab/
proteinflow split --tag 20230626_sabdab

Configs

There are four models in this repository and they can be tested or replicated with corresponding config files. The differences between the models are explained in the table below. Noising scheme here refers to either replacing the masked data with samples from a gaussian distribution (standard) or corrupting it with noise (alternative).

Name Dataset Diffusion Noising scheme
protfill_ab antibody no standard
proftilldiff antibody yes standard
protfill_ppi_standard_noising diverse no standard
protfill_ppi_alternative_noising diverse no alternative

Training

In order to retrain one of the models, run this command with one of the config names.

protfill --config configs/train/NAME.yaml --dataset_path DATASET_PATH

An example can look like this.

protfill --config configs/train/protfill_ab.yaml --dataset_path data/proteinflow_20230626_sabdab

Validation

In order to test one of our pre-trained models on the 'easy' test subset, run the following.

protfill --config configs/test/NAME.yaml --dataset_path DATASET_PATH --easy_test

To test on the 'hard' subset, replace --easy_test with --hard_test. To test on a specific CDR, add i.e. --redesign_cdr H3. Note that the 'hard' antibody subset does not contain light chains and the diverse dataset does not have CDRs or an 'easy' test subset.

Generation

To redesign a part of a new file, run this. The file can have either a .pdb or a .pickle extension, with the pickle files being generated by proteinflow.

protfill --config configs/test/NAME.yaml --redesign_file 7kgk.pdb

By default this command will redesign a random part of the protein. To redesign specific positions, use the --redesign_positions option. This argument should be in the format of chain:start1-end1,start2-end2, e.g. A:5-10,20-21,30-40. The numbering is 0-indexed, the starts are included in the selected slice and the ends are not. In case of PDB files, the chain name is the author name. In case of pickle files, the numbering should be based on the fasta chain. If the file was generated with proteinflow with CDR information, this can also be used with a --redesign_cdr CDR option to redesign a specific CDR, e.g. --redesign_cdr H3.