An autoencoder trained for dimensionality reduction of protein structural features. This was used for my research of Common local protein structural motifs supported by specific amino acids. For more details, refer to my research report and the SeqPredNN paper.
- Python: 3.12
- Packages:
numpy
,pandas
,torch
,scikit-learn
,matplotlib
,scipy
,biopython
,umap-learn
conda env create -f environment.yaml
conda activate autoencoder
I recommend using anaconda to install the packages in a contained environment.
python featurise.py -gm -o example_features examples/example_chain_list.csv examples/example_pdb_directory
This will generate features (using SeqPredNN's featurise.py) using PDB files from the example directory and create chain lists in the output directory.
python inference.py example_features example_features/chain_list.txt pretrained_model/trained_model.pth -o inference --save_latent_space_vectors --plot_feature_space
This will load the pre-trained model using the specified features, run a forward pass through the autoencoder, and save the reconstructed features, latent vectors, and metrics.
python train_model.py example_features example_features/chain_list.txt -o model --balanced_sampling
This will train a new autoencoder model using the specified features, optionally customize the model architecture and training parameters, and save the trained model along with training metrics.
The pre-trained model was trained on 21,690 X-ray crystallographic protein chains (generated from the PISCES server) with resolution ≤ 2 Å, R-factor ≤ 0.25, chain lengths between 40-10,000 residues, and sequence identity < 90%. The model architecture consists of an input dimension of 180, hidden layers of 148 and 116 neurons, and a latent dimension of 84.
The pretrained_model
directory includes:
- Trained model weights (
trained_model.pth
) - PISCES dataset list (
pisces_pdb_list.fasta
,pisces_pdb_list
) - PDB chain list for feature extraction (
chain_list_pdb.csv
) - Feature chain list for training (
chain_list_training.txt
) - Feature chain list for testing (
chain_list_testing.txt
)
This software and code is distributed under MIT License
Common local protein structural motifs supported by specific amino acids. Report link (2024).
Lategan, F.A., Schreiber, C. & Patterton, H.G. SeqPredNN: a neural network that generates protein sequences that fold into specified tertiary structures. BMC Bioinformatics 24, 373 (2023). https://doi.org/10.1186/s12859-023-05498-4