Autoencoder for Protein Structural Features

An autoencoder trained for dimensionality reduction of protein structural features. This was used for my research of Common local protein structural motifs supported by specific amino acids. For more details, refer to my research report and the SeqPredNN paper.

Requirements

Python: 3.12
Packages: numpy, pandas, torch, scikit-learn, matplotlib, scipy, biopython, umap-learn

Getting Started

1. Environment Setup

conda env create -f environment.yaml
conda activate autoencoder

I recommend using anaconda to install the packages in a contained environment.

2. Feature Generation

python featurise.py -gm -o example_features examples/example_chain_list.csv examples/example_pdb_directory

This will generate features (using SeqPredNN's featurise.py) using PDB files from the example directory and create chain lists in the output directory.

3. Dimensionality Reduction

python inference.py example_features example_features/chain_list.txt pretrained_model/trained_model.pth -o inference --save_latent_space_vectors --plot_feature_space

This will load the pre-trained model using the specified features, run a forward pass through the autoencoder, and save the reconstructed features, latent vectors, and metrics.

4. Training a Model

python train_model.py example_features example_features/chain_list.txt -o model --balanced_sampling

This will train a new autoencoder model using the specified features, optionally customize the model architecture and training parameters, and save the trained model along with training metrics.

Pre-trained Model Specifications

The pre-trained model was trained on 21,690 X-ray crystallographic protein chains (generated from the PISCES server) with resolution ≤ 2 Å, R-factor ≤ 0.25, chain lengths between 40-10,000 residues, and sequence identity < 90%. The model architecture consists of an input dimension of 180, hidden layers of 148 and 116 neurons, and a latent dimension of 84.

The pretrained_model directory includes:

Trained model weights (trained_model.pth)
PISCES dataset list (pisces_pdb_list.fasta, pisces_pdb_list)
PDB chain list for feature extraction (chain_list_pdb.csv)
Feature chain list for training (chain_list_training.txt)
Feature chain list for testing (chain_list_testing.txt)

License

This software and code is distributed under MIT License

Citation

Common local protein structural motifs supported by specific amino acids. Report link (2024).

Lategan, F.A., Schreiber, C. & Patterton, H.G. SeqPredNN: a neural network that generates protein sequences that fold into specified tertiary structures. BMC Bioinformatics 24, 373 (2023). https://doi.org/10.1186/s12859-023-05498-4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Autoencoder for Protein Structural Features

Requirements

Getting Started

1. Environment Setup

2. Feature Generation

3. Dimensionality Reduction

4. Training a Model

Pre-trained Model Specifications

License

Citation

Files

README.md

Latest commit

History

README.md

File metadata and controls

Autoencoder for Protein Structural Features

Requirements

Getting Started

1. Environment Setup

2. Feature Generation

3. Dimensionality Reduction

4. Training a Model

Pre-trained Model Specifications

License

Citation