Autoencoder for Protein Structural Features

An autoencoder trained for dimensionality reduction of protein structural features. This was used for my research of Common local protein structural motifs supported by specific amino acids. For more details, refer to my research report and the SeqPredNN paper.

Requirements

Python: 3.12
Packages: numpy, pandas, torch, scikit-learn, matplotlib, scipy, biopython, umap-learn

Getting Started

1. Environment Setup

conda env create -f environment.yaml
conda activate autoencoder

I recommend using anaconda to install the packages in a contained environment.

2. Feature Generation

python featurise.py -gm -o example_features examples/example_chain_list.csv examples/example_pdb_directory

This will generate features (using SeqPredNN's featurise.py) using PDB files from the example directory and create chain lists in the output directory.

3. Dimensionality Reduction

python inference.py example_features example_features/chain_list.txt pretrained_model/trained_model.pth -o inference --save_latent_space_vectors --plot_feature_space

This will load the pre-trained model using the specified features, run a forward pass through the autoencoder, and save the reconstructed features, latent vectors, and metrics.

4. Training a Model

python train_model.py example_features example_features/chain_list.txt -o model --balanced_sampling

This will train a new autoencoder model using the specified features, optionally customize the model architecture and training parameters, and save the trained model along with training metrics.

Pre-trained Model Specifications

The pre-trained model was trained on 21,690 X-ray crystallographic protein chains (generated from the PISCES server) with resolution ≤ 2 Å, R-factor ≤ 0.25, chain lengths between 40-10,000 residues, and sequence identity < 90%. The model architecture consists of an input dimension of 180, hidden layers of 148 and 116 neurons, and a latent dimension of 84.

The pretrained_model directory includes:

Trained model weights (trained_model.pth)
PISCES dataset list (pisces_pdb_list.fasta, pisces_pdb_list)
PDB chain list for feature extraction (chain_list_pdb.csv)
Feature chain list for training (chain_list_training.txt)
Feature chain list for testing (chain_list_testing.txt)

License

This software and code is distributed under MIT License

Citation

Common local protein structural motifs supported by specific amino acids. Report link (2024).

Lategan, F.A., Schreiber, C. & Patterton, H.G. SeqPredNN: a neural network that generates protein sequences that fold into specified tertiary structures. BMC Bioinformatics 24, 373 (2023). https://doi.org/10.1186/s12859-023-05498-4

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
examples		examples
pretrained_model		pretrained_model
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
autoencoder.py		autoencoder.py
constants.py		constants.py
environment.yaml		environment.yaml
featurise.py		featurise.py
inference.py		inference.py
train_model.py		train_model.py
visualiser.py		visualiser.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Autoencoder for Protein Structural Features

Requirements

Getting Started

1. Environment Setup

2. Feature Generation

3. Dimensionality Reduction

4. Training a Model

Pre-trained Model Specifications

License

Citation

About

Releases

Packages

Languages

License

MzamoTembe/autoencoder

Folders and files

Latest commit

History

Repository files navigation

Autoencoder for Protein Structural Features

Requirements

Getting Started

1. Environment Setup

2. Feature Generation

3. Dimensionality Reduction

4. Training a Model

Pre-trained Model Specifications

License

Citation

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages