Skip to content

An autoencoder designed for dimensionality reduction of a protein structure's feature space that describes the structural environment of an amino acid by looking at the relative relative positions, orientations and dihedral angles of neighbouring residues in Euclidean space.

License

Notifications You must be signed in to change notification settings

MzamoTembe/autoencoder

Repository files navigation

Autoencoder for Protein Structural Features

An autoencoder trained for dimensionality reduction of protein structural features. This was used for my research of Common local protein structural motifs supported by specific amino acids. For more details, refer to my research report and the SeqPredNN paper.

Requirements

  • Python: 3.12
  • Packages: numpy, pandas, torch, scikit-learn, matplotlib, scipy, biopython, umap-learn

Getting Started

1. Environment Setup

conda env create -f environment.yaml
conda activate autoencoder

I recommend using anaconda to install the packages in a contained environment.

2. Feature Generation

python featurise.py -gm -o example_features examples/example_chain_list.csv examples/example_pdb_directory

This will generate features (using SeqPredNN's featurise.py) using PDB files from the example directory and create chain lists in the output directory.

3. Dimensionality Reduction

python inference.py example_features example_features/chain_list.txt pretrained_model/trained_model.pth -o inference --save_latent_space_vectors --plot_feature_space

This will load the pre-trained model using the specified features, run a forward pass through the autoencoder, and save the reconstructed features, latent vectors, and metrics.

4. Training a Model

python train_model.py example_features example_features/chain_list.txt -o model --balanced_sampling

This will train a new autoencoder model using the specified features, optionally customize the model architecture and training parameters, and save the trained model along with training metrics.


Pre-trained Model Specifications

The pre-trained model was trained on 21,690 X-ray crystallographic protein chains (generated from the PISCES server) with resolution ≤ 2 Å, R-factor ≤ 0.25, chain lengths between 40-10,000 residues, and sequence identity < 90%. The model architecture consists of an input dimension of 180, hidden layers of 148 and 116 neurons, and a latent dimension of 84.

The pretrained_model directory includes:

License

This software and code is distributed under MIT License

Citation

Common local protein structural motifs supported by specific amino acids. Report link (2024).

Lategan, F.A., Schreiber, C. & Patterton, H.G. SeqPredNN: a neural network that generates protein sequences that fold into specified tertiary structures. BMC Bioinformatics 24, 373 (2023). https://doi.org/10.1186/s12859-023-05498-4

About

An autoencoder designed for dimensionality reduction of a protein structure's feature space that describes the structural environment of an amino acid by looking at the relative relative positions, orientations and dihedral angles of neighbouring residues in Euclidean space.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages