Skip to content

Latest commit

 

History

History
70 lines (44 loc) · 3.77 KB

README.md

File metadata and controls

70 lines (44 loc) · 3.77 KB

Autoencoder for Protein Structural Features

An autoencoder trained for dimensionality reduction of protein structural features. This was used for my research of Common local protein structural motifs supported by specific amino acids. For more details, refer to my research report and the SeqPredNN paper.

Requirements

  • Python: 3.12
  • Packages: numpy, pandas, torch, scikit-learn, matplotlib, scipy, biopython, umap-learn

Getting Started

1. Environment Setup

conda env create -f environment.yaml
conda activate autoencoder

I recommend using anaconda to install the packages in a contained environment.

2. Feature Generation

python featurise.py -gm -o example_features examples/example_chain_list.csv examples/example_pdb_directory

This will generate features (using SeqPredNN's featurise.py) using PDB files from the example directory and create chain lists in the output directory.

3. Dimensionality Reduction

python inference.py example_features example_features/chain_list.txt pretrained_model/trained_model.pth -o inference --save_latent_space_vectors --plot_feature_space

This will load the pre-trained model using the specified features, run a forward pass through the autoencoder, and save the reconstructed features, latent vectors, and metrics.

4. Training a Model

python train_model.py example_features example_features/chain_list.txt -o model --balanced_sampling

This will train a new autoencoder model using the specified features, optionally customize the model architecture and training parameters, and save the trained model along with training metrics.


Pre-trained Model Specifications

The pre-trained model was trained on 21,690 X-ray crystallographic protein chains (generated from the PISCES server) with resolution ≤ 2 Å, R-factor ≤ 0.25, chain lengths between 40-10,000 residues, and sequence identity < 90%. The model architecture consists of an input dimension of 180, hidden layers of 148 and 116 neurons, and a latent dimension of 84.

The pretrained_model directory includes:

License

This software and code is distributed under MIT License

Citation

Common local protein structural motifs supported by specific amino acids. Report link (2024).

Lategan, F.A., Schreiber, C. & Patterton, H.G. SeqPredNN: a neural network that generates protein sequences that fold into specified tertiary structures. BMC Bioinformatics 24, 373 (2023). https://doi.org/10.1186/s12859-023-05498-4