Recognizing the potential allergenicity of proteins is essential for ensuring their safety. Allergens are a major concern in determining protein safety, especially with the increasing use of recombinant proteins in new medical products. These proteins need careful allergenicity assessment to guarantee their safety. However, traditional laboratory testing for allergenicity is expensive and time-consuming. To address this challenge, bioinformatics offers efficient and cost-effective alternatives for predicting protein allergenicity. In this study, we developed an enhanced deep-learning model to predict the potential allergenicity of proteins based on their primary structure represented as protein sequences. In simple terms, this model classifies proteins into allergenic or non-allergenic classes. Our approach utilizes two protein language models to extract distinct feature vectors for each sequence, which are then input into a deep neural network model for classification. Each feature vector represents a specific aspect of the protein sequence, and combining them enhances the outcomes. Finally, we effectively combined the predictions of our top-performing models using ensemble modeling techniques. This could balance the model's sensitivity and specificity and improve the outcome. Our proposed model demonstrates admissible improvement compared to existing models, achieving a sensitivity of 97.91%, specificity of 97.69%, accuracy of 97.80%, and an impressive area under the ROC curve of 99% using the standard five-fold cross-validation.
bioRxiv DOI: https://doi.org/10.1101/2024.08.09.607419
-
feature-extraction
- 1. ESM-v2-embeddings.ipynb: Extracts embeddings using ESM-v2 model. Input protein sequences in FASTA format.
- 2. ProtT5-embeddings.ipynb: Extracts embeddings using ProtT5 model. Input protein sequences in FASTA format.
- 3. AAC-feature-vectors.ipynb: Generates amino acid composition feature vectors. Input protein sequences in FASTA format.
-
modeling
- classic-machine-learning.ipynb: Classic machine learning models' training and evaluating, including SVM, RF, XGBoost, and KNN. This notebook also tests the effect of hyperparameter tuning and autoencoder.
- nonlinear-DNN.ipynb: Train and evaluation of our top-performing deep neural network models, using ESM-v2 and ProtT5 embeddings, and AAC feature vectors.
- single-layer-LSTM.ipynb: Training and evaluation of a single-layer LSTM (Long Short-Term Memory) model.
- 1D-CNN.ipynb: Training and evaluation of a 1-dimensional CNN (Convolutional neural network) model.
-
model-checkpoints
- Contains saved checkpoints of the trained models required for the
nonlinear-DNN
notebook.
- Contains saved checkpoints of the trained models required for the
The utilized dataset for this study includes the public AlgPred 2.0 train and validation sets, which are available here.
-
Feature Extraction:
- Navigate to the
feature-extraction
folder and run the notebooks to extract the necessary feature vectors from protein sequences. Input protein sequences in FASTA format.
- Navigate to the
-
Model Training and Evaluation:
- Navigate to the
modeling
folder. - Open and run the
nonlinear-DNN.ipynb
notebook to train and evaluate the deep neural network model. Ensure the required model checkpoints are available in themodel-checkpoints
folder. - For other models, run the respective notebooks (
classic-machine-learning.ipynb
,single-layer-LSTM.ipynb
,1D-CNN.ipynb
).
- Navigate to the