USMPep is a simple recurrent neural network for MHC binding affinity prediction. It is competitive with state-of-the-art tools for a single model trained from scratch, while ensembling multiple regressors and language model pretraining can slightly improve its performance. In our paper we report the excellent predictive performance of USMPep on several benchmark datasets.
For a detailed description of technical details and experimental results, please refer to our paper:
USMPep: Universal Sequence Models for Major Histocompatibility Complex Binding Affinity Prediction
Johanna Vielhaben, Markus Wenzel, Wojciech Samek, and Nils Strodthoff
@article{Vielhaben:2020USMPep,
author = {Vielhaben, Johanna and Wenzel, Markus and Samek, Wojciech and Strodthoff, Nils},
title = {{USMPep: Universal Sequence Models for Major Histocompatibility Complex Binding Affinity Prediction}},
journal = {BMC Bioinformatics},
year = {2020},
month={Jul},
volume = {21},
number = {1},
pages={279},
issn={1471-2105},
doi = {10.1186/s12859-020-03631-1},
url= {https://doi.org/10.1186/s12859-020-03631-1}
}
This is the accompanying code repository where we also provide a pretrained language model and predictions of our models on the test datasets discussed in our paper.
We present an extended version of USMPep, that we evaluated on a recent SARS-CoV-2 dataset, in our paper:
Johanna Vielhaben, Markus Wenzel, Eva Weicken, Nils Strodthoff
@misc{Vielhaben:2021USMPep,
title={Predicting the Binding of SARS-CoV-2 Peptides to the Major Histocompatibility Complex with Recurrent Neural Networks},
author={Johanna Vielhaben and Markus Wenzel and Eva Weicken and Nils Strodthoff},
year={2021},
eprint={2104.08237},
archivePrefix={arXiv},
primaryClass={q-bio.QM}
}
USMPep builds on the UDSMProt-framework: Universal Deep Sequence Models for Protein Classification
for training/evaluation: pytorch
fastai
fire
for dataset creation: numpy
pandas
scikit-learn
biopython
sentencepiece
lxml
We recommend using conda as Python package and environment manager.
Either install the environment using the provided proteomics.yml
by running conda env create -f proteomics.yml
or follow the steps below:
- Create conda environment:
conda create -n proteomics
andconda activate proteomics
- Install pytorch:
conda install pytorch -c pytorch
- Install fastai:
conda install -c fastai fastai=1.0.52
- Install fire:
conda install fire -c conda-forge
- Install scikit-learn:
conda install scikit-learn
- Install Biopython:
conda install biopython -c conda-forge
- Install sentencepiece:
pip install sentencepiece
- Install lxml:
conda install lxml
Optionally (for support of threshold 0.4 clusters) install cd-hit and add cd-hit
to the default searchpath.
See the USMPep User Guide for extensive usage information.
A second User Guide provides usage information for the extended version of USMPep.
We provide peptide binding affinity predictions for our tools, see git-data
-folder and the corresponding readme file for details.