Skip to content
PitKubi edited this page Nov 4, 2024 · 27 revisions

MHC-validator

MHC-validator is a machine learning software to rescore immunopeptidomics data aquired with mass spectrometers. Data input must be search engine results from a search engine such as Comet or similar. MHC-validator learns from peptide spectrum features reported by the search engine, MHC binding affinities from NetMHCpan4.1 and MHCflurry and peptide sequences to better assess whether a potential immunopeptide in your mass spectrometry run is present or not.

MHC-validator can be built into commonly used immunopeptidomics pipelines. If you implement MHC-validator into your immunopeptidomics pipeline, you can significantly boost the number of confidentially identified immunopeptides. Depending on the sample quality, we report 1.5 up to 10-fold more peptide spectrum matches (PSMs) with mhc-validator compared to the commonly used enhancing tools (Such as percolator, DeepRescore etc.). MHC-validator does not only boost the number of immunopeptides found, it is also highly specific in finding low abundant immunopeptides in your samples.

Below is a brief high level description of how mhc-validator works:

  1. First, mhc-validator loads peptide sequences and its features. Based on these features and the knowledge whether a PSM comes from a target or decoy hit, mhc-validator tries to learn how likely a peptide spectrum match is real. MHC-validator uses three types of features to predict this likelihood, a) the features reported by the search engine (target vs. decoy, mass, peptide length, charge, Xcorr etc.), b) immunopeptide binding affinities reported by NetMHCpan4.1 and/or MHCflurry and c) the peptide amino acid sequences themselves. The base algorithm is based on learning from the search engine results only (termed MV), immunopeptide binding affinity assessment (MHC) and peptide sequence encoding (PE) can be added by the user using the options available. Let's assume we intend to use MHC-validator to its full potential and set the options 'sequence_encoing' (PE), 'netmhcpan' and 'mhcflurry' (MV) to 'True' in this example.

  2. Once the sequences have been loaded, mhc-validator first uses NetMHCpan4.1 and MHCflurry to generate MHC binding affinities/elution scores and adds the results to the feature list provided from the database search results. Based on these features, mhc-validator uses a neuronal network to learn and finally assigns possibilities for each peptide to be hit or not. This first neuronal network can (if sequence encoding is set to True as it is in our example) be connected to a second neuronal network which takes the amino acid sequences into account.

  3. Results are reported in form of a q-value. Peptides with a q-value <0.01 are identified with having less than 1% chance of being a false positive (A.K.A. a true hit based on a 1% FDR cutoff).

  4. You can use the reported peptide spectrum matches for further analysis.

Setting up a immunopetidomics pipeline that includes mhc-validator

Now that we understand how mhc-validator works on a high level, let's provide some more information for setting up a pipeline in which mhc-validator is used. If you already have your own pipeline and know how to work with mass spectrometry data, it is easiest to jump to the last step and install mhc-validator as described in the readme file. **All analysis is described for a linux user as many tools don't work on windows or mac. **

Let's get started:

  1. Gather raw immunopeptidomic files from your mass spectrometry runs in the form of .raw, .mgf, .mzxml or .mzml files. For the comet search, it is best to use .mgf as input files.
  2. If you don't have the mass spectrometry files in the form of .mgf files, you can convert them using ms convert from proteowizard. To be able to convert vendor files, you must use the wine/docker version. This can be done by creating a simple shell script using the below example code and executing it in the folder where your files are present. The example is based on us having .raw files from the mass spectrometer acquisition and converting them to .mgf for usage with the comet search engine. The shell script should look as follows (the applicable paths have to be adjusted for your system, you can copy the below code in a .txt file or download the file here as 'msconvert.sh'):
#!/bin/bash
sudo docker run -it --rm -v /home/USER/FOLDER/TO_YOUR_RAW_FILES:/data proteowizard/pwiz-skyline-i-agree-to-the-vendor-licenses wine msconvert --filter "peakPicking true 1-2" /data/*.raw --mgf

Note: Don't forget to make the file executable.

This should download msconvert and then convert your .raw files to .mgf files that you can search with comet in the next step.

  1. Download comet here (choose the comet.linux.exe).
  2. Create/download your taxonomy-specific fasta file from for example the uniprot database (www.uniprot.org).
  3. Download the comet parameters file 'immunopeptides_example_comet.params' and adjust the path to your fasta file (and other parameters if needed, the provided parameters are what we use with immunopeptidomics orbitrap mass spectrometry data).
  4. Put all your .mgf files, the fasta file and the comet.linux.exe executable in the folder where you want to perform the search.
  5. Run comet using another simple shell script (see below) where you have to simply adjust the fasta file name, you can download the file as run_comet.sh and place it in your search folder:
#!/bin/bash
COMET="./comet.linux.exe"
PARAMS="./immunopeptidomics_example_comet.params"
DIRECTORY="."
FASTA="./FASTA_FILE_NAME.fasta"
$COMET -P$PARAMS -D$FASTA "$DIRECTORY"/*.mgf
  1. Make the run_comet.sh executable and run it. Wait for the comet search to finish and you will see that .pin files (percolator input files) have been created in your folder.
  2. Use the .pin files as input files for mhc-validator.
  3. Install mhc-validator and dependencies and run the software on the .pin files as described here.
  4. Enjoy the results, hopefully you have some nice data now.

Use of Prosit retention time predictions to enhance the MHC-validator performance

This feature is not part of the original publication but has been added due to frequent request by some users. Retention time predictions from Prosit are added with the help of the Koina platform (https://koina.proteomicsdb.org/docs#post-/Prosit_2019_irt/infer) which can access the Prosit retention time prediction model (see: Prosit: proteome-wide prediction of peptide tandem mass spectra by deep learning. Gessulat, S., Schmidt, T., Zolg, D.P. et al. , Nat Methods 16, 509–518 (2019). https://doi.org/10.1038/s41592-019-0426-7).

MHC-validator aligns and then compares these predicted retention times to the observed retention times and calculates an estimated retention time error. This retention time error, or delta RT, is then added to the MHC-validator feature matrix to increase the rescoring performance of the MHC-validator model.

As can be seen below, the above mentioned delta RT is a good discriminator between decoy and target hits when using a 100% FDR cutoff (Label 1 is a potential target, Label 0 is a decoy hit):

histogram_rts

And in addition when plotting observed and predicted retention times of all observed peptide spectrum matches (PSMs) in a file, the PSMs that were identified at 1% FDR using the basic MHC-validator model correlate well, as expected:

rts_scatter

Hence using retention time (RT) predictions from Prosit can be used to increase the performance of MHC-validator. The performance increase by adding retention time errors to the MHC-validator model is in particular useful when the alleles in the sample are not known and only the basic MHC-validator model (NN+PE) can be used. The respective performance increase using the RT feature can be seen in the ROC curves depicted below:

rt_rocs

(X-axis is the FDR, y-axis are PSMs)

How to add the retention time features to your MHC-validator analysis is explained on the readme page.