Skip to content
PitKubi edited this page Jul 18, 2024 · 27 revisions

mhc-validator

MHC-validator is a machine learning software to rescore immunopeptidomics data aquired with mass spectrometers. Data input must be search engine results from a search engine such as Comet. MHC-validator learns from peptide spectrum features reported by the search engine, MHC binding affinities from NetMHCpan4.1 and MHCflurry and peptide sequences to better assess whether a potential immunopeptide in your mass spectrometry run is present or not.

MHC-validator can be built into commonly used immunopeptidomics pipelines. If you implement MHC-Validator into your immunopeptidomics pipeline, you can significantly boost the number of confidentially identified immunopeptides. Depending on the sample quality, we report 1.5 up to 10 fold more peptide spectrum matches (PSMs) with mhc-validator compared to the commonly used enhancing tools (Aka percolator, DeepRescore etc.). MHC-validator does not only boost the number of immunopeptides found, it is also highly specific in finding low abundant immunopeptides in your samples.

Below is a brief high level description of how mhc-validator works:

  1. First, mhc-validator loads peptides sequences and its features. Based on these features and the knowledge whether a peptide comes from a target or decoy search, mhc-validator tries to learn how likely a peptide spectrum match is real. MHC-validator uses three types of features, a) the features reported by the search engine (target vs. decoy, mass, peptide length, charge, Xcorr etc.), b) immuinopeptide binding affinities reported by NetMHCpan4.1 and/or MHCflurry and c) peptide amino acid sequences themselves. The base algorithm is based on learning from the search engine results only (termed MV), immunopeptide binding affinity assessment (MHC) and peptide sequence encoding (PE) can be added by the user using the options available. Let's assume we intend to use MHC-validator to its full potential and set the options 'sequence_encoing' (PE), 'netmhcpan' and 'mhcflurry' (MV) all to 'True' in this example.

  2. Once the sequences have been loaded, mhc-validator first uses NetMHCpan4.1 and MHCflurry to generate MHC binding affinities/elution scores and adds the results to the feature list provided from the database search results. Based on these features, MHC-validator uses a neuronal network to learn and finally assigns possibilities for each peptide to be hit or not. This first neuronal network can (If sequence encoding is set to True as it is in our example) be connected to a second neuronal network which takes the amino acid sequences into account.

  3. Results are reported in form of a q-value. Peptides with a q-value <0.01 are identified with having less than 1% chance of being a false positive (Aka are a true hit based on a 1% FDR cutoff).

  4. You can now use these peptides for further analysis.

Setting up a pipeline with mhc-validator

Now that we have understood how mhc-validator works on a high level, let's provide some more information for setting up a pipeline in which mhc-validator is used. If you already have your own pipeline and know how to work with mass spectrometry data, it is easiest to just install mhc-validator and set it up as described in the readme file. **All analysis is described for a linux user as many tools don't work on windows or mac. **

Let's get started:

  1. Gather raw immunopeptidomic files in the form of .raw, .mgf, .mzxml or .mzml files. For the comet search, best is to use .mgf as input files.
  2. If you don't have the mass spectrometry files in the form of .mgf files, you can convert them using ms conert from proteowizard. To be able to convert vendor files, you must use the wine/docker version. This can be done by creating a simple shell script using the below example code and executing it in the folder where your files are present. The example is based on us having .raw files from the mass spectrometer acquisition and we are converting them to .mgf for usage with comet search engine. The shell script should look as follows (The path has to be adjusted for your system, it can be downloaded here as 'msconvert.sh'):

#!/bin/bash sudo docker run -it --rm -v /home/USER/FOLDER/TO_YOUR_RAW_FILES:/data proteowizard/pwiz-skyline-i-agree-to-the-vendor-licenses wine msconvert --filter "peakPicking true 1-2" /data/*.raw --mgf

This should download msconvert and then convert your .raw files to .mgf files that you can search with comet in the next step.

  1. Download comet here (choose the comet.linux.exe)
  2. Create/download your taxonomy specific fasta file from for example the uniprot database (www.uniprot.org)
  3. Download the comet parameters file 'immunopeptides_example_comet.params' and adjust the path to your fasta file (ANd other paramets if needed, the provided paramers are what we use with immunopeptidomics orbitrap mass spectrometry data)
  4. Put all your .mgf files, the fasta file and the comet.linux.exe executable in the folder where you want to perform the search
  5. Run comet using another simple shell script where you have to simply adjust the fasta file name, you can download it here as run_comet.sh and place it in your search folder:

#!/bin/bash COMET="./comet.linux.exe" PARAMS="./immunopeptidomics_example_comet.params" DIRECTORY="." FASTA="./FASTA_FILE_NAME.fasta" $COMET -P$PARAMS -D$FASTA "$DIRECTORY"/*.mgf`

  1. Wait and you will see that .pin files (percolator input files have been created in your folder).
  2. Use the .pin files as input files for mhc-validator.
  3. Install mhc-validator and dependancies and run the software on the .pin files as described here.
  4. Enjoy the results, hopefully you have some nice data now.
Clone this wiki locally