Updating SageAdapter with new functionality + compatibility with new …

…Sage versions (OpenMS#7577) * First Attempt to cluster, data reading already works * Clustering test * Clustering with KDE works now, access to masses in unimod.obo implemented * CLustering functional, defect pushing mod-mass up during KDE still a probem * Clustering (with workaround), mapping, and reading to idXML completed * Output as a .tsv file complete, sorting based on charge implemented * Added some bug checks to help debug streamlit app * Added multiple masses being written to output * Removed print-bloat * Annotation meta values are now saved, FDR filtering implemented * Added annotation feature * Edited annotation notation * Fixed naming of output-table for PTMs * Added wide-window option, improved annotation * Updated variable modification specification to be compatible with sage v0.15.0 * Re-added some filtering, tweaked backend, wide window mode available now * Removed smoothing to speed up code, removed some old print statements * Cleaned up code a bit, preparing for PR * Removed unnecessary functions * Refactored code to be compliant with OpenMS coding conventions * Also updated minor changes in percolator to be form-compliant * Update AUTHORS * Update CHANGELOG * Changed matching back-end, removed white-space changes to other files * Added better handling for combinations of mods + speed-up through binary search * Changed ini, some code * Changed to ModificationsDB * THIRDPARTY submodule updated * Updated submodule to latest master * Refactored code and made mapping less buggy, changed two thresholds * Removed old ini file and fixed whitespac issues * Removed commented out code block * Added isotope check to mapping * Changed description of smoothing option * add doc * todo * Added parameter to check for Sage * Refactored and rewrote core parts, clarified names * Added parameter description to Percolator, updated ini * Updated ini file, updated .idXML test file, fixed weird segfault bug in percolator * Removed unnecessary files * Removed last unecessary table * Update src/openms/source/FORMAT/PercolatorInfile.cpp Co-authored-by: Timo Sachsenberg <timo.sachsenberg@uni-tuebingen.de> * Formatting fixes * Fixed build issue on Windows, fixed .idXML test-file * Changed cmake file for tests --------- Co-authored-by: Timo Sachsenberg <timo.sachsenberg@uni-tuebingen.de>
cbielow · Sep 30, 2024 · 594447a · 594447a
1 parent 188deb5
commit 594447a
Show file tree

Hide file tree

Showing 11 changed files with 847 additions and 78 deletions.
diff --git a/AUTHORS b/AUTHORS
@@ -53,6 +53,7 @@ the authors tag in the respective file header.
  - Johan Teleman
  - Johannes Junker
  - Johannes Veit
+ - Johannes von Kleist
  - Joshua Charkow
  - Julia Thueringer
  - Juliane Schmachtenberg

diff --git a/CHANGELOG b/CHANGELOG
@@ -21,12 +21,15 @@ PR - Pull Request (on GitHub), i.e. integration of a new feature or bugfix
 ----                                OpenMS 3.2.0                                      ----
 ------------------------------------------------------------------------------------------
 
+
+
 What's new:
 - Changes breaking backwards compatibility: 
    - Rename of parameters for TOPP tool FeatureFinderCentroided (debug -> advanced), and PeakPickerWavelet/TOFCalibration (optimization -> optimization:type) (#7154)
    - Rename of parameters for TOPP tool IDFilter (score:pep -> score:psm; score:prot -> score:protein; score:protgroup -> score:proteingroup) with 'nan' as new default (#7541)
    - 3.2.0 KNIME package requires KNIME 5.3 or later
 - Support for SubsetNeighborSearch (SNS) via DecoyDatabase (#7565)
+- SageAdapter received large updates including added functionality for PTM discovery + enabling features such as chimera seach, RT prediction, filtering by q-value, etc. 
 
 Library:
 - Extend FileHandler to support load and store operations for our major datastructures (spectra, features, identifications, etc.). Replaced file type specific code with the more generic FileHandler calls to decouple the IO code from other parts of the library.
@@ -43,6 +46,7 @@ New Tools:
 - AssayGeneratorMetaboSirius -- Assay library generation from a SIRIUS project directory (Metabolomics)
 - SiriusExport -- Metabolite identification using single and tandem mass spectrometry
 
+
 Fixes:
 - FileConverter: more robust (#7176)
 - MSFragger: allow relative path to database (#7155)
@@ -57,6 +61,7 @@ Fixes:
 - TOPPAS: open files in TOPPView (#7213)
 - pyOpenMS: Log warnings in pure Python code with warnings.warn instead of print (#7418)
 - more robust parsing of mzIdentML (#7153)
+- SageAdapter now works with sage v0.15.0 and beyond
 - OpenSwath: Fix bug in diaPASEF window determination (#7546)  
 
 Misc:

diff --git a/THIRDPARTY b/THIRDPARTY
diff --git a/src/openms/include/OpenMS/FORMAT/PercolatorInfile.h b/src/openms/include/OpenMS/FORMAT/PercolatorInfile.h
@@ -30,19 +30,42 @@ namespace OpenMS
         int min_charge, 
         int max_charge);
 
-      /** @brief load pin file and convert to a vector of PeptideIdentification using the given score column @p score_name and orientation @p higher_score_better.
-          If a decoy prefix is provided, the decoy status is set from the protein accessions.
-          Otherwise, it assumes that the pin file already contains the correctly annotated decoy status.
-          If @p extra_scores is not empty, the scores are added to the PeptideHit as MetaValues.
-          If a filename column is encountered the set of @p filenames is filled in the order of appearance and PeptideIdentifications annotated with the id_merge_index meta value to link them to the filename (similar to a merged idXML file). 
-          TODO: implement something similar to PepXMLFile().setPreferredFixedModifications(getModifications_(fixed_modifications_names));
-          **/
+
+      /**
+      * @brief Loads peptide identifications from a Percolator input file.
+      * 
+      * This function reads a Percolator input file (`pin_file`) and returns a vector of `PeptideIdentification` objects. 
+      * It extracts relevantinformation such as peptide sequences, scores, charges, annotations, and protein accessions, applying
+      * specified thresholds and handling decoy targets as needed.
+      * Note: If a filename column is encountered the set of @p filenames is filled in the order of appearance and PeptideIdentifications annotated with the id_merge_index meta value to link them to the filename (similar to a merged idXML file). 
+      * 
+      * @param pin_file he path to the Percolator input file with a `.pin` extension.
+      * 
+      * @param higher_score_better A boolean flag indicating whether higher scores are considered better (`true`) or lower scores are better (`false`).
+      * 
+      * @param score_name The name of the primary score to be used for ranking peptide hits.
+      * 
+      * @param extra_scores A list of additional score names that should be extracted and stored in each `PeptideHit`.
+      * 
+      * @param filenames Will be populated with the unique raw file names extracted from the input data.
+      * 
+      * @param decoy_prefix The prefix used to identify decoy protein accessions. Proteins with accessions starting with this prefix are marked as decoys. Otherwise, it assumes that the pin file already contains the correctly annotated decoy status.
+      * @param threshold A double value representing the threshold for the `spectrum_q` value. Only spectra with `spectrum_q` below this threshold are processed.
+                         Implemented to allow prefiltering of Sage results.
+      * @param SageAnnotation A boolean value used to determine if the pin file is coming from Sage or not 
+      * @return A `std::vector` of `PeptideIdentification` objects containing the peptide identifications.
+      
+      * @throws `Exception::ParseError` if any line in the input file does not have the expected number of columns.
+      * TODO: implement something similar to PepXMLFile().setPreferredFixedModifications(getModifications_(fixed_modifications_names));      
+      */
       static std::vector<PeptideIdentification> load(const String& pin_file, 
         bool higher_score_better, 
         const String& score_name, 
         const StringList& extra_scores,
         StringList& filenames, 
-        String decoy_prefix = "");
+        String decoy_prefix = "",
+        double threshold = 0.01, 
+        bool SageAnnotation = false);
 
       // uses spectrum_reference, if empty uses spectrum_id, if also empty fall back to using index
       static String getScanIdentifier(const PeptideIdentification& pid, size_t index);

diff --git a/src/openms/source/FORMAT/PercolatorInfile.cpp b/src/openms/source/FORMAT/PercolatorInfile.cpp
@@ -3,7 +3,7 @@
 //
 // --------------------------------------------------------------------------
 // $Maintainer: Timo Sachsenberg $
-// $Authors: Timo Sachsenberg $
+// $Authors: Timo Sachsenberg, Johannes von Kleist $
 // --------------------------------------------------------------------------
 
 #include <OpenMS/FORMAT/PercolatorInfile.h>
@@ -62,29 +62,105 @@ namespace OpenMS
     const String& score_name,
     const StringList& extra_scores,
     StringList& filenames,
-    String decoy_prefix)
+    String decoy_prefix, 
+    double threshold, 
+    bool SageAnnotation)
   {
     CsvFile csv(pin_file, '\t');
-    StringList header;
-    csv.getRow(0, header);
+
+    //Sage Variables, initialized in the following block if SageAnnotation is set 
+    map<int, vector<PeptideHit::PeakAnnotation>> anno_mapping; 
+    CsvFile tsv; 
+    CsvFile annos; 
+    unordered_map<String, size_t> to_idx_t; 
+
+    if (SageAnnotation) // Block for special treatment of sage 
+    {
+      String tsv_file_path = pin_file.substr(0, pin_file.size()-3);
+      tsv_file_path = tsv_file_path + "tsv"; 
+      tsv = CsvFile(tsv_file_path,'\t'); 
+
+      String temp_diff = "results.sage.pin"; 
+      String anno_file_path = pin_file.substr(0, pin_file.size()-temp_diff.length());
+      anno_file_path = anno_file_path + "matched_fragments.sage.tsv"; 
+      annos = CsvFile(anno_file_path, '\t'); 
+      //map PSMID to vec of PeakAnnotation 
+      StringList sage_tsv_header;
+      tsv.getRow(0, sage_tsv_header); 
+      to_idx_t; // map column name to column index, for full .tsv file 
+      {
+        int idx_t{};
+        for (const auto& h : sage_tsv_header) { to_idx_t[h] = idx_t++; }
+      }
+
+      // processs annotation file
+      StringList sage_annotation_header; 
+      annos.getRow(0, sage_annotation_header);
+      unordered_map<String, size_t> to_idx_a; // map column name to column index, for full annotation file file 
+      {
+        int idx_a{};
+        for (const auto& h : sage_annotation_header) { to_idx_a[h] = idx_a++; }
+      }
+      // map PSMs -> PeakAnnotation vector
+      auto num_rows = annos.rowCount(); 
+
+      for (size_t i = 1; i < num_rows; ++i)
+      {
+        StringList row;
+        annos.getRow(i, row);
+
+        //Check if mapping already has PSM, if it does add 
+        if (anno_mapping.find(row[to_idx_a.at("psm_id")].toInt()) == anno_mapping.end())
+        {                 
+          //Make a new vector of annotations 
+          PeptideHit::PeakAnnotation peak_temp; 
+
+          peak_temp.annotation = row[to_idx_a.at("fragment_type")] + row[to_idx_a.at("fragment_ordinals")]; 
+          peak_temp.charge = row[to_idx_a.at("fragment_charge")].toInt(); 
+          peak_temp.intensity = row[to_idx_a.at("fragment_intensity")].toDouble(); 
+          peak_temp.mz = row[to_idx_a.at("fragment_mz_experimental")].toDouble(); 
+
+          vector<PeptideHit::PeakAnnotation> temp_anno_vec; 
+          temp_anno_vec.push_back(peak_temp); 
+          anno_mapping[ row[to_idx_a.at("psm_id")].toInt() ] = temp_anno_vec;
+        }
+        else
+        {
+          //Add values to exisiting vector 
+          PeptideHit::PeakAnnotation peak_temp; 
+
+          peak_temp.annotation = row[to_idx_a.at("fragment_type")] + row[to_idx_a.at("fragment_ordinals")]; 
+          peak_temp.charge = row[to_idx_a.at("fragment_charge")].toInt(); 
+          peak_temp.intensity = row[to_idx_a.at("fragment_intensity")].toDouble(); 
+          peak_temp.mz = row[to_idx_a.at("fragment_mz_experimental")].toDouble(); 
+
+          anno_mapping[ row[to_idx_a.at("psm_id")].toInt() ].push_back(peak_temp);         
+        }
+      }      
+    }
+
+    StringList pin_header;
+
+    csv.getRow(0, pin_header);
 
     unordered_map<String, size_t> to_idx; // map column name to column index
     {
       int idx{};
-      for (const auto& h : header) { to_idx[h] = idx++; }
+      for (const auto& h : pin_header) { to_idx[h] = idx++; }
     }
 
+    // determine file name column index in percolator in file
     int file_name_column_index{-1};
-    if (auto it = std::find(header.begin(), header.end(), "FileName"); it != header.end())
+    if (auto it = std::find(pin_header.begin(), pin_header.end(), "FileName"); it != pin_header.end())
     {
-      file_name_column_index = it - header.begin();
+      file_name_column_index = it - pin_header.begin();
     }
-
-    // get column indices of extra scores
-    std::set<String> found_extra_scores; // additional (non-main) scores that should be stored in the PeptideHit, order important for comparable idXML
+ 
+    // determine extra scores and store column indices
+    std::set<String> found_extra_scores; // additional (non-main) scores that should be stored in the PeptideHit, order important for comparable idXML  
     for (const String& s : extra_scores)
     {
-      if (auto it = std::find(header.begin(), header.end(), s); it != header.end())
+      if (auto it = std::find(pin_header.begin(), pin_header.end(), s); it != pin_header.end())
       {
         found_extra_scores.insert(s);
       }
@@ -93,7 +169,7 @@ namespace OpenMS
         OPENMS_LOG_WARN << "Extra score: " << s << " not found in Percolator input file." << endl;
       }
     }
-
+      
     // charge columns are not standardized, so we check for the format and create hash to lookup column name to charge mapping
     std::regex charge_one_hot_pattern("^charge\\d+$");
     std::regex sage_one_hot_pattern("^z=\\d+$");
@@ -104,7 +180,7 @@ namespace OpenMS
     // The reason is that sage searches always for the charge annotated in the spectrum raw file. Only if the annotation is missing it will search
     // the suggested charge range.
     bool found_sage_otherz_charge_column{false}; 
-    for (const String& c : header)
+    for (const String& c : pin_header)
     {
       if (std::regex_match(c, charge_one_hot_pattern))
       {
@@ -136,11 +212,21 @@ namespace OpenMS
       StringList row;
       csv.getRow(i, row);
 
-      if (row.size() != header.size())
+      StringList t_row; 
+
+      if (SageAnnotation)
+      {
+        tsv.getRow(i, t_row); 
+        // skip if spectrum_q is above threshold
+        if (t_row[to_idx_t.at("spectrum_q")].toDouble() > threshold ) continue;
+      }
+
+      if (row.size() != pin_header.size())
       {
-        throw Exception::ParseError(__FILE__, __LINE__, OPENMS_PRETTY_FUNCTION, "Error: line " + String(i) + " of file '" + pin_file + "' does not have the same number of columns as the header!", String(i));
+        throw Exception::ParseError(__FILE__, __LINE__, OPENMS_PRETTY_FUNCTION, "Error: line " + String(i) + " of file '" + pin_file + "' does not have the same number of columns as the pin_header!", String(i));
       }
 
+
       if (file_name_column_index >= 0)
       {
         raw_file_name = row[file_name_column_index];
@@ -150,7 +236,6 @@ namespace OpenMS
           map_filename_to_idx[raw_file_name] = filenames.size() - 1;
         }
       }
-
       // NOTE: In our pin files that we WRITE, SpecID will be filename + vendor spectrum native ID
       // However, many search engines (e.g. Sage) choose arbitrary IDs, which is unfortunately allowed
       //  by this loosely defined format.
@@ -159,28 +244,25 @@ namespace OpenMS
       if (auto it = to_idx.find("ion_mobility"); it != to_idx.end())
       {
         const String& sIM = row[it->second];
-        const double IM = sIM.toDouble();  
-        pids.back().setMetaValue(Constants::UserParam::IM, IM);
+        const double IM = sIM.toDouble();   
+        if (!pids.empty())  pids.back().setMetaValue(Constants::UserParam::IM, IM);
       }
-
       // In theory, this should be an integer, but Sage currently cannot extract the number from all vendor spectrum IDs,
       //  so it writes the full ID as string
       String sScanNr = row[to_idx.at("ScanNr")];
-
       if (sSpecId != spec_id)
       {
         pids.resize(pids.size() + 1);
         pids.back().setHigherScoreBetter(higher_score_better);
         pids.back().setScoreType(score_name);
         pids.back().setMetaValue(Constants::UserParam::ID_MERGE_INDEX, map_filename_to_idx.at(raw_file_name));
         pids.back().setRT(row[to_idx.at("retentiontime")].toDouble() * 60.0); // search engines typically write minutes (e.g., sage)
-        pids.back().setMetaValue("PinSpecId", sSpecId);
+        pids.back().setMetaValue("PinSpecId", sSpecId);  
         // Since ScanNr is the closest to help in identifying the spectrum in the file later on,
         // we use it as spectrum_reference. Since it can be integer only or the complete
         // vendor ID, you will need a lookup in case of number only later!!
-        pids.back().setSpectrumReference(sScanNr);
+        pids.back().setSpectrumReference(sScanNr);     
       }
-
       String sPeptide = row[to_idx.at("Peptide")];
       const double score = row[to_idx.at(score_name)].toDouble();
       String target_decoy = row[to_idx.at("Label")].toInt() == 1 ? "target" : "decoy";
@@ -253,12 +335,33 @@ namespace OpenMS
       AASequence aa_seq = AASequence::fromString(sPeptide);
       PeptideHit ph(score, rank, charge, std::move(aa_seq));
       ph.setMetaValue("target_decoy", target_decoy);
+
       for (const auto& name : found_extra_scores)
       {
         ph.setMetaValue(name, row[to_idx.at(name)]);
       }
       ph.setRank(rank);
 
+      // adding own meta values 
+      if (SageAnnotation)
+      {
+        ph.setMetaValue("spectrum_q", t_row[to_idx_t.at("spectrum_q")].toDouble());  //TODO: check if column exists / SAGE specific treatment
+      }
+      ph.setMetaValue("DeltaMass", ( row[to_idx.at("ExpMass")].toDouble() - row[to_idx.at("CalcMass")].toDouble()) ); 
+      // add annotations
+      if (SageAnnotation)
+      {
+        if (anno_mapping.find(sSpecId.toInt()) != anno_mapping.end())
+        {
+          // copy annotations from mapping to PeptideHit
+          vector<PeptideHit::PeakAnnotation> pep_vec;
+          for (const PeptideHit::PeakAnnotation& pep : anno_mapping[sSpecId.toInt()])
+          {
+            pep_vec.push_back(pep) ; 
+          }        
+          ph.setPeakAnnotations(pep_vec);        
+        } 
+      }
       // add link to protein (we only know the accession but not start/end, aa_before/after in protein at this point)
       for (const String& accession : accessions)
       {
@@ -267,6 +370,7 @@ namespace OpenMS
 
       pids.back().insertHit(std::move(ph));
     }
+
     return pids;
   }
 
@@ -542,4 +646,4 @@ namespace OpenMS
     return count;
   }
 
-}
+}