Skip to content

Commit

Permalink
Updating SageAdapter with new functionality + compatibility with new …
Browse files Browse the repository at this point in the history
…Sage versions (OpenMS#7577)

* First Attempt to cluster, data reading already works

* Clustering test

* Clustering with KDE works now, access to masses in unimod.obo implemented

* CLustering functional, defect pushing mod-mass up during KDE still a probem

* Clustering (with workaround), mapping, and reading to idXML completed

* Output as a .tsv file complete, sorting based on charge implemented

* Added some bug checks to help debug streamlit app

* Added multiple masses being written to output

* Removed print-bloat

* Annotation meta values are now saved, FDR filtering implemented

* Added annotation feature

* Edited annotation notation

* Fixed naming of output-table for PTMs

* Added wide-window option, improved annotation

* Updated variable modification specification to be compatible with sage v0.15.0

* Re-added some filtering, tweaked backend, wide window mode available now

* Removed smoothing to speed up code, removed some old print statements

* Cleaned up code a bit, preparing for PR

* Removed unnecessary functions

* Refactored code to be compliant with OpenMS coding conventions

* Also updated minor changes in percolator to be form-compliant

* Update AUTHORS

* Update CHANGELOG

* Changed matching back-end, removed white-space changes to other files

* Added better handling for combinations of mods + speed-up through binary search

* Changed ini, some code

* Changed to ModificationsDB

* THIRDPARTY submodule updated

* Updated submodule to latest master

* Refactored code and made mapping less buggy, changed two thresholds

* Removed old ini file and fixed whitespac issues

* Removed commented out code block

* Added isotope check to mapping

* Changed description of smoothing option

* add doc

* todo

* Added parameter to check for Sage

* Refactored and rewrote core parts, clarified names

* Added parameter description to Percolator, updated ini

* Updated ini file, updated .idXML test file, fixed weird segfault bug in percolator

* Removed unnecessary files

* Removed last unecessary table

* Update src/openms/source/FORMAT/PercolatorInfile.cpp

Co-authored-by: Timo Sachsenberg <timo.sachsenberg@uni-tuebingen.de>

* Formatting fixes

* Fixed build issue on Windows, fixed .idXML test-file

* Changed cmake file for tests

---------

Co-authored-by: Timo Sachsenberg <timo.sachsenberg@uni-tuebingen.de>
  • Loading branch information
JohannesvKL and timosachsenberg authored Sep 30, 2024
1 parent 188deb5 commit 594447a
Show file tree
Hide file tree
Showing 11 changed files with 847 additions and 78 deletions.
1 change: 1 addition & 0 deletions AUTHORS
Original file line number Diff line number Diff line change
Expand Up @@ -53,6 +53,7 @@ the authors tag in the respective file header.
- Johan Teleman
- Johannes Junker
- Johannes Veit
- Johannes von Kleist
- Joshua Charkow
- Julia Thueringer
- Juliane Schmachtenberg
Expand Down
5 changes: 5 additions & 0 deletions CHANGELOG
Original file line number Diff line number Diff line change
Expand Up @@ -21,12 +21,15 @@ PR - Pull Request (on GitHub), i.e. integration of a new feature or bugfix
---- OpenMS 3.2.0 ----
------------------------------------------------------------------------------------------



What's new:
- Changes breaking backwards compatibility:
- Rename of parameters for TOPP tool FeatureFinderCentroided (debug -> advanced), and PeakPickerWavelet/TOFCalibration (optimization -> optimization:type) (#7154)
- Rename of parameters for TOPP tool IDFilter (score:pep -> score:psm; score:prot -> score:protein; score:protgroup -> score:proteingroup) with 'nan' as new default (#7541)
- 3.2.0 KNIME package requires KNIME 5.3 or later
- Support for SubsetNeighborSearch (SNS) via DecoyDatabase (#7565)
- SageAdapter received large updates including added functionality for PTM discovery + enabling features such as chimera seach, RT prediction, filtering by q-value, etc.

Library:
- Extend FileHandler to support load and store operations for our major datastructures (spectra, features, identifications, etc.). Replaced file type specific code with the more generic FileHandler calls to decouple the IO code from other parts of the library.
Expand All @@ -43,6 +46,7 @@ New Tools:
- AssayGeneratorMetaboSirius -- Assay library generation from a SIRIUS project directory (Metabolomics)
- SiriusExport -- Metabolite identification using single and tandem mass spectrometry


Fixes:
- FileConverter: more robust (#7176)
- MSFragger: allow relative path to database (#7155)
Expand All @@ -57,6 +61,7 @@ Fixes:
- TOPPAS: open files in TOPPView (#7213)
- pyOpenMS: Log warnings in pure Python code with warnings.warn instead of print (#7418)
- more robust parsing of mzIdentML (#7153)
- SageAdapter now works with sage v0.15.0 and beyond
- OpenSwath: Fix bug in diaPASEF window determination (#7546)

Misc:
Expand Down
39 changes: 31 additions & 8 deletions src/openms/include/OpenMS/FORMAT/PercolatorInfile.h
Original file line number Diff line number Diff line change
Expand Up @@ -30,19 +30,42 @@ namespace OpenMS
int min_charge,
int max_charge);

/** @brief load pin file and convert to a vector of PeptideIdentification using the given score column @p score_name and orientation @p higher_score_better.
If a decoy prefix is provided, the decoy status is set from the protein accessions.
Otherwise, it assumes that the pin file already contains the correctly annotated decoy status.
If @p extra_scores is not empty, the scores are added to the PeptideHit as MetaValues.
If a filename column is encountered the set of @p filenames is filled in the order of appearance and PeptideIdentifications annotated with the id_merge_index meta value to link them to the filename (similar to a merged idXML file).
TODO: implement something similar to PepXMLFile().setPreferredFixedModifications(getModifications_(fixed_modifications_names));
**/

/**
* @brief Loads peptide identifications from a Percolator input file.
*
* This function reads a Percolator input file (`pin_file`) and returns a vector of `PeptideIdentification` objects.
* It extracts relevantinformation such as peptide sequences, scores, charges, annotations, and protein accessions, applying
* specified thresholds and handling decoy targets as needed.
* Note: If a filename column is encountered the set of @p filenames is filled in the order of appearance and PeptideIdentifications annotated with the id_merge_index meta value to link them to the filename (similar to a merged idXML file).
*
* @param pin_file he path to the Percolator input file with a `.pin` extension.
*
* @param higher_score_better A boolean flag indicating whether higher scores are considered better (`true`) or lower scores are better (`false`).
*
* @param score_name The name of the primary score to be used for ranking peptide hits.
*
* @param extra_scores A list of additional score names that should be extracted and stored in each `PeptideHit`.
*
* @param filenames Will be populated with the unique raw file names extracted from the input data.
*
* @param decoy_prefix The prefix used to identify decoy protein accessions. Proteins with accessions starting with this prefix are marked as decoys. Otherwise, it assumes that the pin file already contains the correctly annotated decoy status.
* @param threshold A double value representing the threshold for the `spectrum_q` value. Only spectra with `spectrum_q` below this threshold are processed.
Implemented to allow prefiltering of Sage results.
* @param SageAnnotation A boolean value used to determine if the pin file is coming from Sage or not
* @return A `std::vector` of `PeptideIdentification` objects containing the peptide identifications.
* @throws `Exception::ParseError` if any line in the input file does not have the expected number of columns.
* TODO: implement something similar to PepXMLFile().setPreferredFixedModifications(getModifications_(fixed_modifications_names));
*/
static std::vector<PeptideIdentification> load(const String& pin_file,
bool higher_score_better,
const String& score_name,
const StringList& extra_scores,
StringList& filenames,
String decoy_prefix = "");
String decoy_prefix = "",
double threshold = 0.01,
bool SageAnnotation = false);

// uses spectrum_reference, if empty uses spectrum_id, if also empty fall back to using index
static String getScanIdentifier(const PeptideIdentification& pid, size_t index);
Expand Down
152 changes: 128 additions & 24 deletions src/openms/source/FORMAT/PercolatorInfile.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
//
// --------------------------------------------------------------------------
// $Maintainer: Timo Sachsenberg $
// $Authors: Timo Sachsenberg $
// $Authors: Timo Sachsenberg, Johannes von Kleist $
// --------------------------------------------------------------------------

#include <OpenMS/FORMAT/PercolatorInfile.h>
Expand Down Expand Up @@ -62,29 +62,105 @@ namespace OpenMS
const String& score_name,
const StringList& extra_scores,
StringList& filenames,
String decoy_prefix)
String decoy_prefix,
double threshold,
bool SageAnnotation)
{
CsvFile csv(pin_file, '\t');
StringList header;
csv.getRow(0, header);

//Sage Variables, initialized in the following block if SageAnnotation is set
map<int, vector<PeptideHit::PeakAnnotation>> anno_mapping;
CsvFile tsv;
CsvFile annos;
unordered_map<String, size_t> to_idx_t;

if (SageAnnotation) // Block for special treatment of sage
{
String tsv_file_path = pin_file.substr(0, pin_file.size()-3);
tsv_file_path = tsv_file_path + "tsv";
tsv = CsvFile(tsv_file_path,'\t');

String temp_diff = "results.sage.pin";
String anno_file_path = pin_file.substr(0, pin_file.size()-temp_diff.length());
anno_file_path = anno_file_path + "matched_fragments.sage.tsv";
annos = CsvFile(anno_file_path, '\t');
//map PSMID to vec of PeakAnnotation
StringList sage_tsv_header;
tsv.getRow(0, sage_tsv_header);
to_idx_t; // map column name to column index, for full .tsv file
{
int idx_t{};
for (const auto& h : sage_tsv_header) { to_idx_t[h] = idx_t++; }
}

// processs annotation file
StringList sage_annotation_header;
annos.getRow(0, sage_annotation_header);
unordered_map<String, size_t> to_idx_a; // map column name to column index, for full annotation file file
{
int idx_a{};
for (const auto& h : sage_annotation_header) { to_idx_a[h] = idx_a++; }
}
// map PSMs -> PeakAnnotation vector
auto num_rows = annos.rowCount();

for (size_t i = 1; i < num_rows; ++i)
{
StringList row;
annos.getRow(i, row);

//Check if mapping already has PSM, if it does add
if (anno_mapping.find(row[to_idx_a.at("psm_id")].toInt()) == anno_mapping.end())
{
//Make a new vector of annotations
PeptideHit::PeakAnnotation peak_temp;

peak_temp.annotation = row[to_idx_a.at("fragment_type")] + row[to_idx_a.at("fragment_ordinals")];
peak_temp.charge = row[to_idx_a.at("fragment_charge")].toInt();
peak_temp.intensity = row[to_idx_a.at("fragment_intensity")].toDouble();
peak_temp.mz = row[to_idx_a.at("fragment_mz_experimental")].toDouble();

vector<PeptideHit::PeakAnnotation> temp_anno_vec;
temp_anno_vec.push_back(peak_temp);
anno_mapping[ row[to_idx_a.at("psm_id")].toInt() ] = temp_anno_vec;
}
else
{
//Add values to exisiting vector
PeptideHit::PeakAnnotation peak_temp;

peak_temp.annotation = row[to_idx_a.at("fragment_type")] + row[to_idx_a.at("fragment_ordinals")];
peak_temp.charge = row[to_idx_a.at("fragment_charge")].toInt();
peak_temp.intensity = row[to_idx_a.at("fragment_intensity")].toDouble();
peak_temp.mz = row[to_idx_a.at("fragment_mz_experimental")].toDouble();

anno_mapping[ row[to_idx_a.at("psm_id")].toInt() ].push_back(peak_temp);
}
}
}

StringList pin_header;

csv.getRow(0, pin_header);

unordered_map<String, size_t> to_idx; // map column name to column index
{
int idx{};
for (const auto& h : header) { to_idx[h] = idx++; }
for (const auto& h : pin_header) { to_idx[h] = idx++; }
}

// determine file name column index in percolator in file
int file_name_column_index{-1};
if (auto it = std::find(header.begin(), header.end(), "FileName"); it != header.end())
if (auto it = std::find(pin_header.begin(), pin_header.end(), "FileName"); it != pin_header.end())
{
file_name_column_index = it - header.begin();
file_name_column_index = it - pin_header.begin();
}

// get column indices of extra scores
std::set<String> found_extra_scores; // additional (non-main) scores that should be stored in the PeptideHit, order important for comparable idXML
// determine extra scores and store column indices
std::set<String> found_extra_scores; // additional (non-main) scores that should be stored in the PeptideHit, order important for comparable idXML
for (const String& s : extra_scores)
{
if (auto it = std::find(header.begin(), header.end(), s); it != header.end())
if (auto it = std::find(pin_header.begin(), pin_header.end(), s); it != pin_header.end())
{
found_extra_scores.insert(s);
}
Expand All @@ -93,7 +169,7 @@ namespace OpenMS
OPENMS_LOG_WARN << "Extra score: " << s << " not found in Percolator input file." << endl;
}
}

// charge columns are not standardized, so we check for the format and create hash to lookup column name to charge mapping
std::regex charge_one_hot_pattern("^charge\\d+$");
std::regex sage_one_hot_pattern("^z=\\d+$");
Expand All @@ -104,7 +180,7 @@ namespace OpenMS
// The reason is that sage searches always for the charge annotated in the spectrum raw file. Only if the annotation is missing it will search
// the suggested charge range.
bool found_sage_otherz_charge_column{false};
for (const String& c : header)
for (const String& c : pin_header)
{
if (std::regex_match(c, charge_one_hot_pattern))
{
Expand Down Expand Up @@ -136,11 +212,21 @@ namespace OpenMS
StringList row;
csv.getRow(i, row);

if (row.size() != header.size())
StringList t_row;

if (SageAnnotation)
{
tsv.getRow(i, t_row);
// skip if spectrum_q is above threshold
if (t_row[to_idx_t.at("spectrum_q")].toDouble() > threshold ) continue;
}

if (row.size() != pin_header.size())
{
throw Exception::ParseError(__FILE__, __LINE__, OPENMS_PRETTY_FUNCTION, "Error: line " + String(i) + " of file '" + pin_file + "' does not have the same number of columns as the header!", String(i));
throw Exception::ParseError(__FILE__, __LINE__, OPENMS_PRETTY_FUNCTION, "Error: line " + String(i) + " of file '" + pin_file + "' does not have the same number of columns as the pin_header!", String(i));
}


if (file_name_column_index >= 0)
{
raw_file_name = row[file_name_column_index];
Expand All @@ -150,7 +236,6 @@ namespace OpenMS
map_filename_to_idx[raw_file_name] = filenames.size() - 1;
}
}

// NOTE: In our pin files that we WRITE, SpecID will be filename + vendor spectrum native ID
// However, many search engines (e.g. Sage) choose arbitrary IDs, which is unfortunately allowed
// by this loosely defined format.
Expand All @@ -159,28 +244,25 @@ namespace OpenMS
if (auto it = to_idx.find("ion_mobility"); it != to_idx.end())
{
const String& sIM = row[it->second];
const double IM = sIM.toDouble();
pids.back().setMetaValue(Constants::UserParam::IM, IM);
const double IM = sIM.toDouble();
if (!pids.empty()) pids.back().setMetaValue(Constants::UserParam::IM, IM);
}

// In theory, this should be an integer, but Sage currently cannot extract the number from all vendor spectrum IDs,
// so it writes the full ID as string
String sScanNr = row[to_idx.at("ScanNr")];

if (sSpecId != spec_id)
{
pids.resize(pids.size() + 1);
pids.back().setHigherScoreBetter(higher_score_better);
pids.back().setScoreType(score_name);
pids.back().setMetaValue(Constants::UserParam::ID_MERGE_INDEX, map_filename_to_idx.at(raw_file_name));
pids.back().setRT(row[to_idx.at("retentiontime")].toDouble() * 60.0); // search engines typically write minutes (e.g., sage)
pids.back().setMetaValue("PinSpecId", sSpecId);
pids.back().setMetaValue("PinSpecId", sSpecId);
// Since ScanNr is the closest to help in identifying the spectrum in the file later on,
// we use it as spectrum_reference. Since it can be integer only or the complete
// vendor ID, you will need a lookup in case of number only later!!
pids.back().setSpectrumReference(sScanNr);
pids.back().setSpectrumReference(sScanNr);
}

String sPeptide = row[to_idx.at("Peptide")];
const double score = row[to_idx.at(score_name)].toDouble();
String target_decoy = row[to_idx.at("Label")].toInt() == 1 ? "target" : "decoy";
Expand Down Expand Up @@ -253,12 +335,33 @@ namespace OpenMS
AASequence aa_seq = AASequence::fromString(sPeptide);
PeptideHit ph(score, rank, charge, std::move(aa_seq));
ph.setMetaValue("target_decoy", target_decoy);

for (const auto& name : found_extra_scores)
{
ph.setMetaValue(name, row[to_idx.at(name)]);
}
ph.setRank(rank);

// adding own meta values
if (SageAnnotation)
{
ph.setMetaValue("spectrum_q", t_row[to_idx_t.at("spectrum_q")].toDouble()); //TODO: check if column exists / SAGE specific treatment
}
ph.setMetaValue("DeltaMass", ( row[to_idx.at("ExpMass")].toDouble() - row[to_idx.at("CalcMass")].toDouble()) );
// add annotations
if (SageAnnotation)
{
if (anno_mapping.find(sSpecId.toInt()) != anno_mapping.end())
{
// copy annotations from mapping to PeptideHit
vector<PeptideHit::PeakAnnotation> pep_vec;
for (const PeptideHit::PeakAnnotation& pep : anno_mapping[sSpecId.toInt()])
{
pep_vec.push_back(pep) ;
}
ph.setPeakAnnotations(pep_vec);
}
}
// add link to protein (we only know the accession but not start/end, aa_before/after in protein at this point)
for (const String& accession : accessions)
{
Expand All @@ -267,6 +370,7 @@ namespace OpenMS

pids.back().insertHit(std::move(ph));
}

return pids;
}

Expand Down Expand Up @@ -542,4 +646,4 @@ namespace OpenMS
return count;
}

}
}
Loading

0 comments on commit 594447a

Please sign in to comment.