Skip to content

Commit

Permalink
Sage MS (OpenMS#6786)
Browse files Browse the repository at this point in the history
* reading of version numbers

* scaffold for PercolatorInfile::load

* nop

* minimal parsing

* wip

* more entries

* nop

* Update PercolatorInfile.cpp

* add modification parsing to pin file

* use proper score column for main score

* Update src/openms/include/OpenMS/FORMAT/PercolatorInfile.h

* creates basic test file

* update test

* update test

* update pin file parsing

* Apply suggestions from code review

* some minor work on adapter

* support multi file pin files

* nop

* improved sage support

* improved sage support

* support for multiple files

* add sage subscores for percolator / mokapot rescoring

* support for subscores propagation to percolator

* some preparation to for config template

* wrap easy parameters

* change map_inde to id_merge_index, add enzyme support

* untested support for all relevant parameters

* small refactoring

* first working version

* Update src/topp/SageAdapter.cpp

* Apply suggestions from code review

Co-authored-by: Julianus Pfeuffer <pfeuffer@informatik.uni-tuebingen.de>

* expand search meta data

* prepration for thirdparty test

* removed unneded copy

* fix some missing update error

* add precursor RT and MZ to id results

* fix wrong paramter in test. good test data still WIP

* add Sage test

* mend

* Update src/topp/SageAdapter.cpp

* update test filename in idXML

* Update third_party_tests.cmake

---------

Co-authored-by: timosachsenberg <sachsenb@ibminode02.Cs.Uni-Tuebingen.De>
Co-authored-by: timosachsenberg <sachsenb@ibminode04.cs.uni-tuebingen.de>
Co-authored-by: Samuel Wein <sam@samwein.com>
Co-authored-by: Julianus Pfeuffer <pfeuffer@informatik.uni-tuebingen.de>
  • Loading branch information
5 people authored Aug 14, 2023
1 parent 69a0ee6 commit ef37a65
Show file tree
Hide file tree
Showing 16 changed files with 965 additions and 12 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -249,7 +249,7 @@ namespace OpenMS
/// a map from ScoreType to their names as used around OpenMS
std::map<ScoreType, std::set<String>> type_to_str_ =
{
{ScoreType::RAW, {"XTandem", "OMSSA", "SEQUEST:xcorr", "Mascot", "mvh"}},
{ScoreType::RAW, {"XTandem", "OMSSA", "SEQUEST:xcorr", "Mascot", "mvh", "Sage"}},
//TODO find out reasonable raw scores for SES that provide E-Values as main score or see below
//TODO there is no test for spectraST idXML, so I don't know its score
//TODO check if we should combine RAW and RAW_EVAL:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -85,7 +85,7 @@ namespace OpenMS
static const int N_TERM_MODIFICATION_INDEX; // magic constant to distinguish N_TERM only modifications from ANYWHERE modifications placed at N-term residue
static const int C_TERM_MODIFICATION_INDEX; // magic constant to distinguish C_TERM only modifications from ANYWHERE modifications placed at C-term residue

// Lookup datastructure to allow lock-free generation of modified peptides
// Lookup datastructure to allow lock-free generation of modified peptides. Modifications without origin (e.g., "Protein N-term") set the residue to nullptr.
static MapToResidueType createResidueModificationToResidueMap_(const std::vector<const ResidueModification*>& mods);

// Fast implementation of modification placement. No combinatoric placement is needed in this case - just every site is modified once by each compatible modification. Already modified residues are skipped
Expand Down
13 changes: 11 additions & 2 deletions src/openms/include/OpenMS/FORMAT/PercolatorInfile.h
Original file line number Diff line number Diff line change
Expand Up @@ -58,8 +58,17 @@ namespace OpenMS

/** @brief load pin file and convert to a vector of PeptideIdentification using the given score column @p score_name and orientation @p higher_score_better.
If a decoy prefix is provided, the decoy status is set from the protein accessions.
Otherwise, it assumes that the pin file already contains the correctly annotated decoy status. **/
static std::vector<PeptideIdentification> load(const String& pin_file, bool higher_score_better, const String& score_name, String decoy_prefix = "");
Otherwise, it assumes that the pin file already contains the correctly annotated decoy status.
If @p extra_scores is not empty, the scores are added to the PeptideHit as MetaValues.
If a filename column is encountered the set of @p filenames is filled in the order of appearance and PeptideIdentifications annotated with the id_merge_index meta value to link them to the filename (similar to a merged idXML file).
TODO: implement something similar to PepXMLFile().setPreferredFixedModifications(getModifications_(fixed_modifications_names));
**/
static std::vector<PeptideIdentification> load(const String& pin_file,
bool higher_score_better,
const String& score_name,
const StringList& extra_scores,
StringList& filenames,
String decoy_prefix = "");

// uses spectrum_reference, if empty uses spectrum_id, if also empty fall back to using index
static String getScanIdentifier(const PeptideIdentification& pid, size_t index);
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -91,7 +91,7 @@ namespace OpenMS
* @param target_decoy_available whether target decoy information is stored as meta value
* @param fdr_for_targets_smaller fdr threshold for targets
* @return engine (and optional charge state) id -> vector of triplets (score, target, decoy)
* @note supported engines are: XTandem,OMSSA,MASCOT,SpectraST,MyriMatch,SimTandem,MSGFPlus,MS-GF+,Comet
* @note supported engines are: XTandem,OMSSA,MASCOT,SpectraST,MyriMatch,SimTandem,MSGFPlus,MS-GF+,Comet,Sage
*/
static std::map<String, std::vector<std::vector<double>>> extractAndTransformScores(
const std::vector<ProteinIdentification> & protein_ids,
Expand Down
1 change: 1 addition & 0 deletions src/openms/source/APPLICATIONS/ToolHandler.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -120,6 +120,7 @@ namespace OpenMS
tools_map["ProteinQuantifier"] = Internal::ToolDescription("ProteinQuantifier", "Quantitation");
tools_map["ProteinResolver"] = Internal::ToolDescription("ProteinResolver", "Quantitation");
tools_map["QualityControl"] = Internal::ToolDescription("QualityControl", "Quality Control");
tools_map["SageAdapter"] = Internal::ToolDescription("SageAdapter", "Identification");
tools_map["SeedListGenerator"] = Internal::ToolDescription("SeedListGenerator", "Quantitation");
tools_map["SpecLibSearcher"] = Internal::ToolDescription("SpecLibSearcher", "Identification");
tools_map["SpectraFilterBernNorm"] = Internal::ToolDescription("SpectraFilterBernNorm", "Identification");
Expand Down
59 changes: 55 additions & 4 deletions src/openms/source/FORMAT/PercolatorInfile.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,7 @@

#include <regex>
#include <functional>
#include <unordered_set>

namespace OpenMS
{
Expand Down Expand Up @@ -80,18 +81,44 @@ namespace OpenMS
return scan_identifier.removeWhitespaces();
}

vector<PeptideIdentification> PercolatorInfile::load(const String& pin_file, bool higher_score_better, const String& score_name, String decoy_prefix)
vector<PeptideIdentification> PercolatorInfile::load(
const String& pin_file,
bool higher_score_better,
const String& score_name,
const StringList& extra_scores,
StringList& filenames,
String decoy_prefix)
{
CsvFile csv(pin_file, '\t');
StringList header;
csv.getRow(0, header);

unordered_map<String, size_t> to_idx; // map column name to column index
{
size_t idx{};
int idx{};
for (const auto& h : header) { to_idx[h] = idx++; }
}

int file_name_column_index{-1};
if (auto it = std::find(header.begin(), header.end(), "FileName"); it != header.end())
{
file_name_column_index = it - header.begin();
}

// get column indices of extra scores
unordered_set<String> found_extra_scores; // additional (non-main) scores that should be stored in the PeptideHit
for (const String& s : extra_scores)
{
if (auto it = std::find(header.begin(), header.end(), s); it != header.end())
{
found_extra_scores.insert(s);
}
else
{
OPENMS_LOG_WARN << "Extra score: " << s << " not found in Percolator input file." << endl;
}
}

// charge columns are not standardized so we check for the format and create hash to lookup column name to charge mapping
std::regex charge_one_hot_pattern("^charge\\d+$");
std::regex sage_one_hot_pattern("^z=\\d+$");
Expand Down Expand Up @@ -120,6 +147,9 @@ namespace OpenMS
vector<PeptideIdentification> pids;
pids.reserve(n_rows);
String spec_id;
String raw_file_name("UNKNOWN");
unordered_map<String, size_t> map_filename_to_idx; // fast lookup of filename to index in filenames vector

for (size_t i = 1; i != n_rows; ++i)
{
StringList row;
Expand All @@ -130,12 +160,24 @@ namespace OpenMS
throw Exception::ParseError(__FILE__, __LINE__, OPENMS_PRETTY_FUNCTION, "Error: line " + String(i) + " of file '" + pin_file + "' does not have the same number of columns as the header!", String(i));
}

if (file_name_column_index >= 0)
{
raw_file_name = row[file_name_column_index];
if (map_filename_to_idx.find(raw_file_name) == map_filename_to_idx.end())
{
filenames.push_back(raw_file_name);
map_filename_to_idx[raw_file_name] = filenames.size() - 1;
}
}

const String& sSpecId = row[to_idx.at("SpecId")];
if (sSpecId != spec_id)
{
pids.resize(pids.size() + 1);
pids.back().setHigherScoreBetter(higher_score_better);
pids.back().setScoreType(score_name);
pids.back().setMetaValue("id_merge_index", map_filename_to_idx.at(raw_file_name));
pids.back().setRT(row[to_idx.at("retentiontime")].toDouble() * 60.0);
}

int sScanNr = row[to_idx.at("ScanNr")].toInt();
Expand All @@ -157,6 +199,11 @@ namespace OpenMS
}
}

if (charge != 0)
{
pids.back().setMZ((row[to_idx.at("ExpMass")].toDouble() - std::fabs(charge) * Constants::PROTON_MASS_U) / std::fabs(charge));
}

sProteins.split(';', accessions);

// deduce decoy state from accessions if decoy_prefix is set
Expand All @@ -165,13 +212,18 @@ namespace OpenMS
target_decoy = std::all_of(accessions.begin(), accessions.end(), [&decoy_prefix](const String& acc) { return acc.hasPrefix(decoy_prefix); }) ? "decoy" : "target" ;
}

// needs to handle strings like: [+42]-MVLVQDLLHPTAASEAR, [+304.207]-ETC[+57.0215]RQLGLGTNIYNAER
// needs to handle strings like: [+42]-MVLVQDLLHPTAASEAR, [+304.207]-ETC[+57.0215]RQLGLGTNIYNAER etc.
sPeptide.substitute("]-", "]."); // we can parse [+42].MVLVQDLLHPTAASEAR
sPeptide.substitute("-[", ".["); // we can parse MVLVQDLLHPTAASEAR.[+111]
AASequence aa_seq = AASequence::fromString(sPeptide);
PeptideHit ph(score, rank, charge, std::move(aa_seq));
ph.setMetaValue("SpecId", sSpecId);
ph.setMetaValue("ScanNr", sScanNr);
ph.setMetaValue("target_decoy", target_decoy);
for (const auto name : found_extra_scores)
{
ph.setMetaValue(name, row[to_idx.at(name)]);
}
ph.setRank(rank);

// add link to protein (we only know the accession but not start/end, aa_before/after in protein at this point)
Expand Down Expand Up @@ -458,4 +510,3 @@ namespace OpenMS
}

}

Original file line number Diff line number Diff line change
Expand Up @@ -961,6 +961,10 @@ namespace OpenMS::Math
{
return getScore_({"hyperscore"}, hit, current_score_type); //TODO evaluate transformations
}
else if (engine == "SAGE")
{
return getScore_({"hyperscore"}, hit, current_score_type);
}
else if (engine == "MSFRAGGER")
{
return (-1) * log10(getScore_({"expect"}, hit, current_score_type));
Expand All @@ -980,7 +984,7 @@ namespace OpenMS::Math
std::set<Int> charges;
const StringList search_engines = {"XTandem","OMSSA","MASCOT","SpectraST","MyriMatch",
"SimTandem","MSGFPlus","MS-GF+","Comet","MSFragger",
"tide-search","SimpleSearchEngine",
"tide-search","Sage","SimpleSearchEngine",
"OpenMS/ConsensusID_best","OpenMS/ConsensusID_worst","OpenMS/ConsensusID_average"};

if (split_charge)
Expand Down
2 changes: 1 addition & 1 deletion src/pyOpenMS/pyopenms-docs
13 changes: 12 additions & 1 deletion src/tests/class_tests/openms/source/PercolatorInfile_test.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -65,9 +65,20 @@ END_SECTION

START_SECTION(vector<PeptideIdentification> PercolatorInfile::load(const String& pin_file, bool higher_score_better, const String& score_name, String decoy_prefix))
{
StringList filenames;
// test loading of pin file with automatic update of target/decoy annotation based on decoy prefix in protein accessions
auto pids = PercolatorInfile::load(OPENMS_GET_TEST_DATA_PATH("sage.pin"), true, "ln(hyperscore)", "DECOY_");

// test some extra scores
StringList extra_scores = {"ln(delta_next)", "ln(delta_best)", "matched_peaks"};

auto pids = PercolatorInfile::load(OPENMS_GET_TEST_DATA_PATH("sage.pin"),
true,
"ln(hyperscore)",
extra_scores,
filenames,
"DECOY_");
TEST_EQUAL(pids.size(), 9);
TEST_EQUAL(filenames.size(), 1);
TEST_FALSE(pids[6].getMetaValue("target_decoy") == "decoy") // 7th entry is annotated as target in pin file but only maps to decoy proteins with prefix "DECOY_" -> set to decoy
}
END_SECTION
Expand Down
8 changes: 8 additions & 0 deletions src/tests/topp/THIRDPARTY/SageAdapter_1.fasta
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
>sp|Q99536|VAT1_HUMAN Synaptic vesicle membrane protein VAT-1 homolog OS=Homo sapiens OX=9606 GN=VAT1 PE=1 SV=2
MSDEREVAEAATGEDASSPPPKTEAASDPQHPAASEGAAAAAASPPLLRCLVLTGFGGYD
KVKLQSRPAAPPAPGPGQLTLRLRACGLNFADLMARQGLYDRLPPLPVTPGMEGAGVVIA
VGEGVSDRKAGDRVMVLNRSGMWQEEVTVPSVQTFLIPEAMTFEEAAALLVNYITAYMVL
FDFGNLQPGHSVLVHMAAGGVGMAAVQLCRTVENVTVFGTASASKHEALKENGVTHPIDY
HTTDYVDEIKKISPKGVDIVMDPLGGSDTAKGYNLLKPMGKVVTYGMANLLTGPKRNLMA
LARTWWNQFSVTALQLLQANRAVCGFHLGYLDGEVELVSGVVARLLALYNQGHIKPHIDS
VWPFEKVADAMKQMQEKKNVGKVLLVPGPEKEN
68 changes: 68 additions & 0 deletions src/tests/topp/THIRDPARTY/SageAdapter_1.ini

Large diffs are not rendered by default.

Loading

0 comments on commit ef37a65

Please sign in to comment.