Skip to content
Sergey Venev edited this page Jan 21, 2016 · 8 revisions

GlycoWiki

Notes regarding the dataset features.

Peptide sequence ambiguity.

MS detects short peptide and then maps them to some existing proteome database of your choice. This creates ambiguity issues: the very same peptide can be acknowledged by multiple protein id-s. This situation is handled graciously for the most part, yielding a single data-entry for a given peptide, but with multiple UNIPROT id-s itemized in a comma separated list in a column named "Protein accession numbers". Slightly less pretty thing happens, when this list of accession numbers (uniprot id-s) does get split between columns "Protein accession numbers" and "Other Proteins". And the worse situation observed so far is when there are multiple data entries for the same peptide with potentially overlapping lists of accession numbers. Because peptide sequences are the best candidates to become unique identifier for the whole dataset, including peptide spectrum, we must resolve existing ambiguity.

resolve_peptide_ambiguity.py resolves aforementioned ambiguity and creates file uniq_peptides_catalog.sv, which contains an account of all uniq peptides with all the proteins they can be associated with.

gsites_catalog.py takes uniq_peptides_catalog.sv file and extracts all detected glycol-sites from the spectrum. This script then reorder the table, so that Gsites become unique identifiers, and their peptides of origin are enumerated for each of them. Predicted Gsites are provided as well. The resulting file is gsites_antology.csv.

See pipeline.sh to get the idea of the pipeline.

TODO:

  • Experimentalists used to count from 1, not from 0. Address this.
  • Peptide can be at the very beginning of the protein, what's the preceding AA than? See following example from new_output_large_121515 - dat.iloc[61].

gsites NYT(1) pept (P)MNYTESSPLR(S) peptide_start 0 all_uids sp|Q8NE79|POPD1_HUMAN prot_seq MNYTESSPLRESTAIGFTPELESIIPVPSNKTTCENWREIHHLVFH... protlen 360 uid_max Q8NE79 prot_name sp|Q8NE79|POPD1_HUMAN Blood vessel epicardial ... uniq_pept_count 1 pept_probab 99.7% gsites_predicted NYT(1),NKT(29) gsites_predicted_number 2 gsite_start 1 gsites_AA1_N N gsites_AA2_XbutP Y gsites_AA3_TS T Name: 61, dtype: object

Clone this wiki locally