Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Pitch] Reproducible train/test splitting for peptides #23

Open
jspaezp opened this issue Jun 8, 2023 · 2 comments
Open

[Pitch] Reproducible train/test splitting for peptides #23

jspaezp opened this issue Jun 8, 2023 · 2 comments

Comments

@jspaezp
Copy link

jspaezp commented Jun 8, 2023

Context

Comparing performance among tools is very hard, and becomes even harder if there is little knowledge of the training data that was used for each model. Therefore adding a consensus way to split data in a reproducible manner might be critical down the road to have consistent assurance that a specific peptide has never been seen by the model.

Proposal

Generate a hashing function that gives a number to every peptide sequence, that in average will generate a randomly uniform distribution, therefore could be used for percentage train/test splits.

In other words, propose a way to convert the train = [random() > 0.8 for x in PEPTIDES] pattern to train = [hash(x) > 0.8 for x in PEPTIDES].

  1. Make a reference implementation and testing examples that can be re-implemented in any programming language/framework.

  2. Guidelines for specific sequences that should not be trained on. (IRT peptides/procal, that should be used as landmarks and not training sequences ??)

  • I would recommend NEVER to train anything with hash in the range of [0.9-1], and generally discourage > 0.8 as well as Biognosys iRT peptides/procal peptides ...
  1. Some internal testing to verify that compositions/motifs are not being over-represented in a systematic way.

First implementation

This is the hashing that I use in my model to accomplish this task (with some minor modifications for readability)

SplitSet = Literal["Train", "Test", "Val"]

# Generated using {x:hash(x) for x in CONFIG.encoding_aa_order} once
# and then hard-coded
HASHDICT = {
    "A": 8990350376580739186,
    "C": -5648131828304525110,
    "D": 6043088297348140225,
    "E": 2424930106316864185,
    "F": 7046537624574876942,
    "G": 3340710540999258202,
    "H": 6743161139278114243,
    "I": -3034276714411840744,
    "K": -6360745720327592128,
    "L": -5980349674681488316,
    "M": -5782039407703521972,
    "N": -5469935875943994788,
    "P": -9131389159066742055,
    "Q": -3988780601193558504,
    "R": -961126793936120965,
    "S": 8601576106333056321,
    "T": -826347925826021181,
    "V": 6418718798924587169,
    "W": -3331112299842267173,
    "X": -7457703884378074688,
    "Y": 2606728663468607544,
}

def select_split(pep: str) -> SplitSet:
    """Assigns a peptide to a split set based on its sequence

    It selects all iRT peptides to the 'Val' set.
    The rest of the peptides are hashed based on their stripped sequence (no mods).
    It is done on a semi-random basis

    Args:
        pep str: Peptide to assign to a split set

    Returns:
        SplitSet: Split set to assign the peptide to.
            This is either one of "Train", "Test" or "Val"

    Examples:
        >>> select_split("AAA")
        'Train'
        >>> select_split("AAAK")
        'Test'
        >>> select_split("AAAKK")
        'Train'
        >>> select_split("AAAMTKK")
        'Train'

    """
    num_hash = sum(HASHDICT[x] for x in pep)
    num_hash = num_hash / 1e4
    num_hash = num_hash % 1
    assert 0 <= num_hash <= 1
    return _select_split(pep, num_hash)

# This is a hard-coded set of peptides that I use as landmarks to align
# My retention times but actively exclude in training.
def _select_split(pep: str, num_hash: float):
    in_landmark = pep in IRT_PEPTIDES
    if num_hash > 0.8 or in_landmark:
        return "Val"
    elif num_hash > 0.6:
        return "Test"
    else:
        return "Train"

I think it would be great if we could have a discussion on a good way to do this and IDEALLY have an implementation of something like this in our MS-related training frameworks.

(I will progressively add people to the conversation but feel free to add anyone in the community whose input should be included here).

@jspaezp
Copy link
Author

jspaezp commented Jun 8, 2023

@wfondrie any insights on this and opinion on adopting it for depth charge?

@jspaezp
Copy link
Author

jspaezp commented Jun 8, 2023

@RalfG what do you thinks ? (if we reach any consensus I would love to add a simple train/test split tutorial in proteomicsML)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant