Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve assign peptide type 243 #263

Merged
merged 31 commits into from
Oct 25, 2024
Merged

Conversation

elena-krismer
Copy link
Collaborator

closes #243

@jpquast
Copy link
Owner

jpquast commented Aug 3, 2024

Let me know if I should review this.

@elena-krismer
Copy link
Collaborator Author

@jpquast i'm getting an error from vroom on macos and windows- have you seen this error before? (it's not related to the changes i made)

@jpquast
Copy link
Owner

jpquast commented Sep 7, 2024

Hi Elena,
I had a quick look at the function. I think it is necessary to introduce the protein_id column as you do in order to know which peptides belong to the same protein. However, as I can see you don't have a way to know about the initiator methionine. I think maybe I was not very clear about that. The point was to assign the peptide as fully tryptic if it starts basically at position 2 of the protein and there is no other peptide that starts at 1. In those cases the initiator cysteine is likely completely absent for most of the copies of this protein in the cell. As far as I can tell you check if any of the peptides of the protein don't have a preceding methionine (which could also be in the middle of the protein).
What I would suggest is to also require the start and end column for each protein and then just check if any starts with position 1. If yes keep the original annotation as it is. If not then any peptide starting at position 2 is considered fully-tryptic if it fulfils its C-terminal criterium.

As far as I can tell the output of the function is generally currently wrong:

assign_peptide_type(data, aa_before, last_aa, aa_after, protein_id)
  aa_before last_aa aa_after protein_id      pep_type
1         K       R        T         P1 fully-tryptic
2         S       K        R         P1 fully-tryptic
3         T       Y        T         P2   non-tryptic
4         M       K        R         P2  semi-tryptic

Row 2 should be semi-tryptic since the aa_before is not K/R. Not sure why the function now gets those standard cases wrong.
Row 4 works of course as you expected based on your code, but is conceptually wrong as explained above.

Not sure why vroom fails, but if it still does we can have a look at that.

elena-krismer and others added 9 commits September 24, 2024 16:36
The issue was that the uniprot data seems to now be gziped. I handle this case now in try_query. Not sure if it is generally handled for any potential case but it at least works for uniprot.
Copy link
Owner

@jpquast jpquast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have gone through it and fixed some things. It wasn't completely right yet but now it should be good. I haven't run any tests yet. Would be great if you could check if any tests still fail and then fix them. Also I fixed the issue with vroom. Seems like uniprot updated their data format to be gzip now. At least on my mac that was an issue, but it should work now.

@jpquast jpquast self-requested a review September 30, 2024 20:52
Copy link
Owner

@jpquast jpquast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have fixed all issues.
This involved still more issues with the try_query function. It seems like UniProt returned the data actually as gzip and it was not automatically unzipped while downloading as it is apparently the case for other databases (RCSB). So I needed to add another check that checks for the magic number (1f 8b) in the byte format. Now works as expected for all fetch functions.
I fixed also a lot of tests related to assign_peptide_type().
I also updated the version number and the NEWS file.

@jpquast
Copy link
Owner

jpquast commented Sep 30, 2024

@elena-krismer maybe you can have a look at this. the r-devel version on ubuntu seems to fail. The problem is that the adjusted p-value has a different values than in the current version. This seems like a more complicated problem. Would be great if you could have a look at it.
I also created this PR in ggrepel slowkow/ggrepel#263. This is related to all the warnings we get for functions that use ggrepel in plots. There is no way of silencing them. I hope they accept my PR or find another way.

@jpquast jpquast added enhancement New feature or request new feature New feature and removed enhancement New feature or request new feature New feature labels Oct 1, 2024
@elena-krismer
Copy link
Collaborator Author

Note the issue was reported to the R-developers

@jpquast jpquast merged commit 216a44e into developer Oct 25, 2024
5 of 6 checks passed
@jpquast jpquast deleted the improve_assign_peptide_type_243 branch October 29, 2024 12:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants