-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow Canto to replace a UniProtKB accession number #2677
Comments
Note that this will also be needed if UniProt ever obsoletes an accession that we've used in an existing session. |
Yep, I think it should be straight forward if you know the old and new accessions.
Could you explain this bit some more? |
My point is that when adding UniProtKB accession numbers using Canto's web interface, it prevents adding accession numbers that are obsolete (i.e. moved to UniParc) or invalid (don't map to any existing accession). For example, the following obsolete accession number shows a warning message in the web interface: No genes found for these identifiers: I1RA15 But if the script doesn't query the new accessions against UniProtKB before doing the replacement, then it will be possible to replace a valid accession with an obsolete or non-existent accession, and I don't know whether that will cause problems for the code. Maybe An initial version of the script could just replace the accession number without any checks, and we might not need the checks at all if Canto doesn't break when an invalid accession number is entered, but it would be nice to have some level of sanity checking since curators entering the 'wrong' accession number could be an unfortunately common occurrence. |
Turns out we discussed this before in issue #2032, but I didn't document how it was fixed (it probably just used a |
Thanks James. It should be no problem to call the function in UniProtUtil that gets entry details for the new accession before making changes. Let's have a Zoom chat about this sometime. Maybe at the start of January? |
Sure. We can arrange a date in the new year. |
Hi James. Is this a good summary of the call?:
|
@kimrutherford Yep, that's a good summary. |
Thanks @kimrutherford and @jseager7 |
Hi @jseager7, this is the example we discussed this morning. PHI-base/curation#196, https://canto.phi-base.org/curs/543da1f17a17a6d3 Please could you replace the entries with the incorrect UniProt id X0BNP7 (FOMG_00267) with X0B2H1 (FOMG_00587). |
@kimrutherford I've been looking into this problem myself. My current thinking is to solve the problem with Canto's command line, similar to what we did for renaming strains. Right now, a throwaway Perl script could suffice, but I think this problem is likely to reoccur once we have more community curators. The thinking below could probably apply to either solution though. My suggestion would be to add an option Since replacing a UniProtKB identifier seems similar to the task of renaming strains, I was wondering whether we could copy and repurpose the code in ProblemsThere are lots of added complications for this compared to renaming strains:
SolutionHere's what needs to be done. If I do this myself, I'm probably going to need help identifying the relevant subroutines to do most of this:
|
Hi James.
That sounds sensible. It's been a long time since I looked at it but I think the gene table in the Track database acts only as a cache to prevent calls to the UniProt API. So it's probably safe to ignore the Track database while fixing the identifiers in the sessions. After the ID change(s), if session with a new UniProt ID is accessed, the ID will be looked up with the UniProt API and then cached. There will be a slightly pause but it should all happen transparently. (That's if I'm remembering correctly) I'm happy to help. I'll make a skeleton example tomorrow to get things started. |
As promised, here's the start of a PR to do this. There's no error handling yet and it doesn't update allele identifiers but I hope it gives you an idea of what's needed. I used I'll look at it again after the weekend. |
I've added code to do that: 725e2ee |
In --change-gene-id we use a GeneLookup adaptor to check that the "to" and "from" IDs are available in the main database. Refs #2677
I've made another change (dbaf807) so that now at least the code looks up the IDs that are passed in to make sure that they are real. |
@kimrutherford Thanks for all the work here. Is it worth me getting involved with any development work, or should I just wait until testing is needed? |
If you're keen to any development work that would be great. Could test the current branch? That would be very helpful. With something like:
It works fine on the PomBase database but we need to test when UniProt is configured as the source of genes/proteins. |
@kimrutherford I've tested the current branch (as of commit dbaf807) and the code seems to work fine with UniProtKB accessions. I wasn't sure whether to put my findings here or in the PR, but since all the other comments are here I'll follow suit. I got a list of session IDs when running the command that matched the affected sessions every time I tried it. Replacing with an existing IDReplacing with an ID that already exists in the Track database works fine:
Canto doesn't make any duplicate genes and reuses the existing row as expected. The old ID is seemingly removed from the Replacing with a new IDReplacing with an ID that does not exist in the Track database works fine too:
Invalid and obsolete IDsI tried replacing with an invalid ID and an obsolete ID and both these cases error safely without changing the database:
# Accession number A0AV18 is obsolete
./canto/script/canto_docker /canto/script/canto_admin.pl --change-gene-id Q00909 A0AV18 An example error message from the first command:
Effects in other tablesNext I checked the effects where the old ID had been used in other tables, such as allele IDs and annotations.
(I1RF58 was used in no other curation sessions, and Q4ING3 hadn't been added to the database.) I found that:
Updating allele names might need some careful thought, especially because PHI-Canto curation is prone to using a gene name that doesn't match UniProtKB (though often that's because UniProtKB has no gene name at all). I'll have to check with the PHI-base team about this, because I can see a risk of nightmare scenarios where the only way to reliably update the allele names will be by manually renaming them after the automated replacement is done. Other casesI haven't checked what happens in the case where the target ID (the one that the old ID will be replaced with) has been merged into another ID in UniProtKB. I think this is unlikely to show up in practice since the UniProtKB website redirects to the latest ID when searching for a merged ID, and the lookup in Canto probably handles this transparently if it's querying the right property in the UniProt XML. |
I'll close this issue once I've tested the new script on real curation sessions on the PHI-Canto production server. |
I've tested this on our server and it works. Only problem is the alleles didn't need to be renamed, which I really should've checked before running the script… I'll open a new issue or pull request to add an option to the script that toggles the allele renaming behaviour. |
In --change-gene-id we use a GeneLookup adaptor to check that the "to" and "from" IDs are available in the main database. Refs #2677
(Requested by @CuzickA, possibly related to #2669)
There's been a few curation sessions now where we've needed to change one UniProtKB accession number to another. Usually this is because the accession number is from the wrong species or because the accession is not from the reference proteome / strain.
Unfortunately, most of these changes are needed after alleles, genotypes and metagenotypes have been added to the session, so it's very tedious to make the change manually.
This could be fixed by something as a simple as a
canto_curs_map
script to update the accession number in the database, but there's the added complication that we have to prevent invalid or obsolete UniProtKB accession numbers from being used, and we have to refresh the cached gene information so that the correct gene name and product is displayed.If tools for performing bulk renames are under consideration, then it would be really useful to have something that could handle UniProtKB accession numbers. I'm hoping most of the code needed for this is already in
UniProtUtil
. I'd be happy to help with the implementation and testing.The text was updated successfully, but these errors were encountered: