Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow Canto to replace a UniProtKB accession number #2677

Closed
jseager7 opened this issue Nov 30, 2022 · 20 comments
Closed

Allow Canto to replace a UniProtKB accession number #2677

jseager7 opened this issue Nov 30, 2022 · 20 comments
Assignees

Comments

@jseager7
Copy link
Collaborator

(Requested by @CuzickA, possibly related to #2669)

There's been a few curation sessions now where we've needed to change one UniProtKB accession number to another. Usually this is because the accession number is from the wrong species or because the accession is not from the reference proteome / strain.

Unfortunately, most of these changes are needed after alleles, genotypes and metagenotypes have been added to the session, so it's very tedious to make the change manually.

This could be fixed by something as a simple as a canto_curs_map script to update the accession number in the database, but there's the added complication that we have to prevent invalid or obsolete UniProtKB accession numbers from being used, and we have to refresh the cached gene information so that the correct gene name and product is displayed.

If tools for performing bulk renames are under consideration, then it would be really useful to have something that could handle UniProtKB accession numbers. I'm hoping most of the code needed for this is already in UniProtUtil. I'd be happy to help with the implementation and testing.

@jseager7 jseager7 self-assigned this Nov 30, 2022
@jseager7
Copy link
Collaborator Author

Note that this will also be needed if UniProt ever obsoletes an accession that we've used in an existing session.

@kimrutherford
Copy link
Member

This could be fixed by something as a simple as a canto_curs_map script to update the accession number in the database

Yep, I think it should be straight forward if you know the old and new accessions.

there's the added complication that we have to prevent invalid or obsolete UniProtKB accession numbers from being used

Could you explain this bit some more?

@jseager7
Copy link
Collaborator Author

jseager7 commented Dec 12, 2022

Could you explain this bit some more?

My point is that when adding UniProtKB accession numbers using Canto's web interface, it prevents adding accession numbers that are obsolete (i.e. moved to UniParc) or invalid (don't map to any existing accession).

For example, the following obsolete accession number shows a warning message in the web interface:

No genes found for these identifiers: I1RA15

But if the script doesn't query the new accessions against UniProtKB before doing the replacement, then it will be possible to replace a valid accession with an obsolete or non-existent accession, and I don't know whether that will cause problems for the code.

Maybe UniProtUtil can already handle this case by showing only the accession number and leaving the other fields blank. We'd need this case handled anyway, since even UniProtKB accessions that we've already annotated may be at risk of going obsolete in the future, and the gene cache can presumably still be refreshed for sessions containing obsolete accessions.

An initial version of the script could just replace the accession number without any checks, and we might not need the checks at all if Canto doesn't break when an invalid accession number is entered, but it would be nice to have some level of sanity checking since curators entering the 'wrong' accession number could be an unfortunately common occurrence.

@jseager7
Copy link
Collaborator Author

Turns out we discussed this before in issue #2032, but I didn't document how it was fixed (it probably just used a canto_curs_map script). I've closed that issue in favour of this one.

@kimrutherford
Copy link
Member

Thanks James. It should be no problem to call the function in UniProtUtil that gets entry details for the new accession before making changes.

Let's have a Zoom chat about this sometime. Maybe at the start of January?

@jseager7
Copy link
Collaborator Author

Sure. We can arrange a date in the new year.

@kimrutherford
Copy link
Member

We can arrange a date in the new year.

Hi James. Is this a good summary of the call?:

  • James will gather some more details from @CuzickA, like:
    • how often this happens
    • perhaps find a current example that needs fixing
  • we might solve the problem initially with a script
  • a web UI in Canto could come later
  • access to the UI for changing IDs would be from the PHI-Canto gene and organism edit page

@jseager7
Copy link
Collaborator Author

@kimrutherford Yep, that's a good summary.

@CuzickA
Copy link
Collaborator

CuzickA commented Jan 16, 2023

Thanks @kimrutherford and @jseager7
I have a session that I'm still working on that needs a UniProt Id changing. I think it will be easier if I finish checking the session first (which may take some time). I will let you know when its done and we can try and change the UniProt id.

@CuzickA
Copy link
Collaborator

CuzickA commented Jan 22, 2024

Hi @jseager7, this is the example we discussed this morning. PHI-base/curation#196, https://canto.phi-base.org/curs/543da1f17a17a6d3

Please could you replace the entries with the incorrect UniProt id X0BNP7 (FOMG_00267) with X0B2H1 (FOMG_00587).

@jseager7
Copy link
Collaborator Author

jseager7 commented Apr 3, 2024

@kimrutherford I've been looking into this problem myself. My current thinking is to solve the problem with Canto's command line, similar to what we did for renaming strains. Right now, a throwaway Perl script could suffice, but I think this problem is likely to reoccur once we have more community curators. The thinking below could probably apply to either solution though.

My suggestion would be to add an option --replace-uniprot-id (or similar) to canto_admin.pl that has as its arguments 1) the old UniProtKB accession number to replace, and 2) the new UniProtKB accession number to replace with. In future, we might need an extra argument to filter the replacement by curation session, in case we don't want the replacement to apply globally.

Since replacing a UniProtKB identifier seems similar to the task of renaming strains, I was wondering whether we could copy and repurpose the code in Canto::Track::TrackUtil.pm that renames strains (e.g. sub rename_strain and sub _get_strain_rs).

Problems

There are lots of added complications for this compared to renaming strains:

  • The primary_identifier column in the allele table includes the UniProtKB accession number, so presumably they'll need to be updated too, with additional checks needed to make sure the rename doesn't create a conflict with existing identifiers.

  • The code will have to change all the matching UniProtKB IDs that are used in annotation extensions, which means doing text replacement on the extension mapping in the YAML data embedded in the data column of the annotation table in each curation session database.

  • The UniProtKB accession number is repeated in the primary_identifier column of the gene table in the curation session database, so the accession number will need to be updated in every curation session too.

  • If the new UniProtKB accession number (that will replace an old accession number) already exists in the Track database, then presumably we should just delete the row for the old accession number rather than changing the primary_identifier. Otherwise, we could end up with two gene rows with the same primary identifier. Note we should only do this if the UniProtKB accession number is not in use in any other curation sessions.

  • The cached metadata from UniProtKB in the gene table in the Track database will need to be updated for every accession number that's been replaced. An easy fix would be to run Canto::Track::refresh_gene_cache after the replacement, but maybe that's a bit overkill given that we know which accessions need to be queried.

Solution

Here's what needs to be done. If I do this myself, I'm probably going to need help identifying the relevant subroutines to do most of this:

  1. If the new accession number is already in the Track database:
    1. Check if the old accession number is used in any other session; if not, remove it from the Track database.
  2. Else:
    1. Use retrieve_entries and parse_results from Canto::UniProt::UniProtUtil to get the data for the new accession number.
    2. Update all relevant columns in the gene table in the Track database.
  3. For each curation session database:
    1. Update the primary_identifier column in the gene table.
    2. Update the primary_identifier column in the allele table.
    3. For each row in the annotation table, find and replace the old accession number with the new accession number in the extension mapping in the YAML data embedded in the data column.

@kimrutherford
Copy link
Member

Hi James.

My suggestion would be to add an option --replace-uniprot-id (or similar) to canto_admin.pl

That sounds sensible.

It's been a long time since I looked at it but I think the gene table in the Track database acts only as a cache to prevent calls to the UniProt API. So it's probably safe to ignore the Track database while fixing the identifiers in the sessions.

After the ID change(s), if session with a new UniProt ID is accessed, the ID will be looked up with the UniProt API and then cached. There will be a slightly pause but it should all happen transparently. (That's if I'm remembering correctly)

I'm happy to help. I'll make a skeleton example tomorrow to get things started.

@kimrutherford
Copy link
Member

I'll make a skeleton example tomorrow to get things started.

As promised, here's the start of a PR to do this. There's no error handling yet and it doesn't update allele identifiers but I hope it gives you an idea of what's needed.

I used --change-gene-id as the flag since there's a chance it might be useful for PomBase one day.

I'll look at it again after the weekend.

kimrutherford added a commit that referenced this issue Apr 8, 2024
@kimrutherford
Copy link
Member

it doesn't update allele identifiers

I've added code to do that: 725e2ee

kimrutherford added a commit that referenced this issue Apr 8, 2024
In --change-gene-id we use a GeneLookup adaptor to check that the "to"
and "from" IDs are available in the main database.

Refs #2677
@kimrutherford
Copy link
Member

There's no error handling yet

I've made another change (dbaf807) so that now at least the code looks up the IDs that are passed in to make sure that they are real.

@jseager7
Copy link
Collaborator Author

jseager7 commented Apr 8, 2024

@kimrutherford Thanks for all the work here. Is it worth me getting involved with any development work, or should I just wait until testing is needed?

@kimrutherford
Copy link
Member

Thanks for all the work here. Is it worth me getting involved with any development work, or should I just wait until testing is needed?

If you're keen to any development work that would be great.

Could test the current branch? That would be very helpful. With something like:

./script/canto_admin.pl --change-gene-id OLD_ID NEW_ID

It works fine on the PomBase database but we need to test when UniProt is configured as the source of genes/proteins.

@jseager7
Copy link
Collaborator Author

jseager7 commented Apr 9, 2024

@kimrutherford I've tested the current branch (as of commit dbaf807) and the code seems to work fine with UniProtKB accessions.

I wasn't sure whether to put my findings here or in the PR, but since all the other comments are here I'll follow suit.

I got a list of session IDs when running the command that matched the affected sessions every time I tried it.

Replacing with an existing ID

Replacing with an ID that already exists in the Track database works fine:

./canto/script/canto_docker /canto/script/canto_admin.pl --change-gene-id I1RF58 Q00909

Canto doesn't make any duplicate genes and reuses the existing row as expected. The old ID is seemingly removed from the gene table.

Replacing with a new ID

Replacing with an ID that does not exist in the Track database works fine too:

./canto/script/canto_docker /canto/script/canto_admin.pl --change-gene-id A0A0E0SJI5 I1RBR4

Invalid and obsolete IDs

I tried replacing with an invalid ID and an obsolete ID and both these cases error safely without changing the database:

./canto/script/canto_docker /canto/script/canto_admin.pl --change-gene-id Q00909 foobar
# Accession number A0AV18 is obsolete
./canto/script/canto_docker /canto/script/canto_admin.pl --change-gene-id Q00909 A0AV18

An example error message from the first command:

{UNKNOWN}: no gene found in the database for "to" ID foobar at /canto/script/canto_admin.pl line 214

Effects in other tables

Next I checked the effects where the old ID had been used in other tables, such as allele IDs and annotations.

./canto/script/canto_docker /canto/script/canto_admin.pl --change-gene-id I1RF58 Q4ING3

(I1RF58 was used in no other curation sessions, and Q4ING3 hadn't been added to the database.)

I found that:

  • the primary identifier in gene table was updated correctly,,
  • the primary identifier in allele table was updated correctly,
  • the allele name in the allele table was not updated (see below), and
  • the annotation extension was not updated, but that was expected since it still needs to be implemented.

Updating allele names might need some careful thought, especially because PHI-Canto curation is prone to using a gene name that doesn't match UniProtKB (though often that's because UniProtKB has no gene name at all). I'll have to check with the PHI-base team about this, because I can see a risk of nightmare scenarios where the only way to reliably update the allele names will be by manually renaming them after the automated replacement is done.

Other cases

I haven't checked what happens in the case where the target ID (the one that the old ID will be replaced with) has been merged into another ID in UniProtKB. I think this is unlikely to show up in practice since the UniProtKB website redirects to the latest ID when searching for a merged ID, and the lookup in Canto probably handles this transparently if it's querying the right property in the UniProt XML.

@jseager7
Copy link
Collaborator Author

I'll close this issue once I've tested the new script on real curation sessions on the PHI-Canto production server.

@jseager7
Copy link
Collaborator Author

I've tested this on our server and it works. Only problem is the alleles didn't need to be renamed, which I really should've checked before running the script… I'll open a new issue or pull request to add an option to the script that toggles the allele renaming behaviour.

kimrutherford added a commit that referenced this issue Apr 23, 2024
In --change-gene-id we use a GeneLookup adaptor to check that the "to"
and "from" IDs are available in the main database.

Refs #2677
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants