How to get a comprehensive map of identifiers to preferred labels? #207

matentzn · 2022-06-20T15:01:48Z

matentzn
Jun 20, 2022
Maintainer

This is only tangentially related to SSSOM, as part of a pipeline I am building. I would like to make it easy to augment mapping files with the subject_label and object_label fields. So lets say I get a mapping with subject_id, predicate_id, object_id, I want to add in subject_label, object_label for better readability as a postprocessing step (again, not sssom per se).

The question is: What would be the best way to obtain labels for 80-90% of all commonly used identifiers? @cmungall suggested bioportal. Now the problem is, I don't want to hit the bioportal API with 100K queries to augment labels. Any idea how we should do things like that moving forward? Essentially, having a list of 100K identifiers and obtaining all their preferred labels?

What tools should we use? What APIs? Do we need to build a new service or extend an existing one?

cc @caufieldjh @graybeal as you might have ideas.

caufieldjh · 2022-06-20T15:16:12Z

caufieldjh
Jun 20, 2022

My short answer is kg-bioportal - we have a non-public build of this graph so we could essentially just transform the nodelist to curies+labels. Bioportal wouldn't cover anything beyond ontologies, though.

Other options:

Uniprot ID mapping API / SPARQL
MyGene.info
PubChem, though maybe only through a local copy rather than API

I'm also curious about the extent of the ids we're talking about here - just ontologies? Biomolecule IDs / sequence IDs? Other instance data that's likely to change labels frequently? More general sources, e.g. Wikidata?

English labels only?

1 reply

matentzn Jun 21, 2022
Maintainer Author

I didnt think that far yet. I guess I am primarily concerned with concepts, like UMLS, OBO ontologies, etc, "terminological level".

gaurav · 2022-06-20T16:11:03Z

gaurav
Jun 20, 2022

I'm not sure what our coverage is overall (let along among the identifiers you're trying), but Translator's Node Normalization tool should be able to do this. You can send it a batch of identifiers, and for each one it will return:

The preferred identifier and label
Other identifiers and labels
Biolink types

You can try it out using our Swagger interface. If you find a large gap in the identifiers we support, you can report them on our issues page, but we are pretty focused on NCATS Translator-related identifiers for now.

4 replies

matentzn Jun 21, 2022
Maintainer Author

Interesting! I will give it a shot!

matentzn Jun 21, 2022
Maintainer Author

Super cool! Can you say a word about scalability, would you say this service is reliable enough to build a service for annotating SSSOM files on? I am thinking of something like sssom label map2.sssom.tsv > map2_label.sssom.tsv which will annotate potentially large lists of ids (between 100 and 10K) using Translator?

gaurav Jun 22, 2022

We don't have any hard numbers on NodeNorm's bandwidth, but we can get back 5000 results in ~.9 seconds, so we think you should be able to get 100k in a minute or less. Please let us know if you run into any slowdown/errors with large requests on our issues page!

matentzn Jun 22, 2022
Maintainer Author

Thank you!

AlasdairGray · 2022-06-21T08:55:17Z

AlasdairGray
Jun 21, 2022

This is only tangentially related to SSSOM, as part of a pipeline I am building. I would like to make it easy to augment mapping files with the subject_label and object_label fields. So lets say I get a mapping with subject_id, predicate_id, object_id, I want to add in subject_label, object_label for better readability as a postprocessing step (again, not sssom per se).

Are you envisioning doing this at the time the files are generated? Would the labels be stored in the file and then published?

My PhD was recently working with historic RDF data files in the form of nanopublications and a big challenge we had was getting labels for the data.

3 replies

matentzn Jun 21, 2022
Maintainer Author

No definitely not by default. I envision a utility operation that can be deliberately invoked to augment an sssom file with labels. Its a bit challenging!

AlasdairGray Jun 23, 2022

Yep, definitely challenging. Even more challenging when you are doing it with historic data.

Labels are a really useful utility for humans to go with global identifiers

matentzn Jun 23, 2022
Maintainer Author

Yeah agreed..

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to get a comprehensive map of identifiers to preferred labels? #207

{{title}}

Replies: 3 comments 8 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

How to get a comprehensive map of identifiers to preferred labels? #207

matentzn Jun 20, 2022 Maintainer

Replies: 3 comments · 8 replies

caufieldjh Jun 20, 2022

matentzn Jun 21, 2022 Maintainer Author

gaurav Jun 20, 2022

matentzn Jun 21, 2022 Maintainer Author

matentzn Jun 21, 2022 Maintainer Author

gaurav Jun 22, 2022

matentzn Jun 22, 2022 Maintainer Author

AlasdairGray Jun 21, 2022

matentzn Jun 21, 2022 Maintainer Author

AlasdairGray Jun 23, 2022

matentzn Jun 23, 2022 Maintainer Author

matentzn
Jun 20, 2022
Maintainer

Replies: 3 comments 8 replies

caufieldjh
Jun 20, 2022

matentzn Jun 21, 2022
Maintainer Author

gaurav
Jun 20, 2022

matentzn Jun 21, 2022
Maintainer Author

matentzn Jun 21, 2022
Maintainer Author

matentzn Jun 22, 2022
Maintainer Author

AlasdairGray
Jun 21, 2022

matentzn Jun 21, 2022
Maintainer Author

matentzn Jun 23, 2022
Maintainer Author