Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Future directions #1

Open
hammer opened this issue May 25, 2019 · 12 comments
Open

Future directions #1

hammer opened this issue May 25, 2019 · 12 comments

Comments

@hammer
Copy link
Member

hammer commented May 25, 2019

Musing about a few things we could do for the next manuscript update

  • Extend analysis to full open access subset of PubMed + bioRxiv
  • Include metrics about how label functions improve over time, as well as a qualitative description of lessons learned
  • Link cytokine, TF, and cell type terms and induce/secrete relations to external ontologies (e.g. PRO, CL, GO)
  • Make use of cytokine family information somehow
  • Try to identify emerging T cell types
  • Add disease association
  • Add chemokine receptor expression and other markers of spatial localization
  • Make use of the manually collected Flow Repository and GEO data sets that support the T cell types somehow
  • Identify a prediction made by the model to test in the laboratory
@hammer
Copy link
Member Author

hammer commented May 28, 2019

For the integration w/ external ontologies, I was surprised to learn the iX paper used their own "ontology" of cytokines and their receptors. It's available via ImmPort's Cytokine Registry, in case you haven't seen it yet.

@eric-czech
Copy link
Member

eric-czech commented May 28, 2019

A couple other possibilities:

  • Store mentions by section of paper a la CancerMine to make it possible to segregate new and old findings
  • Consider using word embeddings to resolve cytokine, TF, and/or cell type mentions rather than fuzzy matching against giant lists of synonyms similar to how WordNet is used in Synonym Expansion for Large Shopping Taxonomies to resolve context-independent terms to nodes in a taxonomy (context-independence is a safe assumption with cytokines and cell types I think, but I'm not so sure about TFs with synonyms like "genesis" or "NER")
  • See how much of this information has already been captured by sciBERT

@eric-czech
Copy link
Member

Oh yea I did see that but not until I had already gone more down the MyGene path. I should probably have used that registry instead. Looking at CL though I can't seem to find newer/rarer cell types like tissue resident or stem memory T cells so maybe it's always a good idea to assume an identifier space specific to the project and then map to an external ontology where possible?

@hammer
Copy link
Member Author

hammer commented May 28, 2019

Store mentions by section of paper a la CancerMine to make it possible to segregate new and old findings

Good point on partitioning by section of paper! Open Targets does the same thing. They use a horrifying Perl script called SectionTagger detailed in the paper Section level search functionality in Europe PMC (2015).

We can also use paper metadata to distinguish e.g. reviews from new research results.

@hammer
Copy link
Member Author

hammer commented May 28, 2019

Looking at CL though I can't seem to find newer/rarer cell types like tissue resident or stem memory T cells so maybe it's always a good idea to assume an identifier space specific to the project and then map to an external ontology where possible?

Yeah did you read the sections in the methods and supplementary notes of the iX paper where they explain how they use "seed phrases" and incremental expansion of those phrases to identify cell types and then map them back to CL? CL definitely seems like a pretty poor ontology relative to GO and others.

@eric-czech
Copy link
Member

eric-czech commented May 28, 2019

Oh wow that's a gnarly script.

Re: cell type matching -- I didn't see that when I was just hoping to find synonyms to pull in, but after a closer look I think they're essentially saying they simply matched against the CL names + synonyms in a way where order doesn't matter with a preference for the most specific concepts (and if the candidate has a string not in any name/synonym it is ignored). They explain a scoring system they put together for the candidate matches in Supplementary Note 3 but then at the end of it say they ultimately only take matches with a perfect score of 1. Given the definition of the score, it seems to me like that would then barely be any different than searching for the strings for the names/synonyms directly. Presumably the precision drops off a cliff with a threshold any lower than that. The seed + typed dependencies idea is cool though, maybe there's a way to build on that.

@hammer
Copy link
Member Author

hammer commented Jun 6, 2019

It could also be interesting to look at coreference resoultion and relation extraction across sentence boundaries, e.g. Inter-sentence Relation Extraction for Associating Biological Context with Events in Biomedical Texts. Obviously this will likely be very hard so probably not something to explore now, just filing away for later.

@hammer
Copy link
Member Author

hammer commented Jun 6, 2019

I was doing some thinking about our upcoming call w/ @ajratner after our discussion yesterday. Two topics that we discussed: are there any ways to automate label function generation and could we make use of science corpus-specific training data in the models that feed the label functions and relation extractor?

For the first topic, I found two papers from Paroma Varma, another Chris Ré student, that may have some ideas for us. I think both papers describe the same system, called "Snuba" in one and "Reef" in the other. Code is at https://github.com/HazyResearch/reef. Paroma only lists the Snuba paper on her website, so maybe just read that one. We should also bring up w/ Alex the idea of making use of structures used by QA systems to inform candidate heuristic generation (in the language of Reef/Snuba).

For the second topic, I reread the ScispaCy and SciBERT papers last night and feel like we have to get some lift from making use of these pretrained models, even if they don't have precisely the entity types we need for NER. Table 8 in the ScispaCy paper shows that their custom rule-based tokenizer and domain-specific sentence segmenter massively improve the basic task of sentence segmentation. I wish they gave more details on the en_ner_jnlpba_md training process (e.g. do they fine-tune one of their en_core_sci models? Edit: looks like the code is in train_specialised_ner.py), as it would be interesting to see if we can use the same procedure to make an en_ner_tcellrel model for spaCy. Finally, Table 3 from the SciBERT paper shows a huge lift for the same ChemProt relation extraction task that Snorkel uses as a demo/example, so it seems to me that if we replace the Snorkel biLSTM w/ a model that uses SciBERT embeddings we should do a lot better for free.

@hammer
Copy link
Member Author

hammer commented Jun 6, 2019

One more project that may be interesting to discuss w/ Alex: Babble Labble. Perhaps writing label functions would be less laborious if you were authoring them in natural language? More interesting than that, though, is whatever data structure represents the parsed natural language that is used to generate the label function. That intermediate representation could be the right target for candidate heuristic generation.

@hammer
Copy link
Member Author

hammer commented Jun 10, 2019

Doing a bit more reading on fine-grained entity recognition. The foundational paper in this field seems to be Fine-grained entity recognition (2012) by Xiao Ling and Daniel Weld. What's interesting for us is that they do fine-grained NER in support of relation extraction and show a significant gain in performance on RE.

@hammer
Copy link
Member Author

hammer commented Jun 12, 2019

FWIW I decided to look into the GO terms that correspond to the relations we're learning and I think these map to secretion and differentiation, though I don't know if there's a way to be more specific about cytokine-induced differentiation versus TF-induced differentiation...

@hammer
Copy link
Member Author

hammer commented Jun 12, 2019

Oh I should also dump the Python libraries I've seen for working w/ ontologies

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants