Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filter variants by gene symbol #12

Open
wants to merge 4 commits into
base: development
Choose a base branch
from

Conversation

KoalaQin
Copy link

@KoalaQin KoalaQin commented Jan 7, 2025

This changed back to using Browser tables, since they had multi-step processing of the Gencode table. A few reasons to use their gene table instead of our gencode table:

  1. they retrieved gene symbol from HGNC and get the symbols from Gencode gene name if there's no symbol there (however, there are ~1924 symbols aren't equal between HGNC and Gencode, should we support gene ID instead?)
  2. they merged the intervals for CDS, UTR and exons step by step;
  3. already removed PAR_Y genes;

We had discussion on the display of exon-centric variants in coding and non-coding genes.
Note: this code is a bit slow, for DRD2, it takes ~40s to finish locally.

Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@KoalaQin KoalaQin changed the base branch from main to development January 7, 2025 17:54
@KoalaQin KoalaQin self-assigned this Jan 7, 2025
@KoalaQin KoalaQin requested a review from jkgoodrich January 7, 2025 17:57
@jkgoodrich
Copy link
Contributor

jkgoodrich commented Jan 8, 2025

Can you determine a way to make this run faster? i think we absolutely need this to be able to run locally in a time as fast as the time used in the current code. Is there a place these differences are clearly outlined? I don't like the idea of having resources that differ like this without clear documentation of what the differences are, and ideally a HT that can map exactly from one to the other. For instance, in our research we use this gencode HT, is this indicating that we should not be?

@KoalaQin
Copy link
Author

KoalaQin commented Jan 8, 2025

The gene symbol shouldn't be a big issue, they may have reason to prefer HGNC symbols and they do put Symbol in GENCODE v35 (though it's a typo, should be v39) on the gene page, like here.
Maybe we should check, without merging overlapping intervals, even for non-protein-coding genes, if we get the same number of variants.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants