Skip to content

Commit

Permalink
Merge pull request #159 from monarch-initiative/variant-predicate
Browse files Browse the repository at this point in the history
Implement variant predicates and allele counter
  • Loading branch information
ielis authored Jul 22, 2024
2 parents b76f6ff + 3f7d58b commit b794205
Show file tree
Hide file tree
Showing 34 changed files with 4,865 additions and 1,847 deletions.
2 changes: 1 addition & 1 deletion docs/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -191,7 +191,7 @@
intersphinx_mapping = {
"python": ("https://docs.python.org/3/", None),
# TODO - change to stable when we arrive there
"hpotk": ("https://thejacksonlaboratory.github.io/hpo-toolkit/latest/", None),
"hpotk": ("https://ielis.github.io/hpo-toolkit/latest/", None),
"pandas": ("https://pandas.pydata.org/pandas-docs/version/2.0.0/", None),
"requests": ("https://docs.python-requests.org/en/stable/", None),
"scipy": ("https://docs.scipy.org/doc/scipy-1.11.0/", None),
Expand Down
48 changes: 40 additions & 8 deletions docs/tutorial.rst
Original file line number Diff line number Diff line change
Expand Up @@ -55,23 +55,55 @@ We can then view the data using the list commands.
[('FRAMESHIFT_VARIANT', 1), ('MISSENSE_VARIANT', 1)]

Using the counts, we can choose and run what analyses we want.

For instance, we can partition the patients into two groups based on presence/absence of a *missense* variant:

.. doctest:: tutorial

>>> import pandas as pd
>>> pd.set_option('expand_frame_repr', False)
>>> from genophenocorr.analysis.predicate.genotype import VariantPredicates
>>> from genophenocorr.model import VariantEffect

>>> missense_in_tx = VariantPredicates.variant_effect(VariantEffect.MISSENSE_VARIANT, tx_id='NM_1234.5')

We created a predicate `missense_in_tx` that tests
if a variant leads to an aminoacid change in a transcript `NM_1234.5`.
We will use the predicate in the downstream analysis to assign the patients
into a group and test for genotype-phenotype correlation between the groups.

.. note::

There are many other ways to set up a predicate for testing
for genotype-phenotype correlation.
See the :ref:`predicates` section to learn more about building
a predicate for the analysis of interest.

Next, we will prepare and run the genotype-phenotype analysis
to test the correlation between phenotypes and presence of a missense variant.

.. doctest:: tutorial

>>> from genophenocorr.analysis import configure_cohort_analysis
>>> from genophenocorr.analysis.predicate import PatientCategories # TODO - explain the predicate or update the API
>>> from genophenocorr.model import VariantEffect

>>> cohort_analysis = configure_cohort_analysis(cohort, hpo)
>>> missense = cohort_analysis.compare_hpo_vs_genotype(missense_in_tx)

We prepared the `cohort_analysis` and we ran the analysis to test for correlation
between the clinical signs and symptoms of patients encoded into HPO terms
and presence of a missense variant in the patient.

We stored the results into `missense` variable. We can summarize the results
by showing a data frame with results for each tested HPO term.
We show the first two terms here:

>>> cohort_analysis = configure_cohort_analysis(cohort, hpo)
>>> missense = cohort_analysis.compare_by_variant_effect(VariantEffect.MISSENSE_VARIANT, tx_id='NM_1234.5')
>>> import pandas as pd
>>> pd.set_option('expand_frame_repr', False)
>>> summary_df = missense.summarize(hpo, PatientCategories.YES)
>>> summary_df.head(1) # doctest: +NORMALIZE_WHITESPACE
>>> summary_df.head(2) # doctest: +NORMALIZE_WHITESPACE
MISSENSE_VARIANT on NM_1234.5 Yes No
Count Percent Count Percent p value Corrected p value
Arachnodactyly [HP:0001166] 13/16 81% 1/10 10% 0.000781 0.020299
Count Percent Count Percent p value Corrected p value
Arachnodactyly [HP:0001166] 13/16 81% 1/10 10% 0.000781 0.020299
Spasticity [HP:0001257] 11/16 69% 6/10 60% 0.692449 1.000000

..
Expand Down
1 change: 1 addition & 0 deletions docs/user-guide/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -11,3 +11,4 @@ TODO - write high level overview and bridge to individual sections.
:caption: Contents:

input-data
predicates
3 changes: 2 additions & 1 deletion docs/user-guide/input-data.rst
Original file line number Diff line number Diff line change
Expand Up @@ -53,7 +53,8 @@ Next, let's get a `CohortCreator` for loading the phenopackets. We use the

We can create a cohort starting from a folder with phenopackets stored as JSON files.
For the purpose of this example, we will use a folder `simple_cohort` with 5 example phenopackets located in
`genophenocorr's repository <https://github.com/monarch-initiative/genophenocorr/tree/main/docs/data/simple_cohort>`_
`docs/data/simple_cohort <https://github.com/monarch-initiative/genophenocorr/tree/main/docs/data/simple_cohort>`_
folder of the `genophenocorr` repository:

.. doctest:: input-data

Expand Down
110 changes: 110 additions & 0 deletions docs/user-guide/predicates.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,110 @@
.. _predicates:

==========
Predicates
==========

Genophenocorr uses predicates to test if variant, patient, or any tested item
meets a condition. Based on the test results, the items are assigned into groups.

As described in the :class:`genophenocorr.analysis.predicate.PolyPredicate` API,
the groups must be *exclusive* - the item can be assigned with one and only one group,
and *exhaustive* - the groups must cover all possible scenarios.

However, if the item cannot be assigned into any meaningful category,
the predicate can return `None`, and the item will be omitted from the analysis.

The predicates can be chained to test for more complex conditions.
For instance, "test if a patient has a missense or synonymous variant located in exon 6 of transcript `NM_013275.6`".

Let's demonstrate this on an example with a :class:`genophenocorr.analysis.predicate.genotype.VariantPredicate`.
We will load a cohort of 5 subjects with variants in *ANKRD11*, leading to KBG syndrome.
The the clinical signs and symptoms of the subjects were encoded into HPO terms
along with the pathogenic *ANKRD11* variant.

Let's load the phenopackets, as previously described in greater detail the :ref:`input-data` section.

First, we load HPO using HPO toolkit:

>>> import hpotk
>>> store = hpotk.configure_ontology_store()
>>> hpo = store.load_minimal_hpo(release='v2024-03-06')

then, we will configure the cohort creator:

>>> from genophenocorr.preprocessing import configure_caching_cohort_creator, load_phenopacket_folder
>>> cohort_creator = configure_caching_cohort_creator(hpo)

last, we will load the cohort from a directory with phenopackets:

>>> import os
>>> cohort_dir = os.path.join('docs', 'data', 'simple_cohort')
>>> cohort = load_phenopacket_folder(cohort_dir, cohort_creator)
>>> len(cohort)
5

We loaded a cohort of 5 patients.

Let's use the variant ``16_89281397_89281397_G_C`` that corresponds
to a pathogenic variant `VCV001029215.1 <https://www.ncbi.nlm.nih.gov/clinvar/variation/1029215/>`_
that replaces the tyrosine encoded by the 1715th codon of `NM_013275.6` with a premature stop codon: ``NM_013275.6(ANKRD11):c.5145C>G (p.Tyr1715Ter)``.

>>> variant_key_of_interest = '16_89281397_89281397_G_C'
>>> variant = cohort.get_variant_by_key(variant_key_of_interest)

We will use the variant to exemplify the predicate API.

Simple predicates
*****************

We can check that the variant overlaps with *ANKRD11*:

>>> from genophenocorr.analysis.predicate.genotype import VariantPredicates
>>> gene = VariantPredicates.gene('ANKRD11')
>>> gene.test(variant)
True

it overlaps with the *MANE* transcript:

>>> ankrd11_mane_tx_id = 'NM_013275.6'
>>> tx = VariantPredicates.transcript(ankrd11_mane_tx_id)
>>> tx.test(variant)
True

it in fact overlaps with the exon 9:

>>> exon9 = VariantPredicates.exon(exon=9, tx_id=ankrd11_mane_tx_id)
>>> exon9.test(variant)
True

and it is predicted to introduce a premature termination codon to the MANE transcript:

>>> from genophenocorr.model import VariantEffect
>>> nonsense = VariantPredicates.variant_effect(VariantEffect.STOP_GAINED, tx_id=ankrd11_mane_tx_id)
>>> nonsense.test(variant)
True

See :class:`genophenocorr.analysis.predicate.genotype.VariantPredicates`
for more info on the predicates available off the shelf.


Compound predicates
*******************

The simple predicates can be combined to test for more elaborate conditions.
For instance, we can test if the variant meets *any* or several conditions:

>>> missense = VariantPredicates.variant_effect(VariantEffect.MISSENSE_VARIANT, tx_id=ankrd11_mane_tx_id)
>>> missense_or_nonsense = missense | nonsense
>>> missense_or_nonsense.test(variant)
True

or *all* conditions:

>>> nonsense_and_exon9 = nonsense & exon9
>>> nonsense_and_exon9.test(variant)
True

The `VariantPredicate` overloads Python `&` (AND) and `|` (OR) operators to build a compound predicate from a building block.

That's it for predicates.
Loading

0 comments on commit b794205

Please sign in to comment.