Skip to content

Commit

Permalink
Merge pull request #221 from monarch-initiative/ielis/issue194
Browse files Browse the repository at this point in the history
Improve phenotype and MTC filtering
  • Loading branch information
ielis authored Aug 20, 2024
2 parents bcfcccf + 6ba9b22 commit 2b745bf
Show file tree
Hide file tree
Showing 16 changed files with 803 additions and 767 deletions.
8 changes: 4 additions & 4 deletions docs/user-guide/general_hpo_terms.rst
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
.. _general_hpo_terms:
.. _general-hpo-terms:

=================
General HPO terms
Expand All @@ -25,15 +25,15 @@ We call these terms `level 2`.

Many but not all terms on `level 3` are also "general" in this sense,
e.g., `Abnormality of the parathyroid gland (HP:0000828) <https://hpo.jax.org/browse/term/HP:0000828>`_.
We also skip such `level 3` terms. Our heursitic for identifying these terms is to search the primary label for the word
We also skip such `level 3` terms. Our HPO MTC filter for identifying these terms is to search the primary label for the word
``'Abnormal'``, and remove all terms that start with this string.

According to our heuristic sampler strategy (see :ref:`mtc`), we stop at `level 3`,
According to the HPO MTC filter strategy (see :ref:`mtc`), we stop at `level 3`,
because the terms at lower levels are more likely to be specific,
even if they contain the word "Abnormal" or "Abnormality" in their label.
Users can adapt the procedure if desired.

The following table shows a list of HPO terms that are removed from consideration by the heuristic sampler strategy.
The following table shows a list of HPO terms that are removed from consideration by the HPO MTC filter strategy.



Expand Down
3 changes: 1 addition & 2 deletions docs/user-guide/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,6 @@ TODO - write high level overview and bridge to individual sections.
input-data
exploratory
predicates
stats
mtc
general_hpo_terms
stats
autosomal_recessive
120 changes: 73 additions & 47 deletions docs/user-guide/mtc.rst
Original file line number Diff line number Diff line change
Expand Up @@ -7,11 +7,12 @@ Multiple-testing correction
Background
~~~~~~~~~~

A p-value is the probability that a test result, under the null hypothesis, assumes the observed or a more extreme value. It is important to realize that if we
perform many tests, we are likely to get a "significant" result by chance alone. For instance, if
we test a null hypothesis that is true using a significance level of
:math:`\alpha = 0.05`, then there is a probability of :math:`1-\alpha = 0.95` of
arriving at a correct conclusion of non-significance. If we now test
A p-value is the probability that a test result, under the null hypothesis,
assumes the observed or a more extreme value. It is important to realize that if we
perform many tests, we are likely to get a "significant" result by chance alone.
For instance, if we test a null hypothesis that is true using a significance level
of :math:`\alpha = 0.05`, then there is a probability of :math:`1-\alpha = 0.95`
of arriving at a correct conclusion of non-significance. If we now test
two independent true null hypotheses, the probability that neither
test is significant is :math:`0.95\times 0.95 = 0.90.` If we test 20
independent null hypotheses, the probability that none will be
Expand All @@ -26,25 +27,24 @@ least one significant result.

Implementation in genophenocorr
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

By default, genephenocorr performs a hypothesis test for each HPO term found at least twice
in the cohort, meaning that we may
perform up to hundreds of tests. Therefore, unless we take into
account the fact that multiple statistical tests are being performed,
it is likely that we will obtain one or more false-positive
results.
in the cohort, meaning that we may perform up to hundreds of tests.
Therefore, unless we take into account the fact that multiple statistical tests are being performed,
it is likely that we will obtain one or more false-positive results.

Genephenocorr offers two approaches to mitigate this problem: multiple-testing correction procedures
and heuristic methods to reduce the number of terms to be tested.
Genephenocorr offers two approaches to mitigate this problem: multiple-testing correction (MTC) procedures
and MTC filters to choose the terms to be tested.

Here we will show how to configure the MTC approach
using :class:`genophenocorr.analysis.CohortAnalysisConfiguration` class.
using :class:`~genophenocorr.analysis.CohortAnalysisConfiguration` class.

>>> from genophenocorr.analysis import CohortAnalysisConfiguration
>>> config = CohortAnalysisConfiguration()


Multiple-testing correction (MTC) procedures
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Multiple-testing correction procedures
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

A number of MTC procedures have
been developed to limit the probability of false-positive results. The
Expand Down Expand Up @@ -94,7 +94,8 @@ by the number of tests performed (the result is capped at 1.0). This is the *def

Alternatively, procedures that control the false-discovery rate (FDR),
limit the proportion of significant results that are type I
errors (false discoveries). The Benjamini and Hochberg method (``fdr_bh``) is probably the most commonly used one.
errors (false discoveries).
The Benjamini and Hochberg method (``fdr_bh``) is probably the most commonly used one.

This is how we can set an alternative MTC correction procedure:

Expand All @@ -103,66 +104,92 @@ This is how we can set an alternative MTC correction procedure:
'fdr_bh'


MTC Heuristics: Choosing which terms to test
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
MTC filters: Choosing which terms to test
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

We can reduce the overall MTC burden by choosing which terms to test.
For example, if we choose to test only ten terms out of 450,
then the mutliplication factor of the Bonferroni correction
is only 10 instead of 450, and more p-values
may "survive" the multiple-testing correction.

The genephenocorr package offers two options,
each of which can be combined with any of the MTC methods.
In the context of GPSEA, we represent the concept of phenotype filtering
by :class:`~genophenocorr.analysis.PhenotypeMtcFilter`.
We describe the three filtering strategies in the next sections.


specify terms strategy
~~~~~~~~~~~~~~~~~~~~~~
.. _use-all-terms-strategy:

If you have a specific hypothesis as to which terms may be different between groups,
then you can specify these terms using the ``specify_terms_strategy`` function.
Test all terms
--------------

For example if we want to specifically test
* `Abnormal putamen morphology (HP:0031982) <https://hpo.jax.org/browse/term/HP:0031982>`_ and
* `Abnormal caudate nucleus morphology (HP:0002339) <https://hpo.jax.org/browse/term/HP:0002339>`_,,
The first MTC filtering strategy is the simplest - do not apply any filtering at all.
This will result in testing all terms. We do not recommend this strategy,
but it can be useful to disable MTC filtering.

The strategy is invoked by default,
or explicitly by :func:`~genophenocorr.analysis.CohortAnalysisConfiguration.all_terms_strategy` method:

>>> config.all_terms_strategy()
>>> config.mtc_strategy
<MtcStrategy.ALL_PHENOTYPE_TERMS: 0>

we pass an iterable with these two terms
as an argument to :func:`genophenocorr.analysis.CohortAnalysisConfiguration.specify_terms_strategy` method:

.. _specify-terms-strategy:

>>> abn_putamen = "HP:0031982" # Abnormal putamen morphology
>>> abn_caudate_nucleus = "HP:0002339" # Abnormal caudate nucleus morphology
>>> config.specify_terms_strategy(specified_term_set=(abn_putamen, abn_caudate_nucleus))
Specify terms strategy
----------------------

Later, when using the `config` object in analysis,
this will cause genophenocorr to perform only two hypothesis tests, one for each of the two terms
In presence of a specific hypothesis as to which terms may be different between groups,
then you can specify these terms using
the :func:`~genophenocorr.analysis.CohortAnalysisConfiguration.specify_terms_strategy` method.

For example if we want to specifically test
`Abnormal putamen morphology (HP:0031982) <https://hpo.jax.org/browse/term/HP:0031982>`_ and
`Abnormal caudate nucleus morphology (HP:0002339) <https://hpo.jax.org/browse/term/HP:0002339>`_
we pass an iterable (e.g. a tuple) with these two terms as an argument:

>>> config.specify_terms_strategy(
... terms_to_test=(
... "HP:0031982", # Abnormal putamen morphology
... "HP:0002339", # Abnormal caudate nucleus morphology
... )
... )
>>> config.mtc_strategy
<MTC_Strategy.SPECIFY_TERMS: 2>
<MtcStrategy.SPECIFY_TERMS: 1>
>>> config.terms_to_test
('HP:0031982', 'HP:0002339')

Later, when the `config` is used in analysis,
genophenocorr will only perform two hypothesis tests, one for each of the two terms.


.. _hpo-mtc-filter-strategy:

heuristic sampler strategy
~~~~~~~~~~~~~~~~~~~~~~~~~~
HPO MTC filter strategy
-----------------------

.. _heuristic-sampler-strategy-section:
Last, the HPO MTC strategy involves making several domain judgments to take advantage of the HPO structure.

The heuristic sampler strategy is chosen by invoking
:func:`genophenocorr.analysis.CohortAnalysisConfiguration.heuristic_strategy` method:
The strategy is chosen by invoking
:func:`~genophenocorr.analysis.CohortAnalysisConfiguration.hpo_mtc_strategy` method:

>>> config = CohortAnalysisConfiguration()
>>> config.heuristic_strategy(threshold_HPO_observed_frequency=0.5)
>>> config.hpo_mtc_strategy(min_patients_w_hpo=0.5)
>>> config.mtc_strategy
<MtcStrategy.HPO_MTC: 2>
>>> config.min_patients_w_hpo
0.5

The method takes a threshold as an argument (e.g. 50% here, more info below)
HPO MTC takes a threshold as an argument (e.g. 50% in the example above)
and the method's logic is made up of 8 individual heuristics
designed to skip testing the HPO terms that are unlikely to yield significant or interesting results:

#. Skip terms that occurr very rarely
The ``threshold_HPO_observed_frequency`` determines the mininum proportion of individuals
#. Skip terms that occur very rarely
The ``min_patients_w_hpo`` determines the mininum proportion of individuals
with direct or indirect annotation by the HPO term to test.
We check each of the genotype groups (e.g., MISSENSE vs. not-MISSENSE), and we only retain a term for testing
if the proportion of individuals in at least one genotype group is at least ``threshold_HPO_observed_frequency``.
if the proportion of individuals in at least one genotype group is at least ``min_patients_w_hpo``.

This is because of our assumption that even if there is statistical significance,
if a term is only seen in (for example) 7% of individuals in the MISSENSE group and 2% in the not-MISSENSE group,
Expand Down Expand Up @@ -213,6 +240,5 @@ designed to skip testing the HPO terms that are unlikely to yield significant or
it will lead to at least one of the descendents of
*Abnormality of the nervous system* being significant.

See :ref:`general_hpo_terms` for details.

See :ref:`general-hpo-terms` section for details.

12 changes: 6 additions & 6 deletions docs/user-guide/predicates.rst
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ Predicates
Genophenocorr uses predicates to test if variant, patient, or any tested item
meets a condition. Based on the test results, the items are assigned into groups.

As described in the :class:`genophenocorr.analysis.predicate.PolyPredicate` API,
As described in the :class:`~genophenocorr.analysis.predicate.PolyPredicate` API,
the groups must be *exclusive* - the item can be assigned with one and only one group,
and *exhaustive* - the groups must cover all possible scenarios.

Expand All @@ -17,7 +17,7 @@ the predicate can return `None`, and the item will be omitted from the analysis.
The predicates can be chained to test for more complex conditions.
For instance, "test if a patient has a missense or synonymous variant located in exon 6 of transcript `NM_013275.6`".

Let's demonstrate this on an example with a :class:`genophenocorr.analysis.predicate.genotype.VariantPredicate`.
Let's demonstrate this on an example with a :class:`~genophenocorr.analysis.predicate.genotype.VariantPredicate`.
We will load a cohort of 5 subjects with variants in *ANKRD11*, leading to KBG syndrome.
The the clinical signs and symptoms of the subjects were encoded into HPO terms
along with the pathogenic *ANKRD11* variant.
Expand Down Expand Up @@ -85,7 +85,7 @@ and it is predicted to introduce a premature termination codon to the MANE trans
>>> nonsense.test(variant)
True

See :class:`genophenocorr.analysis.predicate.genotype.VariantPredicates`
See :class:`~genophenocorr.analysis.predicate.genotype.VariantPredicates`
for more info on the predicates available off the shelf.


Expand Down Expand Up @@ -125,7 +125,7 @@ Sometimes we may want to test the variant for a condition that must *not* be met
For instance, we may want to test if the variant is a deletion
that is *not* predicted to shift the transcript reading frame.
One of doing this would be to build a compound predicates
for all variant effects except of `VariantEffect.FRAMESHIFT_VARIANT`:
for all variant effects except of :class:`~genophenocorr.model.VariantEffect.FRAMESHIFT_VARIANT`:

>>> non_frameshift_effects = (
... VariantEffect.SYNONYMOUS_VARIANT, VariantEffect.MISSENSE_VARIANT, VariantEffect.INTRON_VARIANT,
Expand All @@ -149,8 +149,8 @@ This is how we can use the predicate inversion to build the predicate for non-fr
Note the presence of a tilde ``~`` before the variant effect predicate and resulting ``NOT`` in the predicate question.


That's it for predicates. Please see :class:`genophenocorr.analysis.predicate.genotype.VariantPredicates`
and :class:`genophenocorr.analysis.predicate.genotype.ProteinPredicates`
That's it for predicates. Please see :class:`~genophenocorr.analysis.predicate.genotype.VariantPredicates`
and :class:`~genophenocorr.analysis.predicate.genotype.ProteinPredicates`
for a comprehensive list of the predicates available off the shelf.

Please open an issue on our `GitHub tracker <https://github.com/monarch-initiative/genophenocorr/issues>`_ if a predicate seems to be missing.
10 changes: 5 additions & 5 deletions src/genophenocorr/analysis/__init__.py
Original file line number Diff line number Diff line change
@@ -1,16 +1,16 @@
from . import predicate

from ._api import CohortAnalysis, GenotypePhenotypeAnalysisResult, HpoMtcReport
from ._config import CohortAnalysisConfiguration, configure_cohort_analysis, configure_default_protein_metadata_service, MTC_Strategy
from ._gp_analysis import HeuristicMtcFilter, SpecifiedTermsMtcFilter, apply_predicates_on_patients
from ._config import CohortAnalysisConfiguration, configure_cohort_analysis, configure_default_protein_metadata_service, MtcStrategy
from ._gp_analysis import apply_predicates_on_patients
from ._mtc_filter import PhenotypeMtcFilter, UseAllTermsMtcFilter, SpecifiedTermsMtcFilter, HpoMtcFilter

__all__ = [
'configure_cohort_analysis',
'CohortAnalysis', 'GenotypePhenotypeAnalysisResult',
'CohortAnalysisConfiguration', 'MTC_Strategy',
'CohortAnalysisConfiguration', 'MtcStrategy',
'HpoMtcReport',
'HeuristicMtcFilter',
'SpecifiedTermsMtcFilter',
'PhenotypeMtcFilter', 'UseAllTermsMtcFilter', 'SpecifiedTermsMtcFilter', 'HpoMtcFilter',
'apply_predicates_on_patients',
'configure_default_protein_metadata_service',
]
44 changes: 0 additions & 44 deletions src/genophenocorr/analysis/_api.py
Original file line number Diff line number Diff line change
Expand Up @@ -436,47 +436,3 @@ def _group_patients_by_hpo(phenotypic_features: typing.Iterable[hpotk.TermId],
break

return PatientsByHPO(all_with_hpo, all_without_hpo)


class HpoMtcFilter(metaclass=abc.ABCMeta):
"""
`HpoMtcFilter` decides which phenotypes should be tested and which phenotypes
are not worth testing in order to reduce the multiple testing burden.
Note, the filter works only when using the HPO term to represent the phenotype.
Therefore, the expected input asks for :class:`hpotk.TermId` items.
For instance, `n_usable` is a mapping from an *HPO term* to an `int` with the count of the patients
categorized according to the HPO term.
"""
@abc.abstractmethod
def filter_terms_to_test(
self,
n_usable: typing.Mapping[hpotk.TermId, int],
all_counts: typing.Mapping[hpotk.TermId, pd.DataFrame],
) -> typing.Tuple[
typing.Mapping[hpotk.TermId, int],
typing.Mapping[hpotk.TermId, pd.DataFrame],
typing.Mapping[str, int],
]:
"""
Decide which terms to pass through for statistical testing.
The intention of this class is to reduce multiple testing burden by removing terms that are unlikely
to lead to interesting statistical/analytical results.
Args:
n_usable: a mapping from the :class:`hpotk.TermId` to an `int` with the count of patients
that could be binned according to the used genotype/phenotype predicate.
all_counts: a mapping from the :class:`hpotk.TermId` to
Returns:
a tuple with three items:
- a mapping from :class:`hpotk.TermId` ->
- a mapping from :class:`hpotk.TermId` ->
- a mapping from a `str` with reason why a term was filtered out (e.g. *Skipping top level term*)
"""
pass

@abc.abstractmethod
def filter_method_name(self) -> str:
"""returns the name of the heuristic used to limit multiple testing
"""
pass
Loading

0 comments on commit 2b745bf

Please sign in to comment.