Skip to content

Commit

Permalink
Merge pull request #381 from monarch-initiative/poly-predicates-are-a…
Browse files Browse the repository at this point in the history
…ctually-classifiers

Use classifiers instead of genotype/phenotype predicates.
  • Loading branch information
ielis authored Dec 12, 2024
2 parents 0535ef3 + 2f9b036 commit 044f4a4
Show file tree
Hide file tree
Showing 71 changed files with 2,458 additions and 2,348 deletions.
1 change: 1 addition & 0 deletions .github/workflows/pages.yml
Original file line number Diff line number Diff line change
Expand Up @@ -52,6 +52,7 @@ jobs:
cd docs
sphinx-apidoc --separate --module-first -d 2 -H "API reference" --follow-links -o apidocs ../src/gpsea
make clean html
# make clean html 2>&1 > /dev/null | less
mv _build/html/* ../gh-pages/${DOCDIR}
cd ..
Expand Down
44 changes: 23 additions & 21 deletions docs/tutorial.rst
Original file line number Diff line number Diff line change
Expand Up @@ -42,8 +42,8 @@ A typical GPSEA analysis will consist of several steps. Starting with a collecti
we perform input Q/C and functional variant annotation to prepare a cohort.
With the cohort on hand, we generate reports with summary statistics, variant distributions,
and most common Human Phenotype Ontology (HPO) terms or measurements.
We then configure the methods for partitioning the cohort into genotype and phenotype groups,
to test for possible associations between the groups.
We then configure the methods for partitioning the cohort into genotype and phenotype classes,
to test for possible associations between the classes.
We finalize the analysis by statistical testing and evaluation of the results.


Expand Down Expand Up @@ -240,54 +240,56 @@ with one or more variant alleles (*Count*):
Partition the cohort by genotype and phenotype
==============================================

To test for genotype-phenotype associations, we need to partition the cohort into subsets.
In GPSEA, we always assign a cohort member into a genotype group,
where each individual is assigned into a single group and the groups do not overlap.
The phenotype is then used to either assign into a group or to calculate a numeric score.
To test for genotype-phenotype associations, we need to divide the cohort into classes.
In GPSEA, we always assign a cohort member into a genotype class,
where each individual is assigned into a single class and the classes do not overlap.
The phenotype is then used to either assign an individual into a class,
or to calculate a numeric score or survival.


Partition by genotype
---------------------

In context of the tutorial, we assign each cohort member into a group
In context of the tutorial, we assign each cohort member into a class
depending on presence of a single allele of a missense or truncating variant
(e.g. frameshift, stop gain, or splice site region):

>>> from gpsea.model import VariantEffect
>>> from gpsea.analysis.predicate.genotype import VariantPredicates, monoallelic_predicate
>>> is_missense = VariantPredicates.variant_effect(VariantEffect.MISSENSE_VARIANT, tx_id)
>>> from gpsea.analysis.predicate import variant_effect, anyof
>>> from gpsea.analysis.clf import monoallelic_classifier
>>> is_missense = variant_effect(VariantEffect.MISSENSE_VARIANT, tx_id)
>>> truncating_effects = (
... VariantEffect.FRAMESHIFT_VARIANT,
... VariantEffect.STOP_GAINED,
... VariantEffect.SPLICE_DONOR_VARIANT,
... VariantEffect.SPLICE_ACCEPTOR_VARIANT,
... )
>>> is_truncating = VariantPredicates.any(VariantPredicates.variant_effect(e, tx_id) for e in truncating_effects)
>>> gt_predicate = monoallelic_predicate(
>>> is_truncating = anyof(variant_effect(e, tx_id) for e in truncating_effects)
>>> gt_clf = monoallelic_classifier(
... a_predicate=is_missense,
... b_predicate=is_truncating,
... a_label="Missense", b_label="Truncating",
... )
>>> gt_predicate.group_labels
>>> gt_clf.class_labels
('Missense', 'Truncating')

This is a lot of code, and detailed explanations and examples are available in the :ref:`partitioning` section.
For now, it is enough to know that the `gt_predicate` will assign the individuals
into `Missense` or `Truncating` group. The individuals with the number of missense (or truncating) variants
For now, it is enough to know that the `gt_clf` will assign the individuals
into `Missense` or `Truncating` class. The individuals with the number of missense (or truncating) variants
different than one will be omitted from the analysis.


Partition by phenotype
----------------------

We use HPO terms to assign the individuals into phenotype groups,
We use HPO terms to assign the individuals into phenotype classes,
according to the term's presence or exclusion.
The testing leverages the :ref:`true-path-rule` of ontologies.

We now prepare the predicates for assigning into phenotype groups:
We now prepare the classifiers for assigning into phenotype classes:

>>> from gpsea.analysis.predicate.phenotype import prepare_predicates_for_terms_of_interest
>>> pheno_predicates = prepare_predicates_for_terms_of_interest(
>>> from gpsea.analysis.clf import prepare_classifiers_for_terms_of_interest
>>> pheno_clfs = prepare_classifiers_for_terms_of_interest(
... cohort=cohort,
... hpo=hpo,
... )
Expand All @@ -299,7 +301,7 @@ Multiple testing correction
By default, GPSEA performs a test for each HPO term used to annotate at least one individual in the cohort,
and there are 369 such terms in *TBX5* cohort:

>>> len(pheno_predicates)
>>> len(pheno_clfs)
369

However, testing multiple hypothesis on the same dataset increases the chance of receiving false positive result.
Expand Down Expand Up @@ -329,8 +331,8 @@ Now we can perform the testing and evaluate the results.

>>> result = analysis.compare_genotype_vs_phenotypes(
... cohort=cohort,
... gt_predicate=gt_predicate,
... pheno_predicates=pheno_predicates,
... gt_clf=gt_clf,
... pheno_clfs=pheno_clfs,
... )
>>> result.total_tests
17
Expand Down
2 changes: 1 addition & 1 deletion docs/user-guide/analyses/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ and explanations of how they are implemented by our software.
:caption: Contents:

partitioning/index
phenotype-groups
phenotype-classes
mtc
phenotype-scores
measurements
Expand Down
20 changes: 10 additions & 10 deletions docs/user-guide/analyses/measurements.rst
Original file line number Diff line number Diff line change
@@ -1,9 +1,9 @@
.. _measurement-stat:


==========================
Compare measurement values
==========================
====================
Compare measurements
====================


****************
Expand Down Expand Up @@ -55,21 +55,21 @@ Missense should *NOT* be severe.
TODO - create real predicate.

>>> from gpsea.model import VariantEffect
>>> from gpsea.analysis.predicate.genotype import VariantPredicates
>>> is_missense = VariantPredicates.variant_effect(VariantEffect.MISSENSE_VARIANT, tx_id=tx_id)
>>> from gpsea.analysis.predicate import variant_effect
>>> is_missense = variant_effect(VariantEffect.MISSENSE_VARIANT, tx_id=tx_id)
>>> is_missense.description
'MISSENSE_VARIANT on NM_000500.9'

Assuming AR inheritance, we compare missense vs. rest:

>>> from gpsea.analysis.predicate.genotype import biallelic_predicate
>>> gt_predicate = biallelic_predicate(
>>> from gpsea.analysis.clf import biallelic_classifier
>>> gt_clf = biallelic_classifier(
... a_predicate=is_missense,
... b_predicate=~is_missense,
... a_label="Missense", b_label="Other",
... partitions=({0,}, {1, 2}),
... )
>>> gt_predicate.group_labels
>>> gt_clf.class_labels
('Missense/Missense', 'Missense/Other OR Other/Other')

Phenotype score
Expand Down Expand Up @@ -118,7 +118,7 @@ We execute the analysis by running

>>> result = score_analysis.compare_genotype_vs_phenotype_score(
... cohort=cohort,
... gt_predicate=gt_predicate,
... gt_clf=gt_clf,
... pheno_scorer=pheno_scorer,
... )

Expand All @@ -140,7 +140,7 @@ individual 14[PMID_30968594_individual_14] 1 664.0

Prepare genotype category legend:

>>> gt_id_to_name = {c.category.cat_id: c.category.name for c in gt_predicate.get_categorizations()}
>>> gt_id_to_name = {c.category.cat_id: c.category.name for c in gt_clf.get_categorizations()}
>>> gt_id_to_name
{0: 'Missense/Missense', 1: 'Missense/Other OR Other/Other'}

Expand Down
53 changes: 27 additions & 26 deletions docs/user-guide/analyses/partitioning/genotype/allele_count.rst
Original file line number Diff line number Diff line change
Expand Up @@ -16,14 +16,15 @@ with the variant category analysis described in the :ref:`variant-category` sect
The allele count analysis compares the allele counts while keeping the variant type constant,
whereas the variant category analysis compares variant types while keeping the total allele count constant.

A predicate for the allele count analysis is created with
the :func:`~gpsea.analysis.predicate.genotype.allele_count` function.
A classifier for the allele count analysis is created with
the :func:`~gpsea.analysis.clf.allele_count` function.
The function needs two arguments:
``target`` with a :class:`~gpsea.analysis.predicate.genotype.VariantPredicate` for selecting the target variant type,
``target`` with a :class:`~gpsea.analysis.predicate.VariantPredicate` for selecting the target variant type,
and ``counts`` with the target allele counts.
The ``target`` variant predicate typically selects a broad variant types - for instance, all variants that affect a gene of interest.
The ``counts`` take the allele counts to compare.
The inputs are best illustrated on a few examples.

Let's show a few examples.


********
Expand All @@ -33,18 +34,19 @@ Examples
Compare the individuals with *EGFR* mutation
============================================

We can use the allele count analysis to test for G/P associations between the individuals with no mutation in *EGFR*
We can use the allele count analysis to test for G/P associations
between the individuals with no mutation in *EGFR*
and the individuals harboring 1 variant allele.

First, let's create a :class:`~gpsea.analysis.predicate.genotype.VariantPredicate`
First, let's create a :class:`~gpsea.analysis.predicate.VariantPredicate`
to select variants that affect *EGFR*:

>>> from gpsea.analysis.predicate.genotype import VariantPredicates
>>> affects_egfr = VariantPredicates.gene(symbol="EGFR")
>>> from gpsea.analysis.predicate import gene
>>> affects_egfr = gene(symbol="EGFR")
>>> affects_egfr.description
'affects EGFR'

Next, we create allele count predicate to partition the individuals
Next, we create allele count classifier to partition the individuals
based on presence of zero or one *EGFR* mutation allele:

.. figure:: img/allele-count-zero-one.png
Expand All @@ -53,22 +55,23 @@ based on presence of zero or one *EGFR* mutation allele:
:width: 600px


>>> from gpsea.analysis.predicate.genotype import allele_count
>>> gt_predicate = allele_count(
>>> from gpsea.analysis.clf import allele_count
>>> gt_clf = allele_count(
... counts=(0, 1),
... target=affects_egfr,
... )
>>> gt_predicate.group_labels
>>> gt_clf.class_labels
('0', '1')

We create the predicate with two arguments.
The ``counts`` argument takes a tuple of the target allele counts,
to partition the individuals based on zero or one allele.
The ``target`` takes a :class:`~gpsea.analysis.predicate.genotype.VariantPredicate`
selecting the target variants. The ``target`` is particularly important
The ``allele_count`` needs two inputs.
The ``counts`` takes a tuple of the target allele counts,
in order to partition the individuals based on zero or one allele.
The ``target`` takes a :class:`~gpsea.analysis.predicate.VariantPredicate`
to select the target variants.
The ``target`` is particularly important
if the cohort members include variants in other genes than *EGFR*.

The resulting ``gt_predicate`` can partition a cohort along the genotype axis,
The resulting ``gt_clf`` can partition a cohort along the genotype axis,
e.g. to compare the patient survivals in a `survival analysis <survival>`.


Expand All @@ -80,8 +83,8 @@ As another example, let's partition individuals based on one or two alleles of a
For this example, the target mutation is any mutation that affects *LMNA*:


>>> from gpsea.analysis.predicate.genotype import VariantPredicates
>>> affects_lmna = VariantPredicates.gene(symbol="LMNA")
>>> from gpsea.analysis.predicate import gene
>>> affects_lmna = gene(symbol="LMNA")
>>> affects_lmna.description
'affects LMNA'

Expand All @@ -94,16 +97,14 @@ and we will compare the individuals with one allele with those with two alleles:
:width: 600px


>>> gt_predicate = allele_count(
>>> gt_clf = allele_count(
... counts=(1, 2),
... target=affects_lmna,
... )
>>> gt_predicate.group_labels
>>> gt_clf.class_labels
('1', '2')


The predicate will partition the individuals into two groups:
The classifier assigns the individuals into one of two classes:
those with one *LMNA* variant allele and those with two *LMNA* variant alleles.
The individual with other allele counts (e.g. `0` or `3`) will be excluded
from the analysis.

Any cohort member with other allele counts (e.g. `0` or `3`) is ignored.
16 changes: 8 additions & 8 deletions docs/user-guide/analyses/partitioning/genotype/diagnosis.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,26 +4,26 @@
Group by diagnosis
==================

.. figure:: img/diagnosis-predicate.png
:alt: Diagnosis predicate
.. figure:: img/diagnosis-classifier.png
:alt: Diagnosis classifier
:align: center
:width: 600px


We can easily compare individuals diagnosed with different diseases.
:func:`~gpsea.analysis.predicate.genotype.diagnosis_predicate` groups the individuals
:func:`~gpsea.analysis.clf.diagnosis_classifier` groups the individuals
based on a diagnosis presence.

>>> from gpsea.analysis.predicate.genotype import diagnosis_predicate
>>> gt_predicate = diagnosis_predicate(
>>> from gpsea.analysis.clf import diagnosis_classifier
>>> gt_clf = diagnosis_classifier(
... diagnoses=('OMIM:154700', 'OMIM:129600'),
... labels=('Marfan syndrome', 'Ectopia lentis, familial'),
... )
>>> gt_predicate.group_labels
>>> gt_clf.class_labels
('OMIM:154700', 'OMIM:129600')

The predicate takes two or more disease identifiers (`diagnoses`) as well as their names (`labels`),
and it assigns the individuals based on their diagnoses.
The classifier takes two or more disease identifiers (`diagnoses`) as well as their names (`labels`),
and it classifies the individuals based on their diagnoses.

Note, the assignment must be unambiguous; any individual labeled with two or more target diagnoses
(e.g. an individual diagnosed with both *Marfan syndrome* and *Ectopia lentis, familial* in the example above)
Expand Down
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file not shown.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file not shown.
Binary file not shown.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file not shown.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file not shown.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file not shown.
16 changes: 8 additions & 8 deletions docs/user-guide/analyses/partitioning/genotype/index.rst
Original file line number Diff line number Diff line change
@@ -1,20 +1,20 @@
.. _genotype-predicates:
.. _genotype-classifiers:

###################
Genotype predicates
###################
####################
Genotype classifiers
####################


A genotype predicate partitions the individuals based on their genotype.
In GPSEA, genotype predicates leverage information from one of the three areas:
A genotype classifier assigns an individual into a class based on their genotype.
In GPSEA, genotype classifiers leverage information from one of the three areas:

* Sex
* Disease diagnosis
* Presence of variant(s) that meet certain inclusion criteria (e.g. a missense variant in heterozygous genotype)

Partitioning based on sex or disease diagnosis is relatively straightforward - the individuals
Classification based on sex or disease diagnosis is relatively straightforward - the individuals
are assigned by the biological sex or presence of a specific diagnosis.
See :ref:`group-by-sex` and :ref:`group-by-diagnosis` for more details.
See :ref:`group-by-sex` and :ref:`group-by-diagnosis` for more details.

Partitioning based on variants is, however, much more flexible,
to support the analysis of the broad spectrum of pathomechanisms
Expand Down
16 changes: 8 additions & 8 deletions docs/user-guide/analyses/partitioning/genotype/sex.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,23 +4,23 @@
Group by sex
============

.. figure:: img/sex-predicate.png
:alt: Sex predicate
.. figure:: img/sex-classifier.png
:alt: Sex classifier
:align: center
:width: 600px


It is easy to investigate the differences between males and females.
The :func:`~gpsea.analysis.predicate.genotype.sex_predicate` partitions
the individuals based on their :class:`~gpsea.model.Sex`:
The :func:`~gpsea.analysis.clf.sex_classifier` assigns
an individual into a class based on the :class:`~gpsea.model.Sex`:

>>> from gpsea.analysis.predicate.genotype import sex_predicate
>>> gt_predicate = sex_predicate()
>>> gt_predicate.group_labels
>>> from gpsea.analysis.clf import sex_classifier
>>> gt_clf = sex_classifier()
>>> gt_clf.class_labels
('FEMALE', 'MALE')

The individuals with :class:`~gpsea.model.Sex.UNKNOWN_SEX` will be omitted from the analysis.

Note that we implemented this predicate as a genotype predicate.
Note that we implemented this classification as a :class:`~gpsea.analysis.clf.GenotypeClassifier`.
Currently, it is not possible to compare the distribution of genotypes across sexes.

Loading

0 comments on commit 044f4a4

Please sign in to comment.