Merge pull request #381 from monarch-initiative/poly-predicates-are-a…

…ctually-classifiers Use classifiers instead of genotype/phenotype predicates.
monarch-initiative · Dec 12, 2024 · 044f4a4 · 044f4a4
2 parents 0535ef3 + 2f9b036
commit 044f4a4
Show file tree

Hide file tree

Showing 71 changed files with 2,458 additions and 2,348 deletions.
diff --git a/.github/workflows/pages.yml b/.github/workflows/pages.yml
@@ -52,6 +52,7 @@ jobs:
           cd docs
           sphinx-apidoc --separate --module-first -d 2 -H "API reference" --follow-links -o apidocs ../src/gpsea
           make clean html
+          # make clean html 2>&1 > /dev/null | less
           
           mv _build/html/* ../gh-pages/${DOCDIR}
           cd ..

diff --git a/docs/tutorial.rst b/docs/tutorial.rst
@@ -42,8 +42,8 @@ A typical GPSEA analysis will consist of several steps. Starting with a collecti
 we perform input Q/C and functional variant annotation to prepare a cohort.
 With the cohort on hand, we generate reports with summary statistics, variant distributions,
 and most common Human Phenotype Ontology (HPO) terms or measurements.
-We then configure the methods for partitioning the cohort into genotype and phenotype groups,
-to test for possible associations between the groups.
+We then configure the methods for partitioning the cohort into genotype and phenotype classes,
+to test for possible associations between the classes.
 We finalize the analysis by statistical testing and evaluation of the results.
 
 
@@ -240,54 +240,56 @@ with one or more variant alleles (*Count*):
 Partition the cohort by genotype and phenotype
 ==============================================
 
-To test for genotype-phenotype associations, we need to partition the cohort into subsets.
-In GPSEA, we always assign a cohort member into a genotype group,
-where each individual is assigned into a single group and the groups do not overlap.
-The phenotype is then used to either assign into a group or to calculate a numeric score. 
+To test for genotype-phenotype associations, we need to divide the cohort into classes.
+In GPSEA, we always assign a cohort member into a genotype class,
+where each individual is assigned into a single class and the classes do not overlap.
+The phenotype is then used to either assign an individual into a class,
+or to calculate a numeric score or survival.
 
 
 Partition by genotype
 ---------------------
 
-In context of the tutorial, we assign each cohort member into a group
+In context of the tutorial, we assign each cohort member into a class
 depending on presence of a single allele of a missense or truncating variant
 (e.g. frameshift, stop gain, or splice site region):
 
 >>> from gpsea.model import VariantEffect
->>> from gpsea.analysis.predicate.genotype import VariantPredicates, monoallelic_predicate
->>> is_missense = VariantPredicates.variant_effect(VariantEffect.MISSENSE_VARIANT, tx_id)
+>>> from gpsea.analysis.predicate import variant_effect, anyof
+>>> from gpsea.analysis.clf import monoallelic_classifier
+>>> is_missense = variant_effect(VariantEffect.MISSENSE_VARIANT, tx_id)
 >>> truncating_effects = (
 ...    VariantEffect.FRAMESHIFT_VARIANT,
 ...    VariantEffect.STOP_GAINED,
 ...    VariantEffect.SPLICE_DONOR_VARIANT,
 ...    VariantEffect.SPLICE_ACCEPTOR_VARIANT,
 ... )
->>> is_truncating = VariantPredicates.any(VariantPredicates.variant_effect(e, tx_id) for e in truncating_effects)
->>> gt_predicate = monoallelic_predicate(
+>>> is_truncating = anyof(variant_effect(e, tx_id) for e in truncating_effects)
+>>> gt_clf = monoallelic_classifier(
 ...     a_predicate=is_missense,
 ...     b_predicate=is_truncating,
 ...     a_label="Missense", b_label="Truncating",
 ... )
->>> gt_predicate.group_labels
+>>> gt_clf.class_labels
 ('Missense', 'Truncating')
 
 This is a lot of code, and detailed explanations and examples are available in the :ref:`partitioning` section.
-For now, it is enough to know that the `gt_predicate` will assign the individuals
-into `Missense` or `Truncating` group. The individuals with the number of missense (or truncating) variants
+For now, it is enough to know that the `gt_clf` will assign the individuals
+into `Missense` or `Truncating` class. The individuals with the number of missense (or truncating) variants
 different than one will be omitted from the analysis.
 
 
 Partition by phenotype
 ----------------------
 
-We use HPO terms to assign the individuals into phenotype groups,
+We use HPO terms to assign the individuals into phenotype classes,
 according to the term's presence or exclusion.
 The testing leverages the :ref:`true-path-rule` of ontologies.
 
-We now prepare the predicates for assigning into phenotype groups:
+We now prepare the classifiers for assigning into phenotype classes:
 
->>> from gpsea.analysis.predicate.phenotype import prepare_predicates_for_terms_of_interest
->>> pheno_predicates = prepare_predicates_for_terms_of_interest(
+>>> from gpsea.analysis.clf import prepare_classifiers_for_terms_of_interest
+>>> pheno_clfs = prepare_classifiers_for_terms_of_interest(
 ...     cohort=cohort,
 ...     hpo=hpo,
 ... )
@@ -299,7 +301,7 @@ Multiple testing correction
 By default, GPSEA performs a test for each HPO term used to annotate at least one individual in the cohort,
 and there are 369 such terms in *TBX5* cohort:
 
->>> len(pheno_predicates)
+>>> len(pheno_clfs)
 369
 
 However, testing multiple hypothesis on the same dataset increases the chance of receiving false positive result.
@@ -329,8 +331,8 @@ Now we can perform the testing and evaluate the results.
 
 >>> result = analysis.compare_genotype_vs_phenotypes(
 ...     cohort=cohort,
-...     gt_predicate=gt_predicate,
-...     pheno_predicates=pheno_predicates,
+...     gt_clf=gt_clf,
+...     pheno_clfs=pheno_clfs,
 ... )
 >>> result.total_tests
 17

diff --git a/docs/user-guide/analyses/index.rst b/docs/user-guide/analyses/index.rst
@@ -17,7 +17,7 @@ and explanations of how they are implemented by our software.
   :caption: Contents:
 
   partitioning/index
-  phenotype-groups
+  phenotype-classes
   mtc
   phenotype-scores
   measurements

diff --git a/docs/user-guide/analyses/measurements.rst b/docs/user-guide/analyses/measurements.rst
@@ -1,9 +1,9 @@
 .. _measurement-stat:
 
 
-==========================
-Compare measurement values
-==========================
+====================
+Compare measurements
+====================
 
 
 ****************
@@ -55,21 +55,21 @@ Missense should *NOT* be severe.
 TODO - create real predicate.
 
 >>> from gpsea.model import VariantEffect
->>> from gpsea.analysis.predicate.genotype import VariantPredicates
->>> is_missense = VariantPredicates.variant_effect(VariantEffect.MISSENSE_VARIANT, tx_id=tx_id)
+>>> from gpsea.analysis.predicate import variant_effect
+>>> is_missense = variant_effect(VariantEffect.MISSENSE_VARIANT, tx_id=tx_id)
 >>> is_missense.description
 'MISSENSE_VARIANT on NM_000500.9'
 
 Assuming AR inheritance, we compare missense vs. rest:
 
->>> from gpsea.analysis.predicate.genotype import biallelic_predicate
->>> gt_predicate = biallelic_predicate(
+>>> from gpsea.analysis.clf import biallelic_classifier
+>>> gt_clf = biallelic_classifier(
 ...     a_predicate=is_missense,
 ...     b_predicate=~is_missense,
 ...     a_label="Missense", b_label="Other",
 ...     partitions=({0,}, {1, 2}),
 ... )
->>> gt_predicate.group_labels
+>>> gt_clf.class_labels
 ('Missense/Missense', 'Missense/Other OR Other/Other')
 
 Phenotype score
@@ -118,7 +118,7 @@ We execute the analysis by running
 
 >>> result = score_analysis.compare_genotype_vs_phenotype_score(
 ...     cohort=cohort,
-...     gt_predicate=gt_predicate,
+...     gt_clf=gt_clf,
 ...     pheno_scorer=pheno_scorer,
 ... )
 
@@ -140,7 +140,7 @@ individual 14[PMID_30968594_individual_14]         1      664.0
 
 Prepare genotype category legend:
 
->>> gt_id_to_name = {c.category.cat_id: c.category.name for c in gt_predicate.get_categorizations()}
+>>> gt_id_to_name = {c.category.cat_id: c.category.name for c in gt_clf.get_categorizations()}
 >>> gt_id_to_name
 {0: 'Missense/Missense', 1: 'Missense/Other OR Other/Other'}
 

diff --git a/docs/user-guide/analyses/partitioning/genotype/allele_count.rst b/docs/user-guide/analyses/partitioning/genotype/allele_count.rst
@@ -16,14 +16,15 @@ with the variant category analysis described in the :ref:`variant-category` sect
 The allele count analysis compares the allele counts while keeping the variant type constant,
 whereas the variant category analysis compares variant types while keeping the total allele count constant.
 
-A predicate for the allele count analysis is created with
-the :func:`~gpsea.analysis.predicate.genotype.allele_count` function.
+A classifier for the allele count analysis is created with
+the :func:`~gpsea.analysis.clf.allele_count` function.
 The function needs two arguments:
-``target`` with a :class:`~gpsea.analysis.predicate.genotype.VariantPredicate` for selecting the target variant type,
+``target`` with a :class:`~gpsea.analysis.predicate.VariantPredicate` for selecting the target variant type,
 and ``counts`` with the target allele counts.
 The ``target`` variant predicate typically selects a broad variant types - for instance, all variants that affect a gene of interest.
 The ``counts`` take the allele counts to compare.
-The inputs are best illustrated on a few examples.
+
+Let's show a few examples.
 
 
 ********
@@ -33,18 +34,19 @@ Examples
 Compare the individuals with *EGFR* mutation
 ============================================
 
-We can use the allele count analysis to test for G/P associations between the individuals with no mutation in *EGFR*
+We can use the allele count analysis to test for G/P associations
+between the individuals with no mutation in *EGFR*
 and the individuals harboring 1 variant allele.
 
-First, let's create a :class:`~gpsea.analysis.predicate.genotype.VariantPredicate`
+First, let's create a :class:`~gpsea.analysis.predicate.VariantPredicate`
 to select variants that affect *EGFR*:
 
->>> from gpsea.analysis.predicate.genotype import VariantPredicates
->>> affects_egfr = VariantPredicates.gene(symbol="EGFR")
+>>> from gpsea.analysis.predicate import gene
+>>> affects_egfr = gene(symbol="EGFR")
 >>> affects_egfr.description
 'affects EGFR'
 
-Next, we create allele count predicate to partition the individuals
+Next, we create allele count classifier to partition the individuals
 based on presence of zero or one *EGFR* mutation allele:
 
 .. figure:: img/allele-count-zero-one.png
@@ -53,22 +55,23 @@ based on presence of zero or one *EGFR* mutation allele:
    :width: 600px
 
 
->>> from gpsea.analysis.predicate.genotype import allele_count
->>> gt_predicate = allele_count(
+>>> from gpsea.analysis.clf import allele_count
+>>> gt_clf = allele_count(
 ...     counts=(0, 1),
 ...     target=affects_egfr,
 ... )
->>> gt_predicate.group_labels
+>>> gt_clf.class_labels
 ('0', '1')
 
-We create the predicate with two arguments.
-The ``counts`` argument takes a tuple of the target allele counts,
-to partition the individuals based on zero or one allele.
-The ``target`` takes a :class:`~gpsea.analysis.predicate.genotype.VariantPredicate`
-selecting the target variants. The ``target`` is particularly important
+The ``allele_count`` needs two inputs.
+The ``counts`` takes a tuple of the target allele counts,
+in order to partition the individuals based on zero or one allele.
+The ``target`` takes a :class:`~gpsea.analysis.predicate.VariantPredicate`
+to select the target variants.
+The ``target`` is particularly important
 if the cohort members include variants in other genes than *EGFR*.
 
-The resulting ``gt_predicate`` can partition a cohort along the genotype axis,
+The resulting ``gt_clf`` can partition a cohort along the genotype axis,
 e.g. to compare the patient survivals in a `survival analysis <survival>`.
 
 
@@ -80,8 +83,8 @@ As another example, let's partition individuals based on one or two alleles of a
 For this example, the target mutation is any mutation that affects *LMNA*:
 
 
->>> from gpsea.analysis.predicate.genotype import VariantPredicates
->>> affects_lmna = VariantPredicates.gene(symbol="LMNA")
+>>> from gpsea.analysis.predicate import gene
+>>> affects_lmna = gene(symbol="LMNA")
 >>> affects_lmna.description
 'affects LMNA'
 
@@ -94,16 +97,14 @@ and we will compare the individuals with one allele with those with two alleles:
    :width: 600px
 
 
->>> gt_predicate = allele_count(
+>>> gt_clf = allele_count(
 ...     counts=(1, 2),
 ...     target=affects_lmna,
 ... )
->>> gt_predicate.group_labels
+>>> gt_clf.class_labels
 ('1', '2')
 
 
-The predicate will partition the individuals into two groups:
+The classifier assigns the individuals into one of two classes:
 those with one *LMNA* variant allele and those with two *LMNA* variant alleles.
-The individual with other allele counts (e.g. `0` or `3`) will be excluded
-from the analysis.
-
+Any cohort member with other allele counts (e.g. `0` or `3`) is ignored.
diff --git a/docs/user-guide/analyses/partitioning/genotype/diagnosis.rst b/docs/user-guide/analyses/partitioning/genotype/diagnosis.rst
@@ -4,26 +4,26 @@
 Group by diagnosis
 ==================
 
-.. figure:: img/diagnosis-predicate.png
-   :alt: Diagnosis predicate
+.. figure:: img/diagnosis-classifier.png
+   :alt: Diagnosis classifier
    :align: center
    :width: 600px
 
 
 We can easily compare individuals diagnosed with different diseases.
-:func:`~gpsea.analysis.predicate.genotype.diagnosis_predicate` groups the individuals
+:func:`~gpsea.analysis.clf.diagnosis_classifier` groups the individuals
 based on a diagnosis presence.
 
->>> from gpsea.analysis.predicate.genotype import diagnosis_predicate
->>> gt_predicate = diagnosis_predicate(
+>>> from gpsea.analysis.clf import diagnosis_classifier
+>>> gt_clf = diagnosis_classifier(
 ...     diagnoses=('OMIM:154700', 'OMIM:129600'),
 ...     labels=('Marfan syndrome', 'Ectopia lentis, familial'),
 ... )
->>> gt_predicate.group_labels
+>>> gt_clf.class_labels
 ('OMIM:154700', 'OMIM:129600')
 
-The predicate takes two or more disease identifiers (`diagnoses`) as well as their names (`labels`),
-and it assigns the individuals based on their diagnoses.
+The classifier takes two or more disease identifiers (`diagnoses`) as well as their names (`labels`),
+and it classifies the individuals based on their diagnoses.
 
 Note, the assignment must be unambiguous; any individual labeled with two or more target diagnoses
 (e.g. an individual diagnosed with both *Marfan syndrome* and *Ectopia lentis, familial* in the example above)

diff --git a/docs/user-guide/analyses/partitioning/genotype/img/allele-count-zero-one.png b/docs/user-guide/analyses/partitioning/genotype/img/allele-count-zero-one.png
diff --git a/docs/user-guide/analyses/partitioning/genotype/img/allele-count.png b/docs/user-guide/analyses/partitioning/genotype/img/allele-count.png
diff --git a/...-guide/analyses/partitioning/genotype/img/biallelic-classifier-w-partitions.png b/...-guide/analyses/partitioning/genotype/img/biallelic-classifier-w-partitions.png
diff --git a/docs/user-guide/analyses/partitioning/genotype/img/biallelic-classifier.png b/docs/user-guide/analyses/partitioning/genotype/img/biallelic-classifier.png
diff --git a/...r-guide/analyses/partitioning/genotype/img/biallelic-predicate-w-partitions.png b/...r-guide/analyses/partitioning/genotype/img/biallelic-predicate-w-partitions.png
diff --git a/docs/user-guide/analyses/partitioning/genotype/img/biallelic-predicate.png b/docs/user-guide/analyses/partitioning/genotype/img/biallelic-predicate.png
diff --git a/docs/user-guide/analyses/partitioning/genotype/img/diagnosis-classifier.png b/docs/user-guide/analyses/partitioning/genotype/img/diagnosis-classifier.png
diff --git a/docs/user-guide/analyses/partitioning/genotype/img/diagnosis-predicate.png b/docs/user-guide/analyses/partitioning/genotype/img/diagnosis-predicate.png
diff --git a/docs/user-guide/analyses/partitioning/genotype/img/monoallelic-classifier.png b/docs/user-guide/analyses/partitioning/genotype/img/monoallelic-classifier.png
diff --git a/docs/user-guide/analyses/partitioning/genotype/img/monoallelic-predicate.png b/docs/user-guide/analyses/partitioning/genotype/img/monoallelic-predicate.png
diff --git a/docs/user-guide/analyses/partitioning/genotype/img/sex-classifier.png b/docs/user-guide/analyses/partitioning/genotype/img/sex-classifier.png
diff --git a/docs/user-guide/analyses/partitioning/genotype/img/sex-predicate.png b/docs/user-guide/analyses/partitioning/genotype/img/sex-predicate.png
diff --git a/docs/user-guide/analyses/partitioning/genotype/index.rst b/docs/user-guide/analyses/partitioning/genotype/index.rst
@@ -1,20 +1,20 @@
-.. _genotype-predicates:
+.. _genotype-classifiers:
 
-###################
-Genotype predicates
-###################
+####################
+Genotype classifiers
+####################
 
 
-A genotype predicate partitions the individuals based on their genotype.
-In GPSEA, genotype predicates leverage information from one of the three areas:
+A genotype classifier assigns an individual into a class based on their genotype.
+In GPSEA, genotype classifiers leverage information from one of the three areas:
 
 * Sex
 * Disease diagnosis
 * Presence of variant(s) that meet certain inclusion criteria (e.g. a missense variant in heterozygous genotype)
 
-Partitioning based on sex or disease diagnosis is relatively straightforward - the individuals
+Classification based on sex or disease diagnosis is relatively straightforward - the individuals
 are assigned by the biological sex or presence of a specific diagnosis.
-See :ref:`group-by-sex` and :ref:`group-by-diagnosis` for more details. 
+See :ref:`group-by-sex` and :ref:`group-by-diagnosis` for more details.
 
 Partitioning based on variants is, however, much more flexible,
 to support the analysis of the broad spectrum of pathomechanisms

diff --git a/docs/user-guide/analyses/partitioning/genotype/sex.rst b/docs/user-guide/analyses/partitioning/genotype/sex.rst
@@ -4,23 +4,23 @@
 Group by sex
 ============
 
-.. figure:: img/sex-predicate.png
-   :alt: Sex predicate
+.. figure:: img/sex-classifier.png
+   :alt: Sex classifier
    :align: center
    :width: 600px
 
 
 It is easy to investigate the differences between males and females.
-The :func:`~gpsea.analysis.predicate.genotype.sex_predicate` partitions
-the individuals based on their :class:`~gpsea.model.Sex`:
+The :func:`~gpsea.analysis.clf.sex_classifier` assigns
+an individual into a class based on the :class:`~gpsea.model.Sex`:
 
->>> from gpsea.analysis.predicate.genotype import sex_predicate
->>> gt_predicate = sex_predicate()
->>> gt_predicate.group_labels
+>>> from gpsea.analysis.clf import sex_classifier
+>>> gt_clf = sex_classifier()
+>>> gt_clf.class_labels
 ('FEMALE', 'MALE')
 
 The individuals with :class:`~gpsea.model.Sex.UNKNOWN_SEX` will be omitted from the analysis.
 
-Note that we implemented this predicate as a genotype predicate.
+Note that we implemented this classification as a :class:`~gpsea.analysis.clf.GenotypeClassifier`.
 Currently, it is not possible to compare the distribution of genotypes across sexes.