Skip to content

Commit

Permalink
Merge pull request #292 from monarch-initiative/protein_var_table_and…
Browse files Browse the repository at this point in the history
…_doc

adding improvements to protein variant table and adding documentation.
  • Loading branch information
ielis authored Oct 2, 2024
2 parents 64049b4 + d62b146 commit ed12979
Show file tree
Hide file tree
Showing 21 changed files with 935 additions and 139 deletions.
43 changes: 29 additions & 14 deletions docs/tutorial.rst
Original file line number Diff line number Diff line change
Expand Up @@ -114,18 +114,15 @@ The summary report provides an overview about the HPO terms, variants, diseases,
>>> from gpsea.view import CohortViewable
>>> viewer = CohortViewable(hpo)
>>> report = viewer.process(cohort=cohort, transcript_id=tx_id)
>>> with open('docs/report/tbx5_cohort_info.html', 'w') as fh: # doctest: +SKIP
... _ = fh.write(report)
>>> report # doctest: +SKIP

.. raw:: html
:file: report/tbx5_cohort_info.html

.. note::

The report can also be displayed directly in a Jupyter notebook by running::
.. doctest:: tutorial
:hide:

from IPython.display import HTML, display
display(HTML(report))
>>> report.write('docs/report/tbx5_cohort_info.html') # doctest: +SKIP


Plot distribution of variants with respect to the protein sequence
Expand Down Expand Up @@ -155,14 +152,18 @@ and we follow with plotting the diagram of the mutations on the protein:
... cohort,
... ax=ax,
... )
>>> fig.tight_layout()
>>> fig.savefig('docs/img/tutorial/tbx5_protein_diagram.png') # doctest: +SKIP

.. image:: /img/tutorial/tbx5_protein_diagram.png
:alt: TBX5 protein diagram
:align: center
:width: 600px

.. doctest:: tutorial
:hide:

>>> fig.tight_layout()
>>> fig.savefig('docs/img/tutorial/tbx5_protein_diagram.png') # doctest: +SKIP


.. _show-cohort-variants:

Expand All @@ -178,12 +179,16 @@ with one or more variant alleles (*Count*):
>>> from gpsea.view import CohortVariantViewer
>>> viewer = CohortVariantViewer(tx_id=tx_id)
>>> report = viewer.process(cohort=cohort)
>>> with open('docs/report/tbx5_all_variants.html', 'w') as fh: # doctest: +SKIP
... _ = fh.write(report)
>>> report # doctest: +SKIP

.. raw:: html
:file: report/tbx5_all_variants.html

.. doctest:: tutorial
:hide:

>>> report.write('docs/report/tbx5_all_variants.html') # doctest: +SKIP


Prepare genotype and phenotype predicates
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Expand Down Expand Up @@ -289,22 +294,32 @@ by exploring the phenotype MTC filtering report.
>>> from gpsea.view import MtcStatsViewer
>>> mtc_viewer = MtcStatsViewer()
>>> mtc_report = mtc_viewer.process(result)
>>> with open('docs/report/tbx5_frameshift_vs_missense.mtc_report.html', 'w') as fh: # doctest: +SKIP
... _ = fh.write(mtc_report)
>>> mtc_report # doctest: +SKIP

.. raw:: html
:file: report/tbx5_frameshift_vs_missense.mtc_report.html

.. doctest:: tutorial
:hide:

>>> mtc_report.write('docs/report/tbx5_frameshift_vs_missense.mtc_report.html') # doctest: +SKIP


and these are the tested HPO terms ordered by the p value corrected with the Benjamini-Hochberg procedure:

>>> from gpsea.view import summarize_hpo_analysis
>>> summary_df = summarize_hpo_analysis(hpo, result)
>>> summary_df.to_csv('docs/report/tbx5_frameshift_vs_missense.csv') # doctest: +SKIP
>>> summary_df # doctest: +SKIP

.. csv-table:: *TBX5* frameshift vs missense
:file: report/tbx5_frameshift_vs_missense.csv
:header-rows: 2

.. doctest:: tutorial
:hide:

>>> summary_df.to_csv('docs/report/tbx5_frameshift_vs_missense.csv') # doctest: +SKIP

We see that several HPO terms are significantly associated
with presence of a frameshift variant in *TBX5*.
For example, `Ventricular septal defect <https://hpo.jax.org/browse/term/HP:0001629>`_
Expand Down
175 changes: 172 additions & 3 deletions docs/user-guide/exploratory.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,8 +4,177 @@
Cohort exploratory analysis
===========================

TODO - write about exploratory analysis of the cohort. Here we describe what we can learn about the cohort
and the members.
As a general rule for statistical testing, it is preferable to formulate one or several hypotheses and to test
these in a targeted way. Performing numerous tests without any specific plan comes with an increased
danger of false-positive results (although it many be appropriate to generate hypotheses if a validation cohort is
available).

How many variants? Distribution of the variant counts? Variant effects? Plots with phenotypic feature counts?
*************************
Types of statistical test
*************************

GPSEA offers four statistical tests.

- Fisher Exact Test (FET): Association between two categorical variables. In this case, the first categorical variable is presence/absense of a genotype and the second is presence/absense of annotation to an HPO term.
- Mann Whitney U test: Association between presence/absence of a genotype with a phenotype score (e.g., :ref:`devries`).
- t test: Association between presence/absence of a genotype with a numerical test result (e.g., potassium level)
- log rank score: Association between presence/absence of a genotype with age-of-onset of disease or of an HPO feature or with mortality

Please consult the documentation pages for each of these tests if you are unsure which to use.


*****
Power
*****

Two things must be true to identify a valid genotype-phenotype correlation. First, and obviously,
the correlation must actually exist in the dataset being analyzed. Second, there must be statistical power to
identify the effect.
Power is a measure of the ability of an experimental design and hypothesis testing setup to detect a
particular effect if it is truly present (`Wikipedia <https://en.wikipedia.org/wiki/Power_(statistics)>`_).

It will generally not be possible to perform a formal power calculation because the effect size of
most genotype-phenotype correlations is not well known. But, for instance it would not
make much sense to test for correlation with a specific variant that occurs in only one of 50 individuals in a cohort.


***********
Exploration
***********

The purpose of the exploratory analysis is thus to decide which tests to perform.
GPSEA provides tables and graphics that visualize some of the salient aspects
of the cohort and the distribution of the identified variants.
We exemplify the exploratory analysis on a cohort of 156 individuals with mutations
in *TBX5* from Phenopacket Store *0.1.20*. The cohort was preprocessed as described
in the :ref:`input-data` section and stored
at `docs/cohort-data/TBX5.0.1.20.json <https://github.com/monarch-initiative/gpsea/tree/main/docs/cohort-data/TBX5.0.1.20.json>`_
in JSON format.


We start by loading the cohort from the JSON file:

>>> import json
>>> from gpsea.io import GpseaJSONDecoder
>>> fpath_cohort = 'docs/cohort-data/TBX5.0.1.20.json'
>>> with open(fpath_cohort) as fh:
... cohort = json.load(fh, cls=GpseaJSONDecoder)
>>> len(cohort)
156


then we will choose the transcript and protein identifiers, and we fetch the corresponding data:
(see :ref:`choose-tx-and-protein` for more info):

>>> from gpsea.model.genome import GRCh38
>>> from gpsea.preprocessing import VVMultiCoordinateService, configure_default_protein_metadata_service
>>> tx_id = "NM_181486.4"
>>> pt_id = "NP_852259.1"
>>> tx_service = VVMultiCoordinateService(genome_build=GRCh38)
>>> tx_coordinates = tx_service.fetch(tx_id)
>>> pm_service = configure_default_protein_metadata_service()
>>> protein_meta = pm_service.annotate(pt_id)


and we will load HPO `v2024-07-01` for using in the exploratory analysis:

>>> import hpotk
>>> store = hpotk.configure_ontology_store()
>>> hpo = store.load_minimal_hpo(release='v2024-07-01')


Interactive exploration
-----------------------

We designed GPSEA to integrate with interactive Python with a widespread use
in contemporary scientific workflows.

The code for exploration is located in :mod:`gpsea.view` module.
As a rule of thumb, the reports are provided as :class:`~gpsea.view.GpseaReport`
which leverages IPython's "magic" to integrate with environments such as Jupyter notebook.


Cohort summary
--------------

We recommend that users start be generating a cohort summary
with an overview about the HPO terms, variants, diseases, and variant effects that occurr most frequently:

>>> from gpsea.view import CohortViewable
>>> viewer = CohortViewable(hpo)
>>> report = viewer.process(cohort=cohort, transcript_id=tx_id)
>>> report # doctest: +SKIP

.. raw:: html
:file: reports/tbx5_cohort_info.html

.. doctest:: exploratory
:hide:

>>> report.write('docs/user-guide/reports/tbx5_cohort_INFO.html') # doctest: +SKIP


Distribution of variants across protein domains
-----------------------------------------------

GPSEA gathers information about protein domains from the UniProt API, and alternatively allows users to
enter domain information manually (See :ref:`fetch-protein-data`).
Protein domains are distinct functional or structural units in a protein. For instance, the following graphic shows domains of
the *PLD1* protein. The HKD domains contribute to the catalytic activity of the protein whereas the PX and PH domains
regulation PLD1 localization within the cell. Observations such as this may suggest testable hypotheses.

.. figure:: img/PLD1.png
:alt: PLD1
:align: center
:width: 600px

Human *PLD1* with PX, PH, and two HKD domains.


Users can create a table to display the protein domains and the variants
located in them in order to decide whether it might be sensible to test for correlation between variants
located in one or more protein domains and a certain phenotype.

This code will produce the following table on the basis of a cohort of individuals
with variants in the *TBX5* gene:

>>> from gpsea.view import ProteinVariantViewer
>>> cpd_viewer = ProteinVariantViewer(tx_id=tx_id, protein_metadata=protein_meta)
>>> report = cpd_viewer.process(cohort)
>>> report # doctest: +SKIP

.. raw:: html
:file: reports/tbx5_protein_info.html

.. doctest:: exploratory
:hide:

>>> report.write('docs/user-guide/reports/tbx5_protein_info.html') # doctest: +SKIP


Plot distribution of variants with respect to the protein sequence
------------------------------------------------------------------

We use Matplotlib to plot the distribution of variants on a protein diagram:

>>> import matplotlib.pyplot as plt
>>> from gpsea.view import ProteinVisualizer
>>> fig, ax = plt.subplots(figsize=(15, 8))
>>> visualizer = ProteinVisualizer()
>>> visualizer.draw_protein_diagram(
... tx_coordinates,
... protein_meta,
... cohort,
... ax=ax,
... )

.. image:: img/TBX5_protein_diagram.png
:alt: TBX5 protein diagram
:align: center
:width: 600px

.. doctest:: exploratory
:hide:

>>> fig.tight_layout()
>>> fig.savefig('docs/user-guide/img/TBX5_protein_diagram.png') # doctest: +SKIP
Binary file added docs/user-guide/img/PLD1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/user-guide/img/TBX5_protein_diagram.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
44 changes: 27 additions & 17 deletions docs/user-guide/input-data.rst
Original file line number Diff line number Diff line change
Expand Up @@ -47,16 +47,16 @@ the standard `gpsea` installation:
The easiest way to get the `CohortCreator` is to use the
:func:`~gpsea.preprocessing.configure_caching_cohort_creator` convenience method:

.. doctest:: input-data
.. doctest:: input-data

>>> from gpsea.preprocessing import configure_caching_cohort_creator

>>> cohort_creator = configure_caching_cohort_creator(hpo)
>>> cohort_creator = configure_caching_cohort_creator(hpo)

.. note::

The default `CohortCreator` will call Variant Effect Predictor and Uniprot APIs
to perform the functional annotation and protein annotation,
to perform the functional annotation and protein annotation,
and the responses will be cached in the current working directory to reduce the network bandwidth.
See the :func:`~gpsea.preprocessing.configure_caching_cohort_creator` pydoc for more options.

Expand Down Expand Up @@ -88,7 +88,14 @@ Individuals Processed: ...
>>> len(cohort)
19

The cohort includes all 19 individuals.
The cohort includes all 19 individuals.
On top of the ``cohort``, the loader function also provides Q/C results ``qc_results``.
We call :meth:`~gpsea.preprocessing.PreprocessingValidationResult.summarize`
to display the Q/C summary:

>>> qc_results.summarize() # doctest: +SKIP
Validated under none policy
No errors or warnings were found


Alternative phenopacket sources
Expand Down Expand Up @@ -174,9 +181,11 @@ True
We will leave persisting the cohort into an actual file or another data store as an exercise for the interested readers.


***************************************************
Get data for the transcript and protein of interest
***************************************************
.. _choose-tx-and-protein:

*********************************************
Choose the transcript and protein of interest
*********************************************


Choose the transcript
Expand Down Expand Up @@ -215,7 +224,7 @@ Fetch transcript coordinates from Variant Validator REST API

Undoubtedly, the most convenient way for getting the transcript coordinates is to use
the REST API of the amazing `Variant Validator <https://variantvalidator.org/>`_.
GPSEA wraps the boiler-plate associated with querying the API and parsing the response into
GPSEA wraps the boiler-plate associated with querying the API and parsing the response into
:class:`~gpsea.preprocessing.VVMultiCoordinateService`.


Expand Down Expand Up @@ -290,13 +299,14 @@ and the coding sequence includes 1554 coding bases and 518 codons:
518


.. _fetch-protein-data:

Get the protein data
--------------------
Fetch protein data
-------------------


Specific domains of a protein may be associated with genotype-phenotype correlations.
For instance, variants in the pore domain of *PIEZO1* are associated with more severe clinical
Specific domains of a protein may be associated with genotype-phenotype correlations.
For instance, variants in the pore domain of *PIEZO1* are associated with more severe clinical
manifestions in dehydrated hereditary stomatocytosis `Andolfo et al., 2018 <https://pubmed.ncbi.nlm.nih.gov/30187933>`_.

GPSEA uses the protein data in several places: to show distribution of variants with respect to the protein domains
Expand Down Expand Up @@ -347,8 +357,8 @@ which we can see on the following screenshot of the UniProt entry for *TBX5*:

UniProt shows four protein features:

- the Disordered region (1-46)
- the Disordered region (250-356)
- the Disordered region (1-46)
- the Disordered region (250-356)
- presence of Polar residues (263-299)
- presence of Basic and acidic residues (320-346).

Expand All @@ -361,7 +371,7 @@ we can download a JSON file representing the protein features manually,
and load the file into :class:`~gpsea.model.ProteinMetadata`.

To do this, click on the *Download* symbol (see the UniProt screenshot figure above). This will open a dialog
that allows the user to choose the contents of the JSON file.
that allows the user to choose the contents of the JSON file.
Do not change the default option (Features - Domain, Region).
Provided that the file has been saved as `docs/user-guide/data/Q99593.json`,
the ``ProteinMetadata`` can be loaded using :func:`~gpsea.model.ProteinMetadata.from_uniprot_json` function.
Expand All @@ -381,8 +391,8 @@ and `protein_length`, but these are shown in the UniProt entry:
Enter features manually
^^^^^^^^^^^^^^^^^^^^^^^

The information about protein features provided by UniProt entries may not always be complete.
Here we show how to enter the same information manually, in a custom protein dataframe.
The information about protein features provided by UniProt entries may not always be complete.
Here we show how to enter the same information manually, in a custom protein dataframe.

The frame can be created e.g. by running:

Expand Down
Loading

0 comments on commit ed12979

Please sign in to comment.