Improve docs

icbi-lab · Feb 7, 2021 · e5b24eb · e5b24eb
1 parent 8209739
commit e5b24eb
Show file tree

Hide file tree

Showing 3 changed files with 40 additions and 23 deletions.
diff --git a/docs/infercnv.rst b/docs/infercnv.rst
@@ -3,7 +3,7 @@
 The inferCNV method
 ===================
 
-This methodology in this package is essentially a python reimplementation of
+Essentially, this package is a Python reimplementation of
 `infercnv <https://github.com/broadinstitute/inferCNV/>`_. It mostly follows the computation steps
 outlined `here <https://github.com/broadinstitute/inferCNV/wiki/Running-InferCNV>`_,
 with minor modifications. The computation steps are outlined below.
@@ -20,24 +20,24 @@ The function parameters are documented at :func:`infercnvpy.tl.infercnv`.
    multiple categories are available (i.e. multiple values are specified to
    `reference_cat`), the log fold change is "bounded":
 
-      * compute the mean gene expression for each category separately
+      * Compute the mean gene expression for each category separately.
       * Values that are within the minimum and the maximum of the mean of all
         references, receive a log fold change of 0, since they are not considered
         different from the background.
       * From values smaller than the minimum of the mean of all references, subtract that minimum.
       * From values larger than the maximum of the mean of all references, subtract that maximum.
 
-   This procedure avoids calling false positive CNV due to cell-type specific
-   expression of clustered gene regions (e.g. Immunoglobulin or HLA genes in different
+   This procedure avoids calling false positive CNV regions due to cell-type specific
+   expression of clustered gene regions (e.g. Immunoglobulin- or HLA genes in different
    immune cell types).
 2. Clip the fold changes at `-lfc_cap` and `+lfc_cap`.
 3. Smooth the gene expression by genomic position. Computes the average over a
    running window of length `window_size`. Compute only every nth window
    to save time & space, where n = `step`.
-4. Center the smoothed gene expression by cell, but subtracting the
-   calculating and subtracting the median for each cell.
+4. Center the smoothed gene expression by cell, by subtracting the median of each cell
+   from each cell.
 5. Perform noise filtering. Values `< dynamic_theshold * STDDEV` are set to 0,
-   where STDDEV is the standard deviation of the smoothed gene expression
+   where `STDDEV` is the standard deviation of the smoothed gene expression
 6. Smooth the final result using a median filter.
 
 .. _input-data:

diff --git a/docs/tutorials/reproduce_infercnv.md b/docs/tutorials/reproduce_infercnv.md
@@ -9,15 +9,14 @@ jupyter:
       jupytext_version: 1.5.0.rc1
 ---
 
-# Reproduce the heatmap from inverCNV
+# Reproduce the heatmap from inferCNV
 
-This document demonstrates how the [example heatmap](https://github.com/broadinstitute/inferCNV/wiki#demo-example-figure) from the original
+This document demonstrates to reproduce how the [example heatmap](https://github.com/broadinstitute/inferCNV/wiki#demo-example-figure) from the original
 R inverCNV implementation. It is based on a small, 183-cell example dataset of malignant and non-malignant cells from Oligodendroglioma derived from [Tirosh et al. (2016)](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5465819/). 
 
 ```python
 import infercnvpy as cnv
 import scanpy as sc
-import numpy as np
 ```
 
 ## Prepare and inspect dataset
@@ -48,7 +47,7 @@ sc.pl.umap(adata, color="cell_type")
 In this case we know which cells are non-malignant. For best results, it is recommended to use
 the non-malignant cells as a background. We can provide this information using `reference_key` and `reference_cat`. 
 
-In order to reproduce the results as exactely as possible, we use a `window_size` of 50, a `step` of 1. 
+In order to reproduce the results as exactely as possible, we use a `window_size` of 100 and a `step` of 1. 
 
 ```python
 %%time

diff --git a/docs/tutorials/tutorial_3k.md b/docs/tutorials/tutorial_3k.md
@@ -47,11 +47,20 @@ sc.logging.print_header()
     must be normalized and log-transformed. For more information, see
     :ref:`input-data`. 
 
+    Also, the genomic positions need to be stored in `adata.var`. The 
+    columns `chromosome`, `start`, and `end` hold the chromosome and 
+    the start and end positions on that chromosome for each gene, 
+    respectively. 
+
+    Infercnvpy provides the :func:`infercnvpy.io.genomic_position_from_gtf` function
+    to read these information from a GTF file and add them to `adata.var`. 
+
 The example dataset is already appropriately preprocessed. 
 <!-- #endraw -->
 
 ```python
 adata = cnv.datasets.maynard2020_3k()
+adata.var.loc[:, ["ensg", "chromosome", "start", "end"]].head()
 ```
 
 Let's first inspect the UMAP plot based on the transcriptomics data:
@@ -69,6 +78,9 @@ region to a reference. The original inferCNV method uses a window size of 100,
 but larger window sizes can make sense, depending on the number of 
 genes in your dataset. 
 
+:func:`~infercnvpy.tl.infercnv` adds a `cell x genomic_region` matrix to 
+`adata.obsm["X_cnv"]`. 
+
 For more information about the method check out :ref:`infercnv-method`. 
 
 .. note::
@@ -135,6 +147,7 @@ Based on these clusters, we can annotate tumor and normal cells.
 
 .. autosummary::
    :toctree: ../generated
+   :noindex:
 
    infercnvpy.tl.pca
    infercnvpy.pp.neighbors
@@ -149,10 +162,10 @@ cnv.pp.neighbors(adata)
 cnv.tl.leiden(adata)
 ```
 
-After running leiden clustering, we plot the chromosome heatmap 
+After running leiden clustering, we can plot the chromosome heatmap 
 by CNV clusters. We can observe that, as opposted to the clusters 
 at the bottom, the clusters at the top have essentially no differentially expressed genomic regions. 
-The differentially expressed regions are likely due to copy number variation and those 
+The differentially expressed regions are likely due to copy number variation and the respective 
 clusters likely represent tumor cells. 
 
 ```python
@@ -161,16 +174,12 @@ cnv.pl.chromosome_heatmap(adata, groupby="cnv_leiden", dendrogram=True)
 
 ### UMAP plot of CNV profiles
 
-
+<!-- #raw raw_mimetype="text/restructuredtext" -->
 We can visualize the same clusters as a UMAP plot. Additionally, 
-we developed a summary score that quantifies the amount of copy
+:func:`infercnvpy.tl.cnv_score` computes a summary score that quantifies the amount of copy
 number variation per cluster. It is simply defined as the
 mean of the absolute values of the CNV matrix for each cluster. 
-
-.. autosummary::
-   :toctree: ../generated
-
-   infercnvpy.tl.cnv_score
+<!-- #endraw -->
 
 ```python
 cnv.tl.umap(adata)
@@ -202,10 +211,19 @@ Again, we can see that there are subclusters of epithelial cells that belong
 to a distinct CNV cluster, and that these clusters tend to have the 
 highest CNV score. 
 
+```python
+fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(
+    2, 2, figsize=(12, 11), gridspec_kw=dict(wspace=0.5)
+)
+ax4.axis("off")
+sc.pl.umap(adata, color="cnv_leiden", ax=ax1, show=False)
+sc.pl.umap(adata, color="cnv_score", ax=ax2, show=False)
+sc.pl.umap(adata, color="cell_type", ax=ax3)
+```
 
 ### Classifying tumor cells
 
-Based on these observations, we can now assign cell as either "tumor" or "normal". 
+Based on these observations, we can now assign cell to either "tumor" or "normal". 
 To this end, we add a new column `cnv_status` to `adata.obs`. 
 
 ```python
@@ -216,12 +234,12 @@ adata.obs.loc[
 ```
 
 ```python
-fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(11, 5))
+fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5), gridspec_kw=dict(wspace=0.5))
 cnv.pl.umap(adata, color="cnv_status", ax=ax1, show=False)
 sc.pl.umap(adata, color="cnv_status", ax=ax2)
 ```
 
-Now, we can also plot the CNV heatmap for tumor and normal cells separately: 
+Now, we can plot the CNV heatmap for tumor and normal cells separately: 
 
 ```python
 cnv.pl.chromosome_heatmap(adata[adata.obs["cnv_status"] == "tumor", :])