Merge pull request #15 from vivekbhr/develop

bugfix + minor doc changes
bhardwaj-lab · Jul 2, 2023 · 70ab9bb · 70ab9bb
2 parents bb51d40 + e105c79
commit 70ab9bb
Show file tree

Hide file tree

Showing 31 changed files with 414 additions and 51 deletions.
diff --git a/.codecov.yaml b/.codecov.yaml
@@ -0,0 +1,17 @@
+# Based on pydata/xarray
+codecov:
+  require_ci_to_pass: no
+
+coverage:
+  status:
+    project:
+      default:
+        # Require 10% coverage, i.e., always succeed
+        target: 1
+    patch: false
+    changes: false
+
+comment:
+  layout: diff, flags, files
+  behavior: once
+  require_base: no
diff --git a/.gitignore b/.gitignore
@@ -1,8 +1,42 @@
+
+notebooks
+
+# Docs
+/docs/_build/
+
+# tests
+/*coverage*
+/nunit/
+/.cache/
+/.pytest_cache/
+/sincei/tests/test*.h5ad
+/sincei/tests/figures/
+/sincei/tests/notebooks/figures/
+
+# Environment management
+/Pipfile
+/Pipfile.lock
+/requirements*.lock
+
+# always-ignore extensions
+*~
+
+# Python build files
 sincei/__pycache__
 bin/__pycache__
-notebooks
+__pycache__/
+/dist/
+/build/
+/sincei.egg-info/
+/*-env/
+/env-*/
+
+# OS stuff
 .DS_Store
+Thumbs.db
 .ipynb_checkpoints/
-.pytest_cache
-build
-docs/_build
+.directory
+
+# IDEs and editors
+/.idea/
+/.vscode/
diff --git a/README.md b/README.md
@@ -3,7 +3,16 @@
 
 [![DOI](https://zenodo.org/badge/271841139.svg)](https://zenodo.org/badge/latestdoi/271841139) [![Documentation Status](https://readthedocs.org/projects/sincei/badge/?version=latest)](https://sincei.readthedocs.io/en/latest/?badge=latest) [![PyPI Version](https://img.shields.io/pypi/v/sincei.svg?style=plastic)](https://pypi.org/project/sincei/) [![test](https://github.com/vivekbhr/sincei/actions/workflows/test.yml/badge.svg)](https://github.com/vivekbhr/sincei/actions/workflows/test.yml) [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT) [![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
 
-## [Full Documentation](http://sincei.rtfd.io/).
+## Features
+
+sincei provides a flexible, easy-to-use command-line interface to work with single-cell data directly from BAM files. It can:
+
+ - Aggregate signal in bins, genes or any feature of interest from single-cells.
+ - Perform read-level and count-level quality control.
+ - Perform dimentionality reduction and clustering of all kinds of single-cell data (open chromatin, histone marks, methylation, gene expression etc..).
+ - Create coverage files (bigwigs) for visualization.
+
+## [Full Documentation](http://sincei.rtfd.io/)
 
 ## Installation
 
@@ -12,8 +21,7 @@ sincei is a command line toolkit based on python3, and can be installed using [c
 Create a new conda environment and install sincei stable release from github using:
 
 ```
-cd <your_programs_folder>
-conda create -n sincei -c conda-forge -c bioconda scanpy deeptools
+conda create -n sincei -c anaconda python=3.8
 conda activate sincei
 (sincei): pip install --editable=git+https://github.com/vivekbhr/sincei.git@master#egg=sincei
 ```
@@ -32,4 +40,6 @@ Each tool begins with the prefix sc<tool_name>, such as:
 
  $ scBulkCoverage -b file1.bam -g groupinfo.txt -o coverage
 
+## Questions and discussions
 
+To ask  a question related to sincei or start a new discussion, please use our [github discussion forum](https://github.com/vivekbhr/sincei/discussions).
diff --git a/docs/conf.py b/docs/conf.py
@@ -54,6 +54,7 @@ def get_version(path=VPATH):
     "sphinx.ext.viewcode",
     "sphinx.ext.autosummary",
     "sphinxarg.ext",
+    "sphinx_toolbox.collapse",
 ]
 #    'numpydoc'
 

diff --git a/docs/content/images/UMAP_compared_withOrig.png b/docs/content/images/UMAP_compared_withOrig.png
diff --git a/docs/content/images/igv_snapshot_1.png b/docs/content/images/igv_snapshot_1.png
diff --git a/docs/content/tutorials.rst b/docs/content/tutorials.rst
@@ -1,2 +1,7 @@
 Tutorials
 ===========
+
+.. toctree::
+    :maxdepth: 1
+
+    tutorials/sincei_tutorial.rst
diff --git a/docs/content/tutorials/sincei_tutorial.rst b/docs/content/tutorials/sincei_tutorial.rst
@@ -0,0 +1,220 @@
+Analysis of sc-sortChIC data using sincei
+=========================================
+
+Below, we will demonstrate how we can use sincei to explore the data from the protocol single-cell sortChIC, presented in `Zeller, Yueng et.al. <https://www.nature.com/articles/s41588-022-01260-3>`__. This dataset includes BAM files that represent MNase cuts targetted at **H3K27me1** histone mark in single-cells from adult mouse bone marrow, sorted into 384 well plates. We will also use a metadata file that contain the cell labels defined using celltype-specific surface markers from FACS. This will provide independent confirmation that our clustering captures known cell-types.
+
+1. Download the example dataset
+-------------------------------
+
+The test data here contains:
+
+-  **4x BAM files** (indexed): contain data from 4x 384-well plates,
+   reads are taken from **chromosome 1** (mm10/GRCm38)
+-  **mm10_chr1.2bit:** 2-bit file storing the sequence information for
+   mouse chromosome 1
+-  **mm10_blacklist.bed:** blacklisted regions to avoid
+-  **sortChIC_barcodes.txt:** barcodes corresponding to the 384-well
+   plate
+-  **metadata.txt:** metadata file that defines cell types from FACS
+   label
+
+.. code:: bash
+
+    mkdir sortchic_testdata && wget -O
+    sortchic_testdata/sortChIC_testdata.tar.gz
+    https://figshare.com/ndownloader/files/41303289 tar -xvzf
+    sortchic_testdata/sortChIC_testdata.tar.gz ## releases 12 files
+
+    ## save as bash variables
+
+    blacklist=sortchic_testdata/mm10_blacklist.bed
+    barcodes=sortchic_testdata/sortChIC_barcodes.txt
+    genome=sortchic_testdata/mm10_chr1.2bit
+    bamfiles=sortchic_testdata/\*.bam
+
+
+
+2. Quality control - I (read-level)
+-------------------------------------
+
+In order to define high quality cells, we can use the read-level quality statistics from :ref:`scFilterStats`. There could be several indications of low quality cells in this data, such as:
+
+-   high PCR duplicates (filtered using `--duplicateFilter`)
+-   high fraction of reads aligned to blacklisted regions (filtered using
+    `--blacklist`)
+-   high fraction of reads with poor mapping quality (filtered using
+    `--minMappingQuality`)
+-   vey high/low GC content of the aligned reads, indicating reads mostly
+    aligned to low-complexity regions (filtered using `--GCcontentFilter`)
+-   high level of secondary/supplementary alignments (filtered using
+    `--samFlagExclude/Include`)
+
+.. code:: bash
+
+   scFilterStats -p 20 --motifFilter 'A,TA'  \
+   --minAlignedFraction 0.6 --GCcontentFilter '0.2,0.8' \
+   --minMappingQuality 10 \
+   --duplicateFilter 'start_bc_umi' \
+   --samFlagInclude 64 --samFlagExclude 256 --samFlagExclude 2048 \
+   --genome2bit ${genome} \
+   --barcodes ${barcodes} \
+   -bl ${blacklist}  \
+   --smartLabels -o sincei_output/scFilterStats_output.txt \
+   -b ${bamfiles}
+
+``scFilterStats`` summarizes these outputs as a table, which can then be
+visualized using the `MultiQC tool <https://multiqc.info/docs/>`__, to
+select appropriate list of cells to include for counting.
+
+.. code:: bash
+
+    # visualize output using multiQC
+    multiqc sincei_output # see results in multiqc_report.html
+
+
+3. Signal aggregation (counting reads)
+--------------------------------------
+
+Below we aggregate signal from single-cells in 50-kb bins from ``chromosome 1`` using the tool :ref:`scCountReads`. If needed, we can use the same parameters as in :ref:`scFilterStats` to count only high quality reads from our whitelist of barcodes.
+
+We remove duplicates by matching the start position, barcode and UMI between reads (``--duplicateFilter``). ``start_bc_umi`` is the right choice here, as the sortChIC protocol contains PCR duplicates, as well as IVT duplicates, which do not necessarily have the same fragment “end” position in the genome. We avoid counting reads in blacklisted regions of the mouse genome (``--blacklist``), and let the tool figure out sample names from file labels (``--smartLabels``)
+
+Note: Sam flag filtering is applied before duplicate filtering. Note: All read-filtering is performed first, then the remaining reads are extended/centered before the coverage is calculated. Therefore, always specify ``--samFlagInclude 64`` to only count paired-end reads once (even after read extension). Note: If you deduplicate using read start+end, the chimeric reads with the same UMI+barcode would be counted as unique the start position of second mate is used in those cases.
+
+.. code:: bash
+
+    scCountReads bins -p 20 --binSize 50000 --cellTag BC --region chr1 \
+    --minMappingQuality 10 --samFlagInclude 64 --samFlagExclude 2048 \
+    --duplicateFilter 'start_bc_umi' --extendReads \
+    -bl ${blacklist} -bc ${barcodes} \
+    -o sincei_output/scCounts_50kb_bins \
+    --smartLabels -b ${bamfiles}
+    # Number of bins found: 3923
+
+4. Quality control - II (count-level)
+-------------------------------------
+
+After counting, it’s recommended to perform QC of these counts, in order to filter regions and cells that do not get enough counts, or have low enrichment of counts. Even though we already performed read-level QC before, the counts distribution on our specified regions (bins/genes/peaks) could be different from the whole-genome stats.
+
+The tool :ref:`scCountQC` can be used for count-level QC and filtering of count data. With the ``--outMetrics`` option, the tool outputs the count statistics at region and cell level (labelled as ``<prefix>.regions.tsv`` and ``<prefix>.cells.tsv``). Just like :ref:`scFilterStats`, these outputs can then be visualized using the `MultiQC tool <https://multiqc.info/docs/>`__, to select appropriate metrics to filter out the unwanted cells/regions.
+
+.. code:: bash
+
+    # list the metrics we can use to filter cells/regions
+    scCountQC -i sincei_output/scCounts_50kb_bins.loom --describe
+
+    # export the single-cell level metrices
+    scCountQC -i sincei_output/scCounts_50kb_bins.loom \
+    -om sincei_output/countqc_50kb_bins
+
+    # visualize output using multiQC
+    multiqc sincei_output # see results in multiqc_report.html
+
+
+
+Below, we perform a basic filtering using :ref:`scCountQC`. We exclude the cells with low or very high counts (using ``--filterRegionArgs``). Also, we exclude the regions that are detected in too few or too many cells (using ``--filterCellArgs``).
+
+.. code:: bash
+
+   scCountQC -i sincei_output/scCounts_50kb_bins.loom \
+   -o sincei_output/scCounts_50kb_bins_filtered.loom \
+   --filterRegionArgs "n_cells_by_counts: 50, 2000" \
+   --filterCellArgs "n_genes_by_counts: 100, 3000"
+   # Applying filters
+   # Remaining cells: 1333
+   # Remaining features: 2561
+
+5. Dimentionality reduction and clustering
+------------------------------------------
+
+The tool :ref:`scClusterCells` provides a range of options to reduce the dimentionality of our count data, while preserving biological signal. This can be specified with ``--method`` option. Below, we will use a topic modeling method called Latent Schemantic Analysis to reduce the dimentionality of our data to 20 principle components (the default). The tool then uses distance between nearest neighbors to perform a graph-based clustering, and presents a UMAP (2-dimention) plot of the output (``--outFileUMAP`` option). This option also creates a tsv file with UMAP coordinates and assigned clusters for each cell in our data.
+
+
+.. code:: bash
+
+    scClusterCells -i sincei_output/scCounts_50kb_bins_filtered.loom \
+    --method LSA --clusterResolution 0.7 \
+    --outFileUMAP sincei_output/scClusterCells_UMAP.png \
+    -o sincei_output/scCounts_50kb_bins_clustered.loom
+    # Coherence Score:  -1.5
+    # also produces the tsv file "sincei_output/scClusterCells_UMAP.tsv"
+
+(optional) Confirmation of clustering using metadata
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Below, we will load this data in R and compare it to the cell metadata provided with our files to see if our clustering separates celltypes in a biologically meaningful way.
+
+We can color our UMAP output from :ref:`scClusterCells` with the cell-type information based on FACS-sorting from sortChIC.
+
+.. collapse:: Confirmation with metadata
+
+    .. code-block:: r
+
+        umap <- read.delim("sincei_output/scClusterCells_UMAP.tsv", row.names = 1)
+        meta <- read.delim("sortchic_testdata/metadata.tsv", row.names = 1)
+        umap$celltype_facs <- meta[rownames(umap), "ctype"]
+
+        # keep only FACS-defined labels
+        umap %<>% filter(celltype_facs != "AllCells")
+
+        ## make plots
+        df_center <- group_by(umap, cluster) %>% summarise(UMAP1 = mean(UMAP1), UMAP2 = mean(UMAP2))
+        df_center2 <- group_by(umap, celltype_facs) %>% summarise(UMAP1 = mean(UMAP1), UMAP2 = mean(UMAP2))
+
+        col_pallete <- RColorBrewer::brewer.pal(12, "Paired")
+        # colors for sincei UMAP (8 clusters)
+        colors_cluster <- col_pallete[1:6]
+        names(colors_cluster) <- unique(umap$cluster)
+
+        # colors for metadata (12 celltypes)
+        names(col_pallete) <- c("Tcells", "Bcells", "NKs", "DCs", "Eryths", "Granulocytes", "Monocytes")
+
+        p1 <- umap %>%
+          ggplot(., aes(UMAP1, UMAP2, color=factor(cluster), label=cluster)) +
+          geom_point() + geom_label(data = df_center, aes(UMAP1, UMAP2)) +
+          scale_color_manual(values = colors_cluster) + theme_minimal(base_size = 12) +
+          theme(legend.position = "none") + ggtitle("sincei clusters (LSA + louvain)")
+
+        p2 <- umap %>%
+          ggplot(., aes(UMAP1, UMAP2, color=factor(celltype_facs), label=celltype_facs)) +
+            geom_point() + geom_label(data = df_center2, aes(UMAP1, UMAP2)) +
+            scale_color_manual(values = col_pallete) +
+            labs(color="Cluster") + theme_minimal(base_size = 12) +
+            theme(legend.position = "none") +
+            ggtitle("sortChIC cell-types (from FACS)")
+
+        pl <- p1 + p2
+
+        ggsave(plot=pl, "sincei_output/UMAP_compared_withOrig.png", dpi=300, width = 11, height = 6)
+
+
+.. image:: ./../images/UMAP_compared_withOrig.png
+   :height: 800px
+   :width: 1600 px
+   :scale: 50 %
+
+The figure above shows that we can easily replicate the expected cell-type results from the sortChIC data using **sincei**. This was done using only 1/20th of original data (chromosome 1) and basic pre-processing steps, therefore the results should only improve with full data, better cell/region filtering and optimizing the analysis parameters.
+
+6. Creating bigwigs and visualizing signal on IGV
+---------------------------------------------------
+
+For further exploration of data, It's very useful to create in-silico bulk coverage files (bigwigs) that aggregate the signal across cells in our clusters. The tool **scBulkCoverage** takes sincei clustered `.tsv` file, along with the corresponding BAM files, and aggregate the signal to create these bigwigs.
+
+The parameters here are same as other sincei tools that work on BAM files, except that we can ask for a normalized bulk signal (specified using `--normalizeUsing` option) . Below, we prodice CPM-normalized bigwigs with 1kb bins.
+
+.. code:: bash
+
+   scBulkCoverage -p 20 --normalizeUsing CPM --binSize 1000 \
+   --minMappingQuality 10 --samFlagInclude 64 --samFlagExclude 2048 \
+   --duplicateFilter 'start_bc_umi' --extendReads \
+   -b ${bamfiles} --smartLabels \
+   -i sincei_output/scClusterCells_UMAP.tsv \
+   -o sincei_output/sincei_cluster
+   # creates 6 files with names "sincei_cluster_<X>.bw" where X is 0, 1, 2, 3, 4, 5
+
+We can now inspect these bigwigs on `IGV <https://software.broadinstitute.org/software/igv/>`__,. We can clearly see some regions with cell-type specific signal, such as the ones here for genes Prim3 and Tmem131.
+
+.. image:: ./../images/igv_snapshot_1.png
+   :height: 500px
+   :width: 6000 px
+   :scale: 50 %
diff --git a/docs/index.rst b/docs/index.rst
@@ -9,7 +9,7 @@ sincei
     :target: https://zenodo.org/badge/latestdoi/271841139
     :alt: DOI
 
-Mourragui S. and Bhardwaj V. (2023) sincei: A user-friendly toolkit for QC, counting, clustering and plotting of single-cell (epi)genomics data.
+Bhardwaj V. and Mourragui S. (2023) sincei: A user-friendly toolkit for QC, counting, clustering and plotting of single-cell (epi)genomics data.
 
 
 Installation
@@ -20,16 +20,15 @@ Create a new conda environment and install sincei using:
 
 .. code-block:: bash
 
-    cd <programs_folder>
-    conda create -n sincei -c bioconda -c conda-forge scanpy deeptools
+    conda create -n sincei -c anaconda python=3.8
     conda activate sincei
-    (sincei): pip install --editable=git+https://github.com/vivekbhr/sincei.git@master#egg=sincei
+    (sincei): pip install git+https://github.com/vivekbhr/sincei.git@master#egg=sincei
 
 
 Getting Help
 ------------
 
-* For all kind of questions, suggesting changes/enhancements and to report bugs, please create an issue on `our GitHub repository <https://github.com/vivekbhr/sincei>`_
+* For all kind of questions, suggesting changes/enhancements or to report bugs, please create an issue on `our GitHub repository <https://github.com/vivekbhr/sincei>`_
 
 **Please Note that sincei is under active development.** Some features might be incomplete, untested or might be removed as we move towards a stable version.
 
@@ -63,5 +62,6 @@ Contents:
 
    content/list_of_tools
    content/tutorials
+   content/tutorials/sincei_tutorial
    content/modules/modules
    content/news
diff --git a/pyproject.toml b/pyproject.toml
@@ -69,6 +69,7 @@ doc = [
     "sphinx-autodoc-typehints",
     "readthedocs-sphinx-ext",
     "sphinx-argparse",
+    "sphinx_toolbox"
     # For notebooks
  #   "ipykernel",
  #   "ipython",
@@ -122,4 +123,3 @@ exclude = '''
   )/
 )
 '''
-