Merge pull request #202 from DendrouLab/clustering_g

clustering yaml creayed
DendrouLab · Apr 26, 2024 · 844e4a0 · 844e4a0
2 parents d53c976 + c78eea5
commit 844e4a0
Show file tree

Hide file tree

Showing 5 changed files with 52 additions and 29 deletions.
diff --git a/docs/yaml_docs/index.rst b/docs/yaml_docs/index.rst
@@ -12,4 +12,5 @@ Workflows configuration files
     spatial_qc
     spatial_preprocess
     spatial_deconvolution
-    pipeline_refmap_yml.md
+    pipeline_refmap_yml
+
diff --git a/docs/yaml_docs/pipeline_clustering_yml.md b/docs/yaml_docs/pipeline_clustering_yml.md
@@ -14,7 +14,10 @@ In this documentation, the parameters of the `clustering` configuration yaml fil
 This file is generated running `panpipes clustering config`. <br>
 The individual steps run by the pipeline are described in [clustering workflow](https://panpipes-pipelines.readthedocs.io/en/latest/workflows/clustering.html)
 
-When running the clustering workflow, panpipes provides a basic `pipeline.yml` file.
+The `clustering` workflow works with outputs generated by the `integration` workflow, and expects a `MuData` object with 
+`neighbors` saved in the `.uns` of the global layer to run clustering on the multimodal embedding. If `neighbors` are calculated on each modality layers, these can be reused or re-calculated on the flight.
+
+When running the clustering workflow, panpipes provides a basic `pipeline.yml` file to customize with parameters.
 To run the workflow on your own data, you need to specify the parameters described below in the `pipeline.yml` file to meet the requirements of your data.
 
 However, we do provide pre-filled versions of the `pipeline.yml` file for individual [tutorials](https://panpipes-pipelines.readthedocs.io/en/latest/tutorials/index.html).
@@ -62,24 +65,30 @@ Prefix for the sample that comes out of the filtering/ preprocessing steps of th
   Specify the full object if your scaled_obj contains only HVG.  If your scaled_obj contains all the genes then leave full_obj blank. 
   panpipes will use the full object to do marker genes analysis (rank_gene_groups) and for plotting those genes. 
 - <span class="parameter">modalities</span><br>
-  - <span class="parameter">rna</span> `Boolean`, Default: True<br>
+ Which modalities to run clustering on. 
+  - <span class="parameter">rna</span> `Boolean`, Default: True<br> If set to `True`, the workflow will stop if it doesn't find a modality named 'rna'
   - <span class="parameter">prot</span> `Boolean`, Default: True<br>
+  If set to `True`, the workflow will stop if it doesn't find a modality named 'prot'
   - <span class="parameter">atac</span> `Boolean`, Default: False<br>
+   If set to `True`, the workflow will stop if it doesn't find a modality named 'atac'
+
   - <span class="parameter">spatial</span> `Boolean`, Default: False<br>
-  Run clustering on each individual modality.
+  If set to `True`, the workflow will stop if it doesn't find a modality named 'spatial'
+
 
 - <span class="parameter">multimodal</span><br>
-  - <span class="parameter">rna_clustering</span> `Boolean`, Default: True<br>
-  - <span class="parameter">integration_method</span> `String`, Default: WNN<br>
-  Options here include WNN, mofa, and totalVI, and it tells us where to look for.
+  - <span class="parameter">rna_clustering</span> `Boolean`, Default: False<br> If set to True, runs clustering on multimodal embedding
+  - <span class="parameter">integration_method</span> `String`, Default: None<br>
+  In case you have run WNN and want to run clustering on the wnn embedding, specify "WNN" here. The neigbhours are saved with a different `--neighbors_key` param only for wnn, for every other method (totalvi, multivi, mofa) leave this parameter blank. 
+
 
 ## Parameters for finding neighbours 
 
 - <span class="parameter">neighbors:</span> 
  Sets the number of neighbors to use when calculating the graph for clustering and umap.
   - <span class="parameter">rna:</span> 
 
-     - <span class="parameter">use_existing </span> `Boolean`, Default: True<br>
+     - <span class="parameter">use_existing </span> `Boolean`, Default: True<br> Use existing neighbours in .uns calculated in the `integration` workflow. If `False`, it will recalculate using the following parameters
      - <span class="parameter">dim_red </span> `String`, Default: X_pca<br>
        Defines which representation in .obsm to use for nearest neighbors
      - <span class="parameter">n_dim_red</span> `Integer`, Default: 30<br>
@@ -94,7 +103,7 @@ Prefix for the sample that comes out of the filtering/ preprocessing steps of th
 
   - <span class="parameter">prot:</span> 
 
-     - <span class="parameter">use_existing </span> `Boolean`, Default: True<br>
+     - <span class="parameter">use_existing </span> `Boolean`, Default: True<br> Use existing neighbours in .uns calculated in the `integration` workflow. If `False`, it will recalculate using the following parameters
      - <span class="parameter">dim_red </span> `String`, Default: X_pca<br>
        Defines which representation in .obsm to use for nearest neighbors
      - <span class="parameter">n_dim_red</span> `Integer`, Default: 30<br>
@@ -109,7 +118,7 @@ Prefix for the sample that comes out of the filtering/ preprocessing steps of th
 
   - <span class="parameter">atac:</span> 
 
-     - <span class="parameter">use_existing </span> `Boolean`, Default: True<br>
+     - <span class="parameter">use_existing </span> `Boolean`, Default: True<br> Use existing neighbours in .uns calculated in the `integration` workflow. If `False`, it will recalculate using the following parameters
      - <span class="parameter">dim_red </span> `String`, Default: X_lsi<br>
        Defines which representation in .obsm to use for nearest neighbors
      - <span class="parameter">n_dim_red</span> `Integer`, Default: 1<br>
@@ -125,7 +134,7 @@ Prefix for the sample that comes out of the filtering/ preprocessing steps of th
 
   - <span class="parameter">spatial:</span> 
 
-     - <span class="parameter">use_existing </span> `Boolean`, Default: False<br>
+     - <span class="parameter">use_existing </span> `Boolean`, Default: False<br> Use existing neighbours in .uns calculated in the `integration` workflow. If `False`, it will recalculate using the following parameters
      - <span class="parameter">dim_red </span> `String`, Default: X_pca<br>
        Defines which representation in .obsm to use for nearest neighbors
      - <span class="parameter">n_dim_red</span> `Integer`, Default: 30<br>
@@ -142,51 +151,51 @@ Prefix for the sample that comes out of the filtering/ preprocessing steps of th
 
   - <span class="parameter">umap:</span> 
 
-     - <span class="parameter">run </span> `Boolean`, Default: True<br>
+     - <span class="parameter">run </span> `Boolean`, Default: True<br> Set to `True` runs the umap calculation and plotting.
      - <span class="parameter">rna:</span>
          - <span class="parameter">mindist </span> `Float`, Default: 0.5<br>
-           Can specify an array: 0.25,0.5
+           Can specify a single float or an array: 0.25,0.5
       - <span class="parameter">prot:</span>
          - <span class="parameter">mindist </span> `Float`, Default: 0.5<br>
-           Can specify an array: 0.25,0.5,0.8
+           Can specify a single float or an array: 0.25,0.5,0.8
       - <span class="parameter">atac:</span>
          - <span class="parameter">mindist </span> `Float`, Default: 0.5<br>
-           Can specify an array: 0.25,0.5,0.8
+           Can specify a single float or an array: 0.25,0.5,0.8
       - <span class="parameter">multimodal:</span>
          - <span class="parameter">mindist </span> `Float`, Default: 0.5<br>
-           Can specify an array: 0.25,0.5,0.8
+           Can specify a single float or an array: 0.25,0.5,0.8
       - <span class="parameter">rna:</span>
          - <span class="parameter">mindist </span> `Float`, Default: 0.5<br>
-            Can specify an array: 0.25,0.5,0.8
+            Can specify a single float or an array: 0.25,0.5,0.8
 
 ## Parameters for clustering 
 
   - <span class="parameter">clusterspecs:</span>
       - <span class="parameter">rna:</span>
           - <span class="parameter">resolutions </span> `Float`, Default: 0.2, 0.6, 1<br>
-           Can specify an array: 0.2,0.6,1
+           Can specify a single float or an array: 0.2,0.6,1
           - <span class="parameter">algorithm</span> `String`, Default: leiden<br>
             Options include louvain or leiden. 
       - <span class="parameter">prot:</span>
           - <span class="parameter">resolutions </span> `Float`, Default: 0.2, 0.6, 1<br>
-           Can specify an array: 0.2,0.6,1
+           Can specify a single float or an array: 0.2,0.6,1
           - <span class="parameter">algorithm</span> `String`, Default: leiden<br>
             Options include louvain or leiden.
 
       - <span class="parameter">atac:</span>
           - <span class="parameter">resolutions </span> `Float`, Default: 0.2, 0.6, 1<br>
-           Can specify an array to compute in parallel: 0.2,0.6,1
+           Can specify a single float or an array to compute in parallel: 0.2,0.6,1
           - <span class="parameter">algorithm</span> `String`, Default: leiden<br>
             Options include louvain or leiden. 
       - <span class="parameter">multimmodal:</span>
           - <span class="parameter">resolutions </span> `Float`, Default: 0.5, 0.7<br>
-           Can specify an array to compute in parallel: 0.2,0.6,1 
+           Can specify a single float or an array to compute in parallel: 0.2,0.6,1 
           - <span class="parameter">algorithm</span> `String`, Default: leiden<br>
             Options include louvain or leiden.
 
       - <span class="parameter">spatial:</span>
           - <span class="parameter">resolutions </span> `Float`, Default: 0.2, 0.6, 1<br>
-           Can specify an array to compute in parallel: 0.2,0.6,1 
+           Can specify a single float or an array to compute in parallel: 0.2,0.6,1 
           - <span class="parameter">algorithm</span> `String`, Default: leiden<br>
             Options include louvain or leiden. 
 
@@ -207,8 +216,10 @@ When pseudo_seurat is set to True then a [python implementation](https://github.
        Marker analysis is run for clusters >= mincells. If a cluster ncells < mincells , then the cluster is excluded from marker analysis
        - <span class="parameter">pseudo_seurat </span> `Boolean`, Default: False<br>
        - <span class="parameter">minpct </span> `Float`, Default: 0.1<br>
+       Only test genes that are detected in a minimum fraction of min.pct cells in either of the two populations. 
        This parameter is mandatory if pseudo_seurat is set to True 
        - <span class="parameter">threshuse </span> `Float`, Default: 0.25<br>
+       Limit testing to genes which show, on average, at least X-fold difference (log-scale) between the two groups of cells.
        This parameter is mandatory if pseudo_seurat is set to True 
  - <span class="parameter">prot:</span><br>
    - <span class="parameter">run </span> `Boolean`, Default: True<br>
@@ -219,8 +230,10 @@ When pseudo_seurat is set to True then a [python implementation](https://github.
    - <span class="parameter">method </span> `String`, Default: wilcoxon<br>
    - <span class="parameter">pseudo_seurat </span> `Boolean`, Default: False<br>
    - <span class="parameter">minpct </span> `Float`, Default: 0.1<br>
+       Only test genes that are detected in a minimum fraction of min.pct cells in either of the two populations. 
        This parameter is mandatory if pseudo_seurat is set to True 
    - <span class="parameter">threshuse </span> `Float`, Default: 0.25<br>
+    Limit testing to genes which show, on average, at least X-fold difference (log-scale) between the two groups of cells. 
        This parameter is mandatory if pseudo_seurat is set to True 
 
  - <span class="parameter">atac:</span><br>
@@ -234,8 +247,10 @@ When pseudo_seurat is set to True then a [python implementation](https://github.
         Options include: ‘logreg’, ‘t-test’, ‘wilcoxon’, ‘t-test_overestim_var’
     - <span class="parameter">pseudo_seurat </span> `Boolean`, Default: False<br>
     - <span class="parameter">minpct </span> `Float`, Default: 0.1<br>
+       Only test genes that are detected in a minimum fraction of min.pct cells in either of the two populations. 
        This parameter is mandatory if pseudo_seurat is set to True 
     - <span class="parameter">threshuse </span> `Float`, Default: 0.25<br>
+      Limit testing to genes which show, on average, at least X-fold difference (log-scale) between the two groups of cells.
        This parameter is mandatory if pseudo_seurat is set to True 
 
 
@@ -246,9 +261,9 @@ When pseudo_seurat is set to True then a [python implementation](https://github.
         Options include: ‘logreg’, ‘t-test’, ‘wilcoxon’, ‘t-test_overestim_var’
     - <span class="parameter">pseudo_seurat </span> `Boolean`, Default: False<br>
     - <span class="parameter">minpct </span> `Float`, Default: 0.1<br>
-       This parameter is mandatory if pseudo_seurat is set to True 
+       Only test genes that are detected in a minimum fraction of min.pct cells in either of the two populations. This parameter is mandatory if pseudo_seurat is set to True 
     - <span class="parameter">threshuse </span> `Float`, Default: 0.25<br>
-       This parameter is mandatory if pseudo_seurat is set to True
+       Limit testing to genes which show, on average, at least X-fold difference (log-scale) between the two groups of cells.This parameter is mandatory if pseudo_seurat is set to True
 
 
  - <span class="parameter">spatial:</span><br>
@@ -261,11 +276,12 @@ When pseudo_seurat is set to True then a [python implementation](https://github.
        Marker analysis is run for clusters >= mincells. If a cluster ncells < mincells , then the cluster is excluded from marker analysis
    - <span class="parameter">pseudo_seurat </span> `Boolean`, Default: False<br>
    - <span class="parameter">minpct </span> `Float`, Default: 0.1<br>
-      This parameter is mandatory if pseudo_seurat is set to True 
+      Only test genes that are detected in a minimum fraction of min.pct cells in either of the two populations. This parameter is mandatory if pseudo_seurat is set to True 
    - <span class="parameter">threshuse </span> `Float`, Default: 0.25<br>
+       Limit testing to genes which show, on average, at least X-fold difference (log-scale) between the two groups of cells. 
        This parameter is mandatory if pseudo_seurat is set to True 
 ## Plot specifications
-Used to define which metadata columns are used in the visualizations 
+Define which layers are used in the markers visualization 
  - <span class="parameter">plotspecs:</span><br>
    - <span class="parameter">layers: </span><br>
      - <span class="parameter">rna </span> `String`, Default: logged_counts<br>

diff --git a/panpipes/panpipes/pipeline_clustering.py b/panpipes/panpipes/pipeline_clustering.py
@@ -43,9 +43,10 @@ def set_up_dirs(log_file):
 ## Single modality scripts
 ## ------------------------------------
 
-# -----------------------------------=
+# --------------------------------------
 # neighbors
 # --------------------------------------
+# TO DO create task to re-run neighbours on multimodal outer representations (this script can only read in each mod layer)
 @follows(set_up_dirs)
 @originate(PARAMS['mudata_with_knn'])
 def run_neighbors(outfile):

diff --git a/panpipes/panpipes/pipeline_clustering/pipeline.yml b/panpipes/panpipes/pipeline_clustering/pipeline.yml
@@ -29,7 +29,7 @@ modalities:
   atac: False
   spatial: False
 
-# if True, will look for WNN, or totalVI output
+# if True, will look for WNN, mofa, multivi, totalVI embeddings
 multimodal:
   run_clustering: True
   integration_method: 
@@ -40,22 +40,26 @@ multimodal:
 # ---------------------------------------
 # 
 # -----------------------------
+
 neighbors:
   rna:
+    #use the knn calculated in the integration workflow. If False it will recalculate
     use_existing: True
     dim_red: X_pca
     n_dim_red: 30
     k: 30
     metric: euclidean
     method: scanpy
   prot:
+    #use the knn calculated in the integration workflow. If False it will recalculate
     use_existing: True
     dim_red: X_pca
     n_dim_red: 30
     k: 30
     metric: euclidean
     method: scanpy
   atac:
+    #use the knn calculated in the integration workflow. If False it will recalculate
     use_existing: True
     dim_red: X_lsi
     dim_remove: 1
@@ -64,6 +68,7 @@ neighbors:
     metric: euclidean
     method: scanpy
   spatial:
+    #use the knn calculated in the integration workflow. If False it will recalculate
     use_existing: False
     dim_red: X_pca
     n_dim_red: 30

diff --git a/panpipes/python_scripts/run_umap.py b/panpipes/python_scripts/run_umap.py
@@ -33,7 +33,7 @@
                     default=0.1, 
                     help="no. neighbours parameters for sc.pp.neighbors()")
 parser.add_argument("--neighbors_key", 
-                    default="neighbors", help="algortihm choice from louvain and leiden")
+                    default="neighbors", help="name of the saved knn neighbors")
 
 args, opt = parser.parse_known_args()
 L.info(args)