integration pipeline.yml modified until ATAC modality

DendrouLab · Mar 6, 2024 · 3788e93 · 3788e93
1 parent 1637535
commit 3788e93
Show file tree

Hide file tree

Showing 3 changed files with 86 additions and 127 deletions.
diff --git a/docs/yaml_docs/pipeline_integration_yml.md b/docs/yaml_docs/pipeline_integration_yml.md
@@ -16,41 +16,45 @@ When running the integration workflow, panpipes provides you with a basic `pipel
 
 You can download the different integration pipeline.yml files here:
 - Basic `pipeline.yml` file (not pre-filled) that is generated when calling `panpipes integration config`: [Download here](https://github.com/DendrouLab/panpipes/blob/main/panpipes/panpipes/pipeline_integration/pipeline.yml)
-- `pipeline.yml`for [Integration tutorial](https://panpipes-tutorials.readthedocs.io/en/latest/uni_multi_integration/pipeline_yml.html)
+- `pipeline.yml`for Integration tutorial: [View and Download here](https://panpipes-tutorials.readthedocs.io/en/latest/uni_multi_integration/pipeline_yml.html)
 
 For more information on functionalities implemented in `panpipes` to read the configuration files, such as reading blocks of parameters and reusing blocks with  `&anchors` and `*scalars`, please check [our documentation](./useful_info_on_yml.md)
 
 ## Compute resources options
 
-- <span class="parameter">resources</span>
-
+<span class="parameter">resources</span><br>
 Computing resources to use, specifically the number of threads used for parallel jobs.
 Specified by the following parameters:
-
-   - <span class="parameter">threads_high</span> `Integer`, Default: 1<br>
-    Number of threads used for high intensity computing tasks. 
-    For each thread, there must be enough memory to load your MuData object which was created in the preprocessing step of 
-the workflow.
+  - <span class="parameter">threads_high</span> `Integer`, Default: 1<br>
+   Number of threads used for high intensity computing tasks. 
+   For each thread, there must be enough memory to load your MuData object which was created in the preprocessing step of 
+   the workflow.
 
-   - <span class="parameter">threads_medium</span> `Integer`, Default: 1<br>
-    Number of threads used for medium intensity computing tasks.
-    For each thread, there must be enough memory to load your mudata and do computationally light tasks.
+  - <span class="parameter">threads_medium</span> `Integer`, Default: 1<br>
+   Number of threads used for medium intensity computing tasks.
+   For each thread, there must be enough memory to load your mudata and do computationally light tasks.
 
   - <span class="parameter">threads_low</span> `Integer`, Default: 1<br>
-  	 Number of threads used for low intensity computing tasks.
-     For each thread, there must be enough memory to load text files and do plotting, requires much less memory than the other two.  
-   - <span class="parameter">threads_gpu</span> `Integer`, Default: 2<br>
+   Number of threads used for low intensity computing tasks.
+   For each thread, there must be enough memory to load text files and do plotting, requires much less memory than the other two.  
+  - <span class="parameter">threads_gpu</span> `Integer`, Default: 2<br>
    Number of cores per gpu used for computing tasks.
    For each thread, there must be enough memory to compute the tasks above. 
 
 <span class="parameter">condaenv</span> `String`<br>
   Path to conda environment that should be used to run panpipes.
   Leave blank if running native or your cluster automatically inherits the login node environment
 
+<span class="parameter">queues</span><br>
+Allows for tweaking which queues jobs get submitted to, in case there is a special queue for long jobs, or you have access to a gpu-specific queue.
+The default queue should be specified in your .cgat.yml file.
+Leave blank if you do not want to use any alternative queues.
+  - <span class="parameter">long</span><br>
+  - <span class="parameter">gpu</span><br>
+
 ## Loading and merging data options
 ### Data format
 
-
 <span class="parameter">sample_prefix</span> `String`, Mandatory parameter, Default: test<br>
 Prefix for the sample that comes out of the filtering/ preprocessing steps of the workflow.
 
@@ -60,9 +64,9 @@ Prefix for the sample that comes out of the filtering/ preprocessing steps of th
 
 ## Batch correction
 
-**Unimodal: correct each modality independently**
+**Batch correction is done unimodal, meaning each modality is batch corrected independently.**
 
-## RNA modality
+### RNA modality
 
 <span class="parameter">rna:</span> 
   Batch correction for the RNA modality is specified by the following parameters:
@@ -80,7 +84,7 @@ Prefix for the sample that comes out of the filtering/ preprocessing steps of th
 
      The column name of the covariate you want want to batch correct on, if a comma-separated list is specified then all will be used simultaneously.
 
-### Harmony arguments
+#### Harmony arguments
 
 - <span class="parameter">harmony:</span>
     Basic parameters required to run harmony:
@@ -91,14 +95,14 @@ Prefix for the sample that comes out of the filtering/ preprocessing steps of th
 
   For more information on `harmony` check the [harmony documentation](https://portals.broadinstitute.org/harmony/reference/RunHarmony.html)
 
-### BBKNN arguments
+#### BBKNN arguments
 
 - <span class="parameter">bbknn:</span>  
   - <span class="parameter">neighbors_within_batch:</span> `Integer`, Default: 3<br>
 
 For more information on `bbknn` check the [bbknn documentation](https://bbknn.readthedocs.io/en/latest/) 
 
-### SCVI arguments
+#### SCVI arguments
   -  <span class="parameter">scvi</span>: SCVI parameters are specified as
       - <span  class="parameter">exclude_mt_genes:</span> `Boolean`, Default: True<br>
       - <span  class="parameter">exclude_mt_genes:</span> `String`, Default: mt<br>
@@ -134,7 +138,7 @@ For more information on `bbknn` check the [bbknn documentation](https://bbknn.re
          
   For more information on `scvi` check the [scvi documentation](https://docs.scvi-tools.org/en/stable/api/reference/scvi.model.SCVI.html)
 
-### Find neighbour parameters 
+#### Find neighbour parameters 
 Parameters to compute the connectivity graph on RNA
 
 - <span class="parameter">neighbors:</span> `String`<br>
@@ -152,7 +156,7 @@ Parameters to compute the connectivity graph on RNA
     The method can either be scanpy or hnsw
 
 
-## Protein modality
+### Protein modality
 <span class="parameter">prot:</span> 
   Batch correction for the protein modality is specified by the following parameters:
 
@@ -168,34 +172,33 @@ Parameters to compute the connectivity graph on RNA
 
      The column you want to batch correct on, if a comma-separated list is specified then all will be used simultaneously
 
-### Harmony arguments
+#### Harmony arguments
 
-- <span class="parameter">harmony:</span>
-    Basic parameters required to run harmony:
-
-    - <span class="parameter">sigma</span> `Float`, Default: 0.1<br>
-    - <span class="parameter">theta</span> `Float`, Default: 1.0<br>
-    - <span class="parameter">npcs</span> `Integer`, Default: 30<br>
+<span class="parameter">harmony</span><br>
+Basic parameters required to run harmony:
 
-  For more information on `harmony` check the [harmony documentation](https://portals.broadinstitute.org/harmony/reference/RunHarmony.html)
+- <span class="parameter">sigma</span> `Float`, Default: 0.1<br>
+- <span class="parameter">theta</span> `Float`, Default: 1.0<br>
+- <span class="parameter">npcs</span> `Integer`, Default: 30<br>
 
-
-### BBKNN arguments
+For more information on `harmony` check the [harmony documentation](https://portals.broadinstitute.org/harmony/reference/RunHarmony.html)
 
+
+#### BBKNN arguments
 
-- <span class="parameter">bbknn:</span>  
+<span class="parameter">bbknn</span><br> 
   - <span class="parameter">neighbors_within_batch:</span> `Integer`, Default: 3<br>
 
 For more information on `bbknn` check the [bbknn documentation](https://bbknn.readthedocs.io/en/latest/) 
 
-### Find neighbour parameters 
+#### Find neighbour parameters 
 
 Parameters to compute the connectivity graph on Protein
 
-- <span class="parameter">neighbors:</span> `String`, Default: &prot_neighbors<br>
+<span class="parameter">neighbors</span> `String`, Default: &prot_neighbors<br>
 
-  - <span class="parameter">npcs</span> `Integer`, Default: 30<br>   
-   Number of principal components to calculate for neighbors and Umap
+  - <span class="parameter">npcs</span> `Integer`, Default: 30<br>
+    Number of principal components to calculate for neighbors and Umap
 
   -  <span class="parameter">k</span> `Integer`, Default: 30<br>
   Number of neighbors
@@ -207,7 +210,7 @@ Parameters to compute the connectivity graph on Protein
     The method can either be scanpy or hnsw
 
 
-## ATAC modality 
+### ATAC modality 
 
 <span class="parameter">atac:</span> 
   Batch correction for the ATAC modality is specified by the following parameters:
@@ -226,7 +229,7 @@ Parameters to compute the connectivity graph on Protein
 
      The column you want to batch correct on, if a comma-separated list is specified then all will be used simultaneously
 
-### Harmony arguments
+#### Harmony arguments
 
 - <span class="parameter">harmony:</span>
     Basic parameters required to run harmony:

diff --git a/docs/yaml_docs/pipeline_preprocess_yml.md b/docs/yaml_docs/pipeline_preprocess_yml.md
@@ -424,9 +424,10 @@ Whether applying scaling or not is still a matter of debate, as stated in the [L
   - <span class="parameter">color_by</span> `String`, Default: sample_id<br>
         Specify the covariate you want to use to color the dimensionality reduction plot.
 
-  - <span class="parameter">dim_remove</span> `TODO`<br>
+  - <span class="parameter">dim_remove</span> `Integer`<br>
         Whether to remove the component(s) associated to technical artifacts.
         For instance, it is common to remove the first LSI component, as it is often associated with batch effects.
+        Specify `1` to remove the first component.
         Leave blank to avoid removing any.
 
 

diff --git a/panpipes/panpipes/pipeline_integration/pipeline.yml b/panpipes/panpipes/pipeline_integration/pipeline.yml
@@ -1,76 +1,59 @@
 # ============================================================
 # Integration workflow Panpipes (pipeline_integration.py)
 # ============================================================
-# written by Charlotte Rich-Griffin, Fabiola Curion
+# This file contains the parameters for the integration workflow.
+# For full descriptions of the parameters, see the documentation at https://panpipes-pipelines.readthedocs.io/en/latest/yaml_docs/pipeline_integration_yml.html
 
-# compute resource options
-# ------------------------
+
+#--------------------------
+# Compute resources options
+#--------------------------
 resources:
-  # Number of threads used for parallel jobs
-  # this must be enough memory to load your mudata and do computationally intensive tasks
   threads_high: 1
-  # this must be enough memory to load your mudata and do computationally light tasks
   threads_medium: 1
-  # this must be enough memory to load text files and do plotting, requires much less memory than the other two
   threads_low: 1
-  # if you access to a gpu-specific queue, how many gpu threads to request, make sure to edit the queues section below,
-  # so that panpipes can find your gpu queue
+
   threads_gpu: 2
-# path to conda env, leave blank if running native or your cluster automatically inherits the login node environment
+
 condaenv:
 
-# allows for tweaking which queues jobs get submitted to, 
-# in case there is a special queue for long jobs or you have access to a gpu-specific queue
-# the default queue should be specified in your .cgat.yml file
-# leave blank if you do not want to use the alternative queues
 queues:
-  long: 
-  gpu:  
+  long:
+  gpu:
 
-# Start
-# --------------------------
-# either one that exists already with
+# --------------------------------
+# Loading and merging data options
+# --------------------------------
+
+# ----------------------------
+# Data format
 sample_prefix: test
-#this is what comes out of the filtering/preprocessing
 preprocessed_obj: ../preprocess/test.h5mu
-# contains layers: raw_counts, logged_counts, and has scaled or logged counts in X 
 
 
-#--------------------------
+#-----------------
 # Batch correction
-# -------------------------
-# unimodal: correct each modality independently
+# ----------------
+# Batch correction is done unimodal, meaning each modality is batch corrected independently
+
+# ------------
+# RNA modality
 rna:
-  # True or false depending on whether you want to run batch correction
-  run: True 
-  # what method(s) to use to run batch correction, you can specify multiple 
-  # choices: harmony,bbknn,scanorama,scvi (comma-seprated string, no spaces)
+  run: True
   tools: harmony,bbknn,scanorama,scvi
-  # this is the column you want to batch correct on. if you specify a comma separated list, 
-  # they will be all used simultaneosly. 
-  # Specifically all columns specified will be merged into one 'batch' columns.
-  # if you want to test correction for one at a time, 
-  # specify one at a time and run the pipeline in different folders i.e. integration_by_sample, 
-  # integration_by_tissue ...
   column: sample_id 
-  #-----------------------------
-  # Harmony args
-  #-----------------------------
+
+  # Harmony arguments
   harmony:
-  # sigma value, used by Harmony
-    sigma: 0.1 
-  # theta value used by Harmony, default is 1
+    sigma: 0.1
     theta: 1.0
-  # number of pcs, used by Harmony
     npcs: 30
-  #----------------------------
+
   # BBKNN args # https://bbknn.readthedocs.io/en/latest/
-  #-----------------------------
   bbknn:
     neighbors_within_batch: 
-  #-----------------------------
+
   # SCVI args
-  #-----------------------------
   scvi:
     exclude_mt_genes: True
     mt_column: mt
@@ -89,68 +72,40 @@ rna:
         lr_scheduler_metric: 
         lr_patience: 8
         lr_factor: 0.1
-  #----------------------------
-  # find neighbour parameters
-  #-----------------------------
-  # to reuse these params, (for example for WNN) please use anchors (&) and scalars (*) in the relevant place
-  # i.e. &rna_neighbors will be called by *rna_neighbors where referenced
-  neighbors: &rna_neighbors 
-  # number of Principal Components to calculate for neighbours and umap:
-  #   -if no correction is applied, PCA will be calculated and used to run UMAP and clustering on
-  #   -if Harmony is the method of choice, it will use these components to create a corrected dim red.)
-  # the maximum number of dims for neighbors calculation can only only be lower or equal to the total number of dims for PCA or Harmony
-  # note: scvelo default is 30
+
+  # Find neighbour parameters
+  neighbors: &rna_neighbors
     npcs: 30
-    # number of neighbours
     k: 30
-    # metric: euclidean | cosine
     metric: euclidean
-    # scanpy | hnsw (from scvelo)
     method: scanpy
 
-#--------------------------
+# ----------------
+# Protein modality
 prot:
-  # True or false depending on whether you want to run batch correction
-  run: True 
-  # what method(s) to use to run batch correction, you can specify multiple 
-  # choices: harmony,bbknn,combat
+  run: True
   tools: harmony
-  # this is the column you want to batch correct on. if you specify a comma separated list (no spaces), 
-  # they will be all used simultaneosly. if you want to test correction for one at a time, 
-  # specify one at a time and run the pipeline in different folders i.e. integration_by_sample, 
-  # integration_by_tissue ...
   column: sample_id 
-  #----------------------------
+
   # Harmony args
-  #-----------------------------
   harmony:
-  # sigma value, used by Harmony
-    sigma: 0.1 
-  # theta value used by Harmony, default is 1
+    sigma: 0.1
     theta: 1.0
-  # number of pcs, used by Harmony
     npcs: 30
-  #----------------------------
+
   # BBKNN args # https://bbknn.readthedocs.io/en/latest/
-  #-----------------------------
   bbknn:
     neighbors_within_batch: 
-  #----------------------------›
-  # find neighbour parameters
-  #-----------------------------
+
+  # Find neighbour parameters
   neighbors: &prot_neighbors
-    # number of Principal Components to calculate for neighbours and umap:
-    #   -if no correction is applied, PCA will be calculated and used to run UMAP and clustering on
-    #   -if Harmony is the method of choice, it will use these components to create a corrected dim red.)
-    # note: scvelo default is 30
     npcs: 30
-    # number of neighbours
     k: 30
-    # metric: euclidean | cosine
     metric: euclidean
-    # scanpy | hnsw (from scvelo)
     method: scanpy
-#--------------------------
+
+# -------------
+# ATAC modality
 atac:
   # True or false depending on whether you want to run batch correction
   run: False