diff --git a/docs/usage/gene_list_format.md b/docs/usage/gene_list_format.md index 52e98550..2606f967 100644 --- a/docs/usage/gene_list_format.md +++ b/docs/usage/gene_list_format.md @@ -170,6 +170,33 @@ minimal: Generally in the visualization pipeline all gene groups in the input are plotted. In heatmaps and dotplots, one dotplot per group is plotted. For UMAPs, one plot per gene is plotted, and a new file is saved per group. + +## Plot Makers in the Visualization workflow + +The custom maker csv file for full and minimal must contain three columns and follow the following structure: + | mod | feature | group | + |------|----------|--------------| + | prot | prot_CD8 | Tcellmarkers | + | rna | CD8A | Tcellmarkers | + +The full list will be plotted in dot plots and matrix plots, with one plot per group. + +The shorter list will be plotted on umaps as well as other plot types, with one plot per group. + + | feature_1 | feature_2 | colour | + |-----------|-----------|----------------| + | CD8A | prot_CD8 | | + | CD4 | CD8A | doublet_scores | + + + +## Plot metadata variables +The scatter_features.csv file should have the following format: + + | feature_1 | feature_2 | colour | + |-----------|-----------|----------------| + |rna:total_counts | prot:total_counts | doublet_scores + ## Final notes Be deliberate and informative with the choice of group names for any gene set use, since the `.obs` column generated as output will be named based on the group of the gene list input file. diff --git a/docs/yaml_docs/index.rst b/docs/yaml_docs/index.rst index 94ab5126..49bfeeea 100644 --- a/docs/yaml_docs/index.rst +++ b/docs/yaml_docs/index.rst @@ -8,6 +8,7 @@ Workflows configuration files pipeline_ingestion_yml pipeline_preprocess_yml pipeline_integration_yml + pipeline_clustering_yml spatial_qc spatial_preprocess spatial_deconvolution \ No newline at end of file diff --git a/docs/yaml_docs/pipeline_clustering_yml.md b/docs/yaml_docs/pipeline_clustering_yml.md new file mode 100644 index 00000000..bc5a22dd --- /dev/null +++ b/docs/yaml_docs/pipeline_clustering_yml.md @@ -0,0 +1,276 @@ + + +# Clustering YAML + +In this documentation, the parameters of the `clustering` configuration yaml file are explained. +This file is generated running `panpipes clustering config`.
+The individual steps run by the pipeline are described in [clustering workflow](https://panpipes-pipelines.readthedocs.io/en/latest/workflows/clustering.html) + +When running the clustering workflow, panpipes provides a basic `pipeline.yml` file. +To run the workflow on your own data, you need to specify the parameters described below in the `pipeline.yml` file to meet the requirements of your data. + +However, we do provide pre-filled versions of the `pipeline.yml` file for individual [tutorials](https://panpipes-pipelines.readthedocs.io/en/latest/tutorials/index.html). + +For more information on functionalities implemented in `panpipes` to read the configuration files, such as reading blocks of parameters and reusing blocks with `&anchors` and `*scalars`, please check [our documentation](./useful_info_on_yml.md) + +You can download the different clustering pipeline.yml files here: +- Basic `pipeline.yml` file (not prefilled) that is generated when calling `panpipes clustering config`: [Download here](https://github.com/DendrouLab/panpipes/blob/main/panpipes/panpipes/pipeline_clustering/pipeline.yml) +- `pipeline.yml` for [Clustering Tutorial](https://panpipes-tutorials.readthedocs.io/en/latest/_downloads/3895aa0ba60017b15ee1aa6531dc8c25/pipeline.yml) + +## Compute resources options + +- resources
+Computing resources to use, specifically the number of threads used for parallel jobs. +Specified by the following three parameters: + - threads_high `Integer`, Default: 2
+ Number of threads used for high intensity computing tasks. + For each thread, there must be enough memory to load all your input files at once and create the MuData object. + + - threads_medium `Integer`, Default: 2
+ Number of threads used for medium intensity computing tasks. + For each thread, there must be enough memory to load your mudata and do computationally light tasks. + + - threads_low `Integer`, Default: 2
+ Number of threads used for low intensity computing tasks. + For each thread, there must be enough memory to load text files and do plotting, requires much less memory than the other two. + - fewer_jobs `Boolean`, Default: True
+ + - condaenv `String` (Path)
+ Path to conda environment that should be used to run panpipes. + Leave blank if running native or your cluster automatically inherits the login node environment + +## Loading data +### Data format + +- sample_prefix `String`, Mandatory parameter, Default: mdata
+Prefix for the sample that comes out of the filtering/ preprocessing steps of the workflow. + + +- scaled_obj `String`, Mandatory parameter, Default: mdata_scaled.h5mu
+ Path to the output file from preprocessing (e.g. `../preprocessed/mdata_scaled.h5mu`). + Ensure that the path to the file is correct. + +- full_obj `String`, Default:
+ Specify the full object if your scaled_obj contains only HVG. If your scaled_obj contains all the genes then leave full_obj blank. + panpipes will use the full object to do marker genes analysis (rank_gene_groups) and for plotting those genes. +- modalities
+ - rna `Boolean`, Default: True
+ - prot `Boolean`, Default: True
+ - atac `Boolean`, Default: False
+ - spatial `Boolean`, Default: False
+ Run clustering on each individual modality. + +- multimodal
+ - rna_clustering `Boolean`, Default: True
+ - integration_method `String`, Default: WNN
+ Options here include WNN, mofa, and totalVI, and it tells us where to look for. + +## Parameters for finding neighbours + +- neighbors: + Sets the number of neighbors to use when calculating the graph for clustering and umap. + - rna: + + - use_existing `Boolean`, Default: True
+ - dim_red `String`, Default: X_pca
+ Defines which representation in .obsm to use for nearest neighbors + - n_dim_red `Integer`, Default: 30
+ Number of components to use for clustering + - k `Integer`, Default: 30
+ Number of neighbours + - metric `String`, Default: euclidean
+ Options here include euclidean and cosine + - method `String`, Default: scanpy
+ Options include scanpy and hnsw (from scvelo) + + + - prot: + + - use_existing `Boolean`, Default: True
+ - dim_red `String`, Default: X_pca
+ Defines which representation in .obsm to use for nearest neighbors + - n_dim_red `Integer`, Default: 30
+ Number of components to use for clustering + - k `Integer`, Default: 30
+ Number of neighbours + - metric `String`, Default: euclidean
+ Options here include euclidean and cosine + - method `String`, Default: scanpy
+ Options include scanpy and hnsw (from scvelo) + + + - atac: + + - use_existing `Boolean`, Default: True
+ - dim_red `String`, Default: X_lsi
+ Defines which representation in .obsm to use for nearest neighbors + - n_dim_red `Integer`, Default: 1
+ Number of components to use for clustering + - k `Integer`, Default: 30
+ Number of neighbours + - metric `String`, Default: euclidean
+ Options here include euclidean and cosine + - method `String`, Default: scanpy
+ Options include scanpy and hnsw (from scvelo) + + + + - spatial: + + - use_existing `Boolean`, Default: False
+ - dim_red `String`, Default: X_pca
+ Defines which representation in .obsm to use for nearest neighbors + - n_dim_red `Integer`, Default: 30
+ Number of components to use for clustering + - k `Integer`, Default: 30
+ Number of neighbours + - metric `String`, Default: euclidean
+ Options here include euclidean and cosine + - method `String`, Default: scanpy
+ Options include scanpy and hnsw (from scvelo) + +## Parameters for umap calculation + + + - umap: + + - run `Boolean`, Default: True
+ - rna: + - mindist `Float`, Default: 0.5
+ Can specify an array: 0.25,0.5 + - prot: + - mindist `Float`, Default: 0.5
+ Can specify an array: 0.25,0.5,0.8 + - atac: + - mindist `Float`, Default: 0.5
+ Can specify an array: 0.25,0.5,0.8 + - multimodal: + - mindist `Float`, Default: 0.5
+ Can specify an array: 0.25,0.5,0.8 + - rna: + - mindist `Float`, Default: 0.5
+ Can specify an array: 0.25,0.5,0.8 + +## Parameters for clustering + + - clusterspecs: + - rna: + - resolutions `Float`, Default: 0.2, 0.6, 1
+ Can specify an array: 0.2,0.6,1 + - algorithm `String`, Default: leiden
+ Options include louvain or leiden. + - prot: + - resolutions `Float`, Default: 0.2, 0.6, 1
+ Can specify an array: 0.2,0.6,1 + - algorithm `String`, Default: leiden
+ Options include louvain or leiden. + + - atac: + - resolutions `Float`, Default: 0.2, 0.6, 1
+ Can specify an array to compute in parallel: 0.2,0.6,1 + - algorithm `String`, Default: leiden
+ Options include louvain or leiden. + - multimmodal: + - resolutions `Float`, Default: 0.5, 0.7
+ Can specify an array to compute in parallel: 0.2,0.6,1 + - algorithm `String`, Default: leiden
+ Options include louvain or leiden. + + - spatial: + - resolutions `Float`, Default: 0.2, 0.6, 1
+ Can specify an array to compute in parallel: 0.2,0.6,1 + - algorithm `String`, Default: leiden
+ Options include louvain or leiden. + +## Parameters for finding marker genes + +In this part of the analysis we define parameters to run marker analysis. +By default, pseudo_seurat is set to False, and we run [scanpy.tl.rank_genes_groups](https://scanpy.readthedocs.io/en/stable/generated/scanpy.tl.rank_genes_groups.html). +When pseudo_seurat is set to True then a [python implementation](https://github.com/DendrouLab/panpipes/blob/main/panpipes/python_scripts/run_find_markers_multi.py) of `Seurat:::FindMarkers` is run + + - markerspecs:
+ - rna:
+ - run `Boolean`, Default: True
+ - layer `String`, Default: logged_counts
+ Which layer stores counts for differential expression test. + - method `String`, Default: t-test_overestim_var
+ Options include: ‘logreg’, ‘t-test’, ‘wilcoxon’, ‘t-test_overestim_var’ + - mincells `Integer`, Default: 10
+ Marker analysis is run for clusters >= mincells. If a cluster ncells < mincells , then the cluster is excluded from marker analysis + - pseudo_seurat `Boolean`, Default: False
+ - minpct `Float`, Default: 0.1
+ This parameter is mandatory if pseudo_seurat is set to True + - threshuse `Float`, Default: 0.25
+ This parameter is mandatory if pseudo_seurat is set to True + - prot:
+ - run `Boolean`, Default: True
+ - layer `String`, Default: clr
+ Which layer stores counts for differential expression test. + - mincells `Integer`, Default: 10
+ Marker analysis is run for clusters >= mincells. If a cluster ncells < mincells , then the cluster is excluded from marker analysis + - method `String`, Default: wilcoxon
+ - pseudo_seurat `Boolean`, Default: False
+ - minpct `Float`, Default: 0.1
+ This parameter is mandatory if pseudo_seurat is set to True + - threshuse `Float`, Default: 0.25
+ This parameter is mandatory if pseudo_seurat is set to True + + - atac:
+ - run `Boolean`, Default: False
+ - layer `String`, Default: logged_counts
+ Which layer stores counts for differential expression test. + Options include logged_counts, signac_norm , and logTF_norm,logIDF_norm + - mincells `Integer`, Default: 10
+ Marker analysis is run for clusters >= mincells. If a cluster ncells < mincells , then the cluster is excluded from marker analysis + - method `String`, Default: wilcoxon
+ Options include: ‘logreg’, ‘t-test’, ‘wilcoxon’, ‘t-test_overestim_var’ + - pseudo_seurat `Boolean`, Default: False
+ - minpct `Float`, Default: 0.1
+ This parameter is mandatory if pseudo_seurat is set to True + - threshuse `Float`, Default: 0.25
+ This parameter is mandatory if pseudo_seurat is set to True + + + - multimodal:
+ - mincells `Integer`, Default:10
+ If the cluster contains less than this number of cells, the marker analysis won't be run. + - method `String`, Default: wilcoxon
+ Options include: ‘logreg’, ‘t-test’, ‘wilcoxon’, ‘t-test_overestim_var’ + - pseudo_seurat `Boolean`, Default: False
+ - minpct `Float`, Default: 0.1
+ This parameter is mandatory if pseudo_seurat is set to True + - threshuse `Float`, Default: 0.25
+ This parameter is mandatory if pseudo_seurat is set to True + + + - spatial:
+ - run `Boolean`, Default: True
+ - layer `String`, Default: norm_pearson_resid
+ Options include logged_counts, signac_norm , and logTF_norm,logIDF_norm + - method `String`, Default: t-test_overestim_var
+ Options include: ‘logreg’, ‘t-test’, ‘wilcoxon’, ‘t-test_overestim_var’ + - mincells `Integer`, Default: 10
+ Marker analysis is run for clusters >= mincells. If a cluster ncells < mincells , then the cluster is excluded from marker analysis + - pseudo_seurat `Boolean`, Default: False
+ - minpct `Float`, Default: 0.1
+ This parameter is mandatory if pseudo_seurat is set to True + - threshuse `Float`, Default: 0.25
+ This parameter is mandatory if pseudo_seurat is set to True +## Plot specifications +Used to define which metadata columns are used in the visualizations + - plotspecs:
+ - layers:
+ - rna `String`, Default: logged_counts
+ - prot `String`, Default: clr
+ - atac `String`, Default: signac_norm
+ - spatial `String`, Default: None
+ Options include lognorm and norm_pearson_resid depending what was selected on preprocessing. + - top_n_markers `Integer`, Default: 10