Skip to content

Example of results of Individual Analysis for a school case

mAGLAVE edited this page Jul 20, 2022 · 11 revisions

These results were generated on the Flamingo computing cluster of GR.

The dataset:

For this example, I use the sc5p_v2_hs_PBMC_1k_5 sample from 10XGenomics (found here: https://support.10xgenomics.com/single-cell-vdj/datasets/4.0.0/sc5p_v2_hs_PBMC_1k).

It is Human PBMC sample from a Healthy Donor with:

  • Gene Expression,
  • BCR,
  • TCR,
  • ADT (named Feature Barcoding by 10XGenomics).

This sample contains average 1k cells and it was made by v2 chemistry technology.

Preparation of the analysis:

Make the ADT reference index:

In sc5p_v2_hs_PBMC_1k_feature_ref.csv file (found here too: https://support.10xgenomics.com/single-cell-vdj/datasets/4.0.0/sc5p_v2_hs_PBMC_1k), we have the list of protein names and sequence of the ADT tag.

sc5p_v2_hs_PBMC_1k_feature_ref.csv
id name read pattern sequence feature_type
CD3 CD3_TotalC R2 ^NNNNNNNNNN(BC)NNNNNNNNN CTCATTGTAACTCCT Antibody Capture
CD19 CD19_TotalC R2 ^NNNNNNNNNN(BC)NNNNNNNNN CTGGGCAATTACTCG Antibody Capture
CD45RA CD45RA_TotalC R2 ^NNNNNNNNNN(BC)NNNNNNNNN TCAATCCTTCCGCTT Antibody Capture
CD4 CD4_TotalC R2 ^NNNNNNNNNN(BC)NNNNNNNNN TGTTCCCGCTCAACT Antibody Capture
CD8a CD8a_TotalC R2 ^NNNNNNNNNN(BC)NNNNNNNNN GCTGCGCTTTCCATT Antibody Capture
CD14 CD14_TotalC R2 ^NNNNNNNNNN(BC)NNNNNNNNN TCTCAGACCTCCGTA Antibody Capture
CD16 CD16_TotalC R2 ^NNNNNNNNNN(BC)NNNNNNNNN AAGTTCACTCTTTGC Antibody Capture
CD56 CD56_TotalC R2 ^NNNNNNNNNN(BC)NNNNNNNNN TTCGCCGCATTGAGT Antibody Capture
CD25 CD25_TotalC R2 ^NNNNNNNNNN(BC)NNNNNNNNN TTTGTCCTGTACGCC Antibody Capture
CD45RO CD45RO_TotaC R2 ^NNNNNNNNNN(BC)NNNNNNNNN CTCCGAATCATGTTG Antibody Capture
PD-1 PD-1_TotalC R2 ^NNNNNNNNNN(BC)NNNNNNNNN ACAGCGCCGTATTTA Antibody Capture
TIGIT TIGIT_TotalC R2 ^NNNNNNNNNN(BC)NNNNNNNNN TTGCTTACCGCCAGA Antibody Capture
IgG1 IgG1_TotalC R2 ^NNNNNNNNNN(BC)NNNNNNNNN GCCGGACGACATTAA Antibody Capture
IgG2a IgG2a_TotalC R2 ^NNNNNNNNNN(BC)NNNNNNNNN CTCCTACCTAAACTG Antibody Capture
IgG2b IgG2b_TotalC R2 ^NNNNNNNNNN(BC)NNNNNNNNN ATATGTATCACGCGA Antibody Capture
CD127 CD127_TotalC R2 ^NNNNNNNNNN(BC)NNNNNNNNN GTGTGTTGTCCTATG Antibody Capture
CD15 CD15_TotalC R2 ^NNNNNNNNNN(BC)NNNNNNNNN TCACCAGTACCTAGT Antibody Capture
CD197_(CCR7) CD197_TotalC R2 ^NNNNNNNNNN(BC)NNNNNNNNN AGTTCAGTCAACCGA Antibody Capture
HLA-DR HLA-DR_TotalC R2 ^NNNNNNNNNN(BC)NNNNNNNNN AATAGCGAGCAAGTA Antibody Capture

1. Make the fasta file (thanks to the sc5p_v2_hs_PBMC_1k_feature_ref.csv file)

/mnt/beegfs/userdata/m_aglave/pipeline/examples/complete_individual_analysis_example_of_wiki/INDEX_ADT/ADT.fa
>CD3
CTCATTGTAACTCCT
>CD19
CTGGGCAATTACTCG
>CD45RA
TCAATCCTTCCGCTT
>CD4
TGTTCCCGCTCAACT
>CD8a
GCTGCGCTTTCCATT
>CD14
TCTCAGACCTCCGTA
>CD16
AAGTTCACTCTTTGC
>CD56
TTCGCCGCATTGAGT
>CD25
TTTGTCCTGTACGCC
>CD45RO
CTCCGAATCATGTTG
>PD-1
ACAGCGCCGTATTTA
>TIGIT
TTGCTTACCGCCAGA
>IgG1
GCCGGACGACATTAA
>IgG2a
CTCCTACCTAAACTG
>IgG2b
ATATGTATCACGCGA
>CD127
GTGTGTTGTCCTATG
>CD15
TCACCAGTACCTAGT
>CCR7
AGTTCAGTCAACCGA
>HLA-DR
AATAGCGAGCAAGTA

2. Make the tr2gs file (thanks to the sc5p_v2_hs_PBMC_1k_feature_ref.csv file)

/mnt/beegfs/userdata/m_aglave/pipeline/examples/complete_individual_analysis_example_of_wiki/INDEX_ADT/ADT_tr2gs.txt
transcript gene gene_name
CD3 CD3 CD3
CD19 CD19 CD19
CD45RA CD45RA CD45RA
CD4 CD4 CD4
CD8a CD8a CD8a
CD14 CD14 CD14
CD16 CD16 CD16
CD56 CD56 CD56
CD25 CD25 CD25
CD45RO CD45RO CD45RO
PD-1 PD-1 PD-1
TIGIT TIGIT TIGIT
IgG1 IgG1 IgG1
IgG2a IgG2a IgG2a
IgG2b IgG2b IgG2b
CD127 CD127 CD127
CD15 CD15 CD15
CCR7 CCR7 CCR7
HLA-DR HLA-DR HLA-DR

3. Make the index

cd '/mnt/beegfs/userdata/m_aglave/pipeline/examples/complete_individual_analysis_example_of_wiki/INDEX_ADT/'
kallisto index 'ADT.fa' -i 'ADT.kidx' --kmer-size=15

Make the Markfile (Additional files):

/mnt/beegfs/userdata/m_aglave/pipeline/examples/complete_individual_analysis_example_of_wiki/markfile.xlsx
Genes Signatures
IL7R Naive_CD4_T
CCR7 Naive_CD4_T
CD14 CD14_Mono
LYZ CD14_Mono
IL7R Memory_CD4
S100A4 Memory_CD4
MS4A1 B
CD8A CD8_T
FCGR3A FCGR3A_Mono
MS4A7 FCGR3A_Mono
GNLY NK
NKG7 NK
FCER1A DC
CST3 DC
PPBP Platelet

Make the analysis:

As specified in the Pipeline phylosophy (Pipeline details section of this wiki), single-cell data analysis cannot be done through a one-shot pipeline.
So, the rest of the explanation will be a repetition of several phases:

  • Make the Configuration Parameter file
  • Launch of the analysis
  • Interpretation of Results (and Choice of Next Parameters)

Notes:

1- Make the Configuration Parameter file:

We will do all the alignments, and evaluate the quality of the droplets.

/mnt/beegfs/userdata/m_aglave/pipeline/examples/complete_individual_analysis_example_of_wiki/Params.yaml
Steps: ["Alignment_countTable_GE","Alignment_countTable_ADT","Alignment_annotations_TCR_BCR","Droplets_QC_GE"]

############################################ GE (RNA) ############################################

Alignment_countTable_GE:
  ### Project
  sample.name.ge : ["sc5p_v2_hs_PBMC_1k_5gex"]
  input.dir.ge : '/mnt/beegfs/userdata/m_aglave/pipeline/examples/complete_individual_analysis_example_of_wiki/fastq_10X/'
  output.dir.ge : '/mnt/beegfs/userdata/m_aglave/pipeline/examples/complete_individual_analysis_example_of_wiki/Results/'
  sctech : '10xv2'
  kindex.ge : '/mnt/beegfs/database/bioinfo/single-cell/INDEX/KB-python_KALLISTO/0.24.4_0.46.2/homo_sapiens/GRCh38/Ensembl/r99/cDNA_LINCs_MIRs/GRCH38_r99_cDNA_linc_mir.kidx'
  tr2g.file.ge : '/mnt/beegfs/database/bioinfo/single-cell/INDEX/KB-python_KALLISTO/0.24.4_0.46.2/homo_sapiens/GRCh38/Ensembl/r99/cDNA_LINCs_MIRs/GRCH38_r99_cDNA_linc_mir_tr2gs.txt'
  reference.txt: "Ensembl reference transcriptome v99 corresponding to the homo sapiens GRCH38 build"

Droplets_QC_GE:
  author.name : "marine aglave"
  author.mail : "marine.aglave@gustaveroussy.fr, bigr@gustaveroussy.fr"

############################################ ADT ############################################

Alignment_countTable_ADT:
  sample.name.adt : ["sc5p_v2_hs_PBMC_1k_5fb"]
  input.dir.adt : "/mnt/beegfs/userdata/m_aglave/pipeline/examples/complete_individual_analysis_example_of_wiki/fastq_10X/"
  output.dir.adt : "/mnt/beegfs/userdata/m_aglave/pipeline/examples/complete_individual_analysis_example_of_wiki/Results/"
  kindex.adt : "/mnt/beegfs/userdata/m_aglave/pipeline/examples/complete_individual_analysis_example_of_wiki/INDEX_ADT/ADT.kidx"
  tr2g.file.adt : "/mnt/beegfs/userdata/m_aglave/pipeline/examples/complete_individual_analysis_example_of_wiki/INDEX_ADT/ADT_tr2gs.txt"

############################################ TCR/BCR ############################################

Alignment_annotations_TCR_BCR:
  sample.name.tcr : ["sc5p_v2_hs_PBMC_1k_t"]
  input.dir.tcr : "/mnt/beegfs/userdata/m_aglave/pipeline/examples/complete_individual_analysis_example_of_wiki/fastq_10X/"
  sample.name.bcr : ["sc5p_v2_hs_PBMC_1k_b"]
  input.dir.bcr : "/mnt/beegfs/userdata/m_aglave/pipeline/examples/complete_individual_analysis_example_of_wiki/fastq_10X/"
  output.dir.tcr_bcr : "/mnt/beegfs/userdata/m_aglave/pipeline/examples/complete_individual_analysis_example_of_wiki/Results/"

1- Launch of the analysis:

For the traceability of the analysis, I prefer to put the command lines in a script, but it is not mandatory.

/mnt/beegfs/userdata/m_aglave/pipeline/examples/complete_individual_analysis_example_of_wiki/launcher.sh
#!/bin/bash

########################################################################
## Single-cell script to launch single-cell pipeline
##
## using: sbatch /mnt/beegfs/userdata/m_aglave/pipeline/examples/complete_individual_analysis_example_of_wiki/launcher.sh
########################################################################
#SBATCH --job-name=pipeline_sc
#SBATCH --nodes=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=1G
#SBATCH --partition=mediumq

conda activate /mnt/beegfs/userdata/m_aglave/.environnement_conda/single_cell_user
module load singularity

path_to_configfile="/mnt/beegfs/userdata/m_aglave/pipeline/examples/complete_individual_analysis_example_of_wiki/Params.yaml"
path_to_pipeline="/mnt/beegfs/pipelines/single-cell"

snakemake --profile ${path_to_pipeline}/profiles/slurm -s ${path_to_pipeline}/Snakefile --configfile ${path_to_configfile}

conda deactivate
sbatch /mnt/beegfs/userdata/m_aglave/pipeline/examples/complete_individual_analysis_example_of_wiki/launcher.sh

1- Interpretation of results:

Quality control of reads:

#provided by the Alignment_countTable_GE step
/mnt/beegfs/userdata/m_aglave/pipeline/examples/complete_individual_analysis_example_of_wiki/Results/sc5p_v2_hs_PBMC_1k_5gex_GE/QC_reads/sc5p_v2_hs_PBMC_1k_5gex_GE_RAW.html
#provided by the Alignment_countTable_ADT step
/mnt/beegfs/userdata/m_aglave/pipeline/examples/complete_individual_analysis_example_of_wiki/Results/sc5p_v2_hs_PBMC_1k_5fb_ADT/QC_reads/sc5p_v2_hs_PBMC_1k_5fb_ADT_RAW.html
#provided by the Alignment_annotations_TCR_BCR step
/mnt/beegfs/userdata/m_aglave/pipeline/examples/complete_individual_analysis_example_of_wiki/Results/sc5p_v2_hs_PBMC_1k_t_TCR/QC_reads/sc5p_v2_hs_PBMC_1k_t_TCR_RAW.html
/mnt/beegfs/userdata/m_aglave/pipeline/examples/complete_individual_analysis_example_of_wiki/Results/sc5p_v2_hs_PBMC_1k_b_BCR/QC_reads/sc5p_v2_hs_PBMC_1k_b_BCR_RAW.html

It is a classical way to make controle-quality of reads. We can check the number of reads, the reads lentgh (R1 = Barcodes + UMI; R2 = ARN) and other usual metrics.
If there are duplicated reads, it's totally normal, it's the role of UMI to correct the PCR duplicates.

Alignment metrics:

provided by the Alignment_countTable_GE step

/mnt/beegfs/userdata/m_aglave/pipeline/examples/complete_individual_analysis_example_of_wiki/Results/sc5p_v2_hs_PBMC_1k_5gex_GE/KALLISTOBUS/run_info.json
{
	"n_targets": 0,
	"n_bootstraps": 0,
	"n_processed": 79556908,
	"n_pseudoaligned": 51554724,
	"n_unique": 11755362,
	"p_pseudoaligned": 64.8,
	"p_unique": 14.8,
	"kallisto_version": "0.46.2",
	"index_version": 0,
	"start_time": "Fri Apr 16 18:56:49 2021",
	"call": "kallisto bus -i /mnt/beegfs/database/bioinfo/single-cell/INDEX/KB-python_KALLISTO/0.24.4_0.46.2/homo_sapiens/GRCh38/Ensembl/r99/cDNA_LINCs_MIRs/GRCH38_r99_cDNA_linc_mir.kidx -o /mnt/beegfs/userdata/m_aglave/pipeline/test_new_data/Results/sc5p_v2_hs_PBMC_1k_5gex_GE/KALLISTOBUS -x 10xv2 -t 4 /mnt/beegfs/userdata/m_aglave/pipeline/test_new_data/Results/fastq/sc5p_v2_hs_PBMC_1k_5gex_GE_S1_L001_R1_001.fastq.gz /mnt/beegfs/userdata/m_aglave/pipeline/test_new_data/Results/fastq/sc5p_v2_hs_PBMC_1k_5gex_GE_S1_L001_R2_001.fastq.gz /mnt/beegfs/userdata/m_aglave/pipeline/test_new_data/Results/fastq/sc5p_v2_hs_PBMC_1k_5gex_GE_S1_L002_R1_001.fastq.gz /mnt/beegfs/userdata/m_aglave/pipeline/test_new_data/Results/fastq/sc5p_v2_hs_PBMC_1k_5gex_GE_S1_L002_R2_001.fastq.gz"
}

provided by the Alignment_countTable_ADT step

/mnt/beegfs/userdata/m_aglave/pipeline/examples/complete_individual_analysis_example_of_wiki/Results/sc5p_v2_hs_PBMC_1k_5fb_ADT/KALLISTOBUS/run_info.json
{
	"n_targets": 0,
	"n_bootstraps": 0,
	"n_processed": 11043775,
	"n_pseudoaligned": 10520338,
	"n_unique": 10520338,
	"p_pseudoaligned": 95.3,
	"p_unique": 95.3,
	"kallisto_version": "0.46.2",
	"index_version": 0,
	"start_time": "Fri Apr 16 18:52:01 2021",
	"call": "kallisto bus -i /mnt/beegfs/userdata/m_aglave/pipeline/test_new_data/INDEX_ADT/ADT.kidx -o /mnt/beegfs/userdata/m_aglave/pipeline/test_new_data/Results/sc5p_v2_hs_PBMC_1k_5fb_ADT/KALLISTOBUS -x 10xv2 -t 1 /mnt/beegfs/userdata/m_aglave/pipeline/test_new_data/Results/fastq/sc5p_v2_hs_PBMC_1k_5fb_ADT_S1_L001_R1_001.fastq.gz /mnt/beegfs/userdata/m_aglave/pipeline/test_new_data/Results/fastq/sc5p_v2_hs_PBMC_1k_5fb_ADT_S1_L001_R2_001.fastq.gz /mnt/beegfs/userdata/m_aglave/pipeline/test_new_data/Results/fastq/sc5p_v2_hs_PBMC_1k_5fb_ADT_S1_L002_R1_001.fastq.gz /mnt/beegfs/userdata/m_aglave/pipeline/test_new_data/Results/fastq/sc5p_v2_hs_PBMC_1k_5fb_ADT_S1_L002_R2_001.fastq.gz"
}
  • n_processed : number of fragments that have been processed.
  • n_pseudoaligned : number of fragments that have been pseudoaligned.
  • p_pseudoaligned : percentage of fragments that have been pseudoaligned.
  • n_unique : number of fragments that have been pseudoaligned to a unique target sequence.
  • p_unique : percentage of fragments that have been pseudoaligned to a unique target sequence.
  • kallisto_version : version of kallisto tool used for the alignment.
  • call : command line used to run the alignment.

Quality control of droplets:

The saturation plot:

Provided by the Droplets_QC_GE step.

/mnt/beegfs/userdata/m_aglave/pipeline/examples/complete_individual_analysis_example_of_wiki/Results/sc5p_v2_hs_PBMC_1k_5gex_GE/QC_droplets/sc5p_v2_hs_PBMC_1k_5gex_GE_saturation_plot.png

These graphs allow to estimate the complexity of the data (more genes is detected per UMI, more complex is our data) and therefore estimate the saturation of the sequencing.
The first graph is a representation of the number of genes by droplet acording to the number of UMIs by droplet. The droplet with the less of genes and the less of UMI are the empty droplets (containing the ambiant RNA). The red line is the trend curve. If the curve forms a plateau at the top, then all the genes have been discovered.
The second graph is a representation of the density, according to the ratio between genes and umi for each droplet. Generally, we expect the novelty score to be above 0.80. The dark blue curve, representing full droplets, shows a peak in density demonstrating that there is a large set of droplets with high complexity. Low complexity droplets (<0.80) are clearly identified as empty droplets. The empty droplet peak on the right should correspond to droplets where only 1UMI (and therefore only 1 gene) has been detected, granting a ratio of 1.
Note:
More information about the second graph here.

The kneeplot:

Provided by the Droplets_QC_GE step.

/mnt/beegfs/userdata/m_aglave/pipeline/examples/complete_individual_analysis_example_of_wiki/Results/sc5p_v2_hs_PBMC_1k_5gex_GE/QC_droplets/sc5p_v2_hs_PBMC_1k_5gex_GE_kneeplot.png

It's a representation of the number of UMIs by droplet rank. The top plateau corresponds to the real cells and the bottom plateau corresponds to the empty droplets. We must to see this 2 plateaus for a good identification of empty droplets.
Between this 2 plateaus, there are little cells or heterogeneous empty droplets. The Knee Line and the Inflection Line are the caracteristic lines computed by the emptyDrops() function.

Histograms:

Provided by the Droplets_QC_GE step.

/mnt/beegfs/userdata/m_aglave/pipeline/examples/complete_individual_analysis_example_of_wiki/Results/sc5p_v2_hs_PBMC_1k_5gex_GE/QC_droplets/sc5p_v2_hs_PBMC_1k_5gex_GE_QChist.png

At this step, the empty droplets are removed.
It's a representation of the cells frequencies according to the number of genes (nFeatures_RNA), the number of transcripts (nCount_RNA), the percentage of mitochondrial transcripts (%mito) and the percentage of ribosomal transcripts (%ribo). The red lines are the thresholds choosen. In the title of each histogram there is the threshold and the the number of kept cell if the threshold is applied.
Cells must have less than 20% of mitochondrial RNA for publication, and in general, we use a value within 5% (20%, 15%, 10% or 5%). Here many cells (947 cells) have less than 20% so the cells are not dying. We could set the threshold at 15%, or even 10% (to be really stringent), to keep only the best quality cells. So for the Filtering_GE step, we will choose 15% as the maximum threshold of the percentage of mitochondrial RNA. For the percentage of ribosomal RNA are rarely filtered.
For the number of genes and the number of transcripts, the filtering depends on data.

2- Make the Configuration Parameter file:

We will do the filtering of droplets, and evaluate the identification of doublets.
We keep the same Configuration Parameter file, but we add the Filtering_GE step, so the parameters of the sample.name.ge and input.rda.ge will be determined automatically. We only have to add the special parameters of the Filtering_GE step (and add "Filtering_GE" into Steps).

/mnt/beegfs/userdata/m_aglave/pipeline/examples/complete_individual_analysis_example_of_wiki/Params.yaml
Steps: ["Alignment_countTable_GE","Alignment_countTable_ADT","Alignment_annotations_TCR_BCR","Droplets_QC_GE","Filtering_GE"]

############################################ GE (RNA) ############################################

Alignment_countTable_GE:
  ### Project
  sample.name.ge : ["sc5p_v2_hs_PBMC_1k_5gex"]
  input.dir.ge : '/mnt/beegfs/userdata/m_aglave/pipeline/examples/complete_individual_analysis_example_of_wiki/fastq_10X/'
  output.dir.ge : '/mnt/beegfs/userdata/m_aglave/pipeline/examples/complete_individual_analysis_example_of_wiki/Results/'
  sctech : '10xv2'
  kindex.ge : '/mnt/beegfs/database/bioinfo/single-cell/INDEX/KB-python_KALLISTO/0.24.4_0.46.2/homo_sapiens/GRCh38/Ensembl/r99/cDNA_LINCs_MIRs/GRCH38_r99_cDNA_linc_mir.kidx'
  tr2g.file.ge : '/mnt/beegfs/database/bioinfo/single-cell/INDEX/KB-python_KALLISTO/0.24.4_0.46.2/homo_sapiens/GRCh38/Ensembl/r99/cDNA_LINCs_MIRs/GRCH38_r99_cDNA_linc_mir_tr2gs.txt'
  reference.txt: "Ensembl reference transcriptome v99 corresponding to the homo sapiens GRCH38 build"

Droplets_QC_GE:
  author.name : "marine aglave"
  author.mail : "marine.aglave@gustaveroussy.fr, bigr@gustaveroussy.fr"

Filtering_GE:
  pcmito.max : 0.15

############################################ ADT ############################################

Alignment_countTable_ADT:
  sample.name.adt : ["sc5p_v2_hs_PBMC_1k_5fb"]
  input.dir.adt : "/mnt/beegfs/userdata/m_aglave/pipeline/examples/complete_individual_analysis_example_of_wiki/fastq_10X/"
  output.dir.adt : "/mnt/beegfs/userdata/m_aglave/pipeline/examples/complete_individual_analysis_example_of_wiki/Results/"
  kindex.adt : "/mnt/beegfs/userdata/m_aglave/pipeline/examples/complete_individual_analysis_example_of_wiki/INDEX_ADT/ADT.kidx"
  tr2g.file.adt : "/mnt/beegfs/userdata/m_aglave/pipeline/examples/complete_individual_analysis_example_of_wiki/INDEX_ADT/ADT_tr2gs.txt"

############################################ TCR/BCR ############################################

Alignment_annotations_TCR_BCR:
  sample.name.tcr : ["sc5p_v2_hs_PBMC_1k_t"]
  input.dir.tcr : "/mnt/beegfs/userdata/m_aglave/pipeline/examples/complete_individual_analysis_example_of_wiki/fastq_10X/"
  sample.name.bcr : ["sc5p_v2_hs_PBMC_1k_b"]
  input.dir.bcr : "/mnt/beegfs/userdata/m_aglave/pipeline/examples/complete_individual_analysis_example_of_wiki/fastq_10X/"
  output.dir.tcr_bcr : "/mnt/beegfs/userdata/m_aglave/pipeline/examples/complete_individual_analysis_example_of_wiki/Results/"

2- Launch of the analysis:

No change in /mnt/beegfs/userdata/m_aglave/pipeline/examples/complete_individual_analysis_example_of_wiki/launcher.sh script.

sbatch /mnt/beegfs/userdata/m_aglave/pipeline/examples/complete_individual_analysis_example_of_wiki/launcher.sh

2- Interpretation of results:

Filtering:

Histograms:

Provided by the Filtering_GE step.

/mnt/beegfs/userdata/m_aglave/pipeline/examples/complete_individual_analysis_example_of_wiki/Results/sc5p_v2_hs_PBMC_1k_5gex_GE/F200_C1000_M0-0.15_R0-1_G5/DOUBLETSKEPT/sc5p_v2_hs_PBMC_1k_5gex_GE_QChist.png

It is the same histogram as before, but after filtering by the number of genes, the number of transcripts, the percentage of mitochondrial transcripts and the percentage of ribosomal transcripts, and before removing the doublets.
Here there are 907 cells which have less than 15% of mitochondrial RNA.

/mnt/beegfs/userdata/m_aglave/pipeline/examples/complete_individual_analysis_example_of_wiki/Results/sc5p_v2_hs_PBMC_1k_5gex_GE/F200_C1000_M0-0.15_R0-1_G5/DOUBLETSFILTER_all/sc5p_v2_hs_PBMC_1k_5gex_GE_QChist.png

It is the same histogram as before but after filtering by the number of genes, the number of transcripts, the percentage of mitochondrial transcripts and the percentage of ribosomal transcripts, and after removing the doublets.
Here there are 873 cells which have less than 15% of mitochondrial RNA.

Cell Cycle and Doublet identification:

Provided by the Filtering_GE step. Cell cycle is compute at this step, followed by the doublets indentification. To be sure that the identified doublets do not only correspond to cells in the G2M phase of the cell cycle, a full data analysis is performed with default settings.

/mnt/beegfs/userdata/m_aglave/pipeline/examples/complete_individual_analysis_example_of_wiki/Results/sc5p_v2_hs_PBMC_1k_5gex_GE/F200_C1000_M0-0.15_R0-1_G5/DOUBLETSKEPT/LogNormalize/pca/pca_15_0.8/sc5p_v2_hs_PBMC_1k_5gex_GE_RNA_pca_uMAP_dim15_res0.8.png

/mnt/beegfs/userdata/m_aglave/pipeline/examples/complete_individual_analysis_example_of_wiki/Results/sc5p_v2_hs_PBMC_1k_5gex_GE/F200_C1000_M0-0.15_R0-1_G5/DOUBLETSKEPT/LogNormalize/pca/pca_15_0.8/sc5p_v2_hs_PBMC_1k_5gex_GE_RNA_pca_uMAP3d_dim15_res0.8.png

These graphs correspond to the 2D and 3D umaps of the data, and they are not really important at this step of the analysis because it is made with default parameters.

/mnt/beegfs/userdata/m_aglave/pipeline/examples/complete_individual_analysis_example_of_wiki/Results/sc5p_v2_hs_PBMC_1k_5gex_GE/F200_C1000_M0-0.15_R0-1_G5/DOUBLETSKEPT/LogNormalize/pca/pca_15_0.8/technical/sc5p_v2_hs_PBMC_1k_5gex_GE_technical_MULTI_ALL_uMAPs.png

This group of graphs corresponds to the plotting of biases on the umap. We can see doublets (compute by the 2 available methods), the attribution of the cell cycle (compute by seurat and cyclone) and the other biases.
Here, we can see that cells in G2M phase are not just doublets and vice versa. So no problem, we can filtering doublets. If there is a correspondance maybe you only need to use one doublets detection method, or not filter them.
The other graphs in this folder (technical) are not really usefull, they correspond to the grouped graphs.

Statistics:

Provided by the Filtering_GE step.

/mnt/beegfs/userdata/m_aglave/pipeline/examples/complete_individual_analysis_example_of_wiki/Results/sc5p_v2_hs_PBMC_1k_5gex_GE/F200_C1000_M0-0.15_R0-1_G5/DOUBLETSFILTER_all/sc5p_v2_hs_PBMC_1k_5gex_GE_stat.txt
Kallisto_Bustools_alignment.Total_reads 79556908
Kallisto_Bustools_alignment.Pseudo_aligned_reads 51554724
Kallisto_Bustools_alignment.Pseudo_aligned_reads_percent 64,8
Kallisto_Bustools_alignment.Pseudo_aligned_reads_to_unique_target_sequence 11755362
Kallisto_Bustools_alignment.Pseudo_aligned_reads_to_unique_target_sequence_percent 14,8
Droplet_Quality.captured_droplet 126489
Droplet_Quality.total_number_UMI 6396083
Droplet_Quality.estimated_cells 997
Droplet_Quality.estimated_UMI 5919852
Droplet_Quality.fraction_read_in_cells 0,925543336
Cells_Quality.mito_summary.Median 0,060621498
Cells_Quality.ribo_summary.Median 0,251040222
Cells_Quality.stress_summary.Median 0,024519549
Cells_Quality.filter_cells_genes 984
Cells_Quality.filter_cells_counts 967
Cells_Quality.genes_per_cell_summary.Median 1952
Cells_Quality.UMI_per_cell_summary.Median 5813
After_QC_cells_filtering.estim_cells_G1 521
After_QC_cells_filtering.estim_cells_G2M 71
After_QC_cells_filtering.estim_cells_S 315
After_QC_cells_filtering.Genes_covered 17374
After_QC_cells_feature_filtering.estim_doublets 34
Final_Cells_Quality.mito_summary.Median 0,059024654
Final_Cells_Quality.ribo_summary.Median 0,268956044
Final_Cells_Quality.stress_summary.Median 0,025089606
Final_Cells_Quality.genes_per_cell_summary.Median 1957
Final_Cells_Quality.UMI_per_cell_summary.Median 5887
Final_Cells_Quality.nb_genes 17374
Final_Cells_Quality.nb_cells 873

This file summarizes some statistics throughout the filters performed (the filters are accumulated progressively in the table).

Alignment
Kallisto_Bustools_alignment.Total_reads number of total detected reads
Kallisto_Bustools_alignment.Pseudo_aligned_reads number of total pseudo-aligned reads
Kallisto_Bustools_alignment.Pseudo_aligned_reads_percent percentage of total pseudo-aligned reads
Kallisto_Bustools_alignment.Pseudo_aligned_reads_to_unique_target_sequence number of total pseudo-aligned reads to a unique location on the reference
Kallisto_Bustools_alignment.Pseudo_aligned_reads_to_unique_target_sequence_percent percentage of total pseudo-aligned reads to a unique location on the reference
Raw data
Droplet_Quality.captured_droplet number of total detected droplets
Droplet_Quality.total_number_UMI number of total UMI (after UMI deduplication)
After removing empty droplets
Droplet_Quality.estimated_cells number of cells after removing empty droplets
Droplet_Quality.estimated_UMI number of UMI after removing empty droplets
Droplet_Quality.fraction_read_in_cells the number of UMI after removing empty droplets divided by the number of UMI before removing empty droplets
Cells_Quality.mito_summary.Median median of mitochondrial RNA after removing empty droplets, before other filters
Cells_Quality.ribo_summary.Median median of ribosomal RNA after removing empty droplets, before other filters
Cells_Quality.stress_summary.Median median of mecanis stress RNA after removing empty droplets, before other filters
Cells_Quality.filter_cells_genes number of cells after removing empty droplets, if filtering on genes
Cells_Quality.filter_cells_counts number of cells after removing empty droplets, if filtering on counts
Cells_Quality.genes_per_cell_summary.Median median of the number of gene per cell after removing empty droplets, before other filters
Cells_Quality.UMI_per_cell_summary.Median median of the number of UMI per cell after removing empty droplets, before other filters
After removing bad quality cells
After_QC_cells_filtering.estim_cells_G1 number of cells in G1 phase, after filtering
After_QC_cells_filtering.estim_cells_G2M number of cells in G2M phase, after filtering
After_QC_cells_filtering.estim_cells_S number of cells in S phase, after filtering
After_QC_cells_filtering.Genes_covered number of genes, after filtering
After removing rare genes
After_QC_cells_feature_filtering.estim_doublets number of doublets, after filtering genes
After all filtering
Final_Cells_Quality.mito_summary.Median median of mitochondrial RNA after all filtering
Final_Cells_Quality.ribo_summary.Median median of ribosomal RNA after all filtering
Final_Cells_Quality.stress_summary.Median median of mecanic stress RNA after all filtering
Final_Cells_Quality.genes_per_cell_summary.Median median of the number of gene per cell after all fitering
Final_Cells_Quality.UMI_per_cell_summary.Median median of the number of UMI per cell after all fitering
Final_Cells_Quality.nb_genes total number of genes
Final_Cells_Quality.nb_cells total number of cells

At the end of all filters, we have 873 cells and 17 374 genes.

3- Make the Configuration Parameter file:

We will do the normalization and the dimension reduction, and evaluate the biais and the clustering.
We keep the same Configuration Parameter file, but we add the Norm_DimRed_Eval_GE step, so the parameters of the sample.name.ge and input.rda.ge will be determined automatically. We only have to add the special parameters of the Norm_DimRed_Eval_GE step (and add "Norm_DimRed_Eval_GE" into Steps).
To simplify the explanation, we will focus on an analysis by SCTransform and pca, but for an optimal analysis, we should test the normalization by LogNormalize followed by a dimension reduction by scbfa, then compare the methods at the umap level (number, shape and coherence of the clusters).

Note:
SCTransform and pca correspond to default parameters of Norm_DimRed_Eval_GE step.

/mnt/beegfs/userdata/m_aglave/pipeline/examples/complete_individual_analysis_example_of_wiki/Params.yaml
Steps: ["Alignment_countTable_GE","Alignment_countTable_ADT","Alignment_annotations_TCR_BCR","Droplets_QC_GE","Filtering_GE","Norm_DimRed_Eval_GE"]

############################################ GE (RNA) ############################################

Alignment_countTable_GE:
  ### Project
  sample.name.ge : ["sc5p_v2_hs_PBMC_1k_5gex"]
  input.dir.ge : '/mnt/beegfs/userdata/m_aglave/pipeline/examples/complete_individual_analysis_example_of_wiki/fastq_10X/'
  output.dir.ge : '/mnt/beegfs/userdata/m_aglave/pipeline/examples/complete_individual_analysis_example_of_wiki/Results/'
  sctech : '10xv2'
  kindex.ge : '/mnt/beegfs/database/bioinfo/single-cell/INDEX/KB-python_KALLISTO/0.24.4_0.46.2/homo_sapiens/GRCh38/Ensembl/r99/cDNA_LINCs_MIRs/GRCH38_r99_cDNA_linc_mir.kidx'
  tr2g.file.ge : '/mnt/beegfs/database/bioinfo/single-cell/INDEX/KB-python_KALLISTO/0.24.4_0.46.2/homo_sapiens/GRCh38/Ensembl/r99/cDNA_LINCs_MIRs/GRCH38_r99_cDNA_linc_mir_tr2gs.txt'
  reference.txt: "Ensembl reference transcriptome v99 corresponding to the homo sapiens GRCH38 build"

Droplets_QC_GE:
  author.name : "marine aglave"
  author.mail : "marine.aglave@gustaveroussy.fr, bigr@gustaveroussy.fr"

Filtering_GE:
  pcmito.max : 0.15

Norm_DimRed_Eval_GE:
  eval.markers : "GAPDH"

############################################ ADT ############################################

Alignment_countTable_ADT:
  sample.name.adt : ["sc5p_v2_hs_PBMC_1k_5fb"]
  input.dir.adt : "/mnt/beegfs/userdata/m_aglave/pipeline/examples/complete_individual_analysis_example_of_wiki/fastq_10X/"
  output.dir.adt : "/mnt/beegfs/userdata/m_aglave/pipeline/examples/complete_individual_analysis_example_of_wiki/Results/"
  kindex.adt : "/mnt/beegfs/userdata/m_aglave/pipeline/examples/complete_individual_analysis_example_of_wiki/INDEX_ADT/ADT.kidx"
  tr2g.file.adt : "/mnt/beegfs/userdata/m_aglave/pipeline/examples/complete_individual_analysis_example_of_wiki/INDEX_ADT/ADT_tr2gs.txt"

############################################ TCR/BCR ############################################

Alignment_annotations_TCR_BCR:
  sample.name.tcr : ["sc5p_v2_hs_PBMC_1k_t"]
  input.dir.tcr : "/mnt/beegfs/userdata/m_aglave/pipeline/examples/complete_individual_analysis_example_of_wiki/fastq_10X/"
  sample.name.bcr : ["sc5p_v2_hs_PBMC_1k_b"]
  input.dir.bcr : "/mnt/beegfs/userdata/m_aglave/pipeline/examples/complete_individual_analysis_example_of_wiki/fastq_10X/"
  output.dir.tcr_bcr : "/mnt/beegfs/userdata/m_aglave/pipeline/examples/complete_individual_analysis_example_of_wiki/Results/"

3- Launch of the analysis:

No change in /mnt/beegfs/userdata/m_aglave/pipeline/examples/complete_individual_analysis_example_of_wiki/launcher.sh script.

sbatch /mnt/beegfs/userdata/m_aglave/pipeline/examples/complete_individual_analysis_example_of_wiki/launcher.sh

3- Interpretation of results:

Normalization, Dimension Reduction and Clustering:

Correlation graph:

Provided by the Norm_DimRed_Eval_GE step.

/mnt/beegfs/userdata/m_aglave/pipeline/examples/complete_individual_analysis_example_of_wiki/Results/sc5p_v2_hs_PBMC_1k_5gex_GE/F200_C1000_M0-0.15_R0-1_G5/DOUBLETSFILTER_all/SCTransform/pca/sc5p_v2_hs_PBMC_1k_5gex_GE_SCT_pca_dims.bias.cor.png

This graph represent the correlation between potential biases and each dimension, after nomalization and dimension reduction.
In general, the first dimensions are the most correlated with biases.
From experience, biases with more than 0.3 of correlation with one of the dimensions must be corrected, but of course, this is not always the case.
Note that the number of transcripts (nCount_RNA) is related to the percentages of mitochondrial, ribosomal and mechanical stress RNA. Similarly, the number of transcripts is correlated with the number of genes (nFeature_RNA). So the correction of this number of transcripts will influence the correlation of the other biases with the dimensions. So it is better to test several combinations of factors to correct.

Here, we observe a strong correlation of the number of transcripts (nCount_RNA), the percentage of ribosomal RNA and mechanical stress. Also, we have indicated GAPDH as a marker gene (here it is a negative control gene, so it should not vary with size; conversely, a positive control gene should vary).
Therefore, we will repeat this step with a correction of the number of transcripts.

Notes:

  • As we do not know if the percentage of ribosomal RNA corresponds to the cellular activity (which could be interesting to preserve), the correction of this bias depends on the project. Often, I propose 2 versions of results: with and without correction of this bias.
  • Mechanical stress genes rarely have a real influence on cell separation, even though it is strongly correlated with dimensions because the percentage of stress RNA is often low in the cells (less than 5%).
  • If the correlations are all very low (between 0 and 0.1) for all dimensions, the umap will most likely be a large central patatoid. The explanation is that the correction applied has also corrected the biological variations of interest at the same time as the bias. In this case it is better not to correct.
  • Sometimes the correction worsens the phenomenon (the correlations increase with the correction) and it is better not to correct. This is due to the interdependence of the biases explained above.
  • If you have any doubt about the correction, it is better not to correct. The impact of the biases will be visible in the final results and it will be easier to evaluate whether they should be corrected or not. If so, you can easily go back to this step and correct.

4- Make the Configuration Parameter file:

We will do the normalization and the dimension reduction, and evaluate the biais and the clustering.
We keep the same Configuration Parameter file, but we add the correction of the number of transcripts in the Norm_DimRed_Eval_GE step.

/mnt/beegfs/userdata/m_aglave/pipeline/examples/complete_individual_analysis_example_of_wiki/Params.yaml
Steps: ["Alignment_countTable_GE","Alignment_countTable_ADT","Alignment_annotations_TCR_BCR","Droplets_QC_GE","Filtering_GE","Norm_DimRed_Eval_GE"]

############################################ GE (RNA) ############################################

Alignment_countTable_GE:
  ### Project
  sample.name.ge : ["sc5p_v2_hs_PBMC_1k_5gex"]
  input.dir.ge : '/mnt/beegfs/userdata/m_aglave/pipeline/examples/complete_individual_analysis_example_of_wiki/fastq_10X/'
  output.dir.ge : '/mnt/beegfs/userdata/m_aglave/pipeline/examples/complete_individual_analysis_example_of_wiki/Results/'
  sctech : '10xv2'
  kindex.ge : '/mnt/beegfs/database/bioinfo/single-cell/INDEX/KB-python_KALLISTO/0.24.4_0.46.2/homo_sapiens/GRCh38/Ensembl/r99/cDNA_LINCs_MIRs/GRCH38_r99_cDNA_linc_mir.kidx'
  tr2g.file.ge : '/mnt/beegfs/database/bioinfo/single-cell/INDEX/KB-python_KALLISTO/0.24.4_0.46.2/homo_sapiens/GRCh38/Ensembl/r99/cDNA_LINCs_MIRs/GRCH38_r99_cDNA_linc_mir_tr2gs.txt'
  reference.txt: "Ensembl reference transcriptome v99 corresponding to the homo sapiens GRCH38 build"

Droplets_QC_GE:
  author.name : "marine aglave"
  author.mail : "marine.aglave@gustaveroussy.fr, bigr@gustaveroussy.fr"

Filtering_GE:
  pcmito.max : 0.15

Norm_DimRed_Eval_GE:
 eval.markers : "GAPDH"
 vtr.biases : "nCount_RNA"

############################################ ADT ############################################

Alignment_countTable_ADT:
  sample.name.adt : ["sc5p_v2_hs_PBMC_1k_5fb"]
  input.dir.adt : "/mnt/beegfs/userdata/m_aglave/pipeline/examples/complete_individual_analysis_example_of_wiki/fastq_10X/"
  output.dir.adt : "/mnt/beegfs/userdata/m_aglave/pipeline/examples/complete_individual_analysis_example_of_wiki/Results/"
  kindex.adt : "/mnt/beegfs/userdata/m_aglave/pipeline/examples/complete_individual_analysis_example_of_wiki/INDEX_ADT/ADT.kidx"
  tr2g.file.adt : "/mnt/beegfs/userdata/m_aglave/pipeline/examples/complete_individual_analysis_example_of_wiki/INDEX_ADT/ADT_tr2gs.txt"

############################################ TCR/BCR ############################################

Alignment_annotations_TCR_BCR:
  sample.name.tcr : ["sc5p_v2_hs_PBMC_1k_t"]
  input.dir.tcr : "/mnt/beegfs/userdata/m_aglave/pipeline/examples/complete_individual_analysis_example_of_wiki/fastq_10X/"
  sample.name.bcr : ["sc5p_v2_hs_PBMC_1k_b"]
  input.dir.bcr : "/mnt/beegfs/userdata/m_aglave/pipeline/examples/complete_individual_analysis_example_of_wiki/fastq_10X/"
  output.dir.tcr_bcr : "/mnt/beegfs/userdata/m_aglave/pipeline/examples/complete_individual_analysis_example_of_wiki/Results/"

4- Launch of the analysis:

No change in /mnt/beegfs/userdata/m_aglave/pipeline/examples/complete_individual_analysis_example_of_wiki/launcher.sh script.

sbatch /mnt/beegfs/userdata/m_aglave/pipeline/examples/complete_individual_analysis_example_of_wiki/launcher.sh

4- Interpretation of results:

Normalization, Dimension Reduction and Biases:

Correlation graph:

Provided by the Norm_DimRed_Eval_GE step.

/mnt/beegfs/userdata/m_aglave/pipeline/examples/complete_individual_analysis_example_of_wiki/Results/sc5p_v2_hs_PBMC_1k_5gex_GE/F200_C1000_M0-0.15_R0-1_G5/DOUBLETSFILTER_all/SCTransform_nCount_RNA/pca/sc5p_v2_hs_PBMC_1k_5gex_GE_SCT_pca_dims.bias.cor.png

Here the correction is not very convincing, the regressed bias seems to be even more correlated than initially (not more strongly (the maximum having moved from dimension 5 to dimension 4), but on much dimensions). No change for mitochondrial or stress RNA, nor for GAPDH or cell cycle genes (cyclone).
Only the correlation with ribosomal RNA shows a decrease at dimension 5, but remains stable for the other dimensions.
In the end, this regression doesn't bring any improvement, so it is better to continue with the uncorrected version (If you have any doubt about the correction, it is better not to correct.).

Dimensions and Clustering:

UMAPs graph:

Provided by the Norm_DimRed_Eval_GE step.

/mnt/beegfs/userdata/m_aglave/pipeline/examples/complete_individual_analysis_example_of_wiki/Results/sc5p_v2_hs_PBMC_1k_5gex_GE/F200_C1000_M0-0.15_R0-1_G5/DOUBLETSFILTER_all/SCTransform/pca/clustree_SCT_pca/uMAPs/
sc5p_v2_hs_PBMC_1k_5gex_GE_uMAPs_SCT_pca3_ALLres.png

sc5p_v2_hs_PBMC_1k_5gex_GE_uMAPs_SCT_pca7_ALLres.png

sc5p_v2_hs_PBMC_1k_5gex_GE_uMAPs_SCT_pca9_ALLres.png

sc5p_v2_hs_PBMC_1k_5gex_GE_uMAPs_SCT_pca11_ALLres.png

sc5p_v2_hs_PBMC_1k_5gex_GE_uMAPs_SCT_pca13_ALLres.png

sc5p_v2_hs_PBMC_1k_5gex_GE_uMAPs_SCT_pca15_ALLres.png

sc5p_v2_hs_PBMC_1k_5gex_GE_uMAPs_SCT_pca17_ALLres.png

sc5p_v2_hs_PBMC_1k_5gex_GE_uMAPs_SCT_pca19_ALLres.png

sc5p_v2_hs_PBMC_1k_5gex_GE_uMAPs_SCT_pca21_ALLres.png

sc5p_v2_hs_PBMC_1k_5gex_GE_uMAPs_SCT_pca23_ALLres.png

sc5p_v2_hs_PBMC_1k_5gex_GE_uMAPs_SCT_pca25_ALLres.png
<p align="center"><img src="https://github.com/gustaveroussy/single-cell/blob/master/examples/complete_individual_analysis_example_of_wiki/Results/sc5p_v2_hs_PBMC_1k_5gex_GE/F200_C1000_M0-0.15_R0-1_G5/DOUBLETSFILTER_all/SCTransform/pca/clustree_SCT_pca/uMAPs/sc5p_v2_hs_PBMC_1k_5gex_GE_uMAPs_SCT_pca25_ALLres.png" title="sc5p_v2_hs_PBMC_1k_5gex_GE_uMAPs_SCT_pca25_ALLres"></p>
sc5p_v2_hs_PBMC_1k_5gex_GE_uMAPs_SCT_pca27_ALLres.png

sc5p_v2_hs_PBMC_1k_5gex_GE_uMAPs_SCT_pca29_ALLres.png

sc5p_v2_hs_PBMC_1k_5gex_GE_uMAPs_SCT_pca31_ALLres.png

sc5p_v2_hs_PBMC_1k_5gex_GE_uMAPs_SCT_pca33_ALLres.png

sc5p_v2_hs_PBMC_1k_5gex_GE_uMAPs_SCT_pca35_ALLres.png

sc5p_v2_hs_PBMC_1k_5gex_GE_uMAPs_SCT_pca37_ALLres.png

sc5p_v2_hs_PBMC_1k_5gex_GE_uMAPs_SCT_pca39_ALLres.png

sc5p_v2_hs_PBMC_1k_5gex_GE_uMAPs_SCT_pca41_ALLres.png

sc5p_v2_hs_PBMC_1k_5gex_GE_uMAPs_SCT_pca43_ALLres.png

sc5p_v2_hs_PBMC_1k_5gex_GE_uMAPs_SCT_pca45_ALLres.png

sc5p_v2_hs_PBMC_1k_5gex_GE_uMAPs_SCT_pca47_ALLres.png

sc5p_v2_hs_PBMC_1k_5gex_GE_uMAPs_SCT_pca49_ALLres.png

To identify the number of dimensions to keep for clustering as well as the adequate resolution, the pipeline has drawn all possible umaps according to these 2 parameters. By default, it will test between 5 and 49 dimensions with a step of 2, and all resolutions between 0.1 and 1.2 with a step of 0.1.
We have to look at all the umaps and choose the one that seems to be the most "beautiful": cluster well isolated from each other, cells well grouped within its cluster, in accordance with the expected number of clusters (hence the importance of having an estimate of the number of cell types present in the sample).
It is important to know that choosing a low number of dimensions will not capture all the interesting biological information. And, on the contrary, choosing a high number of dimensions will capture all the interesting biological information but also noise. The threshold between the dimensions that bring information and those that bring noise is difficult to detect. However, at the beginning, the more dimensions we add, the more relevant biological information we add, and the more "beautiful" the umaps will be; until a moment when the more dimensions we continue to add, the more noise we will add and the more the umaps will degrade. So you have to take an umap just before the "degradation".

Note:
To choose the right number of dimensions, I look at the umap with a resolution of 1.2 (high resolution), then I choose 2 or 3 numbers of dimensions to keep (depending on the degradation of the umap), then I look at their lower resolution. The evolution of the number of clusters must increase (with the increase of the resolution) and remain coherent for the same number of dimensions (no cells which constantly change cluster, or clusters which appear then disappear). This verification can also be done by the clustree plot.

Clustree plots:

The clutree plot is a tree plot to observe the influence of a parameter on the results.
Here we measure the evolution of the membership of cells to a cluster:

  • the resolution is fixed and the number of dimensions to keep evolves:
/mnt/beegfs/userdata/m_aglave/pipeline/examples/complete_individual_analysis_example_of_wiki/Results/sc5p_v2_hs_PBMC_1k_5gex_GE/F200_C1000_M0-0.15_R0-1_G5/DOUBLETSFILTER_all/SCTransform/pca/clustree_SCT_pca/louvain_resolution *Provided by the Norm_DimRed_Eval_GE step.*
sc5p_v2_hs_PBMC_1k_5gex_GE_SCT_res0.1.png
<p align="center"><img src="https://github.com/gustaveroussy/single-cell/blob/master/examples/complete_individual_analysis_example_of_wiki/Results/sc5p_v2_hs_PBMC_1k_5gex_GE/F200_C1000_M0-0.15_R0-1_G5/DOUBLETSFILTER_all/SCTransform/pca/clustree_SCT_pca/louvain_resolution/sc5p_v2_hs_PBMC_1k_5gex_GE_SCT_res0.1.png" title="sc5p_v2_hs_PBMC_1k_5gex_GE_SCT_res0.1"></p>
sc5p_v2_hs_PBMC_1k_5gex_GE_SCT_res0.2.png
<p align="center"><img src="https://github.com/gustaveroussy/single-cell/blob/master/examples/complete_individual_analysis_example_of_wiki/Results/sc5p_v2_hs_PBMC_1k_5gex_GE/F200_C1000_M0-0.15_R0-1_G5/DOUBLETSFILTER_all/SCTransform/pca/clustree_SCT_pca/louvain_resolution/sc5p_v2_hs_PBMC_1k_5gex_GE_SCT_res0.2.png" title="sc5p_v2_hs_PBMC_1k_5gex_GE_SCT_res0.2"></p>
sc5p_v2_hs_PBMC_1k_5gex_GE_SCT_res0.3.png
<p align="center"><img src="https://github.com/gustaveroussy/single-cell/blob/master/examples/complete_individual_analysis_example_of_wiki/Results/sc5p_v2_hs_PBMC_1k_5gex_GE/F200_C1000_M0-0.15_R0-1_G5/DOUBLETSFILTER_all/SCTransform/pca/clustree_SCT_pca/louvain_resolution/sc5p_v2_hs_PBMC_1k_5gex_GE_SCT_res0.3.png" title="sc5p_v2_hs_PBMC_1k_5gex_GE_SCT_res0.3"></p>
sc5p_v2_hs_PBMC_1k_5gex_GE_SCT_res0.4.png
<p align="center"><img src="https://github.com/gustaveroussy/single-cell/blob/master/examples/complete_individual_analysis_example_of_wiki/Results/sc5p_v2_hs_PBMC_1k_5gex_GE/F200_C1000_M0-0.15_R0-1_G5/DOUBLETSFILTER_all/SCTransform/pca/clustree_SCT_pca/louvain_resolution/sc5p_v2_hs_PBMC_1k_5gex_GE_SCT_res0.4.png" title="sc5p_v2_hs_PBMC_1k_5gex_GE_SCT_res0.4"></p>
sc5p_v2_hs_PBMC_1k_5gex_GE_SCT_res0.5.png
<p align="center"><img src="https://github.com/gustaveroussy/single-cell/blob/master/examples/complete_individual_analysis_example_of_wiki/Results/sc5p_v2_hs_PBMC_1k_5gex_GE/F200_C1000_M0-0.15_R0-1_G5/DOUBLETSFILTER_all/SCTransform/pca/clustree_SCT_pca/louvain_resolution/sc5p_v2_hs_PBMC_1k_5gex_GE_SCT_res0.5.png" title="sc5p_v2_hs_PBMC_1k_5gex_GE_SCT_res0.5"></p>
sc5p_v2_hs_PBMC_1k_5gex_GE_SCT_res0.6.png
<p align="center"><img src="https://github.com/gustaveroussy/single-cell/blob/master/examples/complete_individual_analysis_example_of_wiki/Results/sc5p_v2_hs_PBMC_1k_5gex_GE/F200_C1000_M0-0.15_R0-1_G5/DOUBLETSFILTER_all/SCTransform/pca/clustree_SCT_pca/louvain_resolution/sc5p_v2_hs_PBMC_1k_5gex_GE_SCT_res0.6.png" title="sc5p_v2_hs_PBMC_1k_5gex_GE_SCT_res0.6"></p>
sc5p_v2_hs_PBMC_1k_5gex_GE_SCT_res0.7.png
<p align="center"><img src="https://github.com/gustaveroussy/single-cell/blob/master/examples/complete_individual_analysis_example_of_wiki/Results/sc5p_v2_hs_PBMC_1k_5gex_GE/F200_C1000_M0-0.15_R0-1_G5/DOUBLETSFILTER_all/SCTransform/pca/clustree_SCT_pca/louvain_resolution/sc5p_v2_hs_PBMC_1k_5gex_GE_SCT_res0.7.png" title="sc5p_v2_hs_PBMC_1k_5gex_GE_SCT_res0.7"></p>
sc5p_v2_hs_PBMC_1k_5gex_GE_SCT_res0.8.png
<p align="center"><img src="https://github.com/gustaveroussy/single-cell/blob/master/examples/complete_individual_analysis_example_of_wiki/Results/sc5p_v2_hs_PBMC_1k_5gex_GE/F200_C1000_M0-0.15_R0-1_G5/DOUBLETSFILTER_all/SCTransform/pca/clustree_SCT_pca/louvain_resolution/sc5p_v2_hs_PBMC_1k_5gex_GE_SCT_res0.8.png" title="sc5p_v2_hs_PBMC_1k_5gex_GE_SCT_res0.8"></p>
sc5p_v2_hs_PBMC_1k_5gex_GE_SCT_res0.9.png
<p align="center"><img src="https://github.com/gustaveroussy/single-cell/blob/master/examples/complete_individual_analysis_example_of_wiki/Results/sc5p_v2_hs_PBMC_1k_5gex_GE/F200_C1000_M0-0.15_R0-1_G5/DOUBLETSFILTER_all/SCTransform/pca/clustree_SCT_pca/louvain_resolution/sc5p_v2_hs_PBMC_1k_5gex_GE_SCT_res0.9.png" title="sc5p_v2_hs_PBMC_1k_5gex_GE_SCT_res0.9"></p>
sc5p_v2_hs_PBMC_1k_5gex_GE_SCT_res1.0.png
<p align="center"><img src="https://github.com/gustaveroussy/single-cell/blob/master/examples/complete_individual_analysis_example_of_wiki/Results/sc5p_v2_hs_PBMC_1k_5gex_GE/F200_C1000_M0-0.15_R0-1_G5/DOUBLETSFILTER_all/SCTransform/pca/clustree_SCT_pca/louvain_resolution/sc5p_v2_hs_PBMC_1k_5gex_GE_SCT_res1.0.png" title="sc5p_v2_hs_PBMC_1k_5gex_GE_SCT_res1.0"></p>
sc5p_v2_hs_PBMC_1k_5gex_GE_SCT_res1.1.png
<p align="center"><img src="https://github.com/gustaveroussy/single-cell/blob/master/examples/complete_individual_analysis_example_of_wiki/Results/sc5p_v2_hs_PBMC_1k_5gex_GE/F200_C1000_M0-0.15_R0-1_G5/DOUBLETSFILTER_all/SCTransform/pca/clustree_SCT_pca/louvain_resolution/sc5p_v2_hs_PBMC_1k_5gex_GE_SCT_res1.1.png" title="sc5p_v2_hs_PBMC_1k_5gex_GE_SCT_res1.1"></p>
sc5p_v2_hs_PBMC_1k_5gex_GE_SCT_res1.2.png
<p align="center"><img src="https://github.com/gustaveroussy/single-cell/blob/master/examples/complete_individual_analysis_example_of_wiki/Results/sc5p_v2_hs_PBMC_1k_5gex_GE/F200_C1000_M0-0.15_R0-1_G5/DOUBLETSFILTER_all/SCTransform/pca/clustree_SCT_pca/louvain_resolution/sc5p_v2_hs_PBMC_1k_5gex_GE_SCT_res1.2.png" title="sc5p_v2_hs_PBMC_1k_5gex_GE_SCT_res1.2"></p>
  • the number of dimensions to keep is fixed and the resolution evolves:
/mnt/beegfs/userdata/m_aglave/pipeline/examples/complete_individual_analysis_example_of_wiki/Results/sc5p_v2_hs_PBMC_1k_5gex_GE/F200_C1000_M0-0.15_R0-1_G5/DOUBLETSFILTER_all/SCTransform/pca/clustree_SCT_pca/dimensions/ *Provided by the Norm_DimRed_Eval_GE step.*
sc5p_v2_hs_PBMC_1k_5gex_GE_SCT_pca3.png
<p align="center"><img src="https://github.com/gustaveroussy/single-cell/blob/master/examples/complete_individual_analysis_example_of_wiki/Results/sc5p_v2_hs_PBMC_1k_5gex_GE/F200_C1000_M0-0.15_R0-1_G5/DOUBLETSFILTER_all/SCTransform/pca/clustree_SCT_pca/dimensions/sc5p_v2_hs_PBMC_1k_5gex_GE_SCT_pca3.png" title="sc5p_v2_hs_PBMC_1k_5gex_GE_SCT_pca3"></p>
sc5p_v2_hs_PBMC_1k_5gex_GE_SCT_pca5.png
<p align="center"><img src="https://github.com/gustaveroussy/single-cell/blob/master/examples/complete_individual_analysis_example_of_wiki/Results/sc5p_v2_hs_PBMC_1k_5gex_GE/F200_C1000_M0-0.15_R0-1_G5/DOUBLETSFILTER_all/SCTransform/pca/clustree_SCT_pca/dimensions/sc5p_v2_hs_PBMC_1k_5gex_GE_SCT_pca5.png" title="sc5p_v2_hs_PBMC_1k_5gex_GE_SCT_pca5"></p>
sc5p_v2_hs_PBMC_1k_5gex_GE_SCT_pca7.png
<p align="center"><img src="https://github.com/gustaveroussy/single-cell/blob/master/examples/complete_individual_analysis_example_of_wiki/Results/sc5p_v2_hs_PBMC_1k_5gex_GE/F200_C1000_M0-0.15_R0-1_G5/DOUBLETSFILTER_all/SCTransform/pca/clustree_SCT_pca/dimensions/sc5p_v2_hs_PBMC_1k_5gex_GE_SCT_pca7.png" title="sc5p_v2_hs_PBMC_1k_5gex_GE_SCT_pca7"></p>
sc5p_v2_hs_PBMC_1k_5gex_GE_SCT_pca9.png
<p align="center"><img src="https://github.com/gustaveroussy/single-cell/blob/master/examples/complete_individual_analysis_example_of_wiki/Results/sc5p_v2_hs_PBMC_1k_5gex_GE/F200_C1000_M0-0.15_R0-1_G5/DOUBLETSFILTER_all/SCTransform/pca/clustree_SCT_pca/dimensions/sc5p_v2_hs_PBMC_1k_5gex_GE_SCT_pca9.png" title="sc5p_v2_hs_PBMC_1k_5gex_GE_SCT_pca9"></p>
sc5p_v2_hs_PBMC_1k_5gex_GE_SCT_pca11.png
<p align="center"><img src="https://github.com/gustaveroussy/single-cell/blob/master/examples/complete_individual_analysis_example_of_wiki/Results/sc5p_v2_hs_PBMC_1k_5gex_GE/F200_C1000_M0-0.15_R0-1_G5/DOUBLETSFILTER_all/SCTransform/pca/clustree_SCT_pca/dimensions/sc5p_v2_hs_PBMC_1k_5gex_GE_SCT_pca11.png" title="sc5p_v2_hs_PBMC_1k_5gex_GE_SCT_pca11"></p>
sc5p_v2_hs_PBMC_1k_5gex_GE_SCT_pca13.png
<p align="center"><img src="https://github.com/gustaveroussy/single-cell/blob/master/examples/complete_individual_analysis_example_of_wiki/Results/sc5p_v2_hs_PBMC_1k_5gex_GE/F200_C1000_M0-0.15_R0-1_G5/DOUBLETSFILTER_all/SCTransform/pca/clustree_SCT_pca/dimensions/sc5p_v2_hs_PBMC_1k_5gex_GE_SCT_pca13.png" title="sc5p_v2_hs_PBMC_1k_5gex_GE_SCT_pca13"></p>
sc5p_v2_hs_PBMC_1k_5gex_GE_SCT_pca15.png
<p align="center"><img src="https://github.com/gustaveroussy/single-cell/blob/master/examples/complete_individual_analysis_example_of_wiki/Results/sc5p_v2_hs_PBMC_1k_5gex_GE/F200_C1000_M0-0.15_R0-1_G5/DOUBLETSFILTER_all/SCTransform/pca/clustree_SCT_pca/dimensions/sc5p_v2_hs_PBMC_1k_5gex_GE_SCT_pca15.png" title="sc5p_v2_hs_PBMC_1k_5gex_GE_SCT_pca15"></p>
sc5p_v2_hs_PBMC_1k_5gex_GE_SCT_pca17.png
<p align="center"><img src="https://github.com/gustaveroussy/single-cell/blob/master/examples/complete_individual_analysis_example_of_wiki/Results/sc5p_v2_hs_PBMC_1k_5gex_GE/F200_C1000_M0-0.15_R0-1_G5/DOUBLETSFILTER_all/SCTransform/pca/clustree_SCT_pca/dimensions/sc5p_v2_hs_PBMC_1k_5gex_GE_SCT_pca17.png" title="sc5p_v2_hs_PBMC_1k_5gex_GE_SCT_pca17"></p>
sc5p_v2_hs_PBMC_1k_5gex_GE_SCT_pca19.png
<p align="center"><img src="https://github.com/gustaveroussy/single-cell/blob/master/examples/complete_individual_analysis_example_of_wiki/Results/sc5p_v2_hs_PBMC_1k_5gex_GE/F200_C1000_M0-0.15_R0-1_G5/DOUBLETSFILTER_all/SCTransform/pca/clustree_SCT_pca/dimensions/sc5p_v2_hs_PBMC_1k_5gex_GE_SCT_pca19.png" title="sc5p_v2_hs_PBMC_1k_5gex_GE_SCT_pca19"></p>
sc5p_v2_hs_PBMC_1k_5gex_GE_SCT_pca21.png
<p align="center"><img src="https://github.com/gustaveroussy/single-cell/blob/master/examples/complete_individual_analysis_example_of_wiki/Results/sc5p_v2_hs_PBMC_1k_5gex_GE/F200_C1000_M0-0.15_R0-1_G5/DOUBLETSFILTER_all/SCTransform/pca/clustree_SCT_pca/dimensions/sc5p_v2_hs_PBMC_1k_5gex_GE_SCT_pca21.png" title="sc5p_v2_hs_PBMC_1k_5gex_GE_SCT_pca21"></p>
sc5p_v2_hs_PBMC_1k_5gex_GE_SCT_pca23.png
<p align="center"><img src="https://github.com/gustaveroussy/single-cell/blob/master/examples/complete_individual_analysis_example_of_wiki/Results/sc5p_v2_hs_PBMC_1k_5gex_GE/F200_C1000_M0-0.15_R0-1_G5/DOUBLETSFILTER_all/SCTransform/pca/clustree_SCT_pca/dimensions/sc5p_v2_hs_PBMC_1k_5gex_GE_SCT_pca23.png" title="sc5p_v2_hs_PBMC_1k_5gex_GE_SCT_pca23"></p>
sc5p_v2_hs_PBMC_1k_5gex_GE_SCT_pca25.png
<p align="center"><img src="https://github.com/gustaveroussy/single-cell/blob/master/examples/complete_individual_analysis_example_of_wiki/Results/sc5p_v2_hs_PBMC_1k_5gex_GE/F200_C1000_M0-0.15_R0-1_G5/DOUBLETSFILTER_all/SCTransform/pca/clustree_SCT_pca/dimensions/sc5p_v2_hs_PBMC_1k_5gex_GE_SCT_pca25.png" title="sc5p_v2_hs_PBMC_1k_5gex_GE_SCT_pca25"></p>
sc5p_v2_hs_PBMC_1k_5gex_GE_SCT_pca27.png
<p align="center"><img src="https://github.com/gustaveroussy/single-cell/blob/master/examples/complete_individual_analysis_example_of_wiki/Results/sc5p_v2_hs_PBMC_1k_5gex_GE/F200_C1000_M0-0.15_R0-1_G5/DOUBLETSFILTER_all/SCTransform/pca/clustree_SCT_pca/dimensions/sc5p_v2_hs_PBMC_1k_5gex_GE_SCT_pca27.png" title="sc5p_v2_hs_PBMC_1k_5gex_GE_SCT_pca27"></p>
sc5p_v2_hs_PBMC_1k_5gex_GE_SCT_pca29.png
<p align="center"><img src="https://github.com/gustaveroussy/single-cell/blob/master/examples/complete_individual_analysis_example_of_wiki/Results/sc5p_v2_hs_PBMC_1k_5gex_GE/F200_C1000_M0-0.15_R0-1_G5/DOUBLETSFILTER_all/SCTransform/pca/clustree_SCT_pca/dimensions/sc5p_v2_hs_PBMC_1k_5gex_GE_SCT_pca29.png" title="sc5p_v2_hs_PBMC_1k_5gex_GE_SCT_pca29"></p>
sc5p_v2_hs_PBMC_1k_5gex_GE_SCT_pca31.png
<p align="center"><img src="https://github.com/gustaveroussy/single-cell/blob/master/examples/complete_individual_analysis_example_of_wiki/Results/sc5p_v2_hs_PBMC_1k_5gex_GE/F200_C1000_M0-0.15_R0-1_G5/DOUBLETSFILTER_all/SCTransform/pca/clustree_SCT_pca/dimensions/sc5p_v2_hs_PBMC_1k_5gex_GE_SCT_pca31.png" title="sc5p_v2_hs_PBMC_1k_5gex_GE_SCT_pca31"></p>
sc5p_v2_hs_PBMC_1k_5gex_GE_SCT_pca33.png
<p align="center"><img src="https://github.com/gustaveroussy/single-cell/blob/master/examples/complete_individual_analysis_example_of_wiki/Results/sc5p_v2_hs_PBMC_1k_5gex_GE/F200_C1000_M0-0.15_R0-1_G5/DOUBLETSFILTER_all/SCTransform/pca/clustree_SCT_pca/dimensions/sc5p_v2_hs_PBMC_1k_5gex_GE_SCT_pca33.png" title="sc5p_v2_hs_PBMC_1k_5gex_GE_SCT_pca33"></p>
sc5p_v2_hs_PBMC_1k_5gex_GE_SCT_pca35.png
<p align="center"><img src="https://github.com/gustaveroussy/single-cell/blob/master/examples/complete_individual_analysis_example_of_wiki/Results/sc5p_v2_hs_PBMC_1k_5gex_GE/F200_C1000_M0-0.15_R0-1_G5/DOUBLETSFILTER_all/SCTransform/pca/clustree_SCT_pca/dimensions/sc5p_v2_hs_PBMC_1k_5gex_GE_SCT_pca35.png" title="sc5p_v2_hs_PBMC_1k_5gex_GE_SCT_pca35"></p>
sc5p_v2_hs_PBMC_1k_5gex_GE_SCT_pca37.png
<p align="center"><img src="https://github.com/gustaveroussy/single-cell/blob/master/examples/complete_individual_analysis_example_of_wiki/Results/sc5p_v2_hs_PBMC_1k_5gex_GE/F200_C1000_M0-0.15_R0-1_G5/DOUBLETSFILTER_all/SCTransform/pca/clustree_SCT_pca/dimensions/sc5p_v2_hs_PBMC_1k_5gex_GE_SCT_pca37.png" title="sc5p_v2_hs_PBMC_1k_5gex_GE_SCT_pca37"></p>
sc5p_v2_hs_PBMC_1k_5gex_GE_SCT_pca39.png
<p align="center"><img src="https://github.com/gustaveroussy/single-cell/blob/master/examples/complete_individual_analysis_example_of_wiki/Results/sc5p_v2_hs_PBMC_1k_5gex_GE/F200_C1000_M0-0.15_R0-1_G5/DOUBLETSFILTER_all/SCTransform/pca/clustree_SCT_pca/dimensions/sc5p_v2_hs_PBMC_1k_5gex_GE_SCT_pca39.png" title="sc5p_v2_hs_PBMC_1k_5gex_GE_SCT_pca39"></p>
sc5p_v2_hs_PBMC_1k_5gex_GE_SCT_pca41.png
<p align="center"><img src="https://github.com/gustaveroussy/single-cell/blob/master/examples/complete_individual_analysis_example_of_wiki/Results/sc5p_v2_hs_PBMC_1k_5gex_GE/F200_C1000_M0-0.15_R0-1_G5/DOUBLETSFILTER_all/SCTransform/pca/clustree_SCT_pca/dimensions/sc5p_v2_hs_PBMC_1k_5gex_GE_SCT_pca41.png" title="sc5p_v2_hs_PBMC_1k_5gex_GE_SCT_pca41"></p>
sc5p_v2_hs_PBMC_1k_5gex_GE_SCT_pca43.png
<p align="center"><img src="https://github.com/gustaveroussy/single-cell/blob/master/examples/complete_individual_analysis_example_of_wiki/Results/sc5p_v2_hs_PBMC_1k_5gex_GE/F200_C1000_M0-0.15_R0-1_G5/DOUBLETSFILTER_all/SCTransform/pca/clustree_SCT_pca/dimensions/sc5p_v2_hs_PBMC_1k_5gex_GE_SCT_pca43.png" title="sc5p_v2_hs_PBMC_1k_5gex_GE_SCT_pca43"></p>
sc5p_v2_hs_PBMC_1k_5gex_GE_SCT_pca45.png
<p align="center"><img src="https://github.com/gustaveroussy/single-cell/blob/master/examples/complete_individual_analysis_example_of_wiki/Results/sc5p_v2_hs_PBMC_1k_5gex_GE/F200_C1000_M0-0.15_R0-1_G5/DOUBLETSFILTER_all/SCTransform/pca/clustree_SCT_pca/dimensions/sc5p_v2_hs_PBMC_1k_5gex_GE_SCT_pca45.png" title="sc5p_v2_hs_PBMC_1k_5gex_GE_SCT_pca45"></p>
sc5p_v2_hs_PBMC_1k_5gex_GE_SCT_pca47.png
<p align="center"><img src="https://github.com/gustaveroussy/single-cell/blob/master/examples/complete_individual_analysis_example_of_wiki/Results/sc5p_v2_hs_PBMC_1k_5gex_GE/F200_C1000_M0-0.15_R0-1_G5/DOUBLETSFILTER_all/SCTransform/pca/clustree_SCT_pca/dimensions/sc5p_v2_hs_PBMC_1k_5gex_GE_SCT_pca47.png" title="sc5p_v2_hs_PBMC_1k_5gex_GE_SCT_pca47"></p>
sc5p_v2_hs_PBMC_1k_5gex_GE_SCT_pca49.png
<p align="center"><img src="https://github.com/gustaveroussy/single-cell/blob/master/examples/complete_individual_analysis_example_of_wiki/Results/sc5p_v2_hs_PBMC_1k_5gex_GE/F200_C1000_M0-0.15_R0-1_G5/DOUBLETSFILTER_all/SCTransform/pca/clustree_SCT_pca/dimensions/sc5p_v2_hs_PBMC_1k_5gex_GE_SCT_pca49.png" title="sc5p_v2_hs_PBMC_1k_5gex_GE_SCT_pca49"></p>

The goal is that the membership of the cells in a cluster remains relatively stable.

Here, the umap are very similar between 19 and 35 dimensions, except for clusters 6 and 7 which sometimes disassociate from each other. From 37 dimensions, the umap become blurred with less different cell types. Some noise is certainly captured. We can choose 35 dimensions, just before degradation. Keeping only 15 dimensions could also be interesting because it seems that clusters 1,2,6 and 10 of the first 15 dimensions (resolution 1.2) correspond to clusters 0 and 2 of the first 35 dimensions (resolution 1.2). So to go from the results of the first 15 dimensions to the first 35 dimensions, either information has been added or noise has been added. Since the version with 35 dimensions shows a real stability, I assume the first one (information).

Notes:

  • The best practice is to choose at least 2 resolutions:
    • one lower to see main cell types,
    • one higher to see cell sub-types.
  • If in doubt, it is better to use a resolution that is too high, because clusters that should not have been separated can always be joined (the reverse is not possible, you would have to re-cluster).

Here, we can choose a 0.3 and 1.2 resolutions.

What interests me between these 2 resolutions:

  • a part of cluster 6 is attached to cluster 3 in resolution 1.2, while it is part of cluster 3 in resolution 0.3. This is a re-assignment of cell membership (reproducible over several choices of the resolution parameter), and I don't know which version is the right one.
  • clusters 1 and 4 in resolution 1.2 are assembled in a single cluster 0 in resolution 0.3. However, given the proximity and the shape of these clusters, I think that they are the same cell type.

With 35 dimensions, we can see a very small group with 2-3 cells at the bottom of the graph. Maybe it might be interesting to test higher resolutions to isolate those few cells in a cluster.

For the rest of the explanation, I will only show the results with 35 dimension and 1.2 in resolution, but the interpretation is transposable on the version of the results with a resolution of 0.3.

5- Make the Configuration Parameter file:

We will do the clustering, find the marker genes, make the annotation, add ADT, add TCR, add BCR and convert main results into a cerebro object.
We keep the same Configuration Parameter file, but we add Clust_Markers_Annot_GE, Adding_ADT, Adding_TCR, Adding_BCR and Cerebro steps, with specific parameters for Clust_Markers_Annot_GE and Adding_ADT steps.
Reminder: we start again from the uncorrected version.

/mnt/beegfs/userdata/m_aglave/pipeline/examples/complete_individual_analysis_example_of_wiki/Params.yaml
Steps: ["Alignment_countTable_GE","Alignment_countTable_ADT","Alignment_annotations_TCR_BCR","Droplets_QC_GE","Filtering_GE","Norm_DimRed_Eval_GE","Clust_Markers_Annot_GE","Adding_ADT","Adding_TCR","Adding_BCR","Cerebro"]

############################################ GE (RNA) ############################################

Alignment_countTable_GE:
### Project
sample.name.ge : ["sc5p_v2_hs_PBMC_1k_5gex"]
input.dir.ge : '/mnt/beegfs/userdata/m_aglave/pipeline/examples/complete_individual_analysis_example_of_wiki/fastq_10X/'
output.dir.ge : '/mnt/beegfs/userdata/m_aglave/pipeline/examples/complete_individual_analysis_example_of_wiki/Results/'
sctech : '10xv2'
kindex.ge : '/mnt/beegfs/database/bioinfo/single-cell/INDEX/KB-python_KALLISTO/0.24.4_0.46.2/homo_sapiens/GRCh38/Ensembl/r99/cDNA_LINCs_MIRs/GRCH38_r99_cDNA_linc_mir.kidx'
tr2g.file.ge : '/mnt/beegfs/database/bioinfo/single-cell/INDEX/KB-python_KALLISTO/0.24.4_0.46.2/homo_sapiens/GRCh38/Ensembl/r99/cDNA_LINCs_MIRs/GRCH38_r99_cDNA_linc_mir_tr2gs.txt'
reference.txt: "Ensembl reference transcriptome v99 corresponding to the homo sapiens GRCH38 build"

Droplets_QC_GE:
author.name : "marine aglave"
author.mail : "marine.aglave@gustaveroussy.fr, bigr@gustaveroussy.fr"

Filtering_GE:
pcmito.max : 0.15

Norm_DimRed_Eval_GE:
 eval.markers : "GAPDH"

Clust_Markers_Annot_GE:
 markfile : "/mnt/beegfs/userdata/m_aglave/pipeline/examples/complete_individual_analysis_example_of_wiki/markfile.xlsx"
 keep.dims : 35
 keep.res : 1.2

############################################ ADT ############################################

Alignment_countTable_ADT:
 sample.name.adt : ["sc5p_v2_hs_PBMC_1k_5fb"]
 input.dir.adt : "/mnt/beegfs/userdata/m_aglave/pipeline/examples/complete_individual_analysis_example_of_wiki/fastq_10X/"
 output.dir.adt : "/mnt/beegfs/userdata/m_aglave/pipeline/examples/complete_individual_analysis_example_of_wiki/Results/"
 kindex.adt : "/mnt/beegfs/userdata/m_aglave/pipeline/examples/complete_individual_analysis_example_of_wiki/INDEX_ADT/ADT.kidx"
 tr2g.file.adt : "/mnt/beegfs/userdata/m_aglave/pipeline/examples/complete_individual_analysis_example_of_wiki/INDEX_ADT/ADT_tr2gs.txt"

Adding_ADT:
 gene.names: "CD3G,CD19,PTPRC,CD4,CD8A,CD14,FCGR3A,NCAM1,IL2RA,PTPRC,PDCD1,TIGIT,IGHG1,IGHG2,IGHG2,IL7R,FUT4,CCR7,HLA-DRA"

############################################ TCR/BCR ############################################

Alignment_annotations_TCR_BCR:
sample.name.tcr : ["sc5p_v2_hs_PBMC_1k_t"]
input.dir.tcr : "/mnt/beegfs/userdata/m_aglave/pipeline/examples/complete_individual_analysis_example_of_wiki/fastq_10X/"
sample.name.bcr : ["sc5p_v2_hs_PBMC_1k_b"]
input.dir.bcr : "/mnt/beegfs/userdata/m_aglave/pipeline/examples/complete_individual_analysis_example_of_wiki/fastq_10X/"
output.dir.tcr_bcr : "/mnt/beegfs/userdata/m_aglave/pipeline/examples/complete_individual_analysis_example_of_wiki/Results/"

5- Launch of the analysis:

No change in /mnt/beegfs/userdata/m_aglave/pipeline/examples/complete_individual_analysis_example_of_wiki/launcher.sh script.

sbatch /mnt/beegfs/userdata/m_aglave/pipeline/examples/complete_individual_analysis_example_of_wiki/launcher.sh

5- Interpretation of results:

Clustering:

Final umap with clusters:

Provided by the Clust_Markers_Annot_GE step.

/mnt/beegfs/userdata/m_aglave/pipeline/examples/complete_individual_analysis_example_of_wiki/Results/sc5p_v2_hs_PBMC_1k_5gex_GE/F200_C1000_M0-0.15_R0-1_G5/DOUBLETSFILTER_all/SCTransform/pca/dims35_res1.2/sc5p_v2_hs_PBMC_1k_5gex_GE_SCT_pca_uMAP_dim35_res1.2.png

/mnt/beegfs/userdata/m_aglave/pipeline/examples/complete_individual_analysis_example_of_wiki/Results/sc5p_v2_hs_PBMC_1k_5gex_GE/F200_C1000_M0-0.15_R0-1_G5/DOUBLETSFILTER_all/SCTransform/pca/dims35_res1.2/sc5p_v2_hs_PBMC_1k_5gex_GE_SCT_pca_uMAP3d_dim35_res1.2.png

These graphs correspond to the 2D and 3D umaps of the data with the assignment of cells to their cluster. It corresponds to the umap chosen in the previous step.

Final umap with biases:

Provided by the Clust_Markers_Annot_GE step.

/mnt/beegfs/userdata/m_aglave/pipeline/examples/complete_individual_analysis_example_of_wiki/Results/sc5p_v2_hs_PBMC_1k_5gex_GE/F200_C1000_M0-0.15_R0-1_G5/DOUBLETSFILTER_all/SCTransform/pca/dims35_res1.2/technical/sc5p_v2_hs_PBMC_1k_5gex_GE_technical_MULTI_ALL_uMAPs.png

This group of graphs corresponds to the plotting of biases on the umap. The goal of these graphs is to check the correction or not of biases. The cells should not be separated according to biases, but according to the biological processes of interest of the cells.

The very small group with 2-3 cells at the bottom of the graph has a high mitochondrial RNA level, a very low ribosomal RNA level, and a very high stress RNA level. So these are cells belonging to cluster 0, but in such a level of stress that they are separated from other cells of the same cell type.

It can be observed that there is no separation of cells according to the percentage of mitochondrial RNA, as well as for the percentage of stress RNA (except in the small group of cells at the bottom). There is a bias in the ribosomal RNAs (we had indeed observed it in the plot of the correlation between the biases and the dimensions), but it does not seem to be linked to clustering (there are cells with a high rate and the other with a low rate which coexists in the same cluster). We make the same observation for the number of transcripts and for the number of genes.
So the corrections were superfluous.
We can see no doublets and the attribution of the cell cycle is not specific to the clusters.

Notes:

  • If there was a correlation between the clusters and a bias, we should go back to the normalization and dimension reduction step to correct the bias (unless the correction worsens the phenomenon, in which case there is no solution, the results must be interpreted with full knowledge of the facts).
  • The other graphs in this folder (technical) are not really usefull, they correspond to the grouped graphs.

Marker genes of clusters:

Upsetplot:

Provided by the Clust_Markers_Annot_GE step.

/mnt/beegfs/userdata/m_aglave/pipeline/examples/complete_individual_analysis_example_of_wiki/Results/sc5p_v2_hs_PBMC_1k_5gex_GE/F200_C1000_M0-0.15_R0-1_G5/DOUBLETSFILTER_all/SCTransform/pca/dims35_res1.2/found_markers/sc5p_v2_hs_PBMC_1k_5gex_GE_findmarkers_upset_all.png

/mnt/beegfs/userdata/m_aglave/pipeline/examples/complete_individual_analysis_example_of_wiki/Results/sc5p_v2_hs_PBMC_1k_5gex_GE/F200_C1000_M0-0.15_R0-1_G5/DOUBLETSFILTER_all/SCTransform/pca/dims35_res1.2/found_markers/sc5p_v2_hs_PBMC_1k_5gex_GE_findmarkers_upset_top10.png

An upsetplot is a kind of venn diagram where you can display as many groups as you need.
Here you can see the number of marker genes specific to a cluster and the number of marker genes shared between several clusters. The first graph matches all the results and the second graphs matches the 10 marker genes with the highest logFC for each cluster. So if 2 clusters share all their marker genes, it is a single cell type so the resolution chosen for clutering was too high. For example, cluster 6 has only 2 specific marker genes, but it shares 5 marker genes with cluster 7 and cluster 8, it shares 4 with 8 only and 1 with 7 only. Of the 21 marker genes identified for cluster 6, only 2 are specific for cluster 6.

Table file:

Provided by the Clust_Markers_Annot_GE step.

/mnt/beegfs/userdata/m_aglave/pipeline/examples/complete_individual_analysis_example_of_wiki/Results/sc5p_v2_hs_PBMC_1k_5gex_GE/F200_C1000_M0-0.15_R0-1_G5/DOUBLETSFILTER_all/SCTransform/pca/dims35_res1.2/found_markers/sc5p_v2_hs_PBMC_1k_5gex_GE_SCT_pca.35_res.1.2_findmarkers_all.txt

Ten first lines of file: genes = fmark$gene, logFC = fmark[avg_name], p_val = fmark$p_val, adj.P.Val = fmark$p_val_adj, pct.1 = fmark$pct.1, pct.2 = fmark$pct.2, tested_cluster = fmark$cluster, control_cluster = "All", min.pct = min.pct)

genes logFC p_val adj.P.Val pct.1 pct.2 tested_cluster control_cluster min.pct
RPL30 0.916508905160442 8.12479701874232e-74 1.37959053378245e-69 1 1 0 All 0.75
RPS14 0.872106380074094 1.54261075374736e-71 2.61935305986301e-67 1 0.998 0 All 0.75
RPS15A 0.919787852083529 3.12110760111916e-70 5.29964070670034e-66 1 0.998 0 All 0.75
RPL32 0.897483269401748 8.26611796979938e-70 1.40358683127194e-65 1 1 0 All 0.75
RPS27 1.01449364815204 1.35926612570412e-69 2.3080338814456e-65 1 0.997 0 All 0.75
RPS3A 0.915744210277118 3.95309967442229e-69 6.71236324716905e-65 1 0.998 0 All 0.75
LEF1 1.29790808847987 9.39445446762252e-69 1.5951783686023e-64 0.929 0.274 0 All 0.75
TCF7 1.41638327606605 2.64537156504439e-68 4.49184091744538e-64 0.967 0.427 0 All 0.75
RPL35A 0.779272397622583 3.38257222845832e-66 5.74360764392222e-62 1 1 0 All 0.75

This table lists all the marker genes for each cluster (comparison one cluster against all the others):

  • adj.P.Val > 5%,
  • log2FC > 0.5 (positif log2FC only),
  • pct.1 or pct.2 > 0.75 (pct.x: percentage of cells in the group x expressing the tested gene (example: 0.75 corresponds to 75% of cells); min.pct: threshold used for pct.1 and pct.2).

Heatmap:

Provided by the Clust_Markers_Annot_GE step.

/mnt/beegfs/userdata/m_aglave/pipeline/examples/complete_individual_analysis_example_of_wiki/Results/sc5p_v2_hs_PBMC_1k_5gex_GE/F200_C1000_M0-0.15_R0-1_G5/DOUBLETSFILTER_all/SCTransform/pca/dims35_res1.2/found_markers/sc5p_v2_hs_PBMC_1k_5gex_GE_findmarkers_top10_heatmap.png

The heatmap represents the expression of the 10 best marker genes in logFC for each cluster for all cells group by cluster (also identified in the upsetplot). The expressions were normalized between the genes (scaled with a mean of 0 and an standard deviation of 1) in order to allow a visual comparison between these genes. The width of the top bar representing the cluster corresponds to the number of cells in the cluster, each column representing a cell and each row representing a gene.
This allows to verify that a marker gene is quite specific for a cluster and is not shared by other clusters. It coroborates the results of the upsetplot.
Here we can confirm that clusters 6 and 7 are very close at the expression level, as are clusters 2 and 3. For example, close clusters can be the same cell type in 2 different activity states.

Violinplot:

Provided by the Clust_Markers_Annot_GE step.

/mnt/beegfs/userdata/m_aglave/pipeline/examples/complete_individual_analysis_example_of_wiki/Results/sc5p_v2_hs_PBMC_1k_5gex_GE/F200_C1000_M0-0.15_R0-1_G5/DOUBLETSFILTER_all/SCTransform/pca/dims35_res1.2/found_markers/
sc5p_v2_hs_PBMC_1k_5gex_GE_findmarkers_top10_cluster0_vln.png

sc5p_v2_hs_PBMC_1k_5gex_GE_findmarkers_top10_cluster1_vln.png

sc5p_v2_hs_PBMC_1k_5gex_GE_findmarkers_top10_cluster2_vln.png

sc5p_v2_hs_PBMC_1k_5gex_GE_findmarkers_top10_cluster3_vln.png

sc5p_v2_hs_PBMC_1k_5gex_GE_findmarkers_top10_cluster4_vln.png

sc5p_v2_hs_PBMC_1k_5gex_GE_findmarkers_top10_cluster5_vln.png

sc5p_v2_hs_PBMC_1k_5gex_GE_findmarkers_top10_cluster6_vln.png

sc5p_v2_hs_PBMC_1k_5gex_GE_findmarkers_top10_cluster7_vln.png

sc5p_v2_hs_PBMC_1k_5gex_GE_findmarkers_top10_cluster8_vln.png

The violinplot represents the expression of the 10 best marker genes in logFC for each cluster (also identified in the upsetplot) by cell for all clusters. The expression of the marker gene of each cell is plotted by cluster. The greater the number of cells having the same level of expression, the larger the shape is at the level of this expression.
This allows to verify that a marker gene is quite specific for a cluster and is not shared by other clusters. It coroborates the results of the upsetplot.

Here, we observe that the CCL5 and NKG7 genes, markers of cluster 6 are also strongly expressed in clusters 7 and 8. The B2M genes, marker of cluster 6, is also expressed by all the other clusters.

Thus we can cross the results between the upsetplot, the heatmap and the violoplot in order to confirm the top10 marker genes.

Markers plot from the Markfile:

Provided by the Clust_Markers_Annot_GE step.

/mnt/beegfs/userdata/m_aglave/pipeline/examples/complete_individual_analysis_example_of_wiki/Results/sc5p_v2_hs_PBMC_1k_5gex_GE/F200_C1000_M0-0.15_R0-1_G5/DOUBLETSFILTER_all/SCTransform/pca/dims35_res1.2/markers/sc5p_v2_hs_PBMC_1k_5gex_GE_markers_ALL_uMAPs.png

This is a representation of all genes from the Markfile. It can help to annotate cell types.

The very small group with 2-3 cells at the bottom of the graph seems to express very strongly and specifically the PPBP gene, a marker for Platelet cells. So maybe these are Platelet cells that suffered more from the 10X passage.

Notes:

  • The other graphs in this folder (markers) correspond to the grouped graphs by signature or not grouped graphs.
  • The tracing of gene expression can also be done via the cerebro application.

Annotations:

Provided by the Clust_Markers_Annot_GE step.

/mnt/beegfs/userdata/m_aglave/pipeline/examples/complete_individual_analysis_example_of_wiki/Results/sc5p_v2_hs_PBMC_1k_5gex_GE/F200_C1000_M0-0.15_R0-1_G5/DOUBLETSFILTER_all/SCTransform/pca/dims35_res1.2/cells_annotation/clustifyr/sc5p_v2_hs_PBMC_1k_5gex_GE_SCT_uMAP_CFR_hema_microarray_matrix_clust.png

/mnt/beegfs/userdata/m_aglave/pipeline/examples/complete_individual_analysis_example_of_wiki/Results/sc5p_v2_hs_PBMC_1k_5gex_GE/F200_C1000_M0-0.15_R0-1_G5/DOUBLETSFILTER_all/SCTransform/pca/dims35_res1.2/cells_annotation/clustifyr/sc5p_v2_hs_PBMC_1k_5gex_GE_SCT_uMAP_CFR_hema_microarray_matrix_cells.png

The automatic annotation is done with 2 tools (clusterifyR and singleR) with the references included in these tools. There are 2 types of annotation:

  • the one made on the whole cell cluster ("_clust" suffix)
  • the one made on each cell ("_cell" suffix) Each tool assigns a score to their annotation and scores below the threshold (cfr.minscore and sr.minscore parameters) are considered uncertain and annotated as "NA". Warning: the references provided with the tools are not necessarily very relevant (from microarray or bulk RNA-seq) but they are the best available at the moment. So, the results should be interpreted with caution.

Note:
It is better to group the information to annotate the cells: the automatic annotation, the marker genes and the Markfile genes.

Here, I present you 2 examples (one realized on the clusters, the other realized on each cell).

By cross-checking the results with the Markfile we can conclude that:

  • cluster 0 corresponds to CD4 T Lymphocytes (Naive?),
  • cluster 1 corresponds to Monocytes,
  • cluster 2 corresponds to CD4 T Lymphocytes (Naive?),
  • cluster 3 corresponds to CD8 T Lymphocytes (Naive?),
  • cluster 4 corresponds to Monocytes,
  • cluster 5 corresponds to B Lymphocytes,
  • cluster 6 corresponds to T Lymphocytes (CD4? CD8? Mature?),
  • cluster 7 corresponds to T Lymphocytes (CD4? CD8? Mature?),
  • cluster 8 corresponds to NK cells,
  • the very small group with 2-3 cells at the bottom of the graph may be Platelet cells,
  • the small group with 10 cells at the top of the cluster 4 may be Dendritic Cells.

ADT:

Comparison gene expression and protein level plot:

Provided by the Adding_ADT step.

/mnt/beegfs/userdata/m_aglave/pipeline/examples/complete_individual_analysis_example_of_wiki/Results/sc5p_v2_hs_PBMC_1k_5gex_GE/F200_C1000_M0-0.15_R0-1_G5/DOUBLETSFILTER_all/SCTransform/pca/dims35_res1.2/ADT_results/ADT_dimplot.png

This plot shows the normalized expression of the genes (left) with the normalized expression of the corresponding proteins (right).
Often protein expression is very strong with background noise due to non-specific hybridization of the antibodies. To solve this problem we can modify the cutoff of the legend by quantiles.

/mnt/beegfs/userdata/m_aglave/pipeline/examples/complete_individual_analysis_example_of_wiki/Results/sc5p_v2_hs_PBMC_1k_5gex_GE/F200_C1000_M0-0.15_R0-1_G5/DOUBLETSFILTER_all/SCTransform/pca/dims35_res1.2/ADT_results/ADT_dimplot_legend_cutoff.png

This plot shows exactly the same things, but with a legend cutoff (default parameters).

For example we can see an improvement of the graph for the CD19 protein which shows less non-specific expression with the legend cutoff, although the default parameter is not sufficient to remove all the background noise. It should probably be increased the ADT.min.cutoff parameter.

TCR:

Provided by the Adding_TCR step.

Sometimes there are more than on TCR by cell: https://www.jimmunol.org/content/202/3/637

More information about the annotation of contigs by CellRanger: https://support.10xgenomics.com/single-cell-vdj/software/pipelines/3.1/algorithms/annotation

There are several ways to define a clonotype:

  • gene: use the genes comprising the TCR.
  • nt: use the nucleotide sequence of the CDR3 region.
  • aa: use the amino acid sequence of the CDR3 region.
  • gene+nt: use the genes comprising the TCR + the nucleotide sequence of the CDR3 region for T cells. This is the proper definition of clonotype.

Global analysis:

Productivity of receptors:
/mnt/beegfs/userdata/m_aglave/pipeline/examples/complete_individual_analysis_example_of_wiki/Results/sc5p_v2_hs_PBMC_1k_5gex_GE/F200_C1000_M0-0.15_R0-1_G5/DOUBLETSFILTER_all/SCTransform/pca/dims35_res1.2/TCR_results/Global_analysis/QC_quantif.png

This group of graphs represents the number of productive receptors (functional receptors) for the sample.

Here there are more productive than non-productive receptors (both graphs) and most cells have only one receptor (right graph).

This result is consistent with the usual results.

Unique Contigs Quantification:
/mnt/beegfs/userdata/m_aglave/pipeline/examples/complete_individual_analysis_example_of_wiki/Results/sc5p_v2_hs_PBMC_1k_5gex_GE/F200_C1000_M0-0.15_R0-1_G5/DOUBLETSFILTER_all/SCTransform/pca/dims35_res1.2/TCR_results/Global_analysis/quantUniqueContig.png

This group of graphs represents the number of different (unique) clonotypes in the sample.

Here, we can observe that almost all contigs are present in a single copy (they are unique), regardless of the definition of clonotypes chosen. This is a special case, because often there are not only unique clonotypes at the level of the TCRs, even if they are mostly unique.
For clonotypes defined only by their gene, we observe 444 unique contigs out of a total of 451 contigs.

Clonotypes Abundance:
/mnt/beegfs/userdata/m_aglave/pipeline/examples/complete_individual_analysis_example_of_wiki/Results/sc5p_v2_hs_PBMC_1k_5gex_GE/F200_C1000_M0-0.15_R0-1_G5/DOUBLETSFILTER_all/SCTransform/pca/dims35_res1.2/TCR_results/Global_analysis/abundanceContig.png

This group of line graphs represents the number of clonotypes depending on the number of cells where the contig is present. The points of the graph are connected.

Here we have mostly single clonotypes, present in a single cell, which is represented by a dot at over 400 clonotypes with an abundance of 1. We also have some clonotypes present in 2 cells, which is represented by a point to near 0 clonotype with an abundance of 2. These 2 points are connected by a straight line.
In a classical case, these graphs present rather a decreasing exponential, because there are colonotypes present in more than 2 cells, so the axis of "Abundance" can label "10" or more.

For the clonotypes defined only by their gene, if we group these results with those of the quantification of the unique contigs, there are 7 contigs (451-44) present in 2 cells (2 cells maximum according to the abundance).

Another simple example:

Clonal Space Homeostasis:
/mnt/beegfs/userdata/m_aglave/pipeline/examples/complete_individual_analysis_example_of_wiki/Results/sc5p_v2_hs_PBMC_1k_5gex_GE/F200_C1000_M0-0.15_R0-1_G5/DOUBLETSFILTER_all/SCTransform/pca/dims35_res1.2/TCR_results/Global_analysis/clhomeo.png

By examining the clonal space, we are effectively looking at the relative space occupied by clones at specific proportions. It's like proportions of proportions.

Here all clonotypes are present in a single copy (unique) so they are present in the same proportion compared to the repertoire.
1 copy / 450 total clonotypes = 0.0022 so Medium.
2 copy / 450 total clonotypes = 0.0044 so Medium.
There are no clonotypes more represented (Hyperexpanded) and others under-represented (Rare).

Another simple example:

From the clonal population, we can count the abundance (the number of cell with this clonotype), compute the proportion of each clonotype and assign them to a category of clonotype (called Clonotype Group) as "Rare", "Small", "Medium", "Large" and "Hyperexpanded".
Then, we count the number of each category and compute their proportions. This categories' proportions are drawn on the graph.

Clonal Proportion:
/mnt/beegfs/userdata/m_aglave/pipeline/examples/complete_individual_analysis_example_of_wiki/Results/sc5p_v2_hs_PBMC_1k_5gex_GE/F200_C1000_M0-0.15_R0-1_G5/DOUBLETSFILTER_all/SCTransform/pca/dims35_res1.2/TCR_results/Global_analysis/clprop.png

Like clonal space homeostasis above, clonal proportion acts to place clones into separate bins. The key difference is instead of looking at the relative proportion of the clone to the total, the clonalProportion() function will rank the clones by total number and place them into bins. Example: [1:10] are the top 10 clonotypes in each sample.

Here we find that the [1:10] (10 most represented clonotype) contains 11 cells: 9 with a single clonotype and 1 pair of cells with the same clonotype. For the clonotypes defined only by their gene, we observe that the [1:10] (10 most represented clonotype) contains 17 cells: 3 with a unique clonotype and 7 pairs of cells with the same clonotype. The decomposition into 9+1x2, and 3+7x2 is deduced thanks to the abundance graphs.

Another simple example:

From the clonal population, we can count the abundance (the number of cell with this clonotype), rank clonotype by abundance and group clonotypes by bins of rank (called Clonal Indices).
Then, we sum the abundance of all clonotypes by bin and compute their proportions. This bins' proportions are drawn on the graph.

Clonotypes Frequencies:
/mnt/beegfs/userdata/m_aglave/pipeline/examples/complete_individual_analysis_example_of_wiki/Results/sc5p_v2_hs_PBMC_1k_5gex_GE/F200_C1000_M0-0.15_R0-1_G5/DOUBLETSFILTER_all/SCTransform/pca/dims35_res1.2/TCR_results/Global_analysis/Frequency_top_10_umapsc5p_v2_hs_PBMC_1k_5gex.png

/mnt/beegfs/userdata/m_aglave/pipeline/examples/complete_individual_analysis_example_of_wiki/Results/sc5p_v2_hs_PBMC_1k_5gex_GE/F200_C1000_M0-0.15_R0-1_G5/DOUBLETSFILTER_all/SCTransform/pca/dims35_res1.2/TCR_results/Global_analysis/Frequency_top11to20_umapsc5p_v2_hs_PBMC_1k_5gex.png

/mnt/beegfs/userdata/m_aglave/pipeline/examples/complete_individual_analysis_example_of_wiki/Results/sc5p_v2_hs_PBMC_1k_5gex_GE/F200_C1000_M0-0.15_R0-1_G5/DOUBLETSFILTER_all/SCTransform/pca/dims35_res1.2/TCR_results/Global_analysis/Frequency_umapsc5p_v2_hs_PBMC_1k_5gex.png

The frequency represents the number of cells that contain the clonotype based on its amino acid sequence. The up graph represent the absolute frequency, and the bottom graph represente the frequency by bin (Single: clonotypes present in one cell; Small: clonotypes present from 2 to 5 cells; etc.)

Here, we can confirm that there is one clonotype present in 2 cells and that the other clonotypes are present in only one cell. The localization of clonotypes on the umap confirms the T cell annotation of the clusters. We also observe that not all T cells have an identified TCR.

CDR3 Length:
/mnt/beegfs/userdata/m_aglave/pipeline/examples/complete_individual_analysis_example_of_wiki/Results/sc5p_v2_hs_PBMC_1k_5gex_GE/F200_C1000_M0-0.15_R0-1_G5/DOUBLETSFILTER_all/SCTransform/pca/dims35_res1.2/TCR_results/Global_analysis/lengthContig.png

This graph represents the length distribution of the CDR3 sequences (nt or aa) with combined or separate chains.

The interpretation of this type of graph depends on the biological context.

Clonotypes Decomposition:
/mnt/beegfs/userdata/m_aglave/pipeline/examples/complete_individual_analysis_example_of_wiki/Results/sc5p_v2_hs_PBMC_1k_5gex_GE/F200_C1000_M0-0.15_R0-1_G5/DOUBLETSFILTER_all/SCTransform/pca/dims35_res1.2/TCR_results/Global_analysis/cloneType.png

This group of graphs represents several umap with the location of each part of the TCR and the size of the TRA and TRB.

The interpretation of this type of graph depends on the biological context.

Diversity:
/mnt/beegfs/userdata/m_aglave/pipeline/examples/complete_individual_analysis_example_of_wiki/Results/sc5p_v2_hs_PBMC_1k_5gex_GE/F200_C1000_M0-0.15_R0-1_G5/DOUBLETSFILTER_all/SCTransform/pca/dims35_res1.2/TCR_results/Global_analysis/cldiv.png

This graph represents the measures the diversity of clonotypes within the sample. It is provided by 4 metrics: Shannon, inverse Simpson, Chao1, and Abundance-based Coverage Estimator (ACE).

The interpretation of this type of graph depends on the biological context.

Physicochemical Properties:
/mnt/beegfs/userdata/m_aglave/pipeline/examples/complete_individual_analysis_example_of_wiki/Results/sc5p_v2_hs_PBMC_1k_5gex_GE/F200_C1000_M0-0.15_R0-1_G5/DOUBLETSFILTER_all/SCTransform/pca/dims35_res1.2/TCR_results/Global_analysis/aaProperties.png

This group of graphs represents a list the physicochemical properties of the amino acids that make up the receptors.

  • Length: total amino acid count.
  • Hydrophobicity: grand average of hydrophobicity.
  • Bulk: average bulkiness.
  • Aliphatic: normalized aliphatic index.
  • Polarity: average polarity.
  • Charge: normalized net charge.
  • Basic: basic side chain residue content.
  • Acidic: acidic side chain residue content.
  • Aromatic: aromatic side chain content. More information here

The interpretation of this type of graph depends on the biological context.

Clusters analysis:

Unique Contigs Quantification:
/mnt/beegfs/userdata/m_aglave/pipeline/examples/complete_individual_analysis_example_of_wiki/Results/sc5p_v2_hs_PBMC_1k_5gex_GE/F200_C1000_M0-0.15_R0-1_G5/DOUBLETSFILTER_all/SCTransform/pca/dims35_res1.2/TCR_results/Clusters_analysis/clust_quantContig_sc5p_v2_hs_PBMC_1k_5gex.png

This group of graphs are similar to that of the globale analysis, but by clusters, so the interpretation is similar too.

Clonotypes Abundance:
/mnt/beegfs/userdata/m_aglave/pipeline/examples/complete_individual_analysis_example_of_wiki/Results/sc5p_v2_hs_PBMC_1k_5gex_GE/F200_C1000_M0-0.15_R0-1_G5/DOUBLETSFILTER_all/SCTransform/pca/dims35_res1.2/TCR_results/Clusters_analysis/abundanceContig.png

This group of graphs are similar to that of the globale analysis, but by clusters, so the interpretation is similar too.

Clonal Space Homeostasis:
/mnt/beegfs/userdata/m_aglave/pipeline/examples/complete_individual_analysis_example_of_wiki/Results/sc5p_v2_hs_PBMC_1k_5gex_GE/F200_C1000_M0-0.15_R0-1_G5/DOUBLETSFILTER_all/SCTransform/pca/dims35_res1.2/TCR_results/Clusters_analysis/clust_clhomeosc5p_v2_hs_PBMC_1k_5gex.png

This group of graphs are similar to that of the globale analysis, but by clusters, so the interpretation is similar too.

Clonal Proportion:
/mnt/beegfs/userdata/m_aglave/pipeline/examples/complete_individual_analysis_example_of_wiki/Results/sc5p_v2_hs_PBMC_1k_5gex_GE/F200_C1000_M0-0.15_R0-1_G5/DOUBLETSFILTER_all/SCTransform/pca/dims35_res1.2/TCR_results/Clusters_analysis/clust_clpropsc5p_v2_hs_PBMC_1k_5gex.png

This group of graphs are similar to that of the globale analysis, but by clusters, so the interpretation is similar too.

Clonotypes Frequencies:
/mnt/beegfs/userdata/m_aglave/pipeline/examples/complete_individual_analysis_example_of_wiki/Results/sc5p_v2_hs_PBMC_1k_5gex_GE/F200_C1000_M0-0.15_R0-1_G5/DOUBLETSFILTER_all/SCTransform/pca/dims35_res1.2/TCR_results/Clusters_analysis/Frequency_top_10_clust0_umapsc5p_v2_hs_PBMC_1k_5gex.png

/mnt/beegfs/userdata/m_aglave/pipeline/examples/complete_individual_analysis_example_of_wiki/Results/sc5p_v2_hs_PBMC_1k_5gex_GE/F200_C1000_M0-0.15_R0-1_G5/DOUBLETSFILTER_all/SCTransform/pca/dims35_res1.2/TCR_results/Clusters_analysis/Frequency_top_10_clust2_umapsc5p_v2_hs_PBMC_1k_5gex.png

/mnt/beegfs/userdata/m_aglave/pipeline/examples/complete_individual_analysis_example_of_wiki/Results/sc5p_v2_hs_PBMC_1k_5gex_GE/F200_C1000_M0-0.15_R0-1_G5/DOUBLETSFILTER_all/SCTransform/pca/dims35_res1.2/TCR_results/Clusters_analysis/Frequency_top_10_clust3_umapsc5p_v2_hs_PBMC_1k_5gex.png

/mnt/beegfs/userdata/m_aglave/pipeline/examples/complete_individual_analysis_example_of_wiki/Results/sc5p_v2_hs_PBMC_1k_5gex_GE/F200_C1000_M0-0.15_R0-1_G5/DOUBLETSFILTER_all/SCTransform/pca/dims35_res1.2/TCR_results/Clusters_analysis/Frequency_top_10_clust6_umapsc5p_v2_hs_PBMC_1k_5gex.png

/mnt/beegfs/userdata/m_aglave/pipeline/examples/complete_individual_analysis_example_of_wiki/Results/sc5p_v2_hs_PBMC_1k_5gex_GE/F200_C1000_M0-0.15_R0-1_G5/DOUBLETSFILTER_all/SCTransform/pca/dims35_res1.2/TCR_results/Clusters_analysis/Frequency_top_10_clust7_umapsc5p_v2_hs_PBMC_1k_5gex.png

This group of graphs are similar to that of the globale analysis, but by clusters, so the interpretation is similar too.

Overlap:
/mnt/beegfs/userdata/m_aglave/pipeline/examples/complete_individual_analysis_example_of_wiki/Results/sc5p_v2_hs_PBMC_1k_5gex_GE/F200_C1000_M0-0.15_R0-1_G5/DOUBLETSFILTER_all/SCTransform/pca/dims35_res1.2/TCR_results/Clusters_analysis/clust_clOverlap_sc5p_v2_hs_PBMC_1k_5gex.png

/mnt/beegfs/userdata/m_aglave/pipeline/examples/complete_individual_analysis_example_of_wiki/Results/sc5p_v2_hs_PBMC_1k_5gex_GE/F200_C1000_M0-0.15_R0-1_G5/DOUBLETSFILTER_all/SCTransform/pca/dims35_res1.2/TCR_results/Clusters_analysis/overlap_aa_sc5p_v2_hs_PBMC_1k_5gex.txt
/mnt/beegfs/userdata/m_aglave/pipeline/examples/complete_individual_analysis_example_of_wiki/Results/sc5p_v2_hs_PBMC_1k_5gex_GE/F200_C1000_M0-0.15_R0-1_G5/DOUBLETSFILTER_all/SCTransform/pca/dims35_res1.2/TCR_results/Clusters_analysis/overlap_gene_nt_sc5p_v2_hs_PBMC_1k_5gex.txt
/mnt/beegfs/userdata/m_aglave/pipeline/examples/complete_individual_analysis_example_of_wiki/Results/sc5p_v2_hs_PBMC_1k_5gex_GE/F200_C1000_M0-0.15_R0-1_G5/DOUBLETSFILTER_all/SCTransform/pca/dims35_res1.2/TCR_results/Clusters_analysis/overlap_gene_sc5p_v2_hs_PBMC_1k_5gex.txt
/mnt/beegfs/userdata/m_aglave/pipeline/examples/complete_individual_analysis_example_of_wiki/Results/sc5p_v2_hs_PBMC_1k_5gex_GE/F200_C1000_M0-0.15_R0-1_G5/DOUBLETSFILTER_all/SCTransform/pca/dims35_res1.2/TCR_results/Clusters_analysis/overlap_nt_sc5p_v2_hs_PBMC_1k_5gex.txt
example of overlap_gene_sc5p_v2_hs_PBMC_1k_5gex.txt For more readability the columns tested_clust_list_sequences and ref_clust_list_sequences have been replaced by "...".
tested_clust nb_unique_seq_tested_clust tested_clust_list_sequences ref_clust nb_unique_seq_ref_clust ref_clust_list_sequences nb_overlap overlap_list_sequences
0 233 2 76 0 NA
0 233 3 83 1 NA_TRBV2.TRBJ2-7.None.TRBC2
0 233 6 32 0 NA
0 233 7 27 2 TRAV8-3.TRAJ37.TRAC_TRBV20-1.TRBJ1-1.None.TRBC1;NA_TRBV6-2.TRBJ2-2.None.TRBC2
2 76 0 233 0 NA
2 76 3 83 0 NA
2 76 6 32 0 NA
2 76 7 27 0 NA
3 83 0 233 1 NA_TRBV2.TRBJ2-7.None.TRBC2
3 83 2 76 0 NA
3 83 6 32 0 NA
3 83 7 27 1 NA_TRBV4-2.TRBJ2-6.None.TRBC2
6 32 0 233 0 NA
6 32 2 76 0 NA
6 32 3 83 0 NA
6 32 7 27 0 NA
7 27 0 233 2 NA_TRBV6-2.TRBJ2-2.None.TRBC2;TRAV8-3.TRAJ37.TRAC_TRBV20-1.TRBJ1-1.None.TRBC1
7 27 2 76 0 NA
7 27 3 83 1 NA_TRBV4-2.TRBJ2-6.None.TRBC2
7 27 6 32 0 NA

The graph represents the coefficient of common clonotypes between 2 clusters (number of common clonotypes scaled to the number of unique clonotypes in the smaller cluster). The tables present the number and the sequence of common clonotypes between the clusters.

According to the graph, for the clonotypes defined only by their gene, 7.7% of clonotypes from cluster 7 are shared by the clusters 0; and there is no shared clonotypes between the clusters 0 and 6. According to the text file with the clonotypes defined only by their gene, 2 clonotypes are shared by the clusters 0 and 7; and there is no shared clonotypes between the clusters 0 and 6.

Diversity:
/mnt/beegfs/userdata/m_aglave/pipeline/examples/complete_individual_analysis_example_of_wiki/Results/sc5p_v2_hs_PBMC_1k_5gex_GE/F200_C1000_M0-0.15_R0-1_G5/DOUBLETSFILTER_all/SCTransform/pca/dims35_res1.2/TCR_results/Clusters_analysis/clust_cldivsc5p_v2_hs_PBMC_1k_5gex.png

This group of graphs are similar to that of the globale analysis, but by clusters, so the interpretation is similar too.

Physicochemical Properties:
/mnt/beegfs/userdata/m_aglave/pipeline/examples/complete_individual_analysis_example_of_wiki/Results/sc5p_v2_hs_PBMC_1k_5gex_GE/F200_C1000_M0-0.15_R0-1_G5/DOUBLETSFILTER_all/SCTransform/pca/dims35_res1.2/TCR_results/Clusters_analysis/aaProperties_sc5p_v2_hs_PBMC_1k_5gex.png

This group of graphs are similar to that of the globale analysis, but by clusters, so the interpretation is similar too.

BCR:

Provided by the Adding_BCR step.

/mnt/beegfs/userdata/m_aglave/pipeline/examples/complete_individual_analysis_example_of_wiki/Results/sc5p_v2_hs_PBMC_1k_5gex_GE/F200_C1000_M0-0.15_R0-1_G5/DOUBLETSFILTER_all/SCTransform/pca/dims35_res1.2/BCR_results/

The results provided for the BCRs are the same as for the TCR analysis. So I will not explain again. We observe that all the clonotypes are unique except one. In addition, the clonotypes colocalize well with the cluster of B lymphocytes identified with the annotation. Cluster analysis is not very interesting because we have only one cluster of B lymphocytes in this sample.

Cerebro:

Provided by the Cerebro step.

/mnt/beegfs/userdata/m_aglave/pipeline/examples/complete_individual_analysis_example_of_wiki/Results/sc5p_v2_hs_PBMC_1k_5gex_GE/F200_C1000_M0-0.15_R0-1_G5/DOUBLETSFILTER_all/SCTransform/pca/dims35_res1.2/sc5p_v2_hs_PBMC_1k_5gex_GE_SCTransform_pca_35_1.2_ADT_TCR_BCR.crb

Cerebro file can be loaded into CerebroApp R Shiny to exploit the main results.



Home

Resources of the Theory of single cell RNA-seq

v1.3
Pipeline details
Installation
Usage
Configuration
Results help
Complete Examples of school cases
Individual analysis :
1 sample (scRNA-seq + ADT + TCR + BCR)
Grouped/Integrated analysis :
2 samples (scRNA-seq + ADT + TCR + BCR)

Clone this wiki locally