README.Rmd

---
title: "CAF Subpopulation Analysis"
author: "Kevin Ryan"
date: "`r Sys.time()`"
bibliography: citations.bib
output: 
  github_document:
     toc: true
     toc_depth: 3 
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

# Introduction

Cancer-associated fibroblasts (CAFs) are a heterogeneous cell type found in the tumour microenvironment. They have a wide array of functions, and tend to be immunosuppressive and cancer-promoting. There have been many attempts to characterise subpopulations of CAFs, with much transcriptomic analysis being carried out in the Mechta-Grigoriou lab in Institut Curie. They have identified 4 'subpopulations' which can be separated based on the expression of different markers:

 -  S1: FAP^High^, CD29^Med-High^, α^SMAHigh^, PDPN^High^, PDGFRβ^High^
 -  S2: FAP^Neg^, CD29^Low^, αSMANeg-^Low^, PDPN^Low^, PDGFRβ^Low^
 -  S3: FAP^Neg-Low^, CD29^Med^, αSMA^Neg-Low^, PDPN^Low^, PDGFRβ^Low-Med^
 -  S4: FAP^Low-Med^, CD29^High^, αSMA^High^, PDPN^Low^, PDGFRβ^Med^

[@Pelon2020]

FACS gating strategies can be used to isolate these subpopulations. The Mechta-Grigoriou group have done this and have generated bulk RNA-sequencing data for the S1, S3 and S4 subpopulations. They generated scRNA-sequencing data for the S1 subpopulation. This data was deposited on the European Genome Phenome Archive, and was accessed via a Data Transfer Agreement.

The following summarises the data obtained:

+---------------+---------------+-----------------------+-------------------------------+
| Subpopulation | Total samples | Studies (Samples)     |  Notes                        |
+===============+===============+=======================+===============================+
| S1            |  28           | - EGAD00001003808 (16)| -  3808 has 12xJuxta-tumor    |
|               |               | - EGAD00001005744 (5) | -  5744 5 samples from LN     |
|               |               | - EGAD00001006144 (7) | -  Sorting vs spreading       |
+---------------+---------------+-----------------------+-------------------------------+
| S2            | 0             | N/A                   | N/A                           |
|               |               |                       |                               |
+---------------+---------------+-----------------------+-------------------------------+
| S3            | 14            | - EGAD00001004810 (14)| -  4810 has 11xJuxta-tumor    |
|               |               |                       | -  Ovarian                    |
+---------------+---------------+-----------------------+-------------------------------+
| S4            | 15            | - EGAD00001003808 (10)| -  3808 has 9xJuxta-tumor     |
|               |               | - EGAD00001005744 (5) | -  5744 5 samples from LN     |
+---------------+---------------+-----------------------+-------------------------------+

With the juxta-tumour data, tumour and juxta-tumour samples came from the same patient. However, the metadata gives no indication of these pairings. We could possibly use Optitype [@Szolek2014] to determine HLA allele and match the tumour and juxta-tumour samples.

We also have scRNA-seq data for S1, labelled with 8 subpopulations of S1 CAFs. It may be possible to use CIBERSORT [@Newman2015] and BayesPrism [@Chu2022] to deconvolve the bulk S1 RNA-sequencing data to further confirm the presence of these subpopulations.

It is likely that sorting the cells using FACS alters the transcriptional properties of the cells compared to if they are separated using spreading approaches, as is seen in study `EGAD00001006144` and described in [@Kieffer2020]. This is something that we will have to keep in mind.

The data was processed using nf-core/rnaseq version `3.8.1` using the default parameters. STAR/Salmon were used for alignment/quantification.

We would expect our tumour-associated normal to be most like the S3 subpopulation (usually accumulate in juxta-tumours). The S2 subpopulation has been found to accumulate more in luminal A breast cancer, whereas the S4 subpopulation tends to be present in Her2+ breast cancers. Unfortunately, data is not available for the S2 subpopulation and 11 of the 12 cancers encountered in our samples are Luminal A.

Combining RNA-sequencing datasets from different studies can be very challenging. We can expect batch effects to be present, so it might not be possible to determine whether differences we observe are due to actual biological effects or technical artifacts. In addition, a recent study suggests that DESeq2 and edgeR (the most popular differential expression tools) experience large rates of false positives when used with large sample sizes [@Li2022]. However, this assertion has been refuted, and it has been implied that the Li 2022 study did not apply appropriate batch correction and quality control ([Twitter thread](https://threadreaderapp.com/thread/1513468597288452097.html) from Mike Love and associated [code on GitHub](https://github.com/mikelove/preNivolumabOnNivolumab/blob/main/preNivolumabOnNivolumab.knit.md)). One of the datasets (`EGAD00001006144`) was produced using stranded RNA-seq, whereas the other datasets were unstranded. This can lead to a lack of comparability of the datasets [@Zhao2020]. It may be necessary to drop this dataset from the analysis. All samples were prepared by poly(A) selection (use of oligo-dT).

# Preparation

Columns will be: Sample, Study, Subpopulation, Tumor_Juxtatumor

*Here we will be combining data from 5 studies. To begin with, we will only include the metadata available for all studies (except for our unknown CAF Subpopulation label). Breast cancer subtype is only available for certain studies and so is not included at this stage.*

There are also: ovarian cancer samples, EPCAM+ cells (an epithelial marker) and samples from lymph nodes. For the time being, I will not consider them.

```{r load packages, include = FALSE}
library(dplyr)
library(stringr)
library(biomaRt)
library(tximport)
library(DT)
library(tidyverse)
library(ggplot2)
library(cowplot)
library(PCAtools)
library(dplyr)
library(SummarizedExperiment)
library(DESeq2)
library(pheatmap)
library(RColorBrewer)
library(glmpca)
library(hexbin)
library(IHW)
library(sva)
library(ggpubr)
library(vsn)
httr::set_config(httr::config(ssl_verifypeer = FALSE))
library(tximeta)
library(BiocParallel)
library(ashr)
library(GSVA)
library(clusterProfiler)
library(org.Hs.eg.db)
library(HGNChelper)
source("scripts/functions_caf_subpopulation_analysis.R")
library(here)
```

# Read in data

Samples were processed with nf-core/rnaseq version `3.8.1`.

Salmon was used in alignment mode so there is no salmon index and no checksum to import the metadata. Therefore, the parameters recommended in the [tximeta vignette](https://bioconductor.org/packages/release/bioc/vignettes/tximeta/inst/doc/tximeta.html#What_if_checksum_isn%E2%80%99t_known) were used to summarise transcript counts to the gene level, using a tx2gene file constructed using `generate_tx2gene_table.R`.

```{r Prepare to read in data, include = FALSE}
# metadata file created with create_metadata.R
metadata <- read.table(here("intermediate_files/metadata/metadata_all_samples.txt"), row.names = 1, sep = "\t")
metadata_no_inhouse <- read.table(here("intermediate_files/metadata/metadata_no_inhouse.txt"), row.names = 1, sep = "\t")
metadata_no_inhouse_no6144 <- read.table(here("intermediate_files/metadata/metadata_no_inhouse_without_6144.txt"), row.names = 1, sep = "\t")
files <- file.path(metadata$directory, rownames(metadata), "quant.sf")
coldata <- data.frame(files, names=rownames(metadata), Study = metadata$Study, 
                      Subpopulation = metadata$Subpopulation, 
                      Tumor_JuxtaTumor = metadata$Tumor_JuxtaTumor,
                      stringsAsFactors=FALSE)

# tx2gene but using the hgnc symbol instead of ensembl gene id version
mart <- useMart(biomart = "ensembl", dataset = "hsapiens_gene_ensembl", #host="https://www.ensembl.org")
                host="uswest.ensembl.org")
tx2gene <- getBM(attributes = c("ensembl_transcript_id_version", "hgnc_symbol"), mart = mart, useCache = FALSE)
```

```{r Read in data}
# salmon was used in alignment mode so there is no salmon index, therefore there is no checksum to import the metadata 
# txOut = FALSE means to summarise to gene level (i.e. don't give out transcripts, give out gene level)
se <- tximeta(coldata, skipMeta=TRUE, txOut=FALSE, tx2gene=tx2gene)
```
```{r read in data without inhouse data, include = FALSE}
files_no_inhouse <- file.path(metadata_no_inhouse$directory, rownames(metadata_no_inhouse), "quant.sf")
coldata_no_inhouse <- data.frame(files = files_no_inhouse, names=rownames(metadata_no_inhouse), Study = metadata_no_inhouse$Study, 
                      Subpopulation = metadata_no_inhouse$Subpopulation, 
                      Tumor_JuxtaTumor = metadata_no_inhouse$Tumor_JuxtaTumor,
                      Strandedness = metadata_no_inhouse$Strandedness,
                      stringsAsFactors=FALSE)
# tx2gene file for the gencode v31 file used in the analysis
se_no_inhouse <- tximeta(coldata = coldata_no_inhouse, skipMeta=TRUE, txOut=FALSE, tx2gene=tx2gene)
```

A DESeqDataSet was created, and only genes with a count of 10 or more in 7 samples were kept for further analysis. This was done for the combined dataset both with and without the in-house data. A variance stabilising transformation (vst) was also carried out to allow the data to be used in downstream processes requiring homoskedastic data, e.g. PCA. *vst* was carried out instead of regularised logarithm (*rlog*) as *rlog* takes a long time to run when there is a large sample size.  

```{r Carry out all steps up to vsd step for se_hgnc}
dds <- DESeqDataSet(se, design = ~1)
# returns a vector of whether the total count of each gene is >= 10 (True or false)
keep <- rowSums(counts(dds)) >= 10
# only keep rows (genes) for which keep is TRUE
dds <- dds[keep,]
# at least X samples with a count of 10 or more, where X is 5% of samples
X <- round(0.05*ncol(dds))
#X <- 7
keep <- rowSums(counts(dds) >= 10) >= X
dds <- dds[keep,]
vsd <- vst(dds, blind = TRUE)
```

```{r Apply common filters according to DESeq2 vignette to remove lowly expressed genes}
# do no design for the time being, Subpopulation + Batch gives error - model matrix not full rank in DESeq2
# this function stores input values, intermediate calculations and results of DE analysis - makes counts non-negative integers
dds_no_inhouse <- DESeqDataSet(se_no_inhouse, design = ~1)

# returns a vector of whether the total count of each gene is >= 10 (True or false)
keep <- rowSums(counts(dds_no_inhouse)) >= 10
# only keep rows (genes) for which keep is TRUE
dds_no_inhouse <- dds_no_inhouse[keep,]
# at least X samples with a count of 10 or more, where X is 5% of samples
X <- round(0.05*ncol(dds_no_inhouse))
#X <- 7
keep <- rowSums(counts(dds_no_inhouse) >= 10) >= X
dds_no_inhouse <- dds_no_inhouse[keep,]
```

# Exploratory data analysis

## PCA + clinical correlations all studies

For consistency between the DESeq2 `plotPCA` function (which by default takes the 500 most variable genes) and the PCATools `pca` function, all genes were used when carrying out PCA.

```{r}
plotPCA(vsd, intgroup = c("Subpopulation"), ntop = nrow(vsd))
plotPCA(vsd, intgroup = c("Study"), ntop = nrow(vsd))
plotPCA(vsd, intgroup = c("Tumor_JuxtaTumor"), ntop = nrow(vsd))

```

There seems to be 4 groups of samples here: the samples at PC1 < -100 which look like outliers, the main group in the middle (-100 < PC1 < 50) and then 2 groups which separate on PC2. One of these groups comes completely from one batch (EGAD00001005744) which is purely tumour CAFs and the other is a mixture of our in-house samples and study EGAD00001006144. There is clear separation between the in-house samples and EGAD00001006144, so perhaps they could be called one cluster. The in-house samples underwent culturing in a medium to promote the growth of fibroblastic cells, whereas the EGAD00001006144 samples either underwent separation by sorting or spreading. It is possible that there are similarities in the conditions under which the samples were kept which altered their transcriptomic properties.

We can see that there are 16 samples that explain much of the variation in PC1, meaning that they are quite different from the other samples. Let's have a look at the PCA loadings using the `biplot` from `PCATools`.

It is important to note here that our interpretation of the PCA is subjective, and can change depending on the number of highly variable genes we consider when carrying out PCA.


```{r Variance stabilising transformation and PCA on full dataset}
vsd_mat <- assay(vsd)
metadata_pca <- metadata[,1:4]
p <- pca(vsd_mat, metadata = metadata_pca)
```


```{r Biplot}
biplot(p, showLoadings = T, lab = NULL)
```

There seems to be 10 genes that are associated with PC2, separating our main cluster and the In-house/EGAD00001006144. samples.

* FOS
  + Proto-oncogene, forms part of TF complex, regulators of cell proliferation, differentiation, transformation, apoptosis.
* APOD
  + Apolipoprotein D, encodes part of HDL
  + Expression induced in quiescent/senescent fibroblasts [@Rassart2020], and so may inhibit cell growth
  + Downregulated in CAFs in our initial CAF vs TAN DE analysis
* TMEM176B
  + A transmembrane protein
  + Identified as LR8 in 1999 [@Lurton1999] and was proposed as a marker for fibroblasts and their subpopulations.
  + It has recently been found to be important in the AKT/mTOR pathway, which is involved in cell proliferation (and hence can be implicated in cancer) [@Kang2021].
* SELENOP
  + Selenoprotein P
  + Increased expression stops conversion of fibroblasts to myofibroblasts [@Short2017]
* PLXDC1
  + Plexin Domain Containing 1
  + Involved in angiogenesis
  + Cell surface receptor for Pigment Epithelium Derived Factor [@Cheng2014]
* P4HB
  + Protein disulfide isomerase
  + Possible fibroblast marker [@Wetzig2013]
* CHPF
  + Chondroitin polymerising factor
  + Alters the formation of chondroitin sulphate in breast cancer. Chondroitin sulphate forms "abnormal" chains in breast cancer [@Liao2021]
* CEMIP
  + Cell Migration Inducing Hyaluronidase 1
  + WNT-related [@Dong2021]
  + High expression associated with malignancy and increased CAF infiltration [@Dong2021]. Possible biomarker. Dong study looked at expression in tumour cells, here we can see that its expression seems to change between different groups of CAFs too.
* TFPI2
  + Tissue factor pathway inhibitor 2.
  + Serine proteinase
  + Tumour suppressor
  + Inhibits plasmin, thereby inhibiting the activation of MMPs
  + Increased expression of TFPI2 in cancer cells downregulates the expression of MMPs in CAFs (the opposite is the case too) [@Gaud2011].
* GREM1
  + Antagonist of BMP, playing a role in tissue differentiation.
  + Expressed in basal cell carcinoma CAF myofibroblasts [@Kim2017].
  + Expression of GREM1 derived from CAFs thought to promote cancer progression [@Ren2019].
  + Found to be expressed in CAF cell lines but not in breast cancer cell lines [@Ren2019].
  + Its expression in bulk tumour samples is correlated with the expression of CAF markers such as FAP [@Ren2019].
  
```{r, include = FALSE}

peigencor <- eigencorplot(p,
    components = getComponents(p, 1:10),
    metavars = colnames(metadata_pca),
    col = c('white', 'cornsilk1', 'gold', 'forestgreen', 'darkgreen'),
    cexCorval = 0.7,
    colCorval = 'black',
    fontCorval = 2,
    posLab = 'bottomleft',
    rotLabX = 45,
    posColKey = 'top',
    cexLabColKey = 1.5,
    scale = TRUE,
    corFUN = 'pearson',
    corUSE = 'pairwise.complete.obs',
    corMultipleTestCorrection = 'none',
    main = 'PCs clinical correlations',
    colFrame = 'white',
    plotRsquared = TRUE)
```

```{r}
peigencor
```

```{r, include=FALSE}
ppairs <- pairsplot(p, components = getComponents(p, c(1:3)),
    triangle = TRUE, trianglelabSize = 12,
    hline = 0, vline = 0,
    pointSize = 0.8, gridlines.major = FALSE, gridlines.minor = FALSE,
    colby = 'Study',
    title = '', plotaxes = FALSE,
    margingaps = unit(c(0.01, 0.01, 0.01, 0.01), 'cm'),
    legendPosition = 'bottom',
    returnPlot = TRUE)
#ggsave(filename = "/home/kevin/Documents/PhD/rna_seq_bc/images_for_presentation/pairsplot_no_inhouse_col_strandedness.png", plot = ppairs)
```

```{r}
ppairs
```
## Test out removal of samples that are outliers on PCA

```{r}
rv <- rowVars(assay(vsd))
pc <- prcomp(t(assay(vsd)[head(order(-rv),nrow(assay(vsd))),]))
plot(pc$x[,1:2])

idx <- pc$x[,1] < -110
sum(idx)
plot(pc$x[,1:2], col=idx+1, pch=20, asp=1)
outliers_1 <- which(idx == TRUE)
```
### Clinical correlations by outlier status

```{r}
metadata_pca$Outlier <- ifelse(idx, "Yes", "No")
p <- pca(vsd_mat, metadata = metadata_pca)
peigencor <- eigencorplot(p,
    components = getComponents(p, 1:10),
    metavars = colnames(metadata_pca),
    col = c('white', 'cornsilk1', 'gold', 'forestgreen', 'darkgreen'),
    cexCorval = 0.7,
    colCorval = 'black',
    fontCorval = 2,
    posLab = 'bottomleft',
    rotLabX = 45,
    posColKey = 'top',
    cexLabColKey = 1.5,
    scale = TRUE,
    corFUN = 'pearson',
    corUSE = 'pairwise.complete.obs',
    corMultipleTestCorrection = 'none',
    main = 'PCs clinical correlations',
    colFrame = 'white',
    plotRsquared = TRUE)
```

```{r}
peigencor
```


```{r}
patient_samples <- rownames(metadata)
patient_samples <- patient_samples[!idx]
dds_remove_outliers <- dds[,!idx]
vsd_remove_outliers <- vst(dds_remove_outliers, blind = TRUE)
vsd_remove_outliers_mat <- assay(vsd_remove_outliers)
```

```{r}
plotPCA(vsd_remove_outliers, intgroup = c("Subpopulation"), ntop = nrow(vsd_remove_outliers))
plotPCA(vsd_remove_outliers, intgroup = c("Study"), ntop = nrow(vsd_remove_outliers))
plotPCA(vsd_remove_outliers, intgroup = c("Tumor_JuxtaTumor"), ntop = nrow(vsd_remove_outliers))

```

```{r}
metadata_pca_remove_outliers <- colData(vsd_remove_outliers)[,2:4]
#metadata_pca_remove_outliers <- as.data.frame(metadata_pca_remove_outliers[!names(metadata_pca_remove_outliers) %in% c("names")])
p_remove_outliers <- pca(vsd_remove_outliers_mat, metadata = metadata_pca_remove_outliers)
peigencor_remove_outliers <- eigencorplot(p_remove_outliers,
    components = getComponents(p_remove_outliers, 1:10),
    metavars = colnames(metadata_pca_remove_outliers),
    col = c('white', 'cornsilk1', 'gold', 'forestgreen', 'darkgreen'),
    cexCorval = 0.7,
    colCorval = 'black',
    fontCorval = 2,
    posLab = 'bottomleft',
    rotLabX = 45,
    posColKey = 'top',
    cexLabColKey = 1.5,
    scale = TRUE,
    corFUN = 'pearson',
    #corUSE = 'pairwise.complete.obs',
    corUSE = 'pairwise.complete.obs',
    corMultipleTestCorrection = 'none',
    main = 'PCs clinical correlations without\nOutliers',
    colFrame = 'white',
    plotRsquared = TRUE)
```

```{r}
peigencor_remove_outliers
```
Now that the outliers have been removed, batch accounts for most of the variability in PC1

How many of each subpopulation do we have left?

```{r}
table(colData(dds)$Subpopulation)
table(colData(dds_remove_outliers)$Subpopulation)
```

## Remove batch effects after outlier removal

```{r Batch correction only include batch and covariates}
counts_matrix_remove_outliers <- assay(dds_remove_outliers)
batch_remove_outliers <- colData(dds_remove_outliers)$Study
adjusted_remove_outliers <- ComBat_seq(counts = counts_matrix_remove_outliers, batch = batch_remove_outliers, group = colData(dds_remove_outliers)$Tumor_JuxtaTumor)
```

```{r}
dds_batch_corrected <- DESeqDataSetFromMatrix(adjusted_all_samples, colData = metadata, design = ~1)
# not sure about the blind = FALSE here
vsd_batch_corrected <- vst(dds_batch_corrected, blind = FALSE)
```

```{r PC1 vs PC2 batch corrected no covariate}
plotPCA(vsd_batch_corrected, intgroup = c("Subpopulation"), ntop = nrow(vsd_batch_corrected))
plotPCA(vsd_batch_corrected, intgroup = c("Study"), ntop = nrow(vsd_batch_corrected))
plotPCA(vsd_batch_corrected, intgroup = c("Tumor_JuxtaTumor"), ntop = nrow(vsd_batch_corrected))

```

## PCA + clinical correlations without In-House data

```{r, include = FALSE}
vsd_no_inhouse <- vst(dds_no_inhouse, blind = TRUE)
vsd_mat_no_inhouse <- assay(vsd_no_inhouse)
```

```{r}
plotPCA(vsd_no_inhouse, intgroup = c("Subpopulation"), ntop = nrow(vsd_no_inhouse))
plotPCA(vsd_no_inhouse, intgroup = c("Study"), ntop = nrow(vsd_no_inhouse))
plotPCA(vsd_no_inhouse, intgroup = c("Tumor_JuxtaTumor"), ntop = nrow(vsd_no_inhouse))

```

```{r}
metadata_pca_no_inhouse <- colData(vsd_no_inhouse)[,1:5]
metadata_pca_no_inhouse <- as.data.frame(metadata_pca_no_inhouse[!names(metadata_pca_no_inhouse) %in% c("names")])
p_no_inhouse <- pca(vsd_mat_no_inhouse, metadata = metadata_pca_no_inhouse)
peigencor_no_inhouse <- eigencorplot(p_no_inhouse,
    components = getComponents(p_no_inhouse, 1:10),
    metavars = colnames(metadata_pca_no_inhouse),
    col = c('white', 'cornsilk1', 'gold', 'forestgreen', 'darkgreen'),
    cexCorval = 0.7,
    colCorval = 'black',
    fontCorval = 2,
    posLab = 'bottomleft',
    rotLabX = 45,
    posColKey = 'top',
    cexLabColKey = 1.5,
    scale = TRUE,
    corFUN = 'pearson',
    #corUSE = 'pairwise.complete.obs',
    corUSE = 'pairwise.complete.obs',
    corMultipleTestCorrection = 'none',
    main = 'PCs clinical correlations without In-House',
    colFrame = 'white',
    plotRsquared = TRUE)
```

```{r}
peigencor_no_inhouse
```

Here we can see strandedness (i.e. the study `EGAD00001006144`) is significantly correlated with PCs 1-3. These PCs explain 29%, 7% and 5% of the variance of the entire dataset respectively).

```{r, include=FALSE}
ppairs <- pairsplot(p_no_inhouse, components = getComponents(p_no_inhouse, c(1:3)),
    triangle = TRUE, trianglelabSize = 12,
    hline = 0, vline = 0,
    pointSize = 0.8, gridlines.major = FALSE, gridlines.minor = FALSE,
    colby = 'Strandedness',
    title = '', plotaxes = FALSE,
    margingaps = unit(c(0.01, 0.01, 0.01, 0.01), 'cm'),
    legendPosition = 'bottom',
    returnPlot = TRUE)
#ggsave(filename = "/home/kevin/Documents/PhD/rna_seq_bc/images_for_presentation/pairsplot_no_inhouse_col_strandedness.png", plot = ppairs)
```

```{r}
ppairs
```

## PCA + clinical correlations without In-House data or study EGAD00001006144

Given the plot above, we can see that most of the variability of PC1 can be attributed to Study, indicating batch effects. Methods exist to "remove" batch effects, but they can not be run on our data. All of our S3 samples come from study `EGAD00001004810` and so when one tries to make a model matrix including both Study and Subpopulation, the model matrix is of less than full rank and there is an error. Let's look at the clinical correlations plot if we remove our outlier study `EGAD00001006144` as well as our in-house data.

```{r Remove study EGAD00001006144, include=FALSE}
idx <- which(se_no_inhouse$Study != "EGAD00001006144")
se_no_inhouse_no_6144 <- se_no_inhouse[,idx]
dds_no_inhouse_no_6144 <- DESeqDataSet(se_no_inhouse_no_6144, design = ~1)
# returns a vector of whether the total count of each gene is >= 10 (True or false)
keep <- rowSums(counts(dds_no_inhouse_no_6144)) >= 10
# only keep rows (genes) for which keep is TRUE
dds_no_inhouse_no_6144 <- dds_no_inhouse_no_6144[keep,]
# at least X samples with a count of 10 or more, where X can be chosen as the sample size of the smallest group of samples
X <- 7
keep <- rowSums(counts(dds_no_inhouse_no_6144) >= 10) >= X
dds_no_inhouse_no_6144 <- dds_no_inhouse_no_6144[keep,]
vsd_no_inhouse_no_6144 <- vst(dds_no_inhouse_no_6144, blind = TRUE)
```

```{r PC1 vs PC2 without study 6144 or inhouse data}
plotPCA(vsd_no_inhouse_no_6144, intgroup = c("Subpopulation"), ntop = nrow(vsd_no_inhouse_no_6144))
plotPCA(vsd_no_inhouse_no_6144, intgroup = c("Study"), ntop = nrow(vsd_no_inhouse_no_6144))
plotPCA(vsd_no_inhouse_no_6144, intgroup = c("Tumor_JuxtaTumor"), ntop = nrow(vsd_no_inhouse_no_6144))

```

```{r, include = FALSE}
vsd_mat_reduced <- assay(vsd_no_inhouse_no_6144)
metadata_pca_reduced <- colData(vsd_no_inhouse_no_6144)[,1:4]
metadata_pca_reduced <- as.data.frame(metadata_pca_reduced[!names(metadata_pca_reduced) %in% c("names")])
p_no_inhouse_no6144 <- pca(vsd_mat_reduced, metadata = metadata_pca_reduced)
peigencor_reduced <- eigencorplot(p_no_inhouse_no6144,
    components = getComponents(p_no_inhouse_no6144, 1:10),
    metavars = colnames(metadata_pca_reduced),
    col = c('white', 'cornsilk1', 'gold', 'forestgreen', 'darkgreen'),
    cexCorval = 0.7,
    colCorval = 'black',
    fontCorval = 2,
    posLab = 'bottomleft',
    rotLabX = 45,
    posColKey = 'top',
    cexLabColKey = 1.5,
    scale = TRUE,
    corFUN = 'pearson',
    #corUSE = 'pairwise.complete.obs',
    corUSE = 'pairwise.complete.obs',
    corMultipleTestCorrection = 'none',
    main = 'PCs clinical correlations\nremove EGAD00001006144',
    colFrame = 'white',
    plotRsquared = TRUE)

```

```{r}
peigencor_reduced
```

```{r pairs plot without EGAD00001006144 colour by study, include = FALSE}
ppairs_no_6144 <- pairsplot(p_no_inhouse_no6144, components = getComponents(p_no_inhouse_no6144, c(1:3)),
    triangle = TRUE, trianglelabSize = 12,
    hline = 0, vline = 0,
    pointSize = 0.8, gridlines.major = FALSE, gridlines.minor = FALSE,
    #colby = 'Strandedness',
    colby = 'Study',
    title = '', plotaxes = FALSE,
    margingaps = unit(c(0.01, 0.01, 0.01, 0.01), 'cm'),
    legendPosition = 'bottom',
    returnPlot = TRUE)
```

```{r}
ppairs_no_6144
```

```{r pairs plot without EGAD00001006144, include = FALSE}
ppairs_no_6144_colour_subpop <- pairsplot(p_no_inhouse_no6144, components = getComponents(p_no_inhouse_no6144, c(1:3)),
    triangle = TRUE, trianglelabSize = 12,
    hline = 0, vline = 0,
    pointSize = 0.8, gridlines.major = FALSE, gridlines.minor = FALSE,
    #colby = 'Strandedness',
    colby = 'Subpopulation',
    title = '', plotaxes = FALSE,
    margingaps = unit(c(0.01, 0.01, 0.01, 0.01), 'cm'),
    legendPosition = 'bottom',
    returnPlot = TRUE)
```

```{r}
ppairs_no_6144_colour_subpop
```

# Batch correction with Combat-Seq and limma

## Combat-Seq

Combat-Seq [@Zhang2020] models the distribution for each gene using a negative binomial regression model. This model contains a mean and dispersion parameter. It allows us to preserve changes in counts due to biological condition after adjustment. However, in our case, it is not possible to guarantee the preservation of changes due to biological condition (CAF subpopulation) while removing batch effects. Here we carry out batch correction using Combat-Seq, and will look at the data after adjustment to see if biological condition is preserved after adjustment. We should note here that the S3 subpopulation does not separate well from the other subpopulations before batch correction, even though the S3 data comes from a different "study".

### Combat-Seq no clinical covariates

```{r Batch correction only include batch no covariates}
counts_matrix <- assay(dds)
batch_all_samples <- metadata$Study
adjusted_all_samples <- ComBat_seq(counts = counts_matrix, batch = batch_all_samples)
```
```{r}
dds_batch_corrected <- DESeqDataSetFromMatrix(adjusted_all_samples, colData = metadata, design = ~1)
# not sure about the blind = FALSE here
vsd_batch_corrected <- vst(dds_batch_corrected, blind = FALSE)
```
```{r PC1 vs PC2 batch corrected no covariate}
plotPCA(vsd_batch_corrected, intgroup = c("Subpopulation"), ntop = nrow(vsd_batch_corrected))
plotPCA(vsd_batch_corrected, intgroup = c("Study"), ntop = nrow(vsd_batch_corrected))
plotPCA(vsd_batch_corrected, intgroup = c("Tumor_JuxtaTumor"), ntop = nrow(vsd_batch_corrected))

```

```{r, include = FALSE}
vsd_batch_corrected_mat <- assay(vsd_batch_corrected)
metadata_pca_batch_corrected <- colData(vsd_batch_corrected)[,1:4]
#metadata_pca_batch_corrected <- as.data.frame(metadata_pca_batch_corrected[!names(metadata_pca_batch_corrected) %in% c("names")])
p_batch_corrected<- pca(vsd_batch_corrected_mat, metadata = metadata_pca_batch_corrected)
peigencor_batch_corrected <- eigencorplot(p_batch_corrected,
    components = getComponents(p_batch_corrected, 1:10),
    metavars = colnames(metadata_pca_batch_corrected),
    col = c('white', 'cornsilk1', 'gold', 'forestgreen', 'darkgreen'),
    cexCorval = 0.7,
    colCorval = 'black',
    fontCorval = 2,
    posLab = 'bottomleft',
    rotLabX = 45,
    posColKey = 'top',
    cexLabColKey = 1.5,
    scale = TRUE,
    corFUN = 'pearson',
    #corUSE = 'pairwise.complete.obs',
    corUSE = 'pairwise.complete.obs',
    corMultipleTestCorrection = 'none',
    main = 'Clinical correlations after\nbatch correction no covariates',
    cexMain = 1.75,
    colFrame = 'white',
    plotRsquared = TRUE)
```

```{r}
peigencor_batch_corrected
```
### Combat-Seq with Tumour_Juxtatumor as covariate

```{r Batch correction include batch and Tumor_Juxtatumor}
adjusted_all_samples <- ComBat_seq(counts = counts_matrix, batch = batch_all_samples, group = metadata$Tumor_JuxtaTumor)
```

```{r}
dds_batch_corrected <- DESeqDataSetFromMatrix(adjusted_all_samples, colData = metadata, design = ~1)
# not sure about the blind = FALSE here
vsd_batch_corrected <- vst(dds_batch_corrected, blind = FALSE)
```

```{r PC1 vs PC2 batch corrected Tumor_Juxtatumor covariate}
plotPCA(vsd_batch_corrected, intgroup = c("Subpopulation"), ntop = nrow(vsd_batch_corrected))
plotPCA(vsd_batch_corrected, intgroup = c("Study"), ntop = nrow(vsd_batch_corrected))
plotPCA(vsd_batch_corrected, intgroup = c("Tumor_JuxtaTumor"), ntop = nrow(vsd_batch_corrected))
```

```{r, include = FALSE}
vsd_batch_corrected_mat <- assay(vsd_batch_corrected)
metadata_pca_batch_corrected <- colData(vsd_batch_corrected)[,1:4]
#metadata_pca_batch_corrected <- as.data.frame(metadata_pca_batch_corrected[!names(metadata_pca_batch_corrected) %in% c("names")])
p_batch_corrected<- pca(vsd_batch_corrected_mat, metadata = metadata_pca_batch_corrected)
peigencor_batch_corrected <- eigencorplot(p_batch_corrected,
    components = getComponents(p_batch_corrected, 1:10),
    metavars = colnames(metadata_pca_batch_corrected),
    col = c('white', 'cornsilk1', 'gold', 'forestgreen', 'darkgreen'),
    cexCorval = 0.7,
    colCorval = 'black',
    fontCorval = 2,
    posLab = 'bottomleft',
    rotLabX = 45,
    posColKey = 'top',
    cexLabColKey = 1.5,
    scale = TRUE,
    corFUN = 'pearson',
    #corUSE = 'pairwise.complete.obs',
    corUSE = 'pairwise.complete.obs',
    corMultipleTestCorrection = 'none',
    main = 'Clinical correlations after\nbatch correction tumor-juxtatumor as covariate',
    cexMain = 1.75,
    colFrame = 'white',
    plotRsquared = TRUE)
```

```{r}
peigencor_batch_corrected
```
## limma::removeBatchEffects 

### limma::removeBatchEffects with all covariates

Limma's removeBatchEffect function requires a design matrix as input, this is the "treatment condition" we wish to preserve. It is usually the design matrix with all experimental factors other than batch effects. The ideal scenario would be to include Subpopulation in this matrix. However, this treats `Unknown` as its own Subpopulation, and so will preserve differences between the InHouse samples and the other samples. This is contrary to what we want when assigning our samples to a cluster.

```{r}
vsd_not_blind <- vst(dds, blind = FALSE)
```

```{r removeBatchEffect all covariates model matrix}
mat <- assay(vsd_not_blind)
# create model matrix, full model matrix with Tumor_JuxtaTumor and Subpopulation
mm <- model.matrix(~Tumor_JuxtaTumor+Subpopulation, colData(vsd_not_blind))
mat <- limma::removeBatchEffect(mat, batch = vsd_not_blind$Study, design = mm)
```
```{r}
vsd_batch_corrected_limma_all_covariates <- vsd_not_blind
assay(vsd_batch_corrected_limma_all_covariates) <- mat
```

```{r}
plotPCA(vsd_batch_corrected_limma_all_covariates, intgroup = c("Subpopulation"), ntop = nrow(vsd_batch_corrected_limma_all_covariates))
plotPCA(vsd_batch_corrected_limma_all_covariates, intgroup = c("Study"),ntop = nrow(vsd_batch_corrected_limma_all_covariates) )
plotPCA(vsd_batch_corrected_limma_all_covariates, intgroup = c("Tumor_JuxtaTumor"), ntop = nrow(vsd_batch_corrected_limma_all_covariates))
```
```{r}
plotPCA(vsd_batch_corrected_limma_all_covariates, intgroup = c("Subpopulation"), ntop = 500)
plotPCA(vsd_batch_corrected_limma_all_covariates, intgroup = c("Study"),ntop = 500 )
plotPCA(vsd_batch_corrected_limma_all_covariates, intgroup = c("Tumor_JuxtaTumor"), ntop =500)
```

We can see that this removes batch effects and preserves differences between the subpopulations. However, it treats our Unknown samples as their own subpopulation.This is a case where the number of highest variable genes chosen makes a big difference.

```{r, include = FALSE}
vsd_batch_corrected_mat <- assay(vsd_batch_corrected_limma_all_covariates)
metadata_pca_batch_corrected <- colData(vsd_batch_corrected_limma_all_covariates)[,2:4]
#metadata_pca_batch_corrected <- as.data.frame(metadata_pca_batch_corrected[!names(metadata_pca_batch_corrected) %in% c("names")])
p_batch_corrected<- pca(vsd_batch_corrected_mat, metadata = metadata_pca_batch_corrected)
peigencor_batch_corrected <- eigencorplot(p_batch_corrected,
    components = getComponents(p_batch_corrected, 1:10),
    metavars = colnames(metadata_pca_batch_corrected),
    col = c('white', 'cornsilk1', 'gold', 'forestgreen', 'darkgreen'),
    cexCorval = 0.7,
    colCorval = 'black',
    fontCorval = 2,
    posLab = 'bottomleft',
    rotLabX = 45,
    posColKey = 'top',
    cexLabColKey = 1.5,
    scale = TRUE,
    corFUN = 'pearson',
    #corUSE = 'pairwise.complete.obs',
    corUSE = 'pairwise.complete.obs',
    corMultipleTestCorrection = 'none',
    main = 'Clinical correlations after\nbatch correction with limma all covariates',
    cexMain = 1.75,
    colFrame = 'white',
    plotRsquared = TRUE)
```

```{r}
peigencor_batch_corrected
```

### limma::removeBatchEffects with no clinical covariates or subpopulation information

```{r removeBatchEffect no covariates model matrix}
# create model matrix, full model matrix with Tumor_JuxtaTumor and Subpopulation
mm <- model.matrix(~1, colData(vsd_not_blind))
mat <- limma::removeBatchEffect(mat, batch = vsd_not_blind$Study, design = mm)
```

```{r}
vsd_batch_corrected_limma_no_covariates <- vsd_not_blind
assay(vsd_batch_corrected_limma_no_covariates) <- mat
```

```{r}
plotPCA(vsd_batch_corrected_limma_no_covariates, intgroup = c("Subpopulation"), ntop = nrow(vsd_batch_corrected_limma_no_covariates))
plotPCA(vsd_batch_corrected_limma_no_covariates, intgroup = c("Study"),ntop = nrow(vsd_batch_corrected_limma_no_covariates) )
plotPCA(vsd_batch_corrected_limma_no_covariates, intgroup = c("Tumor_JuxtaTumor"), ntop = nrow(vsd_batch_corrected_limma_no_covariates))
```

```{r, include = FALSE}
vsd_batch_corrected_mat <- assay(vsd_batch_corrected_limma_no_covariates)
metadata_pca_batch_corrected <- colData(vsd_batch_corrected_limma_no_covariates)[,2:4]
#metadata_pca_batch_corrected <- as.data.frame(metadata_pca_batch_corrected[!names(metadata_pca_batch_corrected) %in% c("names")])
p_batch_corrected<- pca(vsd_batch_corrected_mat, metadata = metadata_pca_batch_corrected)
peigencor_batch_corrected <- eigencorplot(p_batch_corrected,
    components = getComponents(p_batch_corrected, 1:10),
    metavars = colnames(metadata_pca_batch_corrected),
    col = c('white', 'cornsilk1', 'gold', 'forestgreen', 'darkgreen'),
    cexCorval = 0.7,
    colCorval = 'black',
    fontCorval = 2,
    posLab = 'bottomleft',
    rotLabX = 45,
    posColKey = 'top',
    cexLabColKey = 1.5,
    scale = TRUE,
    corFUN = 'pearson',
    #corUSE = 'pairwise.complete.obs',
    corUSE = 'pairwise.complete.obs',
    corMultipleTestCorrection = 'none',
    main = 'Clinical correlations after\nbatch correction with limma no clinical covariates',
    cexMain = 1.75,
    colFrame = 'white',
    plotRsquared = TRUE)
```

```{r}
peigencor_batch_corrected
```

# Surrogate variable analysis

```{r}
dds_no_inhouse_no_6144 <- DESeq(dds_no_inhouse_no_6144)
dat  <- counts(dds_no_inhouse_no_6144, normalized = TRUE)
idx  <- rowMeans(dat) > 1
dat  <- dat[idx, ]
# we have Subpopulation and tumour-juxtatomour in the model matrix
mod  <- model.matrix(~ Subpopulation + Tumor_JuxtaTumor, colData(dds_no_inhouse_no_6144))
mod0 <- model.matrix(~   1, colData(dds_no_inhouse_no_6144))
svseq_no6144 <- svaseq(dat, mod, mod0)
```
```{r}
sv_names <- paste("SV", seq(1,svseq_no6144$n.sv), sep = "")
for (i in 1:length(sv_names)){
  colData(dds_no_inhouse_no_6144)[,sv_names[i]] <- svseq_no6144$sv[,i]
}
colData(dds_no_inhouse_no_6144)[,"Subpopulation"] <- as.factor(colData(dds_no_inhouse_no_6144)$Subpopulation)
colData(dds_no_inhouse_no_6144)[,"Tumor_JuxtaTumor"] <- as.factor(colData(dds_no_inhouse_no_6144)$Tumor_JuxtaTumor)
```

# Differential expression analysis

```{r}
# add surrogate variables to design DDS object
design(dds_no_inhouse_no_6144) <- ~ Tumor_JuxtaTumor + SV1 + SV2 + SV3 + SV4 + SV5 + SV6 + SV7 + SV8 + SV9 + SV10 + SV11 + SV12 + SV13 + SV14 + SV15 + SV16 + SV17 + SV18 + Subpopulation
#dds_no_inhouse_no_6144_deseq <- DESeq(ddssva_copy)
#write_rds(dds_no_inhouse_no_6144_deseq, file = "/home/kevin/Documents/PhD/subtypes/caf-subtype-analysis/dds_noinhouse_no_6144_deseq_hgnc_hasntbeenrun.Rds")
dds_no_inhouse_no_6144_deseq <- readRDS("dds_noinhouse_no6144_hgnc_25082022.Rds")
```


The aim here is to extract lists of genes that are upregulated in each subpopulation compared to the other 2 subpopulations. To do this, steps from the [following tutorial](https://github.com/tavareshugo/tutorial_DESeq2_contrasts/blob/main/DESeq2_contrasts.md) were followed.

```{r Define model matrix}
# define model matrix
mod_mat <- model.matrix(design(dds_no_inhouse_no_6144_deseq), colData(dds_no_inhouse_no_6144))
# e.g. for each sample that is S1, get the mean of the coefficients all the components of the formula 
S1 <- colMeans(mod_mat[dds_no_inhouse_no_6144_deseq$Subpopulation == "S1",])
S3 <- colMeans(mod_mat[dds_no_inhouse_no_6144_deseq$Subpopulation == "S3",])
S4 <- colMeans(mod_mat[dds_no_inhouse_no_6144_deseq$Subpopulation == "S4",])
not_S1 <- colMeans(mod_mat[dds_no_inhouse_no_6144_deseq$Subpopulation %in% c("S3", "S4"),])
not_S3 <- colMeans(mod_mat[dds_no_inhouse_no_6144_deseq$Subpopulation %in% c("S1", "S4"),])
not_S4 <- colMeans(mod_mat[dds_no_inhouse_no_6144_deseq$Subpopulation %in% c("S1", "S3"),])
```

From the DESeq2 paper (thanks to Dónal for pointing it out) - the default null hypothesis with the `results` function is that the actual log fold change is exactly zero. This is not likely to be biologically significant. What we often do is extract genes with a lfc greater than a certain threshold, e.g. LFC >=2. However, there is a better way of doing this. We can incorporate our threshold into the test itself, i.e. we can test if the actual LFC is greater than or equal to 2, and then our adjusted P-values will be with respect to this new null hypothesis. 

```{r Extract results for each pairwise contrast}
res_not_S1 <- results(dds_no_inhouse_no_6144_deseq, contrast = S1 - not_S1, filterFun = ihw, alpha = 0.05, lfcThreshold = 2, altHypothesis = "greater")
res_not_S3 <- results(dds_no_inhouse_no_6144_deseq, contrast = S3 - not_S3, filterFun = ihw, alpha = 0.05, lfcThreshold = 2, altHypothesis = "greater")
res_not_S4 <- results(dds_no_inhouse_no_6144_deseq, contrast = S4 - not_S4, filterFun = ihw, alpha = 0.05, lfcThreshold = 2, altHypothesis = "greater")
```

```{r Extract shrunken LFC for each contrast}
res_not_S1_shrink <- lfcShrink(dds_no_inhouse_no_6144_deseq, contrast = S1 - not_S1, res = res_not_S1, type = "ashr")
res_not_S3_shrink <- lfcShrink(dds_no_inhouse_no_6144_deseq, contrast = S3 - not_S3, res = res_not_S3, type = "ashr")
res_not_S4_shrink <- lfcShrink(dds_no_inhouse_no_6144_deseq, contrast = S4 - not_S4, res = res_not_S4, type = "ashr")
```

```{r Extract upregulated genes}
res_not_S1_shrink_extract <- get_upregulated(res_not_S1_shrink)
res_not_S3_shrink_extract <- get_upregulated(res_not_S3_shrink)
res_not_S4_shrink_extract <- get_upregulated(res_not_S4_shrink)
```

```{r}
dfs_to_filter <- list(res_not_S1_shrink_extract, res_not_S3_shrink_extract, res_not_S4_shrink_extract)
dfs_filtered <- filter_dfs_antijoin_rownames(dfs_to_filter)
names(dfs_filtered) <- c("S1", "S3", "S4")
s1_filtered <- dfs_filtered[[1]]
s1_filtered_annotations <- annotate_de_genes(s1_filtered, filter_by = "hgnc_symbol")
s3_filtered <- dfs_filtered[[2]]
s3_filtered_annotations <- annotate_de_genes(s3_filtered, filter_by = "hgnc_symbol")
s4_filtered <- dfs_filtered[[3]]
s4_filtered_annotations <- annotate_de_genes(s4_filtered, filter_by = "hgnc_symbol")
```

```{r Extract gene sets}
s1_gs <- s1_filtered_annotations$Gene
s3_gs <- s3_filtered_annotations$Gene
s4_gs <- s4_filtered_annotations$Gene
genesets <- list(s1_gs, s3_gs, s4_gs)
names(genesets) <- c("S1", "S3", "S4")
```

## Gene set variation analysis (GSVA) for gene signature identification

```{r Get inhouse expression data}
idx <- which(dds$Study == "InHouse")
dds_inhouse <- dds[,idx]
```

```{r}
inhouse_metadata <- read.csv(here("intermediate_files/metadata/reformat_samples_extra_info.csv"))
caf_es <- gsva(expr = dds_inhouse,
               gset.idx.list = genesets,
               method = "gsva",
               kcdf = "Poisson",
               mx.diff=FALSE)
caf_es$Patient <- inhouse_metadata$Patient
caf_es$Subtype <- inhouse_metadata$Subtype
caf_es$Grade <- inhouse_metadata$Grade
caf_es$Histology <- inhouse_metadata$Histology
```

```{r Plot heatmaps}
subpopulationOrder <- c("tumor", "juxtatumor")
sampleOrderBySubpopulation <- sort(match(caf_es$Tumor_JuxtaTumor, subpopulationOrder),
                             index.return=TRUE)$ix
subpopulationXtable <- table(caf_es$Tumor_JuxtaTumor)
subpopulationColorLegend <- c(tumor="red", juxtatumor="green")
geneSetOrder <- c("S4", "S3", "S1")
geneSetLabels <- geneSetOrder
hmcol <- colorRampPalette(brewer.pal(10, "RdBu"))(256)
hmcol <- hmcol[length(hmcol):1]
#png(filename = "/home/kevin/Documents/PhD/subtypes/caf-subtype-analysis/caf_rnaseq_combined_analysis_files/figure-gfm/gsva_subpopulation_caf_tan.png")
heatmap(assay(caf_es)[geneSetOrder, sampleOrderBySubpopulation], Rowv=NA,
        Colv=NA, scale="row", margins=c(3,5), col=hmcol,
        ColSideColors=rep(subpopulationColorLegend[subpopulationOrder],
                          times=subpopulationXtable[subpopulationOrder]),
        labCol="", 
        caf_es$Tumor_JuxtaTumor[sampleOrderBySubpopulation],
        labRow=paste(toupper(substring(geneSetLabels, 1,1)),
                     substring(geneSetLabels, 2), sep=""),
        cexRow=2, main=" \n ")
par(xpd=TRUE)
text(0.285,1.1, "CAF", col="red", cex=1.2)
text(0.55,1.1, "TAN", col="green", cex=1.2)
#text(0.17,1.13, "CAF", col="red", cex=1.2)
#text(0.65,1.13, "TAN", col="green", cex=1.2)
mtext("CAF subpopulation signature", side=4, line=0, cex=1.5)
mtext("Samples", side=1, line=4, cex=1.5, at = 0.42)
#dev.off()
subtypeOrder <- c("LuminalA", "TNBC")
sampleOrderBySubtype <- sort(match(caf_es$Subtype, subtypeOrder),
                             index.return=TRUE)$ix
subtypeXtable <- table(caf_es$Subtype)
subtypeColorLegend <- c(LuminalA="red", TNBC="green")
geneSetOrder <- c("S4", "S3", "S1")
geneSetLabels <- geneSetOrder
hmcol <- colorRampPalette(brewer.pal(10, "RdBu"))(256)
hmcol <- hmcol[length(hmcol):1]

#png(filename = "/home/kevin/Documents/PhD/subtypes/caf-subtype-analysis/caf_rnaseq_combined_analysis_files/figure-gfm/gsva_cancer_subtype.png")
heatmap(assay(caf_es)[geneSetOrder, sampleOrderBySubtype], Rowv=NA,
        Colv=NA, scale="row", margins=c(3,5), col=hmcol,
        ColSideColors=rep(subtypeColorLegend[subtypeOrder],
                          times=subtypeXtable[subtypeOrder]),
        labCol="", 
        #labCol = caf_es$Patient[sampleOrderBySubtype],
        caf_es$Subtype[sampleOrderBySubtype],
        labRow=paste(toupper(substring(geneSetLabels, 1,1)),
                     substring(geneSetLabels, 2), sep=""),
        cexRow=2, main=" \n ")
par(xpd=TRUE)
text(0.4,1.1, "LuminalA", col="red", cex=1.2)
text(0.65,1.1, "TNBC", col="green", cex=1.2)
#text(0.35,1.13, "LuminalA", col="red", cex=1.2)
#text(0.85,1.13, "TNBC", col="green", cex=1.2)
mtext("CAF subpopulation signature", side=4, line=0, cex=1.5)
mtext("Samples", side=1, line=4, cex=1.5, at = 0.42)
#dev.off()

gradeOrder <- c("Grade_2", "Grade_3")
sampleOrderByGrade <- sort(match(caf_es$Grade, gradeOrder),
                             index.return=TRUE)$ix
gradeXtable <- table(caf_es$Grade)
gradeColorLegend <- c(Grade_2="red", Grade_3="green")
geneSetOrder <- c("S4", "S3", "S1")
geneSetLabels <- geneSetOrder
hmcol <- colorRampPalette(brewer.pal(10, "RdBu"))(256)
hmcol <- hmcol[length(hmcol):1]

#png(filename = "/home/kevin/Documents/PhD/subtypes/caf-subtype-analysis/caf_rnaseq_combined_analysis_files/figure-gfm/gsva_cancer_grade.png")
heatmap(assay(caf_es)[geneSetOrder, sampleOrderByGrade], Rowv=NA,
        Colv=NA, scale="row", margins=c(3,5), col=hmcol,
        ColSideColors=rep(gradeColorLegend[gradeOrder],
                          times=gradeXtable[gradeOrder]),
        labCol="", 
        #labCol = caf_es$Patient[sampleOrderByGrade],
        caf_es$Subtype[sampleOrderBySubtype],
        labRow=paste(toupper(substring(geneSetLabels, 1,1)),
                     substring(geneSetLabels, 2), sep=""),
        cexRow=2, main=" \n ")
par(xpd=TRUE)
text(0.34,1.1, "Grade 2", col="red", cex=1.2)
text(0.6,1.1, "Grade 3", col="green", cex=1.2)
#text(0.25,1.13, "Grade 2", col="red", cex=1.2)
#text(0.74,1.13, "Grade 3", col="green", cex=1.2)
mtext("CAF subpopulation signature", side=4, line=0, cex=1.5)
mtext("Samples", side=1, line=4, cex=1.5, at = 0.42)
#dev.off()
```


```{r}
caf_idx <- which(colData(caf_es)$Tumor_JuxtaTumor == "tumor")
tan_idx <- which(colData(caf_es)$Tumor_JuxtaTumor == "juxtatumor")
df_plot_S1 <- cbind.data.frame(S1_ES = assay(caf_es)[1,], Tumor_JuxtaTumor = colData(caf_es)$Tumor_JuxtaTumor)
gsva_plot_s1 <- ggplot(df_plot_S1, aes(x = Tumor_JuxtaTumor, y = S1_ES)) +
  geom_point(size = 2,  # reduce point size to minimize overplotting 
    position = position_jitter(
      width = 0.1,  # amount of jitter in horizontal direction
      height = 0     # amount of jitter in vertical direction (0 = none)
    )
  ) +
  theme(panel.grid.major = element_blank(), 
        panel.grid.minor = element_blank(),
        panel.background = element_blank(), 
        axis.line = element_line(colour = "black"), 
        axis.title.x = element_blank(),
        plot.title = element_text(hjust = 0.5)) +
  ylab("S1 GSVA ES")  #, axis.title.x = element_blank())

df_plot_S3 <- cbind.data.frame(S3_ES = assay(caf_es)[2,], Tumor_JuxtaTumor = colData(caf_es)$Tumor_JuxtaTumor)
gsva_plot_s3 <- ggplot(df_plot_S3, aes(x = Tumor_JuxtaTumor, y = S3_ES)) +
  geom_point(size = 2,  # reduce point size to minimize overplotting 
    position = position_jitter(
      width = 0.1,  # amount of jitter in horizontal direction
      height = 0     # amount of jitter in vertical direction (0 = none)
    )
  ) +
  theme(panel.grid.major = element_blank(), 
        panel.grid.minor = element_blank(),
        panel.background = element_blank(), 
        axis.line = element_line(colour = "black"), 
        axis.title.x = element_blank(),
        plot.title = element_text(hjust = 0.5)) +
  ylab("S3 GSVA ES")  #, axis.title.x = element_blank())

df_plot_S4 <- cbind.data.frame(S4_ES = assay(caf_es)[3,], Tumor_JuxtaTumor = colData(caf_es)$Tumor_JuxtaTumor)
gsva_plot_s4 <- ggplot(df_plot_S4, aes(x = Tumor_JuxtaTumor, y = S4_ES)) +
  geom_point(size = 2,  # reduce point size to minimize overplotting 
    position = position_jitter(
      width = 0.1,  # amount of jitter in horizontal direction
      height = 0     # amount of jitter in vertical direction (0 = none)
    )
  ) +
  theme(panel.grid.major = element_blank(), 
        panel.grid.minor = element_blank(),
        panel.background = element_blank(), 
        axis.line = element_line(colour = "black"), 
        axis.title.x = element_blank(),
        plot.title = element_text(hjust = 0.5)) +
  ylab("S4 GSVA ES")  #, axis.title.x = element_blank())
```

```{r}
distribution_es_s1 <- df_plot_S1 %>% 
ggplot(mapping = aes(S1_ES)) +
  geom_density() +
  facet_wrap(~Tumor_JuxtaTumor) +
  theme_linedraw()

distribution_es_s3 <- df_plot_S3 %>% 
ggplot(mapping = aes(S3_ES)) +
  geom_density() +
  facet_wrap(~Tumor_JuxtaTumor) +
  theme_linedraw()

distribution_es_s4 <- df_plot_S4 %>% 
ggplot(mapping = aes(S4_ES)) +
  geom_density() +
  facet_wrap(~Tumor_JuxtaTumor) +
  theme_linedraw()
```

```{r, include = FALSE}
top_row <- plot_grid(gsva_plot_s1, gsva_plot_s3, gsva_plot_s4,
      ncol = 3,
      labels = c('A', 'B', 'C'),
      label_fontfamily = 'serif',
      label_fontface = 'bold',
      label_size = 15,
      align = 'h',
      rel_widths = c(1.10, 0.80, 1.10))

    bottom_row <- plot_grid(distribution_es_s1, distribution_es_s3, distribution_es_s4,
      ncol = 3,
      labels = c('D', 'E', 'F'),
      label_fontfamily = 'serif',
      label_fontface = 'bold',
      label_size = 22,
      align = 'h',
      rel_widths = c(0.8, 1.2))
```

```{r}
 plot_grid(top_row, bottom_row, nrow = 2,
      rel_heights = c(1.1, 0.9))
```

```{r Shapiro Wilk test for normality}
df_plot_S1$S1_ES %>% 
shapiro.test()

df_plot_S3$S3_ES %>% 
shapiro.test()

df_plot_S4$S4_ES %>% 
shapiro.test()
```

p < 0.05, data is not normally distributed. T-test may be unsuitable for comparing distributions. Carry out Wilcoxon rank-sum test (Mann-Whitney U test).

```{r}
wilcox.test(x = df_plot_S1$S1_ES[df_plot_S1$Tumor_JuxtaTumor == "tumor"], 
            y = df_plot_S1$S1_ES[df_plot_S1$Tumor_JuxtaTumor == "juxtatumor"],
            paired = TRUE)
wilcox.test(x = df_plot_S3$S3_ES[df_plot_S3$Tumor_JuxtaTumor == "tumor"], 
            y = df_plot_S3$S3_ES[df_plot_S3$Tumor_JuxtaTumor == "juxtatumor"],
            paired = TRUE)
wilcox.test(x = df_plot_S4$S4_ES[df_plot_S4$Tumor_JuxtaTumor == "tumor"], 
            y = df_plot_S4$S4_ES[df_plot_S4$Tumor_JuxtaTumor == "juxtatumor"],
            paired = TRUE)
```

```{r Find medians of enrichment scores in S1 and S3 in CAF and TAN}
median(df_plot_S1$S1_ES[df_plot_S1$Tumor_JuxtaTumor == "tumor"])
median(df_plot_S1$S1_ES[df_plot_S1$Tumor_JuxtaTumor == "juxtatumor"])
median(df_plot_S3$S3_ES[df_plot_S3$Tumor_JuxtaTumor == "tumor"])
median(df_plot_S3$S3_ES[df_plot_S3$Tumor_JuxtaTumor == "juxtatumor"])
median(df_plot_S4$S4_ES[df_plot_S4$Tumor_JuxtaTumor == "tumor"])
median(df_plot_S4$S4_ES[df_plot_S4$Tumor_JuxtaTumor == "juxtatumor"])
```

We can see from the above that there is a difference between the enrichment scores for the gene signatures for the S1 and S3 subpopulaions between CAF and TAN. We can suggest that the TAN (juxta-tumour) samples are enriched for our S1 and S3 gene signatures compared to the genes outside the gene signatures, and that this enrichment is greater in the TAN samples than in the CAF samples.

```{r}
results_combined <- cbind.data.frame(colData(caf_es)$names, colData(caf_es)$Patient, df_plot_S1$S1_ES, df_plot_S3$S3_ES, df_plot_S4$S4_ES, df_plot_S1$Tumor_JuxtaTumor, colData(caf_es)$Subtype, colData(caf_es)$Grade, colData(caf_es)$Histology)
colnames(results_combined) <- c("Sample", "Patient", "S1_ES", "S3_ES", "S4_ES", "Tumor_JuxtaTumor", "Subtype", "Grade", "Histology")
results_combined
```

# Chemoresistance gene signature

[@Su2018] carried out differential expression analysis between chemoresistant and chemosensitive CAFs from breast cancer. It is possible to try and identify the resulting signature in our samples. We will use the list of upregulated genes in chemoresistant samples as our marker of chemoresistance and the genes downregulated in the chemoresistant cells as our marker of chemosensitivity.

The HGNCHelper package can be used to fix outdated gene symbols.

```{r read in gene signatures, include=FALSE}
gene_signature <- read.csv(here("1-s2.0-S0092867418300448-mmc1.csv"), skip = 1)
gene_signature_loc <- gene_signature[str_detect(gene_signature$Gene.name, pattern = "^LOC\\d+$"),]
# write to file and look up loc gene signatures manually, read back in as gene_signature_loc_official
#write.table(gene_signature_loc, file = "/home/kevin/Documents/PhD/rna_seq_bc/gene_signature/chemo_signature_loc.txt", quote = F, row.names = F)
gene_signature_loc_official <- read.table(here("intermediate_files/chemo_signature_loc_official_names.txt"), header = T)
gene_signature_proper_names <- gene_signature
gene_signature_proper_names$Official_name <- gene_signature$Gene.name
idx <- match(gene_signature_loc$Gene.name, gene_signature$Gene.name)
gene_signature_proper_names[idx,]$Official_name <- gene_signature_loc_official$Official_symbol
gene_signature_proper_names <- drop_na(gene_signature_proper_names)
row_remove <- which(gene_signature_proper_names$Gene.name == "N/A")
gene_signature_proper_names <- gene_signature_proper_names[-c(row_remove),]
gene_signature_hgnc <- checkGeneSymbols(gene_signature_proper_names$Official_name, species = "human")
gene_signature_hgnc$Suggested.Symbol[which(gene_signature_hgnc$x == "CCRL1")] <- NA
gene_signature_hgnc$Suggested.Symbol[which(gene_signature_hgnc$x == "FLJ40504")] <- NA
gene_signature_hgnc$Suggested.Symbol[which(gene_signature_hgnc$x == "LETR1")] <- "LETR1"
gene_signature_hgnc <- drop_na(gene_signature_hgnc)
colnames(gene_signature_hgnc) <- c("Gene.name", "Approved", "Suggested.Symbol")
gene_signature_join <- full_join(gene_signature, gene_signature_hgnc, by = "Gene.name")
gene_signature_chemoresistance <- gene_signature_join$Suggested.Symbol[which(gene_signature_join$Regulation == "up")]
gene_signature_chemoresistance <- gene_signature_chemoresistance[!is.na(gene_signature_chemoresistance)]
gene_signature_chemosensitivity <- gene_signature_join$Suggested.Symbol[which(gene_signature_join$Regulation == "down")]
gene_signature_chemosensitivity<- gene_signature_chemosensitivity[!is.na(gene_signature_chemosensitivity)]
genesets_chemo <- list(gene_signature_chemoresistance, gene_signature_chemosensitivity)
names(genesets_chemo) <- c("Resist.", "Sensit.")
```

```{r GSVA for chemoresistance signatures, include=FALSE}
chemo_es <- gsva(expr = dds_inhouse,
               gset.idx.list = genesets_chemo,
               method = "gsva",
               kcdf = "Poisson",
               mx.diff=FALSE)
chemo_es$Patient <- inhouse_metadata$Patient
chemo_es$Subtype <- inhouse_metadata$Subtype
chemo_es$Grade <- inhouse_metadata$Grade
chemo_es$Histology <- inhouse_metadata$Histology
```

```{r}
assay(chemo_es)
```

```{r}
#rownames(assay(chemo_es)) <- c("Resistance", "Sensitive")
subpopulationOrder <- c("tumor", "juxtatumor")
sampleOrderBySubpopulation <- sort(match(chemo_es$Tumor_JuxtaTumor, subpopulationOrder),
                             index.return=TRUE)$ix
subpopulationXtable <- table(chemo_es$Tumor_JuxtaTumor)
subpopulationColorLegend <- c(tumor="red", juxtatumor="green")
geneSetOrder <- c("Resist.", "Sensit.")
geneSetLabels <- geneSetOrder
hmcol <- colorRampPalette(brewer.pal(10, "RdBu"))(256)
hmcol <- hmcol[length(hmcol):1]

#png(filename = "/home/kevin/Documents/PhD/subtypes/caf-subtype-analysis/caf_rnaseq_combined_analysis_files/figure-gfm/gsva_chemoresistance_caf_tan.png")
heatmap(assay(chemo_es)[geneSetOrder, sampleOrderBySubpopulation], Rowv=NA,
        Colv=NA, scale="row", margins=c(3,5), col=hmcol,
        ColSideColors=rep(subpopulationColorLegend[subpopulationOrder],
                          times=subpopulationXtable[subpopulationOrder]),
        labCol="", 
        chemo_es$Tumor_JuxtaTumor[sampleOrderBySubpopulation],
        labRow=paste(toupper(substring(geneSetLabels, 1,1)),
                     substring(geneSetLabels, 2), sep=""),
        cexRow=2, main=" \n ")
par(xpd=TRUE)
text(0.285,1.1, "CAF", col="red", cex=1.2)
text(0.55,1.1, "TAN", col="green", cex=1.2)
#text(0.16,1.13, "CAF", col="red", cex=1.2)
#text(0.65,1.13, "TAN", col="green", cex=1.2)
mtext("Gene sets", side=4, line=0, cex=1.5)
mtext("Samples", side=1, line=4, cex=1.5, at = 0.42)
#dev.off()

subtypeOrder <- c("LuminalA", "TNBC")
sampleOrderBySubtype <- sort(match(chemo_es$Subtype, subtypeOrder),
                             index.return=TRUE)$ix
subtypeXtable <- table(chemo_es$Subtype)
subtypeColorLegend <- c(LuminalA="red", TNBC="green")
geneSetOrder <- c("Resist.", "Sensit.")
geneSetLabels <- geneSetOrder
hmcol <- colorRampPalette(brewer.pal(10, "RdBu"))(256)
hmcol <- hmcol[length(hmcol):1]

#png(filename = "/home/kevin/Documents/PhD/subtypes/caf-subtype-analysis/caf_rnaseq_combined_analysis_files/figure-gfm/gsva_chemoresistance_cancer_subtype.png")
heatmap(assay(chemo_es)[geneSetOrder, sampleOrderBySubtype], Rowv=NA,
        Colv=NA, scale="row", margins=c(3,5), col=hmcol,
        ColSideColors=rep(subtypeColorLegend[subtypeOrder],
                          times=subtypeXtable[subtypeOrder]),
        labCol="", 
        #labCol = caf_es$Patient[sampleOrderBySubtype],
        chemo_es$Subtype[sampleOrderBySubtype],
        labRow=paste(toupper(substring(geneSetLabels, 1,1)),
                     substring(geneSetLabels, 2), sep=""),
        cexRow=2, main=" \n ")
par(xpd=TRUE)
text(0.4,1.1, "LuminalA", col="red", cex=1.2)
text(0.65,1.1, "TNBC", col="green", cex=1.2)
#text(0.35,1.13, "LuminalA", col="red", cex=1.2)
#text(0.85,1.13, "TNBC", col="green", cex=1.2)
mtext("Gene sets", side=4, line=0, cex=1.5)
mtext("Samples", side=1, line=4, cex=1.5, at = 0.42)
#dev.off()

gradeOrder <- c("Grade_2", "Grade_3")
sampleOrderByGrade <- sort(match(chemo_es$Grade, gradeOrder),
                             index.return=TRUE)$ix
gradeXtable <- table(chemo_es$Grade)
gradeColorLegend <- c(Grade_2="red", Grade_3="green")
geneSetOrder <- c("Resist.", "Sensit.")
geneSetLabels <- geneSetOrder
hmcol <- colorRampPalette(brewer.pal(10, "RdBu"))(256)
hmcol <- hmcol[length(hmcol):1]

#png(filename = "/home/kevin/Documents/PhD/subtypes/caf-subtype-analysis/caf_rnaseq_combined_analysis_files/figure-gfm/gsva_chemoresistance_cancer_grade.png")
heatmap(assay(chemo_es)[geneSetOrder, sampleOrderByGrade], Rowv=NA,
        Colv=NA, scale="row", margins=c(3,5), col=hmcol,
        ColSideColors=rep(gradeColorLegend[gradeOrder],
                          times=gradeXtable[gradeOrder]),
        labCol="", 
        #labCol = caf_es$Patient[sampleOrderByGrade],
        chemo_es$Subtype[sampleOrderBySubtype],
        labRow=paste(toupper(substring(geneSetLabels, 1,1)),
                     substring(geneSetLabels, 2), sep=""),
        cexRow=2, main=" \n ")
par(xpd=TRUE)
text(0.34,1.1, "Grade 2", col="red", cex=1.2)
text(0.6,1.1, "Grade 3", col="green", cex=1.2)
#text(0.25,1.13, "Grade 2", col="red", cex=1.2)
#text(0.75,1.13, "Grade 3", col="green", cex=1.2)
mtext("Gene sets", side=4, line=0, cex=1.5)
mtext("Samples", side=1, line=4, cex=1.5, at = 0.42)
#dev.off()
```

```{r}
chemoresistance_ratio <- function(df){
  out <- data.frame(Patient = df$Patient,)
}
as_tibble(t(assay(chemo_es))) %>% mutate(Patient = colData(chemo_es)$Patient, Chemoresistance_ratio = abs(Resist.)/abs(Sensit.)) %>% print(n = 24)
```

```{r}
wilcox.test(x = df_plot_S1$S1_ES[df_plot_S1$Tumor_JuxtaTumor == "tumor"], 
            y = df_plot_S1$S1_ES[df_plot_S1$Tumor_JuxtaTumor == "juxtatumor"],
            paired = TRUE)
wilcox.test(x = df_plot_S3$S3_ES[df_plot_S3$Tumor_JuxtaTumor == "tumor"], 
            y = df_plot_S3$S3_ES[df_plot_S3$Tumor_JuxtaTumor == "juxtatumor"],
            paired = TRUE)
wilcox.test(x = df_plot_S4$S4_ES[df_plot_S4$Tumor_JuxtaTumor == "tumor"], 
            y = df_plot_S4$S4_ES[df_plot_S4$Tumor_JuxtaTumor == "juxtatumor"],
            paired = TRUE)
```

# Deconvolution using CIBERSORTx

CIBERSORTx [@Newman2019] is the most commonly used tool for cell-type deconvolution. It is a machine learning method based on support vector regression, which carries out batch correction, correcting for between-platform differences in expression data. In this case, the signature matrix was made using all of the available CAF subpopulation data (40 S1 samples, 25 S3 samples, 24 S4 samples). It is possible to infer the proportions of the different subpopulations as well as a subpopulation-specific gene expression matrix for each sample. Either scRNA-seq, bulk RNA-seq or microarray data can be used as the reference.

- Create signature matrix for CIBERSORTx
- TPM normalise data,probably optional
- Run CIBERSORTx to figure out proportions of S1, S3 and S4 in In-house samples

The files for CIBERSORT (mixture file, reference data and phenotype classes file), were prepared using the `cibersortx_prepare_files.R` script.

`CIBERSORTx` was run using the following command:
```
docker run -v /home/kevin/Documents/PhD/cibersort/caf_subpopulation/infiles:/src/data -v /home/kevin/Documents/PhD/cibersort/caf_subpopulation/outfiles:/src/outdir cibersortx/fractions --username k.ryan45@nuigalway.ie --token b7f03b943ade9b4146dc2126b4ac9d19 --single_cell FALSE --refsample caf_subtypes_tpm_for_sig_matrix.txt --mixture caf_tpm_mixture.txt --rmbatchBmode TRUE --outdir /home/kevin/Documents/PhD/cibersort/caf_subpopulation/outfiles --phenoclasses /home/kevin/Documents/PhD/cibersort/caf_subpopulation/infiles/phenoclasses_caf.txt
```

```{r CIBERSORT initial results Docker}
cibersort_results <- read.table(here("intermediate_files/cibersort/CIBERSORTx_Adjusted.txt"), header = T)
cibersort_results
```

The online version of CIBERSORTx was also used, using 500 permutations. The study `EGAD00001006144` was excluded from the signature matrix due to its probable outlier status.

```{r CIBERSORTx online results}
cibersort_results_online <- read.csv(here("intermediate_files/cibersort/2022-09-01-CIBERSORTx_Job6_Results_online_no6144_hgnc.csv"), header = T)
cibersort_results_online 
```

```{r, include = FALSE}
cibersort_results_long <- pivot_longer(cibersort_results, cols = c(S1, S3, S4), names_to = "Subpopulation")
cibersort_results_long$Mixture <- as.character(cibersort_results_long$Mixture)
cibersort_plot_docker <- ggplot(cibersort_results_long, aes(x = Mixture, y = value, fill = `Subpopulation`)) + 
  geom_col() + 
  ggtitle("Docker") + 
  theme(plot.title = element_text(hjust = 0.5), axis.text = element_text(size = 4, angle = 90))+ 
  xlab("Mixture") + 
  ylab("Proportion")
cibersort_plot_docker
```

```{r, include = FALSE}
cibersort_results_long_online <- pivot_longer(cibersort_results_online, cols = c(S1, S3, S4), names_to = "Subpopulation")
cibersort_results_long_online$Mixture <- as.character(cibersort_results_long_online$Mixture)
cibersort_plot_online <- ggplot(cibersort_results_long_online, 
                                aes(x = as.character(Mixture), y = value, fill = `Subpopulation`)) +
  geom_col() + 
  ggtitle("Online") +   
  theme(plot.title = element_text(hjust = 0.5),  axis.text = element_text(size = 4, angle = 90)) +
  xlab("Mixture") + 
  ylab("Proportion")

```

```{r, include = FALSE}
legend <- get_legend(
  # create some space to the left of the legend
  cibersort_plot_docker + 
    theme(legend.box.margin = margin(0, 0, 0, 12),
          legend.key.size = unit(0.5, 'cm'),
          legend.key.height = unit(0.5, 'cm'),
          legend.key.width = unit(0.5, 'cm'),
          legend.title = element_text(size=8),
          legend.text = element_text(size=8)
          )
)
cibersort_grid <- plot_grid(cibersort_plot_online + theme(legend.position = "none"),
                            cibersort_plot_docker + theme(legend.position = "none"),
                             ncol = 2,
      labels = c('A', 'B'),
      label_fontfamily = 'serif',
      label_fontface = 'bold',
      label_size = 15,
      align = 'h')
cibersort_grid <- plot_grid(cibersort_grid, legend, rel_widths = c(3,.5))

ggsave(filename = here("outfiles/cibersort_plot.png"), 
       plot = cibersort_grid,
       width = 20,
       height = 10,
       units = "cm")
```

```{r CIBERSORT plots}
cibersort_grid
```

Using the deconvolution approach with CIBERSORTx, there seems to be none of the S3 subpopulation present. It is possible that the S3 subpopulation does not have a very well defined transcriptional profile. 

Results look strange with P-value of 9999. CIBERSORTx was run using the Docker image and using the GUI, and different results were obtained. When the Docker image was used and the number of permutations was changed from 0 to 100, the p-value changed to 0.000. The proportions look different between the two methods, with the GUI predicting about 0.75 S1 with the rest being S4 for most samples.


# References