Add complex heatmap

lcolladotor · May 28, 2024 · 27582dd · 27582dd
1 parent 3c6a771
commit 27582dd
Show file tree

Hide file tree

Showing 6 changed files with 80 additions and 15 deletions.
diff --git a/.DS_Store b/.DS_Store
diff --git a/20_DGE_analysis_overview.Rmd b/20_DGE_analysis_overview.Rmd
@@ -93,22 +93,22 @@ Although here we describe EDA as being comprised by QCA, dimensionality reductio
 
 
 #### Quality Control Analysis (QCA)
-First, the quality metrics of the samples regarding read and RNA contents, and read mapping rates have to be compared to (**Figure 2**: step 1):
+First, the quality metrics of the samples regarding read and RNA contents, and read mapping rates have to be compared to (**Figure 3**: step 1):
 
  1) Identify punctual samples or groups of samples of poor quality that may have arisen by technical causes during experimental steps.  
  2) Evaluate if samples from the groups of interest for DGE (diagnostic, treatment, etc.) differ in their quality metrics as these can represent confounding factors for differential expression. 
  3) Detect high biological variability to subsequently support data partition to perform subanalyses of the data.  
 
-Further, we are also interested in investigating trends and relationships between sample variables to unveil underlying technical and biological aspects of the observed data (**Figure 2**: step 2). 
+Further, we are also interested in investigating trends and relationships between sample variables to unveil underlying technical and biological aspects of the observed data (**Figure 3**: step 2). 
 
 After having identified poor-quality samples, we have to remove them to not include the unreliable expression data they provide in downstream analyses. Cutoffs can be defined for specific QC metrics to decide which samples to keep; this however, is not strongly recommended as no consolidated references exist to define them and therefore rather represent arbitrary values. 
-Other approaches include identifying outlier QC metrics (**Figure 2**: step 3), but again, caution must be taken as outlier definition is also arbitrary and we could be discarding good-quality samples.  
+Other approaches include identifying outlier QC metrics (**Figure 3**: step 3), but again, caution must be taken as outlier definition is also arbitrary and we could be discarding good-quality samples.  
 
 <figure>
     <img src="Figures/QCA.png" width="700px" align=center />
         <figcaption style="color: gray; line-height: 0.9; text-align: justify">
             <font size="-1.8">
-                <b>Figure 2</b>: <b> Quality Control Analysis steps. </b> 
+                <b>Figure 3</b>: <b> Quality Control Analysis steps. </b> 
 
 <b> 1. Evaluate QC metrics for groups of samples</b>: sample QC metrics such as the fraction of reads that mapped to the mitochondrial chromosome (`mitoRate`) and to the reference genome (`overallMapRate`) are compared between sample groups given by the variable of interest (`Group` in this example), technical variables (e.g. `plate` for sample library preparation), and biological variables (e.g. `Age`). 
 
@@ -127,13 +127,13 @@ Other approaches include identifying outlier QC metrics (**Figure 2**: step 3),
 
 #### Exploration of sample-level effects
 
-Sample gene expression profiles can be analyzed and compared after dimensionality reduction procedures such as Principal Component Analysis (PCA) and Multidimensional-Scaling (MDS). These analyses are useful to potentially detect samples with outlier transcriptomic profiles to further remove and to identify sample variables driving gene expression variations (**Figure 3**).
+Sample gene expression profiles can be analyzed and compared after dimensionality reduction procedures such as Principal Component Analysis (PCA) and Multidimensional-Scaling (MDS). These analyses are useful to potentially detect samples with outlier transcriptomic profiles to further remove and to identify sample variables driving gene expression variations (**Figure 4**).
 
 <figure>
-    <img src="Figures/PCA.png" width="700px" align=center />
+    <img src="Figures/PCA.png" width="650px" align=center />
         <figcaption style="color: gray; line-height: 0.9; text-align: justify">
             <font size="-1.8">
-                <b>Figure 2</b>: <b>  Exploration of sample-level effects through PCA</b> 
+                <b>Figure 4</b>: <b>  Exploration of sample-level effects through PCA</b> 
 
 <b> 1. Detection of atypical samples (manual PCA-based sample filtering)</b>: PCx vs PCy plots can expose outlier samples that appear segregated from the rest (purple-squared sample) or samples of a particular group (`Sex`: `F` or `M`) closer to samples from the other group (blue-squared sample). These should be further examined to evalute if they can be kept or must be discarded. In this case, after removing them, PC2 that explains a higher % of variance in gene expression, separates samples by sex. 
 
@@ -145,12 +145,18 @@ Sample gene expression profiles can be analyzed and compared after dimensionalit
 
 #### Model building: covariate selection for *limma*-*voom* 
 
+DGE methods fitting linear models to gene expression data to assess if a covariate impacts significantly on the expression of a gene, require the selection of sample-level variables to model transcriptomic data. 
+If very few variables are present, normally they are all included in the model but that’s not often the case with RNA-seq and it doesn't represent a well founded strategy. Usually, multiple technical and biological variables are implicated in the experiments and sample QC metrics can affect the gene expression levels, even after count normalization, whereas other variables are redundant and/or minimally informative. Therefore, we’d like to identify an optimal set of variables to adjust gene expression for, in addition to the covariate of interest.
+
+We have already introduced one first approximation to that with PCA as this analysis allows us to identify variables explaining high percentages of gene expression variance between samples. In **Chapter XX** we will review how correlation and variance partition analyses at the gene level can help us determine a suitable set of highly explanatory variables. 
 
 
 ## Differential Gene Expression 
+Different mathematical and statistical approaches exist to compare gene expression between two or more conditions. In **Chapter XX** we'll briefly introduce methods based on the negative binomial distribution and address how to perform DGE under the empirical Bayes *limma*-*voom* framework, distinguishing how it operates, its main specifications, inputs, and outputs. 
 
 
 ## Downstream analyses
+After finding DEGs, volcano plots and heat maps are commonly used to graphically represent them, plotting relevant information about them and their expression levels, respectively. In **Chapter XX** we'll also check how to create and interpret these plots.
 
 
 

diff --git a/22_DGE_with_limma_voom.Rmd b/22_DGE_with_limma_voom.Rmd
@@ -667,9 +667,10 @@ de_genes <- top_genes[which(top_genes$adj.P.Val<0.05), ]
 dim(de_genes)
 ```
 
+### Volcano plots
 A very practical and useful plot to graphically represent DEGs and visualize their expression differences between conditions is a volcano plot. This is a scatter plot of the logFC's of the genes in the x-axis vs their adjusted *p*-values in a -log scale in the y-axis. 
 
-```{r volcano plot}
+```{r volcano plot, warning=FALSE, message=FALSE}
 library("ggplot2")
 
 ## Define up- and down-regulated DEGs, and non-DEGs
@@ -720,26 +721,45 @@ ggplot(data = top_genes,
               legend.text = element_text(size=12)) 
 ```
 
+### Heat maps
 Another common way to represent differential expression results is through a heat map. The package [*ComplexHeatmap*](https://bioconductor.org/packages/release/bioc/html/ComplexHeatmap.html) offers a flexible toolkit to easily create heat maps with row and column annotations, a feature of particular value to plot expression data of genes across samples with multiple biological and technical differences. Although initially all genes in your data can be plotted, frequently only DEGs are included as they tend to show clearer gene expression patterns. 
 
 ```{r complexHeatmap, warning=FALSE,message=FALSE}
 library("ComplexHeatmap")
 
 ## We plot lognorm counts
 lognorm_data<- assays(rse_gene_filt)$logcounts
-colnames(lognorm_data) <- paste0("Pup_", 1:dim(rse_gene_filt)[2])
+
+## Subset to DEGs only
+lognorm_data <- lognorm_data[rownames(de_genes),]
+
+## Define column (sample) names to display
+colnames(lognorm_data) <- paste0("Pup_", 1:dim(lognorm_data)[2])
 ```
 
 
 <div style="background-color:#EDEDED; padding:20px; font-family: sans-serif; border: 1px solid black">
-🗒️ **Note**: it is important to regress out the technical variables' contributions on gene expression (if necessary and suitable; functions such as [`cleaningY()`](https://rdrr.io/github/LieberInstitute/jaffelab/man/cleaningY.html) of *jaffelab* can be used for this purpose) and to correctly scale and center (around zero) the lognorm counts to make the differences in the expression of the genes more notorious. A simple way to do that is substracting from each lognorm count $cpm_{gi}$ (from the gene $g$ and sample $i$) the mean expression of the gene across all samples* and dividing by the standard deviation of the same gene expression values. This is formally called the z-score: the number of sd away from the mean.
+🗒️ **Notes**:
+
+1) It is sometimes convenient to regress out the technical variables' contributions on gene expression to see more clearly the effects of interest. This can happen, for instance, when the logFCs are too small to see any significant differences in the plots or when there are other strong confounding factors. Functions such as [`cleaningY()`](https://rdrr.io/github/LieberInstitute/jaffelab/man/cleaningY.html) of *jaffelab* can be used for this purpose.
+
+2) The lognorm counts have to be correctly scaled and centered (around zero) to make the differences in the expression of the genes more notorious in the heat map. A simple way to do that is substracting from each lognorm count $y_{gi}$ (from the gene $g$ and sample $i$) the mean expression of the gene* and dividing by the standard deviation ($\sigma$) of the same gene expression values. This is formally called the *z*-score: the number of standard deviations away from the mean.
 
 $$
-z_{ score}=\frac{cpm_{gi} - \frac{\sum_{k=1}^{n}{cpm_{gk}}}{n}}{sd},
+z=\frac{y_{gi} - \frac{\sum_{k=1}^{n}{y_{gk}}}{n}}{\sigma},
 $$
 $n$ is the number of samples.
 
-<font size="2"> \* This can also be done for rows (features), not only for samples (columns). </font>
+<font size="2"> \* This can also be done by columns (samples), not only by rows (genes). </font>
+
+<p class="link" style="background-color:#EDEDED">
+    👉🏼 For more on centering and scaling, see this video:
+</p>
+
+<p class="link" style="background-color:#EDEDED; text-align: center">
+    <iframe width="560" height="315" src="https://www.youtube.com/embed/c8ffeFVUzk8" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" allowfullscreen></iframe>
+</p>
+
 </div>
 
 <p>
@@ -750,11 +770,50 @@ $n$ is the number of samples.
 ```{r scale_data}
 ## Center and scale the data to make differences more evident
 lognorm_data <- (lognorm_data - rowMeans(lognorm_data)) / rowSds(lognorm_data)
+
+## Sample annotation: Sex, Group, and library size
+col_anno <- HeatmapAnnotation(
+    df = as.data.frame(colData(rse_gene_filt)[, c("Sex", "Group")]),
+    library_size = anno_barplot(colData(rse_gene_filt)$sum, gp = gpar(fill = "lightyellow2")),
+    col = list("Sex" = c("F" = "hotpink1", "M" = "dodgerblue"),
+               "Group" = c("Control" = "gray68", "Experimental" = "gold2"))
+    )
+
+## Gene annotation: logFC and biotype
+de_genes$logFC_binary <- sapply(de_genes$logFC, function(x){if(x>0){">0"} else{"<0"}})
+de_genes$protein_coding_gene <- sapply(rowData(rse_gene_filt[rownames(de_genes), ])$gene_type, function(x){
+    if(x=="protein_coding"){"TRUE"} else{"FALSE"}})
+
+gene_anno <- rowAnnotation(
+    df = as.data.frame(cbind("logFC" = de_genes$logFC_binary, 
+                             "protein_coding_gene"= de_genes$protein_coding_gene)),
+    col = list("logFC" = c("<0"="deepskyblue3", ">0"="brown2"), 
+               "protein_coding_gene" = c("TRUE"="darkseagreen3", "FALSE"="magenta"))
+    )
 ```
 
-<p class="link">
-    👉🏼 More for on centering and scaling, see this video:
-    <iframe width="560" height="315" src="https://www.youtube.com/embed/c8ffeFVUzk8" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" allowfullscreen></iframe>
+```{r heat map, warning=FALSE, message=FALSE, fig.width=10,fig.height=6.9}
+library("circlize")
+
+## Plot
+Heatmap(lognorm_data,
+    name = "lognorm counts",
+    show_row_names = FALSE,
+    top_annotation = col_anno,
+    left_annotation = gene_anno,
+    row_km = 2,
+    column_km = 2,
+    col = colorRamp2(c(-4, -0.0001, 00001, 4), c("darkblue", "lightblue", "lightsalmon", "darkred")),
+    row_title = "DEGs",
+    column_title = "Samples",
+    column_names_gp = gpar(fontsize = 7),
+    heatmap_width = unit(12.5, "cm"),
+    heatmap_height = unit(12.5, "cm")
+)
+```
+
+<p class="exercise">
+📝 **Exercise 7**: obtain the number of DEGs you got and represent them in a volcano plot and a heat map. Include all the sample and gene information you consider relevant in the latter. 
 </p>
 
 

diff --git a/Figures/PCA.png b/Figures/PCA.png
diff --git a/Figures/QCA.png b/Figures/QCA.png
diff --git a/Figures/all_analyses.png b/Figures/all_analyses.png