Skip to content

Commit

Permalink
Add complex heatmap
Browse files Browse the repository at this point in the history
  • Loading branch information
daianna21 committed May 28, 2024
1 parent 3c6a771 commit 27582dd
Show file tree
Hide file tree
Showing 6 changed files with 80 additions and 15 deletions.
Binary file modified .DS_Store
Binary file not shown.
20 changes: 13 additions & 7 deletions 20_DGE_analysis_overview.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -93,22 +93,22 @@ Although here we describe EDA as being comprised by QCA, dimensionality reductio


#### Quality Control Analysis (QCA)
First, the quality metrics of the samples regarding read and RNA contents, and read mapping rates have to be compared to (**Figure 2**: step 1):
First, the quality metrics of the samples regarding read and RNA contents, and read mapping rates have to be compared to (**Figure 3**: step 1):

1) Identify punctual samples or groups of samples of poor quality that may have arisen by technical causes during experimental steps.
2) Evaluate if samples from the groups of interest for DGE (diagnostic, treatment, etc.) differ in their quality metrics as these can represent confounding factors for differential expression.
3) Detect high biological variability to subsequently support data partition to perform subanalyses of the data.

Further, we are also interested in investigating trends and relationships between sample variables to unveil underlying technical and biological aspects of the observed data (**Figure 2**: step 2).
Further, we are also interested in investigating trends and relationships between sample variables to unveil underlying technical and biological aspects of the observed data (**Figure 3**: step 2).

After having identified poor-quality samples, we have to remove them to not include the unreliable expression data they provide in downstream analyses. Cutoffs can be defined for specific QC metrics to decide which samples to keep; this however, is not strongly recommended as no consolidated references exist to define them and therefore rather represent arbitrary values.
Other approaches include identifying outlier QC metrics (**Figure 2**: step 3), but again, caution must be taken as outlier definition is also arbitrary and we could be discarding good-quality samples.
Other approaches include identifying outlier QC metrics (**Figure 3**: step 3), but again, caution must be taken as outlier definition is also arbitrary and we could be discarding good-quality samples.

<figure>
<img src="Figures/QCA.png" width="700px" align=center />
<figcaption style="color: gray; line-height: 0.9; text-align: justify">
<font size="-1.8">
<b>Figure 2</b>: <b> Quality Control Analysis steps. </b>
<b>Figure 3</b>: <b> Quality Control Analysis steps. </b>

<b> 1. Evaluate QC metrics for groups of samples</b>: sample QC metrics such as the fraction of reads that mapped to the mitochondrial chromosome (`mitoRate`) and to the reference genome (`overallMapRate`) are compared between sample groups given by the variable of interest (`Group` in this example), technical variables (e.g. `plate` for sample library preparation), and biological variables (e.g. `Age`).

Expand All @@ -127,13 +127,13 @@ Other approaches include identifying outlier QC metrics (**Figure 2**: step 3),

#### Exploration of sample-level effects

Sample gene expression profiles can be analyzed and compared after dimensionality reduction procedures such as Principal Component Analysis (PCA) and Multidimensional-Scaling (MDS). These analyses are useful to potentially detect samples with outlier transcriptomic profiles to further remove and to identify sample variables driving gene expression variations (**Figure 3**).
Sample gene expression profiles can be analyzed and compared after dimensionality reduction procedures such as Principal Component Analysis (PCA) and Multidimensional-Scaling (MDS). These analyses are useful to potentially detect samples with outlier transcriptomic profiles to further remove and to identify sample variables driving gene expression variations (**Figure 4**).

<figure>
<img src="Figures/PCA.png" width="700px" align=center />
<img src="Figures/PCA.png" width="650px" align=center />
<figcaption style="color: gray; line-height: 0.9; text-align: justify">
<font size="-1.8">
<b>Figure 2</b>: <b> Exploration of sample-level effects through PCA</b>
<b>Figure 4</b>: <b> Exploration of sample-level effects through PCA</b>

<b> 1. Detection of atypical samples (manual PCA-based sample filtering)</b>: PCx vs PCy plots can expose outlier samples that appear segregated from the rest (purple-squared sample) or samples of a particular group (`Sex`: `F` or `M`) closer to samples from the other group (blue-squared sample). These should be further examined to evalute if they can be kept or must be discarded. In this case, after removing them, PC2 that explains a higher % of variance in gene expression, separates samples by sex.

Expand All @@ -145,12 +145,18 @@ Sample gene expression profiles can be analyzed and compared after dimensionalit

#### Model building: covariate selection for *limma*-*voom*

DGE methods fitting linear models to gene expression data to assess if a covariate impacts significantly on the expression of a gene, require the selection of sample-level variables to model transcriptomic data.
If very few variables are present, normally they are all included in the model but that’s not often the case with RNA-seq and it doesn't represent a well founded strategy. Usually, multiple technical and biological variables are implicated in the experiments and sample QC metrics can affect the gene expression levels, even after count normalization, whereas other variables are redundant and/or minimally informative. Therefore, we’d like to identify an optimal set of variables to adjust gene expression for, in addition to the covariate of interest.

We have already introduced one first approximation to that with PCA as this analysis allows us to identify variables explaining high percentages of gene expression variance between samples. In **Chapter XX** we will review how correlation and variance partition analyses at the gene level can help us determine a suitable set of highly explanatory variables.


## Differential Gene Expression
Different mathematical and statistical approaches exist to compare gene expression between two or more conditions. In **Chapter XX** we'll briefly introduce methods based on the negative binomial distribution and address how to perform DGE under the empirical Bayes *limma*-*voom* framework, distinguishing how it operates, its main specifications, inputs, and outputs.


## Downstream analyses
After finding DEGs, volcano plots and heat maps are commonly used to graphically represent them, plotting relevant information about them and their expression levels, respectively. In **Chapter XX** we'll also check how to create and interpret these plots.



Expand Down
75 changes: 67 additions & 8 deletions 22_DGE_with_limma_voom.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -667,9 +667,10 @@ de_genes <- top_genes[which(top_genes$adj.P.Val<0.05), ]
dim(de_genes)
```

### Volcano plots
A very practical and useful plot to graphically represent DEGs and visualize their expression differences between conditions is a volcano plot. This is a scatter plot of the logFC's of the genes in the x-axis vs their adjusted *p*-values in a -log scale in the y-axis.

```{r volcano plot}
```{r volcano plot, warning=FALSE, message=FALSE}
library("ggplot2")
## Define up- and down-regulated DEGs, and non-DEGs
Expand Down Expand Up @@ -720,26 +721,45 @@ ggplot(data = top_genes,
legend.text = element_text(size=12))
```

### Heat maps
Another common way to represent differential expression results is through a heat map. The package [*ComplexHeatmap*](https://bioconductor.org/packages/release/bioc/html/ComplexHeatmap.html) offers a flexible toolkit to easily create heat maps with row and column annotations, a feature of particular value to plot expression data of genes across samples with multiple biological and technical differences. Although initially all genes in your data can be plotted, frequently only DEGs are included as they tend to show clearer gene expression patterns.

```{r complexHeatmap, warning=FALSE,message=FALSE}
library("ComplexHeatmap")
## We plot lognorm counts
lognorm_data<- assays(rse_gene_filt)$logcounts
colnames(lognorm_data) <- paste0("Pup_", 1:dim(rse_gene_filt)[2])
## Subset to DEGs only
lognorm_data <- lognorm_data[rownames(de_genes),]
## Define column (sample) names to display
colnames(lognorm_data) <- paste0("Pup_", 1:dim(lognorm_data)[2])
```


<div style="background-color:#EDEDED; padding:20px; font-family: sans-serif; border: 1px solid black">
🗒️ **Note**: it is important to regress out the technical variables' contributions on gene expression (if necessary and suitable; functions such as [`cleaningY()`](https://rdrr.io/github/LieberInstitute/jaffelab/man/cleaningY.html) of *jaffelab* can be used for this purpose) and to correctly scale and center (around zero) the lognorm counts to make the differences in the expression of the genes more notorious. A simple way to do that is substracting from each lognorm count $cpm_{gi}$ (from the gene $g$ and sample $i$) the mean expression of the gene across all samples* and dividing by the standard deviation of the same gene expression values. This is formally called the z-score: the number of sd away from the mean.
🗒️ **Notes**:

1) It is sometimes convenient to regress out the technical variables' contributions on gene expression to see more clearly the effects of interest. This can happen, for instance, when the logFCs are too small to see any significant differences in the plots or when there are other strong confounding factors. Functions such as [`cleaningY()`](https://rdrr.io/github/LieberInstitute/jaffelab/man/cleaningY.html) of *jaffelab* can be used for this purpose.

2) The lognorm counts have to be correctly scaled and centered (around zero) to make the differences in the expression of the genes more notorious in the heat map. A simple way to do that is substracting from each lognorm count $y_{gi}$ (from the gene $g$ and sample $i$) the mean expression of the gene* and dividing by the standard deviation ($\sigma$) of the same gene expression values. This is formally called the *z*-score: the number of standard deviations away from the mean.

$$
z_{ score}=\frac{cpm_{gi} - \frac{\sum_{k=1}^{n}{cpm_{gk}}}{n}}{sd},
z=\frac{y_{gi} - \frac{\sum_{k=1}^{n}{y_{gk}}}{n}}{\sigma},
$$
$n$ is the number of samples.

<font size="2"> \* This can also be done for rows (features), not only for samples (columns). </font>
<font size="2"> \* This can also be done by columns (samples), not only by rows (genes). </font>

<p class="link" style="background-color:#EDEDED">
👉🏼 For more on centering and scaling, see this video:
</p>

<p class="link" style="background-color:#EDEDED; text-align: center">
<iframe width="560" height="315" src="https://www.youtube.com/embed/c8ffeFVUzk8" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" allowfullscreen></iframe>
</p>

</div>

<p>
Expand All @@ -750,11 +770,50 @@ $n$ is the number of samples.
```{r scale_data}
## Center and scale the data to make differences more evident
lognorm_data <- (lognorm_data - rowMeans(lognorm_data)) / rowSds(lognorm_data)
## Sample annotation: Sex, Group, and library size
col_anno <- HeatmapAnnotation(
df = as.data.frame(colData(rse_gene_filt)[, c("Sex", "Group")]),
library_size = anno_barplot(colData(rse_gene_filt)$sum, gp = gpar(fill = "lightyellow2")),
col = list("Sex" = c("F" = "hotpink1", "M" = "dodgerblue"),
"Group" = c("Control" = "gray68", "Experimental" = "gold2"))
)
## Gene annotation: logFC and biotype
de_genes$logFC_binary <- sapply(de_genes$logFC, function(x){if(x>0){">0"} else{"<0"}})
de_genes$protein_coding_gene <- sapply(rowData(rse_gene_filt[rownames(de_genes), ])$gene_type, function(x){
if(x=="protein_coding"){"TRUE"} else{"FALSE"}})
gene_anno <- rowAnnotation(
df = as.data.frame(cbind("logFC" = de_genes$logFC_binary,
"protein_coding_gene"= de_genes$protein_coding_gene)),
col = list("logFC" = c("<0"="deepskyblue3", ">0"="brown2"),
"protein_coding_gene" = c("TRUE"="darkseagreen3", "FALSE"="magenta"))
)
```

<p class="link">
👉🏼 More for on centering and scaling, see this video:
<iframe width="560" height="315" src="https://www.youtube.com/embed/c8ffeFVUzk8" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" allowfullscreen></iframe>
```{r heat map, warning=FALSE, message=FALSE, fig.width=10,fig.height=6.9}
library("circlize")
## Plot
Heatmap(lognorm_data,
name = "lognorm counts",
show_row_names = FALSE,
top_annotation = col_anno,
left_annotation = gene_anno,
row_km = 2,
column_km = 2,
col = colorRamp2(c(-4, -0.0001, 00001, 4), c("darkblue", "lightblue", "lightsalmon", "darkred")),
row_title = "DEGs",
column_title = "Samples",
column_names_gp = gpar(fontsize = 7),
heatmap_width = unit(12.5, "cm"),
heatmap_height = unit(12.5, "cm")
)
```

<p class="exercise">
📝 **Exercise 7**: obtain the number of DEGs you got and represent them in a volcano plot and a heat map. Include all the sample and gene information you consider relevant in the latter.
</p>


Expand Down
Binary file modified Figures/PCA.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified Figures/QCA.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified Figures/all_analyses.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit 27582dd

Please sign in to comment.