-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path05_DGE_with_limma_voom.Rmd
870 lines (619 loc) · 45.7 KB
/
05_DGE_with_limma_voom.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
# Differential Gene Expression analysis with *limma*-*voom*
Instructor: Daianna González Padilla
In this chapter you'll learn how DGE analysis is performed under the empirical Bayes framework of the popular *limma*-*voom* pipeline, highlighting key assumptions and concepts, and main differences with other methodologies.
## NB-based DGE methods?
An initial central point of discussion around DGE method development is how to model the distribution of the reads. Many methods model the read counts ($y_{k,ij}$, non-negative integers) of a gene $i$ in the $j$ samples of condition $k$ through the Poisson or the Negative Binomial (NB) distribution. Of these, NB is often preferred as it allows the mean ($\mu$) and the variance ($\sigma$) of the reads to be different, compared to the Poisson distribution where $\mu$=$\sigma$. This is of particular importance as controlling the variance allows to account for variability in the gene expression levels across biological samples [1].
<figure>
<img src="Figures/NB_DGE_methods.png" width="800px" align=center />
<figcaption style="color: gray; line-height: 0.9; text-align: justify">
<font size="-1.8">
<b>Figure 1</b>: <b> NB-distributed read counts. </b> Modeling of read counts for gene $i$ in the samples of the first and second conditions based on the NB model. Modified from [Li, W. V., & Li, J. J. (2018)](https://pubmed.ncbi.nlm.nih.gov/31456901/).
</font>
</figcaption>
</figure>
Estimating the NB distribution parameters is necessary to assess DE of each gene $i$ between any two conditions $k=1,2$ (**Figure 1**). Bayesian models are used defining prior distributions and relationships of such parameters. Briefly, after 1) estimating gene-wise NB parameters, 2) the mean-variance relationship across all genes can be used to shrink the gene variance estimations borrowing information from all genes or incorporating prior knowledge, something advantageous when sample sizes are small. 3) A statistical test is used to assess for each gene $i$ if its true expression in the first and second condition ($\theta_{1i}$ and $\theta_{2i}$) is the same (null hypothesis) or differs (alternative hypothesis):
* $H_0: \theta_{1i}=\theta_{2i}$
* $H_1: \theta_{1i}≠\theta_{2i}$, where the $\theta_{i}$'s are parameters included in the mean of the NB distributions ($\mu$).
4\) The test statistic is computed for each gene and 5) its associated *p*-value is calculated based on the null distribution. 6) Finally, *p*-values are corrected for multiple-testing and DEGs are determined based on an adjusted *p*-values cutoff [1].
Examples of popular methods based on the NB distribution are [*edgeR*](https://bioconductor.org/packages/edgeR) and [*DESeq2*](https://bioconductor.org/packages/DESeq2).
Nevertheless, one limitation NB-based methods face is that they set dispersion of the data as a known and global parameter, ignoring observation-specific variation and importantly, there's a reduced number of statistical methods for count distributions compared to the normal distribution [1,2]. Here, we'll focus on [*limma*](https://bioconductor.org/packages/limma) that does not rely on a certain distribution but rather works on $log_2(cpm)$ (CPM: counts per million) and fits linear models for DGE enabling the incorporation of additional predictors to model gene expression, a feature specially valuable for complex experimental settings.
## *limma*-*voom* pipeline
[*limma*](https://bioconductor.org/packages/limma) is a package for the analysis of gene expression data arising from microarray or RNA-seq technologies. It has features that make the analyses stable even for experiments with small number of arrays or samples —this is achieved by borrowing information across genes. It is specially designed for analyzing complex experiments with a variety of experimental conditions and predictors [3].
Usually, *limma* DGE analysis is carried out in five main steps, the last four of them completed by *limma* `R` functions, as described below. We'll use bulk RNA-seq data from the [*smokingMouse*](https://bioconductor.org/packages/smokingMouse) package to exemplify these steps.
```{r download_data_DGE, warning=FALSE, message=FALSE}
## Load the container package for RSE
library("SummarizedExperiment")
## Connect to ExperimentHub
library("ExperimentHub")
eh <- ExperimentHub::ExperimentHub()
## Load package datasets
myfiles <- query(eh, "smokingMouse")
## Download the mouse gene data
rse_gene <- myfiles[["EH8313"]]
## Samples from the nicotine experiment and from pups only
rse_gene_nic <- rse_gene[, which(rse_gene$Expt == "Nicotine" & rse_gene$Age == "Pup")]
## Retain only expressed genes (passed the filtering step)
rse_gene_filt <- rse_gene_nic[
rowData(rse_gene_nic)$retained_after_feature_filtering,
]
```
Let's explore a little the data.
```{r explore_data}
## Data dimensions: number of genes and samples
dim(rse_gene_filt)
## Raw counts for first 3 genes in the first 5 samples
assays(rse_gene_filt)$counts[1:3, 1:5]
## Log-normalized counts for first 3 genes in the first 5 samples
assays(rse_gene_filt)$logcounts[1:3, 1:5]
## Data for the first 2 samples
head(colData(rse_gene_filt), 2)
```
<p class="exercise">
📝 **Exercise 1**: in order for you to perform a DGE analysis, locate your own RNA-seq datasets if you have any, or download expression data from a study of your interest and build a `RSE` object using *recount3* (see **Chapter 3: recount3 introduction**). A third option you have is to download gene expression data from the [*smokingMouse*](http://research.libd.org/smokingMouse/index.html) package used here. A fourth option is to download data from GEO as Sean Davis will explain next. We'll have more time tomorrow for doing this exercise with data of your choosing.
<p/>
### `model.matrix()`
*limma* fits a linear model to the expression data of each gene (response variable), modeling the systematic part of the data by sample-level covariates (predictors).
<style>
p.exercise {
background-color: #FFFAFA;
padding: 15px;
border: 2px solid black;
margin-left: 0px;
border-radius: 1px;
font-family: sans-serif;
}
</style>
<style>
p.info {
background-color: #FFFFF0;
padding: 20px;
border: 1px solid black;
margin-left: 0px;
border-radius: 1px;
font-family: sans-serif;
}
</style>
<style>
p.conclusion {
background-color: #EEE9E9;
padding: 20px;
border: 1px solid black;
margin-left: 0px;
border-radius: 1px;
font-family: sans-serif;
}
</style>
<style>
p.question{
background-color: #E3E3E3;
padding: 20px;
border: 1px solid black;
margin-left: 0px;
border-radius: 1px;
font-family: sans-serif;
}
</style>
<style>
p.link{
background-color: #FFFFFF;
padding: 10px;
border: 0px solid black;
margin-left: 0px;
border-radius: 1px;
font-size: 13px;
font-family: sans-serif;
}
</style>
<style>
p.comment {
background-color: #F0F0F0;
padding: 20px;
border: 0px solid black;
margin-left: 0px;
border-radius: 1px;
font-family: sans-serif;
}
</style>
<style>
p.alert {
background-color: #FFE4E1;
padding: 14px;
border: 0px solid black;
margin-left: 0px;
border-radius: 1px;
font-family: sans-serif;
}
</style>
<style>
p.success {
background-color: #E0EEE0;
padding: 14px;
border: 0px solid black;
margin-left: 0px;
border-radius: 1px;
font-family: sans-serif;
}
</style>
<p class="info">
💡 A model is a specification of how a set of variables relate to each other. In the case of a linear model, it is a linear equation that describes how the dependent or response variable is explained by the independent variables, also called predictors. A regression analysis with more than one independent variable is called **multiple regression**. Regression with only one independent variable is called **simple regression** [4].
</p>
The *limma* model is specified with a **design matrix**, also known as **model matrix** or **regressor matrix**, often denoted by $X$. This is a matrix of values for explanatory variables of the samples: rows correspond to samples and columns to sample variables.
Say that the values the $i$th sample take in the $h$ covariates are $X_{ih}$'s and their coefficients are $\beta_{h}$'s. The predicted expression of a gene in the $i$th sample is given by $\hat y_i =\beta_0 + \sum_{1}^h\beta_{h}X_{ih}$.
$$
\hat y = X\beta=\displaystyle {\begin{bmatrix} \hat y_{1}\\ \hat y
_{2}\\ \hat y_{3}\\...\\ \hat y_{n-1}\\ \hat y_{n}\end{bmatrix}}={\begin{bmatrix}1&X_{11}&X_{12}&X_{13}&\cdots&X_{1,h-1}&X_{1h}\\1&X_{21}&X_{22}&X_{23}&\cdots&X_{2,h-1}&X_{2h}\\1&X_{31}&X_{32}&X_{33}&\cdots&X_{3,h-1}&X_{3h} \\ \vdots & \vdots & \vdots & \vdots & \ddots & \vdots & \vdots \\1&X_{n-1,1}&X_{n-1,2}&X_{n-1,3}&\cdots&X_{n-1,h-1}&X_{n-1,h} \\1&X_{n,1}&X_{n,2}&X_{n,3}&\cdots&X_{n,h-1}&X_{n,h} \end{bmatrix}}{\begin{bmatrix}\beta _{0}\\\beta _{1}\\\beta _{2}\\\beta_{3}\\...\\\beta_{h-1}\\\beta_{h}\end{bmatrix}}
$$
where $n$ is the number of samples.
In the first step we create this matrix using `model.matrix()` that receives a formula with the variables to include in the models and the sample data.
```{r model.matrix()}
## Define formula
formula <- ~ Group + Sex + flowcell + mitoRate + overallMapRate + totalAssignedGene + detected + ERCCsumLogErr
## Model matrix
model <- model.matrix(formula, data = colData(rse_gene_filt))
head(model)
```
<p class="question">
❓ **Which variables to include as covariates in the models?** A straightforward strategy is to keep the model as **simple as possible** and after fitting the model evaluate the comparisons of interest [3]. In **Chapter 7** we will discuss how correlation and variance partition analyses can help us to set up the best models.
</p>
<p class="alert">
⚠️ **Very important**: always check which condition group is set as the reference in you model for the coefficient/contrast of interest (column named as `[Coefficient_name][Reference_Group]`; corresponding reference group set to 1) as this determines if a DEG is up or downregulated in the given condition compared to the other.</p>
```{r coeff}
## Comparison of interest: Group
coef <- "GroupExperimental"
```
<p class="exercise">
📝 **Exercise 2**: identify the sample data of your study and create the respective design matrix. Which is the reference group for your main variable of interest?
Tomorrow we will learn how to use `ExploreModelMatrix` for helping us interpret coefficients.
<p/>
### `voom()`
Compared to NB-based methods, *limma* works with $log2(cpm)$ which are approximately normally distributed (as we have seen) and thus, opens the possibility to leverage a wide range of normal-based statistical tools not available for count distributions, including methods developed for microarray data. However, *limma* doesn't assume nor require data to follow a normal distribution, but it does apply normal-based microarray-like statistical methods to RNA-seq read counts [2].
> "... limma does not make any assumption that the data appears normal in a histogram."
<div align="right"> \- Gordon Smyth, author of *limma*, in the [Bioconductor support website](https://support.bioconductor.org/p/9140296/#9156131) 2021. </div>
The benefit of using $log2(cpm)$, however, is not immediate. One limitation for the direct application of normal-based methods to log-counts is that reads counts have unequal variabilities even after a log-transformation depending on the count sizes: probability distributions for counts are naturally heteroscedastic, with log-cpm not having constant variances (larger variances for larger counts) [2]. It has been proposed that to design powerful statistical analysis for RNA-seq, it is more important to model the relationship between the mean and the variance in the data than to specify which probabilistic distribution to use for the counts [2]. And importantly, converting count data taking such relationship into account does open up access to their analysis with normal-based methods. That’s why we use `voom()`.
What `voom()` does is:
1. First, to compute log-cpm. Log-normalized expression for gene $g$ in sample $i$ ($y_{gi}$) is given by
$$
y_{gi}=log_2(\frac{r_{gi} + 0.5}{R_i + 1.0} \times 10^6)
$$
where $r_{gi}$ is the raw count for the gene in the sample and $R_i$ the library size of the sample.
We add +0.5 to the counts to avoid log of zero and +1 to the library size to
ensure that $\frac{r_{gi}+0.5}{R_i+1}$ is strictly less than 1 (if $r_{gi} = R_i$).
2. A linear model is fitted to gene log-cpm values by ordinary least squares as:
$$ E(y_{gi})=\mu_{gi}=X_i\beta_g $$
where $E(y_{gi})$ is the expected expression of gene $g$ in sample $i$, $X_i$ is the vector with the sample values for the covariates and $\beta_g$ the vector of covariate coefficients for the gene. As a result, we have the estimated $\hat\beta_g$, the fitted log-cpm’s $\hat\mu_{gi}=X_i\hat\beta_g$ and the residual standard deviations $s_g$.
3. Then it estimates the mean-variance trend of the data by fitting a smooth curve to the $\sqrt s_g$ of the genes presented as a function of the average gene expression (in log-counts, not log-cpm). The $\sqrt s_g$‘s are used because they are symmetrically distributed. Log-counts typically show a decreasing mean-variance trend.
4. `voom()` then predicts the standard deviation of each individual normalized observation $y_{gi}$ (*limma*-trend does that at the gene level) using this trend curve: the fitted log-count of each observation is mapped to the curve and its $\sqrt s_{gi}$value is obtained. The observation weights are $w_{gi}=\frac{1}{s_{gi}^2}$.
<figure>
<img src="Figures/voom.png" width="800px" align=center />
<figcaption style="color: gray; line-height: 0.9; text-align: justify">
<font size="-1.8">
<b>Figure 2</b>: <b> `voom()` procedure to estimate observation-level variance weights for *limma*. </b> Extracted from the original voom publication ( [Law, C. W. *et al*. 2018](https://genomebiology.biomedcentral.com/articles/10.1186/gb-2014-15-2-r29)).
</font>
</figcaption>
</figure>
5. Log-cpm ($y_{gi}$) and associated weights ($w_{gi}$) can then be entered into the *limma* framework for linear modeling. These weights are used in the linear modeling to adjust for count heteroscedasticity [2].
```{r voom}
library("limma")
## voom():
# 1. Transform counts to log2(cpm)
# ----------------------------------------------------------------------------
# . | Note we passed voom() raw counts as input, not the lognorm counts!!! |
# ----------------------------------------------------------------------------
# 2. Estimate mean-variance relationship for each gene
# 3. Compute observation weights for limma (next step)
vGene <- voom(assay(rse_gene_filt), design = model, plot = TRUE)
```
Let's explore the outpus of this function.
```{r voom_outs}
## Returned data
names(vGene)
## E: contains the computed log(cpm)
dim(vGene$E)
vGene$E[1:5, 1:5]
## weights: contains the computed variance weight for each observation
dim(vGene$weights)
vGene$weights[1:5, 1:5]
## design: is the provided design matrix
head(vGene$design)
## targets: the sample library sizes used to compute log(cpm) in the first step
dim(vGene$targets)
head(vGene$targets)
identical(vGene$targets$lib.size, colSums(assay(rse_gene_filt)))
```
<p class="conclusion">
➡️ In summary, `voom()` estimates non-parametrically the global mean-variance trend of the count data based on the expression of the genes and uses that to predict the variance of each individual expression observation (each log-cpm value) based on their predicted count sizes. The predicted variances are then associated as inverse weights to each observation that when used in linear modeling eliminate the log-cpm mean-variance trend [2].
</p>
<div style="background-color:#E0EEE0; padding:20px; font-family: sans-serif">
<p>
👉🏼 **Advantages**:
<ul>
<li>✅ `voom()` estimates the mean-variance relationship in a non-parametric way. </li>
<blockquote class="ludwig">
"The parametric advantages of the Poisson or NB distributions are mitigated by the fact that the observed mean-variance relationship of RNA-seq data does not perfectly match the theoretical mean-variance relationships inherent in these distributions. While the quadratic mean-variance relationship of the NB distribution captures most of the mean-variance trend, the NB dispersion still shows a non-ignorable trend with gene abundance." [2]
</blockquote>
<li>✅ Since `voom()` is a method to adapt count data to normal models, these give access to tractable empirical Bayes distribution theory. </li>
<li>✅ The use of normal distribution approaches and variance modeling is supported by generalized linear model theory. </li>
</ul>
</p>
</div>
<p>
</p>
<p class="exercise">
📝 **Exercise 3**: compute the $log2(cpm)$ and the residual variance weights for each observation in your data using `voom()`.
<p/>
### `lmFit()`
This *limma* function fits a **multiple linear model** to the expression of each gene by weighted or generalized least squares to estimate the coefficients of the sample covariates which correspond to the logFC's comparing gene expression between sample groups.
**Ordinary least squares (OLS)**
This is used to estimate the coefficients of a linear regression by minimizing the residual sum of squares [5].
<div style="text-align: center">
<figure>
<img src="Figures/lmFit.png" width="400px" align=center/>
<figcaption style="color: gray; line-height: 0.9; text-align: justify; caption-side: bottom">
<font size="-1.8">
<b>Figure 3</b>: <b> Graphical representation of the OLS method for simple regression analysis</b>. Source: Gulve, A. (2020). Ordinary Least Square (OLS) Method for Linear Regression.
</font>
</figcaption>
</figure>
</div>
For simplicity, let’s work with one gene and say we have $n$ samples. The fitted expression of the gene in the $j$th sample is $\hat y_j =\beta_{0} + \sum_{1}^h\beta_{h}X_{jh}$ , where $\beta_h$ is the coefficient for the $h$th covariate and $X_{jh}$ the value the $j$th sample takes for the $h$th covariate. It can also be written as $\hat y_j =\sum_{0}^h\beta_{h}X_{jh}$ if $X_{j0}=1$.
So we have an overdetermined system of $n$ linear equations and $h$ unknown parameters with $n>h$:
$\hat y_j =\sum_{0}^h\beta_{h}X_{jh}$ with $j=(1,2, ..., n)$.
Such system usually has no exact solution, so we need to estimate the coefficients that better fit the data in a linear regression. The problem is reduced to solving a quadratic minimization problem:
$\hat \beta=arg \ _\beta\ min \ \ S(\beta)$ where $S(\beta)=\sum_j(y_j -\hat y_j)^2=RSS$ (residual sum of squares).
<div style="background-color:#FFFFF0; padding:20px; font-family: sans-serif; border: 1px solid black">
💡 We can think of these $\beta$’s as differences in the fitted (expected) expression of a gene. Say we have two binary categorical variables in the model ($X_1$ and $X_2$), then the expected gene expression in a sample is $E(y|X_1, X_2) =\hat y =\beta_{0} + \beta_1X_1+\beta_2X_2$, where $X_1$ and $X_2$ equal to 1 or 0. Then we have the following 4 combinations:
- <mark style= "background-color: #FAECF8"> $E(y|X_1=1, X_2=1) = \mu_{12}=\beta_{0} + \beta_1+\beta_2$ </mark>
- <mark style= "background-color: #FCF3CF"> $E(y|X_1=1, X_2=0) =\mu_{1}=\beta_{0} + \beta_1$ </mark>
- <mark style= "background-color: #FAECF8"> $E(y|X_1=0, X_2=1) =\mu_{2}=\beta_{0} + \beta_2$ </mark>
- <mark style= "background-color: #FCF3CF"> $E(y|X_1=0, X_2=0) =\mu_{0}=\beta_{0}$ </mark>
So $\beta_1=$ <mark style= "background-color: #FCF3CF">$\mu_1-\mu_0$</mark> $=$ <mark style= "background-color: #FAECF8"> $\mu_{12}-\mu_2$ </mark> and $\beta_2=$ <mark style= "background-color: #FAECF8">$\mu_2$</mark>$-$ <mark style= "background-color: #FCF3CF">$\mu_0$</mark>. Say our variable of interest is $\beta_1$. Then what we are testing is if the expected expression of a gene is different when $X_1=1$ (in the first sample group) and $X_1=0$ (in the second sample group), fixing $X_2$ in either 1 or 0.
</div>
<p>
</p>
**Generalized least squares (GLS)**
Is a generalization of OLS that allows for heteroskedasticity and correlation between the residuals [6].
**Weighted least squares (WLS)**
In this case the function to be minimized becomes the weighted sum of the squared residuals: squared residuals are weighted by the reciprocal of their variance so that more noisy observations have less weight. That’s what we used `voom()` for.
`lmFit()` returns a fitted model object with the estimated **coefficients**, **standard errors** ($SE=sd/\sqrt n$) and **residual standard errors/deviations** ($RSE=s_g=\sqrt {RSS/ n-2}$) for each gene. Depending on the arguments and correlations in the data, this function calls one of the following functions to fit a linear model for each gene [7]:
- `mrlm`: for a robust regression if `method="robust”`.
- `gls.series`: GLS estimator if `method="ls”` and a correlation structure has been specified.
- `lm.series`: OLS method if `method="ls”` and there is no correlation structure.
For the `weights` argument of `lmFit()`, the precision weights for the observations previously computed are extracted from the `voom()` output.
```{r lmFit}
## lmFit():
# 1. Fit linear model for each gene to estimate logFCs
fitGene <- lmFit(vGene)
## Corroborate "ls" method was applied
fitGene$method
## Explore outputs: estimated coefficients (logFCs)
head(fitGene$coefficients)
```
<div style="background-color:#FFFFF0; padding:20px; font-family: sans-serif; border: 1px solid black">
💡 **Interaction terms in linear models**
There may be cases where we want to assess gene expression differences between 2 conditions within more than one specific group; for example if we were interested in knowing what are the effects of a treatment ($X_1=1$ for treatment and 0 for controls) in <mark style= "background-color: #FAECF8">females</mark> and <mark style= "background-color: #E0FFFF">males</mark> separately ($X_2=1$ for females and 0 for males). In such cases we can fit an **interaction model** in which we include the product of $X_1$ and $X_2$ so that $X_1X_2=1$ if a sample comes from a female that was treated and 0 otherwise:
$$E(y|X_1, X_2) =\beta_{0} + \beta_1X_1+\beta_2X_2 + \beta_3X_1X_2$$
- <mark style= "background-color: #FAECF8"> $E(y|X_1=1, X_2=1) =\mu_{12} =\beta_{0} + \beta_1+\beta_2+\beta_3$ </mark>
- <mark style= "background-color: #E0FFFF"> $E(y|X_1=1, X_2=0) =\mu_{1} =\beta_{0} + \beta_1$ </mark>
- <mark style= "background-color: #FAECF8"> $E(y|X_1=0, X_2=1) =\mu_{2} =\beta_{0} + \beta_2$ </mark>
- <mark style= "background-color: #E0FFFF"> $E(y|X_1=0, X_2=0) =\mu_{0} =\beta_{0}$ </mark>
$\beta_1 + \beta_3=$ <mark style= "background-color: #FAECF8">$\mu_{12}-\mu_2$</mark> which is the expression difference between treated and control female samples ($X_2=1$) and $\beta_1 =$ <mark style= "background-color: #E0FFFF">$\mu_{1}-\mu_0$</mark> for male samples ($X_2=0$). Finally $\beta_3$, called the **interaction term**, is (<mark style= "background-color: #FAECF8">$\mu_{12}-\mu_2$</mark>)$-$(<mark style= "background-color: #E0FFFF">$\mu_1-\mu_0$</mark>), described as the difference in gene expression changes driven by the treatment in females compared to males [8].
</div>
<p>
</p>
<p class="exercise">
📝 **Exercise 4**: fit a linear regression model to the expression data of your genes and extract the coefficients for the included covariates.
<p/>
### `eBayes()`
Next, we want to assess if the differences in gene expression between the sample groups are statistically significant. Initially, we can think of comparing the mean expression of a gene in the sample groups (e.g. cases and controls) which can be handled applying a **two-sample *t*-test** assuming that the values in both groups have an approximately normal distribution.
Here we use the *t*-score (***t*-stats**) to define if the difference in the means is statistically significant based on a *t*-distribution.
The *t*-stats is given by:
$$
t=\frac{\bar x_1 - \bar x_2}{\sqrt{\frac{s_1^2}{n_1}+\frac{s_2^2}{n_2}}}
$$
where $\bar x_1$ and $\bar x_2$ are the means of the expression values of a gene in the first and second sample groups, $s_1$ and $s_2$ are the sample standard deviations of gene expression in the same groups, and $n_1$, $n_2$ the corresponding sample group sizes:
$s_1 = \sqrt{\frac{\sum_{i=1}^ {n_1} (x_i-\bar x_1)^2}{n_1-1}}$ and $s_2 = \sqrt{\frac{\sum_{j=1}^ {n_2} (x_j-\bar x_2)^2}{n_2-1}}$, with $x_i$ and $x_j$ the gene expression values in the samples of group 1 and 2, respectively.
<p class="question">
➡️ Note that we say *sample* means and *sample* standard deviations because they are estimators of the population parameters, computed based on the data that we have.
</p>
We can think of this *t*-stats as a ratio of signal and noise. The numerator contains the difference between the two means, taken as the **signal** for DE. The denominator corresponds to the **standard error** and represents the **noise** in terms of gene expression variance within the sample groups. This represents how spread out the signal is [9]. In that way, the *t*-stats is a measure of how strong is the DE signal. Once computed, the *t*-stats have an associated *p*-value based on a Student *t*-distribution under the null hypothesis ($H_o$: $\bar x_1 - \bar x_2=0$).
This is exactly what we can get using `lm()`:
```{r lm}
## Lognorm expression of first gene
rse_gene_one_gene <- rse_gene_filt[1, ]
colData(rse_gene_one_gene) <- cbind(colData(rse_gene_one_gene),
"lognorm_expr" = assays(rse_gene_one_gene)$logcounts[1, ]
)
## Fit simple linear model
formula <- lognorm_expr ~ Group
lm <- lm(formula, data = colData(rse_gene_one_gene))
summary(lm)
## Two sample t-test
t.test(formula, data = colData(rse_gene_one_gene), var.equal = TRUE)
```
<p class="info">
💡 Sample sizes are critical! Larger sample sizes increase the power of the tests and reduce the false discovery rate (FDR) as they decrease the denominator of the *t*-stats (increasing their values) and slight differences can then be detected.
</p>
<p class="alert">
⚠️ Now consider that for genes with small variances in their expression the *t*-stats will be greater and we could be detecting non-DEGs as DE (false positives).
</p>
But two things must be considered at least when working with gene expression data:
1. The first is that expression values are usually not normally distributed.
2. Second, the distributions and variances of expression values vary across genes and conditions.
<p class="question">
➡️ With that in mind, inference at the individual gene level can be addressed borrowing information from all the genes in the experiment through a Bayes or empirical Bayes method that produces more powerful tests.
</p>
The idea of **Bayesian statistics** is to give unknown quantities a prior distribution, considering each feature as a member of a population of features such as genes. More specifically, **empirical Bayes methods** are procedures for statistical inference in which the (empirical) prior distribution is estimated from the population of all features (from the data) [8]; in standard Bayesian methods this prior distribution is fixed before observing any data [10].
Inspired by the work of Lönnstedt and Speed (2002) in which a simple expression for the posterior odds of differential expression for each gene was computed using a parametric empirical Bayes approach, [Smyth, G. K. (2004)](https://doi.org/10.2202/1544-6115.1027) generalized this model for its application to experiments with any numbers of samples and conditions and reformulated the posterior odds statistic in terms of a **moderated *t*-statistic** in which the posterior residual standard deviations are used instead of the ordinary ones, eliminating the requirement of knowing the non-null prior guess for the proportion of differentially expressed genes required in the log-odds [11]. Let’s see how it proceeds.
First, for each gene $g$ we have a vector with the expression values in the $n$ samples:
$$
y_{g}= (y_{g1}, y_{g2}, ..., y_{gn})
$$
We already know that the expected (predicted) gene expression in the samples is $E(y_{g})=X\alpha_g$ with $X$ the design matrix and $\alpha_g$ the vector of the coefficients for the $h$ covariates in the model; of these the ones of biological interest are the $\beta_g$’s (contrasts of interest).
Then, as previously described, a linear model is fitted to the expression data for each gene to obtain the **coefficient estimators** ($\hat \alpha_g$) (as well as $\hat \beta_g$), the **residual sample variances** ($s_g^2$) as estimators of the (true but unknown) residual variances ($\sigma_g^2$), and the estimated covariance matrices.
Two relevant considerations here:
1. The expression values are not necessarily assumed to be normally distributed.
2. The linear model is not assumed to be necessarily by least squares.
However, there are two assumptions:
1. The contrast estimators $\hat \beta_g$ are assumed to be approximately normally distributed with mean $\beta_g$.
2. The residual sample variances ($s_g^2$) are assumed to follow approximately a scaled chisquare distribution.
Under such assumptions the ordinary *t*-stats for the covariate $j$ in the gene $g$ is defined by:
$$
t_{gj}=\frac{\hat \beta_{gj}}{s_g u_{gj}}=\frac{\hat \beta_{gj}}{SE(\hat \beta_{gj})}
$$
with $s_g$ the residual sample standard deviation of the gene and $u_{gj}$ the unscaled standard deviation. $SE$ stands for standard error.
The key step in the empirical Bayes approach of *limma* is to leverage the information across all genes by defining prior distributions for the unknown coefficients $\beta_{gj}$ and residual variance $\sigma_g^2$ of the genes.
* For $\beta_{gj}$ it models the prior distribution of the coefficients that are not zero, i.e. is the expected distribution of the **logFC** of the genes that are DE, is given by:
$$\beta_{gj}|\sigma_{g}^2, \beta_{gj}≠0 \sim N(0, v_{0j}\sigma_g^2)$$
> "Saying that the betas have prior information centered around zero implies that we are ignorant of the sign (**+**/**-**) of the beta."
<div align="right"> \– Vincent Carey (personal communication)</div>
<p>
</p>
* For the residual variances what *limma* does is to take the residual sample variances of all genes ($s_g^2$’s) and estimate the empirical parameters of the gamma distribution it is assumed that they follow. Specifically, $\frac{1}{\sigma_g^2}$ is modeled by a scaled chisquare (gamma) prior distribution with mean $\frac{1}{s_0^2}$ and $d_0$ degrees of freedom, describing how the residual variances are expected to vary across genes:
$$\frac{1}{\sigma_g^2} \sim \frac{1}{d_0s_0^2}\chi_{d_0}^2$$
What we want to do next is not only to take the mean of the residual variances in the distribution (prior mean$\frac{1}{s_0^2}$) but to estimate each gene residual variance as a Bayes predictor: as a weighted average of the **prior mean** ($\frac{1}{s_0^2}$) and the observed **sample variance** ($s_g^2$) of each gene. This is called the **moderated variance** and what is graphically happening is that we are pulling the observed gene variances towards the prior mean variance: large variances are reduced and the |*t*-stats| increases (more powerful *t*-test for those genes) and small variances are increased, decreasing the |*t*-stats| and the power of the *t*-test. Under this model the **posterior residual sample variance** or posterior residual variance estimator ($\tilde s_g^2$) is:
$$
\tilde s_g^2=E(\sigma_g^2|s_g^2)=\frac{d_0s_0^2 + d_gs_g^2}{d_0+d_g}
$$
Moderation is somehow like having larger sample sizes for the estimation of variance given that the moderated variances are (on average) closer to the population variance than the original sample variances.
The **moderated *t*-statistic** can be now defined in terms of this **posterior residual sample standard deviations** instead of the usual ones:
$$
\tilde t_{gj}=\frac{\hat \beta_{gj}}{\tilde s_g u_{gj}}
$$
These moderated *t*-stats follow a *t*-distribution under the null hypothesis ($H_o:B_{gj}=0$) with degrees of freedom $d_g+d_0$ and the associated *p*-values can be computed based on such distribution.
As previously stated, with this redefined formula, large *t*-stats merely from very small $s_g$’s are avoided. This results in increased power and reduced false non-discovery rate (FNR) (non detected DEGs) and the number of DEGs obtained increases [8].
In the end we say we have moderated the residual sample standard deviations of each gene in the *t*-stats denominator by using the distribution of all of them across the population of genes.
<div style="background-color:#E0EEE0; padding:14px; font-family: sans-serif">
✅ The approach of using the posterior values results in shrinking the gene-wise residual sample variances ($s_g^2$) towards the prior mean, making a more stable inference when a small number of samples is available. </div>
<p>
</p>
`eBayes()` will implement this empirical Bayes model to compute for each gene and for each contrast these moderated *t*-statistic and their unadjusted *p*-values. Additionally, it returns moderated F-statistic and log-odds of differential expression. The moderated F-statistic tests whether any of the contrasts for a gene is non-zero ($H_0:B_{g}=0$), i.e., whether that gene is differentially expressed for any contrast; it is similar to the ordinary F-statistic from analysis of variance (ANOVAR). The *t*-test does that for each individual contrast $j$ ($H_0:B_{gj}=0$).
<p class="link">
👉🏼 Check more about F-stats and other statistics computed by `eBayes()` here: https://support.bioconductor.org/p/6124/.
</p>
```{r eBayes}
## eBayes()
## 1. Compute the empirical Bayes statistics for DE
eBGene <- eBayes(fitGene)
## Outputs of interest:
## s2.prior -> prior residual variance (prior mean 1/s0^2)
## in prior distribution of residual variances
eBGene$s2.prior
## df.prior -> degrees of freedom d0 in prior distribution
## of residual variances
eBGene$df.prior
## s2.post -> posterior residual sample variances of the genes (~sg^2)
length(eBGene$s2.post)
head(eBGene$s2.post)
## t -> moderated t-stats of the genes for each contrast
dim(eBGene$t)
eBGene$t[1:5, 1:5]
## p.value: corresponding unadjusted p-values of moderated t-stats
dim(eBGene$p.value)
eBGene$p.value[1:5, 1:5]
```
<p class="exercise">
📝 **Exercise 5**: obtain the moderated *t*-stats and associated *p*-values of all genes in you data for all covariates included in your model.
<p/>
<p>
</p>
### `topTable()`
This function is also provided by *limma* and summarizes the results of the linear model, performs hypothesis tests and adjusts the *p*-values for multiple testing [12]. Among the summary statistics presented, it returns the log2FCs, moderated *t*-statistics, *p*-values, and FDR-adjusted *p*-values of the genes for a given contrast of interest. The default form of *p*-value adjustment is the **Benjamini and Hochberg’s** method to control the false discovery rate (FDR) which assumes independence between genes.
**Relevant concepts**:
* ***q*-value** → is the **FDR-adjusted *p*-value** used to control the **False Discovery Rate (FDR)** that is the expected proportion of false discoveries among the discoveries (DEGs). Selecting discoveries as those being below $\alpha$ in *q*-value, we control FDR ≤ $\alpha$.
Now we have the final statistics to determine wich genes are DE.
```{r topTable}
## topTable()
## 1. Obtain gene-wise DE stats for Group (Nicotine vs Ctrl)
top_genes <- topTable(eBGene, coef = coef, p.value = 1, number = nrow(rse_gene_filt), sort.by = "none")
## Outputs for each gene and for the coeff selected (Group):
## logFC: log2-fold-changes
head(top_genes$logFC)
```
In limma the $\beta_{gj}$‘s are the logFC’s:
```{r betas_logFCs}
setdiff(top_genes$logFC, eBGene$coefficients[, "GroupExperimental"])
```
```{r topTable2}
## t: moderated t-stats
head(top_genes$t)
## . P.value: unadjusted p-values of t-stats
head(top_genes$P.Value)
## adj.P.Val: p-values adjusted to control the FDR
head(top_genes$adj.P.Val)
```
After running all these 5 steps, one main initial plot we have to look at is the histogram of the *p*-values of the moderated *t*-stats of the genes. If there were DEGs, we'd expect to see a flat distribution of *p*-values corresponding to non-DEGs and a peak near *p*=0 for DEGs (for which we reject the null hypothesis). If this peak is absent but a uniform distribution still appears, DEGs might be detected after correcting for multiple testing.
```{r hist_p}
## Histogram of unadjusted p-values
hist(top_genes$P.Value, xlab = "p-values", main = "")
```
If very different *p*-value distributions are obtained from the uniform one, the best we can do is trying to explore if there are specific groups of genes (e.g. lowly-expressed genes) presenting such variable *p*-values and revisiting the assumptions and considerations of the statistical tests implemented [13].
<p class="exercise">
📝 **Exercise 6**: obtain the DE logFCs, *t*-stats, *p*-values, and adjusted *p*-values of the genes for a given constrast/covariate under study.
</p>
## DE visualization
DEGs are identified defining a significance threshold (on the adjusted *p*-values). Let's quantify the number of DEGs for nicotine exposure in pup brain and visualize their expression and DE statistics.
```{r quantify_DEGs}
## DEGs for FDR<0.05
de_genes <- top_genes[which(top_genes$adj.P.Val < 0.05), ]
## Number of DEGs
dim(de_genes)
```
### Volcano plots
A very practical and useful plot to graphically represent DEGs and visualize their expression differences between conditions is a volcano plot. This is a scatter plot of the logFC's of the genes in the x-axis vs their adjusted *p*-values in a -log scale in the y-axis.
```{r volcano plot, warning=FALSE, message=FALSE}
library("ggplot2")
## Define up- and down-regulated DEGs, and non-DEGs
FDR <- 0.05
DE <- vector()
for (i in 1:dim(top_genes)[1]) {
if (top_genes$adj.P.Val[i] > FDR) {
DE <- append(DE, "n.s.")
} else {
if (top_genes$logFC[i] > 0) {
DE <- append(DE, "Up")
} else {
DE <- append(DE, "Down")
}
}
}
top_genes$DE <- DE
## Colors, sizes and transparencies for up & down DEGs and non-DEGs
cols <- c("Up" = "indianred2", "Down" = "steelblue2", "n.s." = "grey")
sizes <- c("Up" = 1.3, "Down" = 1.3, "n.s." = 0.8)
alphas <- c("Up" = 0.4, "Down" = 0.6, "n.s." = 0.5)
## Plot volcano plot
ggplot(
data = top_genes,
aes(
x = logFC, y = -log10(adj.P.Val),
color = DE,
fill = DE,
size = DE,
alpha = DE
)
) +
geom_point(shape = 21) +
geom_hline(
yintercept = -log10(FDR),
linetype = "dashed", color = "gray35", linewidth = 0.5
) +
geom_vline(
xintercept = c(-1, 1),
linetype = "dashed", color = "gray35", linewidth = 0.5
) +
labs(y = "-log10(FDR)", x = "logFC(Nicotine vs Control)") +
theme_bw() +
scale_color_manual(values = cols, name = "Differential expression") +
scale_fill_manual(values = cols, name = "Differential expression") +
scale_size_manual(values = sizes, name = "Differential expression") +
scale_alpha_manual(values = alphas, name = "Differential expression") +
theme(
plot.margin = unit(c(1, 1, 1, 1), "cm"),
legend.key.height = unit(0.15, "cm"),
axis.title = element_text(size = (13)),
legend.title = element_text(size = 13),
legend.text = element_text(size = 12)
)
```
### Heat maps
Another common way to represent differential expression results is through a heat map. The package [*ComplexHeatmap*](https://bioconductor.org/packages/release/bioc/html/ComplexHeatmap.html) offers a flexible toolkit to easily create heat maps with row and column annotations, a feature of particular value to plot expression data of genes across samples with multiple biological and technical differences. Although initially all genes in your data can be plotted, frequently only DEGs are included as they tend to show clearer gene expression patterns.
```{r complexHeatmap, warning=FALSE,message=FALSE}
library("ComplexHeatmap")
## We plot lognorm counts
lognorm_data <- assays(rse_gene_filt)$logcounts
## Subset to DEGs only
lognorm_data <- lognorm_data[rownames(de_genes), ]
## Define column (sample) names to display
colnames(lognorm_data) <- paste0("Pup_", 1:dim(lognorm_data)[2])
```
<div style="background-color:#EDEDED; padding:20px; font-family: sans-serif; border: 1px solid black">
🗒️ **Notes**:
1) It is sometimes convenient to regress out the technical variables' contributions on gene expression to see more clearly the effects of interest. This can happen, for instance, when the logFCs are too small to see any significant differences in the plots or when there are other strong confounding factors. Functions such as [`cleaningY()`](https://rdrr.io/github/LieberInstitute/jaffelab/man/cleaningY.html) of *jaffelab* can be used for this purpose.
2) The lognorm counts have to be correctly scaled and centered (around zero) to make the differences in the expression of the genes more notorious in the heat map. A simple way to do that is substracting from each lognorm count $y_{gi}$ (from the gene $g$ and sample $i$) the mean expression of the gene* and dividing by the standard deviation ($\sigma$) of the same gene expression values. This is formally called the *z*-score: the number of standard deviations away from the mean.
$$
z=\frac{y_{gi} - \frac{\sum_{k=1}^{n}{y_{gk}}}{n}}{\sigma},
$$
$n$ is the number of samples.
<font size="2"> \* This can also be done by columns (samples), not only by rows (genes). </font>
<p class="link" style="background-color:#EDEDED">
👉🏼 For more on centering and scaling, see this video:
</p>
<p class="link" style="background-color:#EDEDED; text-align: center">
<iframe width="560" height="315" src="https://www.youtube.com/embed/c8ffeFVUzk8" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" allowfullscreen></iframe>
</p>
</div>
<p>
</p>
```{r scale_data}
## Center and scale the data to make differences more evident
lognorm_data <- (lognorm_data - rowMeans(lognorm_data)) / rowSds(lognorm_data)
## Sample annotation: Sex, Group, and library size
col_anno <- HeatmapAnnotation(
df = as.data.frame(colData(rse_gene_filt)[, c("Sex", "Group")]),
library_size = anno_barplot(colData(rse_gene_filt)$sum, gp = gpar(fill = "lightyellow2")),
col = list(
"Sex" = c("F" = "hotpink1", "M" = "dodgerblue"),
"Group" = c("Control" = "gray68", "Experimental" = "gold2")
)
)
## Gene annotation: logFC and biotype
de_genes$logFC_binary <- sapply(de_genes$logFC, function(x) {
if (x > 0) {
">0"
} else {
"<0"
}
})
de_genes$protein_coding_gene <- sapply(rowData(rse_gene_filt[rownames(de_genes), ])$gene_type, function(x) {
if (x == "protein_coding") {
"TRUE"
} else {
"FALSE"
}
})
gene_anno <- rowAnnotation(
df = as.data.frame(cbind(
"logFC" = de_genes$logFC_binary,
"protein_coding_gene" = de_genes$protein_coding_gene
)),
col = list(
"logFC" = c("<0" = "deepskyblue3", ">0" = "brown2"),
"protein_coding_gene" = c("TRUE" = "darkseagreen3", "FALSE" = "magenta")
)
)
```
```{r heat map, warning=FALSE, message=FALSE, fig.width=10,fig.height=6.9}
library("circlize")
## Plot
Heatmap(lognorm_data,
name = "lognorm counts",
show_row_names = FALSE,
top_annotation = col_anno,
left_annotation = gene_anno,
row_km = 2,
column_km = 2,
col = colorRamp2(c(-4, -0.0001, 00001, 4), c("darkblue", "lightblue", "lightsalmon", "darkred")),
row_title = "DEGs",
column_title = "Samples",
column_names_gp = gpar(fontsize = 7),
heatmap_width = unit(12.5, "cm"),
heatmap_height = unit(12.5, "cm")
)
```
<p class="exercise">
📝 **Exercise 7**: obtain the number of DEGs you got and represent them in a volcano plot and a heat map. Include all the sample and gene information you consider relevant in the latter.
</p>
## References {-}
1. Li, W. V., & Li, J. J. (2018). Modeling and analysis of RNA‐seq data: a review from a statistical perspective. Quantitative Biology, 6(3), 195-209.
2. Law, C. W., Chen, Y., Shi, W., & Smyth, G. K. (2014). voom: Precision weights unlock linear model analysis tools for RNA-seq read counts. Genome biology, 15(2), 1-17.
3. Smyth, G. K., Ritchie, M., Thorne, N., Wettenhall, J., Shi, W., & Hu, Y. (2002). limma: linear models for microarray and RNA-Seq data user’s guide. Bioinformatics Division, The Walter and Eliza Hall Institute of Medical Research, Melbourne, Australia.
4. van den Berg, S. M. (2022). Analysing data using linear models. Web site: https://bookdown.org/pingapang9/linear_models_bookdown/
5. Wikipedia. (n.d.). Ordinary least squares. Web site: https://en.wikipedia.org/wiki/Ordinary_least_squares
6. Taboga, Marco (2021). "Generalized least squares", Lectures on probability theory and mathematical statistics. Kindle Direct Publishing. Online appendix. https://www.statlect.com/fundamentals-of-statistics/generalized-least-squares.
7. Documentation for lmFit: https://rdrr.io/bioc/limma/man/lmFit.html
8. The Pennsylvania State University. (2018). Statistical Analysis of Genomics Data. Web site: https://online.stat.psu.edu/stat555/node/36/
9. Tushe, M. (2021). A Simple Trick to Understand the t-test. Web site: https://miroslavtushev.medium.com/a-simple-trick-to-understand-the-t-test-2c2a9e7f1dc5
10. Wikipedia. (n.d.). Empirical Bayes method. Web site: https://en.wikipedia.org/wiki/Empirical_Bayes_method#:~:text=Empirical Bayes methods are procedures,before any data are observed.
11. Smyth, G. K. (2004). Linear models and empirical bayes methods for assessing differential expression in microarray experiments. Statistical applications in genetics and molecular biology, 3(1).
12. Documentation for topTable: https://www.rdocumentation.org/packages/limma/versions/3.28.14/topics/toptable
13. Robinson, D. (2014). How to interpret a p-value histogram. Web site: http://varianceexplained.org/statistics/interpreting-pvalue-histogram/