-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathCorrelation of differential gene expression between fetal and adult brains.Rmd
362 lines (296 loc) · 12.8 KB
/
Correlation of differential gene expression between fetal and adult brains.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
---
title: Examine the association between gene expression and age_group variable (fetus vs adult)
author: Vy K Phung
output:
rmarkdown::pdf_document:
toc: true
vignette: >
%\VignetteIndexEntry{Developmental regulation of human cortex transcription project}
%\VignetteEngine{knitr::rmarkdown}
\usepackage[utf8]{inputenc}
---
## Introduction
The purpose of this re-analysis is to examine the correlation of differential gene expression between fetal and adult brains, which is evaluated through RNA-sequencing. If it has correlation, then count how many up-regulated and down-regulated genes. I will do exploratory analysis and statistical analysis by using R, RStudio.
## Load library
```{r, results="hide", warning=FALSE, message=FALSE}
library(tidyverse)
library(limma)
library(preprocessCore)
library(RColorBrewer)
library(org.Hs.eg.db)
library(AnnotationDbi)
library(edge)
library(sva)
library(DESeq2)
library(broom)
library(readxl)
```
## Data preprocessing
```{r}
count <- read.delim("D:/word/bioinformatics/personal project/PRJNA245228/tidy data/count.csv")
pheno <- read.csv("D:/word/bioinformatics/personal project/PRJNA245228/sample_data/pheno_sample.csv")
head(count)
head(pheno)
```
It is clear that there are some duplications in columns "SYMBOL"(gene symbol) and "ENTREZID", for example, in line 5,6 of "count" table. I will remove duplicated genes and use gene symbol as row name.
```{r}
dup <- duplicated(count$SYMBOL)
table(dup)
count_symbol<- count[!dup,-1:-2]
na <- is.na(count_symbol$SYMBOL)
count_symbol <- count_symbol[!na,]
row.names(count_symbol) <- count_symbol$SYMBOL
count_symbol <- count_symbol[,-1]
head(count_symbol)
```
## Data exploration
Although the target of this exploratory analysis is figuring out if there is a correlation between differential gene expression in fetus vs adult brains, I still plot PCA for the sex group in this data exploration to have a brief overview if there might be an association between sex variable (female vs male) with gene expression in fetus vs adult.
At first, I will explore unfiltered data which still has some genes' row names having 0 reads in "count_symbol" table by using DESeq2.
### Unfiltered data
```{r,warning=FALSE,message=FALSE}
library(DESeq2)
edata <- DESeqDataSetFromMatrix(countData = count_symbol, colData = pheno, design = ~ age_group)
```
```{r}
edata_tr <- rlog(edata, blind = FALSE)
plotPCA(edata_tr, intgroup = c("age_group"))
plotPCA(edata_tr, intgroup = c("sex"))
```
According to plotPCA having intgroup "age_group", there can be an association between differential gene expression and age_group(fetus vs adult). However, in plotPCA having intgroup "sex", we can see that there might be no association between gender and gene expression.
I will also show the table of data transform of the above count_symbol table and the cluster of it in Dendogram so that we could easily visualize the correlation between those samples.
```{r}
edata_tr <- assay(edata_tr)
head(edata_tr)
dist_samples <- dist(t(edata_tr))
gene_fit <- hclust(dist_samples, method="ward.D")
plot(gene_fit, hang=-1)
```
According to cluster Dendogram, there are 2 main branches. The smaller branches SRR15545(41,67,38,68,37) on the main left are fetus group, while the others on the main right are in adult group. This visualization reinforces the hypothesis that there can be a correlation between differential gene expression in fetus vs adult.
Additionally, I also visualize the frequency of number of reads in the count table when data has been transformed on histogram and boxplot. We can see that mostly the reads are below 20 and in the range between 5 to 15.
```{r}
hist(edata_tr,breaks=100,col=2,xlim=c(0,20),ylim=c(0,6000))
boxplot(edata_tr,col=2,range=0)
```
### Filtered data
Removing lowly expressed genes and using DESeq to explore data as the same way as above unfiltered data in order to see if there are any differences between filtered and unfiltered one.
```{r,warning=FALSE,message=FALSE}
count_filter = count_symbol[rowMeans(count_symbol) > 100,]
edata <- DESeqDataSetFromMatrix(countData = count_filter, colData = pheno, design = ~ age_group)
```
```{r}
edata_tr <- rlog(edata, blind = FALSE)
plotPCA(edata_tr, intgroup = c("age_group"))
plotPCA(edata_tr, intgroup = c("sex"))
edata_tr <- assay(edata_tr)
head(edata_tr)
dist_samples <- dist(t(edata_tr))
gene_fit <- hclust(dist_samples, method="ward.D")
plot(gene_fit, hang=-1)
```
We can see that there is almost no difference in the results of unfiltered and filtered data except on the cluster Dendogram of filtered one, the relation between SRR1554568 and SRR1554537 are not close-related as same as that of unfiltered.
```{r}
hist(edata_tr,breaks=100,col=2,xlim=c(0,20),ylim=c(0,6000))
boxplot(edata_tr,col=2,range=0)
```
### Stratified analysis
Because I want to explore if the sex variable (male vs female) might effect the association between age_group variable and gene expression, I get the female data in 10 samples above and analyse the factor age_group in this gender and do the same with the male gender.
#### Female
There are 4/10 samples having sex is female
```{r}
pheno_female = pheno[pheno$sex == "female",]
pheno_female
female_run <- (colnames(count_symbol) %in% pheno_female$Run)
table(female_run)
count_female = count_symbol[,female_run]
head(count_female)
```
```{r, warning=FALSE,message=FALSE}
edata_fe <- DESeqDataSetFromMatrix(count_female, pheno_female, ~age_group)
```
```{r}
edata_fe_tr <- rlog(edata_fe, blind = FALSE)
plotPCA(edata_fe_tr, intgroup = c("age_group"))
```
#### Male
```{r}
pheno_male = pheno[pheno$sex == "male",]
pheno_male
male_run <- (colnames(count_symbol) %in% pheno_male$Run)
table(male_run)
count_male = count_symbol[,male_run]
head(count_male)
```
```{r, warning=FALSE,message=FALSE}
edata_ma <- DESeqDataSetFromMatrix(count_male, pheno_male, ~age_group)
```
```{r}
edata_ma_tr <- rlog(edata_ma, blind = FALSE)
plotPCA(edata_ma_tr, intgroup = c("age_group"))
```
For both female and male, the visualization of plotPCA shows that there still can be an association between gene expression with age_group variable(fetus vs adult).
## Statistical analysis
The target of this statistical analysis is examining the correlation between age_group variable (fetus or adult) and gene expression more clearly so that in the end I will count up and down regulated genes.
In this statistical analysis, I will use limma package and DESeq package to see if there are any different results between two methods.
I will use DEseq package to analyze edata (output of DESeq) and Limma package to analyze edata_tr (edata has been transformed)
### Unadjusted data
At first, I will analyze statistically variable age_group without adjustment factor (sex variable).
#### Fit regression with limma package
```{r,warning=FALSE,message=FALSE}
# age_group
mod_age = model.matrix(~ pheno$age_group)
fit_limma_age = lmFit(edata_tr,mod_age)
ebayes_limma_age = eBayes(fit_limma_age)
re = topTable(ebayes_limma_age, number=dim(count_symbol)[1])
head(re)
```
```{r,warning=FALSE,message=FALSE}
# Statistics
hist(ebayes_limma_age$t[,2], col=2)
# P-values
pval_limma_age = topTable(ebayes_limma_age, number=dim(edata_tr)[1])$P.Value
hist(pval_limma_age)
# Adjusted p-values
adj_pval_limma_age = topTable(ebayes_limma_age,number=dim(edata_tr)[1])$adj.P.Val
hist(adj_pval_limma_age)
```
According to the histogram of adjusted p-value, it suggests that there might be an association between age_group variable and gene expression, which means there is a differential gene expression between fetal and adult brains in these samples.
A number of genes have adjusted p-value less than 0.05
```{r}
sum(re$adj.P.Val < 0.05)
```
#### Fit regression with DESeq
```{r, warning=FALSE,message=FALSE}
dds <- DESeq(edata)
```
```{r}
res <- results(dds)
res = as.data.frame(res)
head(res)
```
```{r}
# Statistic
hist(res$stat)
# P-values
hist(res$pvalue)
#Adjusted p-values
hist(res$padj)
```
We can see that there is also an association between gene expression with age_group as same as the above result of limma package.
A number of genes have adjusted p-value less than 0.05
```{r}
table(res$padj <0.05)
```
### Adjusted data
Because I suspect that sex variable can adjust the association between gene expression with age_group factor, I will analyze statistically data with adjustment factor.
#### Fit regression with limma package with adjustment factor
```{r}
mod_adj = model.matrix(~ pheno$age_group+pheno$sex)
fit_limma_adj = lmFit(edata_tr,mod_adj)
ebayes_limma_adj <- eBayes(fit_limma_adj)
names(ebayes_limma_adj)
```
```{r,warning=FALSE,message=FALSE}
# Statistics
hist(ebayes_limma_adj$t[,2], col=2)
# P-values
pval_limma_adj = topTable(ebayes_limma_adj,number=dim(edata_tr)[1])$P.Value
hist(pval_limma_adj)
# Adjusted p-values
adj_pval_limma_adj = topTable(ebayes_limma_adj,number=dim(edata_tr)[1])$adj.P.Val
hist(adj_pval_limma_adj)
```
```{r, warning=FALSE,message=FALSE}
re_adj = topTable(ebayes_limma_adj,number=dim(edata_tr)[1])
head(re_adj)
sum(re_adj$adj.P.Val < 0.05)
```
#### Fit regression with DESeq with adjustment factor
```{r,warning=FALSE,message=FALSE}
de_adj = DESeqDataSetFromMatrix(countData = count_filter, colData = pheno,~ age_group + sex)
glm_all_adj = DESeq(de_adj)
```
```{r}
results_adj = results(glm_all_adj)
results_adj = as.data.frame(results_adj)
head(results_adj)
```
```{r}
# Statistic
hist(results_adj$stat)
# P-values
hist(results_adj$pvalue)
#Adjusted p-values
hist(results_adj$padj)
```
In the adjustment section, we can see that in the limma package, the results having adjusted p-value less than 0.05 are 8232, which also account for about 52.9% in the total of 15559 genes. However, in DESeq package, the results less than 0.05 are a few, and according to the the histogram of results_adj pvalue from DESeq package, it is likely that there is no association between gene expression with adjusted data(including age_group variable and sex variable).
## Count up and down regulated genes
Because of clear evidence of correlation when analyzing unadjusted data, which only have age_group variable, we can see that there is a correlation between the age_group factor and gene expression. Therefore, I will count the up-regulated genes, which are highly expressed and up-regulation in human especially for fetus, and down-regulated genes, which are down-regulation in human especially when getting older and as a result highly appear in the adult than the fetus
I will get up-regulated and down-regulated genes from unadjusted data of both limma package and DESeq package.
### Limma package
**The number of up-regulated genes**
```{r}
sum(re$adj.P.Val <0.05 & re$logFC > 1)
```
```{r}
up_limma <- re %>% filter (logFC > 1 & adj.P.Val < 0.05) %>% arrange(adj.P.Val)
head(up_limma)
```
**The number of down-regulated genes**
```{r}
sum(re$adj.P.Val <0.05 & re$logFC < -1)
```
```{r}
down_limma <- re %>% filter (logFC < -1 & adj.P.Val < 0.05) %>% arrange(adj.P.Val)
head(down_limma)
```
### DESeq package
**The number of up-regulated genes**
```{r}
sum(res$padj < 0.05 & res$log2FoldChange > 1, na.rm=TRUE)
```
```{r}
up_de <- res %>% filter (log2FoldChange > 1 & padj < 0.05) %>% arrange(padj)
head(up_de)
```
**The number of down-regulated genes**
```{r}
sum(res$padj < 0.05 & res$log2FoldChange < -1, na.rm=TRUE)
```
```{r}
down_de <- res %>% filter (log2FoldChange < -1 & padj < 0.05) %>% arrange(padj)
head(down_de)
```
**The number of common up-regulated and also down-regulated genes from both DESeq package and Limma package**
```{r}
up <- rownames(up_de) %in% rownames(up_limma)
table(up)
down <- rownames(down_de) %in% rownames(down_limma)
table(down)
up = up_de[up,]
head(up)
down = down_de[down,]
head(down)
```
As we can see, there are 2540 common up-regulated genes
and 3026 common down-regulated genes between limma package and DESeq package.
In addition to analyzing the correlation in R, I also want to predict and classify some characteristics of those 10 samples such as gender, or age by using Python. Below is the preparation for that process.
## Preparing data for prediction and classification
```{r}
up_down_reg = rbind(up,down)
dim(up_down_reg)
```
```{r}
up_down_tr <- (rownames(edata_tr) %in% rownames(up_down_reg))
table(up_down_tr)
up_down = edata_tr[up_down_tr,]
head(up_down)
```
```{r}
# df is saved in a name "data for regulated gene.csv"
df <- merge(up_down_reg,up_down, by =0)
row.names(df) <- df$Row.names
df = df[,-1]
head(df)
# save file
# write.csv(df, file = "D:/word/bioinformatics/personal project/genomic-data-science-project-about-fetus-and-adult/tidy data/data for regulated gene.csv")
```