Correlation of differential gene expression between fetal and adult brains.Rmd

---
title: Examine the association between gene expression and age_group variable (fetus vs adult)
author: Vy K Phung
output:
  rmarkdown::pdf_document:
    toc: true
  vignette: >  
    %\VignetteIndexEntry{Developmental regulation of human cortex transcription project}
    %\VignetteEngine{knitr::rmarkdown}
    \usepackage[utf8]{inputenc}
---

## Introduction

The purpose of this re-analysis is to examine the correlation of differential gene expression between fetal and adult brains, which is evaluated through RNA-sequencing. If it has correlation, then count how many up-regulated and down-regulated genes. I will do exploratory analysis and statistical analysis by using R, RStudio. 

## Load library 
```{r, results="hide", warning=FALSE, message=FALSE}
library(tidyverse)
library(limma)
library(preprocessCore)
library(RColorBrewer)
library(org.Hs.eg.db)
library(AnnotationDbi)
library(edge)
library(sva)
library(DESeq2)
library(broom)
library(readxl)
```
## Data preprocessing 
```{r}
count <- read.delim("D:/word/bioinformatics/personal project/PRJNA245228/tidy data/count.csv")
pheno <- read.csv("D:/word/bioinformatics/personal project/PRJNA245228/sample_data/pheno_sample.csv")
head(count)
head(pheno)
```

It is clear that there are some duplications in columns "SYMBOL"(gene symbol) and "ENTREZID", for example, in line 5,6 of "count" table. I will remove duplicated genes and use gene symbol as row name.
```{r}
dup <- duplicated(count$SYMBOL)
table(dup)
count_symbol<- count[!dup,-1:-2]
na <- is.na(count_symbol$SYMBOL)
count_symbol <- count_symbol[!na,]
row.names(count_symbol) <- count_symbol$SYMBOL
count_symbol <- count_symbol[,-1]
head(count_symbol)
```
## Data exploration

Although the target of this exploratory analysis is figuring out if there is a correlation between differential gene expression in fetus vs adult brains, I still plot PCA for the sex group in this data exploration to have a brief overview if there might be an association between sex variable (female vs male) with gene expression in fetus vs adult. 

At first, I will explore unfiltered data which still has some genes' row names having 0 reads in "count_symbol" table by using DESeq2. 

### Unfiltered data
```{r,warning=FALSE,message=FALSE}
library(DESeq2)
edata <- DESeqDataSetFromMatrix(countData = count_symbol, colData = pheno, design = ~ age_group)
```
```{r}
edata_tr <- rlog(edata, blind = FALSE)
plotPCA(edata_tr, intgroup = c("age_group"))
plotPCA(edata_tr, intgroup = c("sex"))
```

According to plotPCA having intgroup "age_group", there can be an association between differential gene expression and age_group(fetus vs adult). However, in plotPCA having intgroup "sex", we can see that there might be no association between gender and gene expression. 

I will also show the table of data transform of the above count_symbol table and the cluster of it in Dendogram so that we could easily visualize the correlation between those samples. 
```{r}
edata_tr <- assay(edata_tr)
head(edata_tr)
dist_samples <- dist(t(edata_tr))
gene_fit <- hclust(dist_samples, method="ward.D")
plot(gene_fit, hang=-1)
```

According to cluster Dendogram, there are 2 main branches. The smaller branches SRR15545(41,67,38,68,37) on the main left are fetus group, while the others on the main right are in adult group. This visualization reinforces the hypothesis that there can be a correlation between differential gene expression in fetus vs adult.  

Additionally, I also visualize the frequency of number of reads in the count table when data has been transformed on histogram and boxplot. We can see that mostly the reads are below 20 and in the range between 5 to 15. 
```{r}
hist(edata_tr,breaks=100,col=2,xlim=c(0,20),ylim=c(0,6000))
boxplot(edata_tr,col=2,range=0)
```

### Filtered data
Removing lowly expressed genes and using DESeq to explore data as the same way as above unfiltered data in order to see if there are any differences between filtered and unfiltered one. 
```{r,warning=FALSE,message=FALSE}
count_filter = count_symbol[rowMeans(count_symbol) > 100,] 
edata <- DESeqDataSetFromMatrix(countData = count_filter, colData = pheno, design = ~ age_group)
```

```{r}
edata_tr <- rlog(edata, blind = FALSE)
plotPCA(edata_tr, intgroup = c("age_group"))
plotPCA(edata_tr, intgroup = c("sex"))
edata_tr <- assay(edata_tr)
head(edata_tr)
dist_samples <- dist(t(edata_tr))
gene_fit <- hclust(dist_samples, method="ward.D")
plot(gene_fit, hang=-1)
```

We can see that there is almost no difference in the results of unfiltered and filtered data except on the cluster Dendogram of filtered one, the relation between SRR1554568 and SRR1554537 are not close-related as same as that of unfiltered.    
```{r}
hist(edata_tr,breaks=100,col=2,xlim=c(0,20),ylim=c(0,6000))
boxplot(edata_tr,col=2,range=0)
```

### Stratified analysis 

Because I want to explore if the sex variable (male vs female) might effect the association between age_group variable and gene expression, I get the female data in 10 samples above and analyse the factor age_group in this gender and do the same with the male gender.

#### Female

There are 4/10 samples having sex is female

```{r}
pheno_female = pheno[pheno$sex == "female",]
pheno_female
female_run <- (colnames(count_symbol) %in% pheno_female$Run)
table(female_run)
count_female = count_symbol[,female_run]
head(count_female)
```
```{r, warning=FALSE,message=FALSE}
edata_fe <- DESeqDataSetFromMatrix(count_female, pheno_female, ~age_group)
```
```{r}
edata_fe_tr <- rlog(edata_fe, blind = FALSE)
plotPCA(edata_fe_tr, intgroup = c("age_group"))
```

#### Male
```{r}
pheno_male = pheno[pheno$sex == "male",]
pheno_male
male_run <- (colnames(count_symbol) %in% pheno_male$Run)
table(male_run)
count_male = count_symbol[,male_run]
head(count_male)
```
```{r, warning=FALSE,message=FALSE}
edata_ma <- DESeqDataSetFromMatrix(count_male, pheno_male, ~age_group)
```
```{r}
edata_ma_tr <- rlog(edata_ma, blind = FALSE)
plotPCA(edata_ma_tr, intgroup = c("age_group"))
```

For both female and male, the visualization of plotPCA shows that there still can be an association between gene expression with age_group variable(fetus vs adult).

## Statistical analysis
The target of this statistical analysis is examining the correlation between age_group variable (fetus or adult) and gene expression more clearly so that in the end I will count up and down regulated genes.

In this statistical analysis, I will use limma package and DESeq package to see if there are any different results between two methods. 

I will use DEseq package to analyze edata (output of DESeq) and Limma package to analyze edata_tr (edata has been transformed) 

### Unadjusted data
At first, I will analyze statistically variable age_group without adjustment factor (sex variable). 

#### Fit regression with limma package

```{r,warning=FALSE,message=FALSE}
# age_group
mod_age = model.matrix(~ pheno$age_group)
fit_limma_age = lmFit(edata_tr,mod_age)
ebayes_limma_age = eBayes(fit_limma_age)
re = topTable(ebayes_limma_age, number=dim(count_symbol)[1])
head(re)
```

```{r,warning=FALSE,message=FALSE}
# Statistics 
hist(ebayes_limma_age$t[,2], col=2)
# P-values
pval_limma_age = topTable(ebayes_limma_age, number=dim(edata_tr)[1])$P.Value
hist(pval_limma_age)
# Adjusted p-values
adj_pval_limma_age = topTable(ebayes_limma_age,number=dim(edata_tr)[1])$adj.P.Val
hist(adj_pval_limma_age)
```

According to the histogram of adjusted p-value, it suggests that there might be an association between age_group variable and gene expression, which means there is a differential gene expression between fetal and adult brains in these samples. 

A number of genes have adjusted p-value less than 0.05
```{r}
sum(re$adj.P.Val < 0.05)
```

#### Fit regression with DESeq

```{r, warning=FALSE,message=FALSE}
dds <- DESeq(edata)
```
```{r}
res <- results(dds)
res = as.data.frame(res)
head(res)
```

```{r}
# Statistic 
hist(res$stat)
# P-values
hist(res$pvalue)
#Adjusted p-values
hist(res$padj)
```

We can see that there is also an association between gene expression with age_group as same as the above result of limma package. 

A number of genes have adjusted p-value less than 0.05
```{r}
table(res$padj <0.05)
```

### Adjusted data
Because I suspect that sex variable can adjust the association between gene expression with age_group factor, I will analyze statistically data with adjustment factor. 

#### Fit regression with limma package with adjustment factor
```{r}
mod_adj = model.matrix(~ pheno$age_group+pheno$sex)
fit_limma_adj = lmFit(edata_tr,mod_adj)
ebayes_limma_adj <- eBayes(fit_limma_adj)
names(ebayes_limma_adj)
```

```{r,warning=FALSE,message=FALSE}
# Statistics 
hist(ebayes_limma_adj$t[,2], col=2)
# P-values
pval_limma_adj = topTable(ebayes_limma_adj,number=dim(edata_tr)[1])$P.Value
hist(pval_limma_adj)
# Adjusted p-values
adj_pval_limma_adj = topTable(ebayes_limma_adj,number=dim(edata_tr)[1])$adj.P.Val
hist(adj_pval_limma_adj)
```

```{r, warning=FALSE,message=FALSE}
re_adj = topTable(ebayes_limma_adj,number=dim(edata_tr)[1])
head(re_adj)
sum(re_adj$adj.P.Val < 0.05)
```

#### Fit regression with DESeq with adjustment factor
```{r,warning=FALSE,message=FALSE}
de_adj = DESeqDataSetFromMatrix(countData = count_filter, colData = pheno,~ age_group + sex)
glm_all_adj = DESeq(de_adj)
```
```{r}
results_adj = results(glm_all_adj)
results_adj = as.data.frame(results_adj)
head(results_adj)
```



```{r}
# Statistic 
hist(results_adj$stat)
# P-values
hist(results_adj$pvalue)
#Adjusted p-values
hist(results_adj$padj)
```

In the adjustment section, we can see that in the limma package, the results having adjusted p-value less than 0.05 are 8232, which also account for about 52.9% in the total of 15559 genes. However, in DESeq package, the results less than 0.05 are a few, and according to the the histogram of results_adj pvalue from DESeq package, it is likely that there is no association between gene expression with adjusted data(including age_group variable and sex variable).

## Count up and down regulated genes 
Because of clear evidence of correlation when analyzing unadjusted data, which only have age_group variable, we can see that there is a correlation between the age_group factor and gene expression. Therefore, I will count the up-regulated genes, which are highly expressed and up-regulation in human especially for fetus, and down-regulated genes, which are down-regulation in human especially when getting older and as a result highly appear in the adult than the fetus

I will get up-regulated and down-regulated genes from unadjusted data of both limma package and DESeq package.  

### Limma package

**The number of up-regulated genes** 
```{r}
sum(re$adj.P.Val <0.05 & re$logFC > 1)
```
```{r}
up_limma <- re %>% filter (logFC > 1 & adj.P.Val < 0.05) %>% arrange(adj.P.Val)
head(up_limma)
```

**The number of down-regulated genes**
```{r}
sum(re$adj.P.Val <0.05 & re$logFC < -1)
```
```{r}
down_limma <- re %>% filter (logFC < -1 & adj.P.Val < 0.05) %>% arrange(adj.P.Val)
head(down_limma)
```

### DESeq package


**The number of up-regulated genes**
```{r}
sum(res$padj < 0.05 & res$log2FoldChange > 1, na.rm=TRUE)
```
```{r}
up_de <- res %>% filter (log2FoldChange > 1 & padj < 0.05) %>% arrange(padj)
head(up_de)
```


**The number of down-regulated genes**
```{r}
sum(res$padj < 0.05 & res$log2FoldChange < -1, na.rm=TRUE)
```
```{r}
down_de <- res %>% filter (log2FoldChange < -1 & padj < 0.05) %>% arrange(padj)
head(down_de)
```

**The number of common up-regulated and also down-regulated genes from both DESeq package and Limma package**
```{r}
up <- rownames(up_de) %in% rownames(up_limma)
table(up)
down <- rownames(down_de) %in% rownames(down_limma)
table(down)
up = up_de[up,]
head(up)
down = down_de[down,]
head(down)
```

As we can see, there are 2540 common up-regulated genes
and 3026 common down-regulated genes between limma package and DESeq package.


In addition to analyzing the correlation in R, I also want to predict and classify some characteristics of those 10 samples such as gender, or age by using Python. Below is the preparation for that process. 

## Preparing data for prediction and classification

```{r}
up_down_reg = rbind(up,down)
dim(up_down_reg)
```
```{r}
up_down_tr <- (rownames(edata_tr) %in% rownames(up_down_reg))
table(up_down_tr)
up_down = edata_tr[up_down_tr,]
head(up_down)
```

```{r}
# df is saved in a name "data for regulated gene.csv"
df <- merge(up_down_reg,up_down, by =0)
row.names(df) <- df$Row.names
df = df[,-1]
head(df)
# save file
# write.csv(df, file = "D:/word/bioinformatics/personal project/genomic-data-science-project-about-fetus-and-adult/tidy data/data for regulated gene.csv")
```