-
Notifications
You must be signed in to change notification settings - Fork 0
/
correlation_analysis.Rmd
347 lines (254 loc) · 14.6 KB
/
correlation_analysis.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
---
title: "mos_w3"
author: "Jingsong Zhou"
date: "2023-07-01"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
```{r}
# read the dataset
dataset <- read.table("my_agam_aging_expr_timeseries_Jun23_2023.tsv", header = TRUE)
```
### 1. Variance Analysis of Single Gene Expression over Time
In this analysis, we investigate whether there is evidence of variance in the expression of a single gene increasing with time. We iterate over a selected set of geneIDs and compute the variance of gene expression for each age group. This analysis helps us understand if the regulatory robustness of gene expression changes with age, which has implications for evolutionary perspectives on gene regulation.
```{r}
# Get unique geneIDs from the dataset
geneIDs <- unique(dataset$geneID)
# Split the dataset by geneID
gene_data_list <- split(dataset, dataset$geneID)
```
The goal of this analysis is to determine whether there is evidence of variance in the expression of a single gene increasing or decreasing with time. By examining the variance of gene expression at different ages and fitting a linear regression model, we can assess the significance of the age coefficient and determine if there is a positive/negative trend in variance over time.
```{r}
# Initialize the variables to count genes with changing trend
genes_with_increasing_variance <- 0
genes_with_decreasing_variance <- 0
increasing_trend_age_genes<-c()
decreasing_trend_age_genes<-c()
# Initialize an empty vector to store the coefficients
coefficients_vector <- c()
# Iterate over the gene IDs
for (i in 1:length(geneIDs)) {
gene <- geneIDs[i]
# Subset the dataset for the specific geneID
gene_data <- gene_data_list[[as.character(gene)]]
# Compute the variance of gene expression for each age
variance_by_age <- aggregate(transcription ~ age, data = gene_data, FUN = var)
# Fit a linear regression model
lm_model <- lm(variance_by_age$transcription ~ variance_by_age$age)
# Check the significance of the age coefficient
p_value <- summary(lm_model)$coefficients[2, 4]
# Check the sign of the age coefficient
coefficient <- summary(lm_model)$coefficients[2, 1]
# Store the coefficient in the vector
coefficients_vector <- c(coefficients_vector, coefficient)
# Make a conclusion based on the p-value and coefficient sign
if (p_value < 0.05 & coefficient > 0) {
# There is significant evidence of increasing variance over ages
genes_with_increasing_variance <- genes_with_increasing_variance + 1
increasing_trend_age_genes<-c(increasing_trend_age_genes, gene)
}
else if (p_value < 0.05 & coefficient < 0) {
# There is significant evidence of decreasing variance over ages
genes_with_decreasing_variance <- genes_with_decreasing_variance + 1
decreasing_trend_age_genes<-c(decreasing_trend_age_genes, gene)
}
}
# Calculate the proportion of genes with increasing variance
proportion_increasing_variance <- genes_with_increasing_variance / length(geneIDs)
# Calculate the proportion of genes with decreasing variance
proportion_decreasing_variance <- genes_with_decreasing_variance / length(geneIDs)
# Print the summary result
cat("Proportion of genes with increasing variance over time:", proportion_increasing_variance, "\n")
cat("Proportion of genes with decreasing variance over time:", proportion_decreasing_variance, "\n")
```
The code below generates a histogram to visualize the distribution of coefficients. The coefficients represent the relationship between age and gene expression variance for each gene. By examining the distribution, we can gain insights into the overall pattern and variability of the coefficients. This information helps us understand the extent to which gene expression variance changes with age.
```{r}
# Plot the distribution of coefficients
hist(coefficients_vector, main = "Distribution of Coefficients",
xlab = "Coefficient", ylab = "Proportion", freq = FALSE)
```
The code creates a quantile-quantile (QQ) plot to assess the normality of the coefficients. The QQ plot compares the observed distribution of coefficients against the expected distribution under normality. If the points on the plot roughly align with the diagonal line (qqline), it suggests that the coefficients follow a normal distribution. Departures from the diagonal line indicate deviations from normality, which can be important for statistical analyses that assume normality.
```{r}
# Create a quantile-quantile (QQ) plot
qqnorm(coefficients_vector)
qqline(coefficients_vector)
```
This section performs bootstrap resampling to estimate the uncertainty associated with a symmetry measure of the coefficients. The symmetry measure, in this case, is the mean of the bootstrap samples. We use Bootstrap method to resample the coefficients with replacement, calculate the mean of each bootstrap sample, and store it in the bootstrap_samples vector. The resulting vector represents a distribution of bootstrap symmetry measures. The confidence interval for the symmetry measure is then computed using the quantile function, specifying the desired percentiles. Finally, the confidence interval is printed to provide an estimate of the uncertainty surrounding the symmetry measure of the coefficients.
```{r}
# Set the number of bootstrap iterations
num_iterations <- 1000
# Initialize an empty vector to store bootstrap samples
bootstrap_samples <- numeric(num_iterations)
# Perform bootstrapping
for (i in 1:num_iterations) {
# Resample the coefficients with replacement
bootstrap_sample <- sample(coefficients_vector, replace = TRUE)
# Calculate the symmetry measure (mean) of the bootstrap sample
bootstrap_symmetry <- mean(bootstrap_sample) # Replace with the symmetry measure
# Store the symmetry measure in the bootstrap_samples vector
bootstrap_samples[i] <- bootstrap_symmetry
}
# Calculate the confidence interval for the symmetry measure
confidence_interval <- quantile(bootstrap_samples, c(0.025, 0.975))
# Print the confidence interval
cat("Bootstrap Confidence Interval:", confidence_interval, "\n")
```
For a more intuitive understanding, we use visualization methods for genes with increasing or decreasing trend in gene expression variance to observe the change in variance with age.
```{r}
# Set the maximum number of iterations
max_iterations <- 9
# Set the number of rows and columns for the grid
num_rows <- 3
num_cols <- 3
# Create a new blank plot
plot.new()
# Set up the grid layout
par(mfrow = c(num_rows, num_cols), mar = c(4, 4, 2, 2))
for (i in 1:max_iterations) {
gene <- increasing_trend_age_genes[i]
# Subset the dataset for the specific geneID
gene_data <- gene_data_list[[as.character(gene)]]
# Compute the variance of gene expression for each age
variance_by_age <- aggregate(transcription ~ age, data = gene_data, FUN = var)
# Plot the variance by age
plot(variance_by_age$age, variance_by_age$transcription, type = "b",
xlab = "Age", ylab = "Expression Variance",
main = paste0("Gene ID:", gene))
}
# Reset the plot layout
par(mfrow = c(1, 1))
```
```{r}
# Set the maximum number of iterations
max_iterations <- 9
# Set the number of rows and columns for the grid
num_rows <- 3
num_cols <- 3
# Create a new blank plot
plot.new()
# Set up the grid layout
par(mfrow = c(num_rows, num_cols), mar = c(4, 4, 2, 2))
for (i in 1:max_iterations) {
gene <- decreasing_trend_age_genes[i]
# Subset the dataset for the specific geneID
gene_data <- gene_data_list[[as.character(gene)]]
# Compute the variance of gene expression for each age
variance_by_age <- aggregate(transcription ~ age, data = gene_data, FUN = var)
# Plot the variance by age
plot(variance_by_age$age, variance_by_age$transcription, type = "b",
xlab = "Age", ylab = "Gene Expression Variance",
main = paste0("Gene ID:", gene))
}
# Reset the plot layout
par(mfrow = c(1, 1))
```
### 2. Correlation Analysis of Gene Expression between Different Genes over Time
The code analyzes the correlation between two selected genes over different ages. It subsets the dataset to extract the gene data, calculates the correlation coefficient for each age, and plots the correlation coefficients over time. This analysis helps to understand how the correlation between the genes changes with age, providing insights into gene interactions and potential biological processes related to aging or development.
```{r}
correlation_genes_over_ages <- function(selected_gene1,selected_gene2){ #This function computes the correlation for the chosen genes pair over ages and return a vector containing the correlation
# Subset the dataset for the selected genes
selected_gene1_data <- gene_data_list[[as.character(selected_gene1)]]
selected_gene2_data <- gene_data_list[[as.character(selected_gene2)]]
# Initialize vectors to store correlation coefficients and ages
correlation_coefficients <- c()
# Iterate over unique ages
for (i in unique(dataset$age)) {
# Subset the data for the current age
subset_gene1_data <- subset(selected_gene1_data, age == i)
subset_gene2_data <- subset(selected_gene2_data, age == i)
# Check if there are multiple measurements for the same experimentID
if (length(unique(dataset$experimentID)) != nrow(selected_gene1_data)) {
# Calculate the average transcription value for each experimentID
subset_gene1_data <- aggregate(transcription ~ experimentID, data = subset_gene1_data, FUN = mean)
}
if (length(unique(dataset$experimentID)) != nrow(selected_gene2_data)) {
# Calculate the average transcription value for each experimentID
subset_gene2_data <- aggregate(transcription ~ experimentID, data = subset_gene2_data, FUN = mean)
}
# Merge the data based on experimentID
merged_data <- merge(subset_gene1_data, subset_gene2_data, by = "experimentID")
# Extract the gene1 and gene2 data for the current age
gene1_data <- merged_data$transcription.x
gene2_data <- merged_data$transcription.y
# Calculate the correlation coefficient
correlation <- cor(gene1_data, gene2_data)
# Store the correlation coefficient and age
correlation_coefficients <- c(correlation_coefficients, correlation)
}
return(correlation_coefficients)
}
```
The code performs a correlation analysis between randomly selected pairs of genes over different ages. The resulting plots provide insights into the variability and patterns of gene-gene correlations across different gene pairs.
```{r}
# Set the maximum number of iterations
max_iterations <- 9
# Set the number of rows and columns for the grid
num_rows <- 3
num_cols <- 3
# Create a new blank plot
plot.new()
# Set up the grid layout
par(mfrow = c(num_rows, num_cols), mar = c(4, 4, 2, 2))
for (i in 1:max_iterations) {
# Randomly select a pair of genes for correlation analysis
random_genes <- sample(geneIDs, size = 2) # we can adjust the size as desired
# Compute the correlation
correlation_coefficients <- correlation_genes_over_ages(random_genes[1],random_genes[2])
# Plot the correlation coefficients over ages
plot(unique(dataset$age), correlation_coefficients,type = 'b', pch = 19, xlab = "Age", ylab = "Correlation Coefficient", main = "Correlation Coefficient over Ages")
}
# Reset the plot layout
par(mfrow = c(1, 1))
```
The goal of this analysis is to determine whether there is evidence of correlation between the expression of two genes increasing or decreasing with time. By examining the correlation of gene expression at different ages and fitting a linear regression model, we can assess the significance of the age coefficient and determine if there is a positive/negative trend in correlation over time.
In the below code, we randomly select gene pairs and calculate the correlation between their expression over age. We then fit a linear regression model to determine the trend based on the slope of the regression line. Gene pairs with positive slopes indicate increasing trends, while gene pairs with negative slopes indicate decreasing trends. The gene pairs with increasing and decreasing trends are stored in separate vectors.
```{r}
# Initialize the variables to count genes with changing trend
genes_with_increasing_correlation <- 0
genes_with_decreasing_correlation <- 0
ages <- unique(dataset$age)
# Initialize an empty vector to store the coefficients
cor_coefficients_vector <- c()
# Set the maximum number of iterations
max_iterations <- 200
for (i in 1:max_iterations) {
# Randomly select a pair of genes for correlation analysis
random_genes <- sample(geneIDs, size = 2)
# Compute the correlation of gene expression for each age
correlation_coefficients <- correlation_genes_over_ages(random_genes[1],random_genes[2])
# Fit a linear regression model
lm_model <- lm(correlation_coefficients ~ ages)
# Check the significance of the age coefficient
p_value <- summary(lm_model)$coefficients[2, 4]
# Check the sign of the age coefficient
coefficient <- summary(lm_model)$coefficients[2, 1]
# Store the coefficient in the vector
cor_coefficients_vector <- c(cor_coefficients_vector, coefficient)
# Make a conclusion based on the p-value and coefficient sign
if (p_value < 0.1 & coefficient > 0) {
# There is significant evidence of increasing correlation over ages
genes_with_increasing_correlation <- genes_with_increasing_correlation + 1
}
else if (p_value < 0.1 & coefficient < 0) {
# There is significant evidence of decreasing correlation over ages
genes_with_decreasing_correlation <- genes_with_decreasing_correlation + 1
}
}
# Calculate the proportion of genes with increasing correlation
proportion_increasing_correlation <- genes_with_increasing_correlation / max_iterations
# Calculate the proportion of genes with decreasing variance
proportion_decreasing_correlation <- genes_with_decreasing_correlation / max_iterations
# Print the summary result
cat("Proportion of genes with increasing correlation over time:", proportion_increasing_correlation, "\n")
cat("Proportion of genes with decreasing correlation over time:", proportion_decreasing_correlation, "\n")
# Plot the distribution of coefficients
hist(coefficients_vector, main = "Distribution of Coefficients",
xlab = "Coefficient", ylab = "Proportion", freq = FALSE)
```
```{r}
# Create a quantile-quantile (QQ) plot
qqnorm(coefficients_vector)
qqline(coefficients_vector)
```