-
Notifications
You must be signed in to change notification settings - Fork 0
/
Homework2.Rmd
647 lines (532 loc) · 26.9 KB
/
Homework2.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
---
title: "Stat 115 2019: Homework 2"
author: "(your name)"
date: "Due: February 24, 2019"
output: html_document
---
```{r libraries, message = FALSE}
options(repos = c(CRAN = "http://cran.rstudio.com"))
library(affy)
library(affyPLM)
library(limma)
library(hgu133plus2.db)
library(sva)
library(ggplot2)
library(dplyr)
library(biobroom)
library(class)
library(e1071)
library(xlsx)
library(fdrtool)
library(affxparser)
```
# Part I. Differential Expression
In this part, we will continue to analyze the microarray gene expression
dataset from HW1.
1. The RMA run in HW1 will estimate the expression index of the samples
in a matrix. Use LIMMA, find the differentially expressed genes between
siEZH2 and control in LNCaP cells, and repeat the same in ABL cells. Use
false discovery rate (FDR) 0.05 and fold change (FC) 1.3 as cutoff to
filter the final result list. How many genes are differentially
expressed at this cutoff? Count in terms of gene symbol (e.g. TP53)
instead of transcript like (e.g. NM_000546)?
```{r part1-load, cache = TRUE}
celFiles <- list.celfiles(path = "data", full.names=TRUE)
data.affy <- ReadAffy(filenames = celFiles)
data.rma <- rma(data.affy)
expr.rma <- exprs(data.rma)
```
```{r part1-limma}
# your code here
# knitr::opts_chunk$set(echo = TRUE, fig.align = "center") #to setup for figure alignment
# print(dir("./data"))
#=========================================================#
fit_limma <- function (expr.rma, design.mat){
limma_fit <- lmFit(expr.rma, design.mat) %>%
eBayes() %>%
tidy() %>%
mutate(fdr = p.adjust(p.value, method = "fdr")) %>%
# mutate(fdr = fdrtool(p.value, statistic = "normal")$pval) %>%
arrange(p.value)
return(limma_fit)
}
#=========================================================#
Annot <- data.frame(
# PROBEID = keys(hgu133plus2),
REFSEQ = sapply(contents(hgu133plus2REFSEQ), paste, collapse=", "),
SYMBOL = sapply(contents(hgu133plus2SYMBOL), paste, collapse=", "),
DESC = sapply(contents(hgu133plus2GENENAME), paste, collapse=", "),
stringsAsFactors = FALSE)
Annot$PROBEID <- rownames(Annot)
#=========================================================#
head(expr.rma)
#=========================================================#
sensitive <- c(0, 1, 0, 1 , 0, 1) # assign control and trt groups
design.mat.ABL <- model.matrix(formula(~sensitive))
limma_fit.ABL <- fit_limma(expr.rma[ , 1:6], design.mat.ABL)
# limma_fit.ABL %>% filter(fdr < 0.05, abs(estimate) > log2(1.3))
test <- filter(limma_fit.ABL, fdr < 0.05, abs(estimate) > log2(1.3))
print(paste("Number of Significant genes in ABL Group: ", dim(test)[1], sep = ""))
#=========================================================#
sensitive <- c(0, 1, 0, 1 , 0, 1) # assign control and trt groups
design.mat.LNCaP <- model.matrix(formula(~sensitive))
# design.mat.LNCaP
limma_fit.LNCaP <- fit_limma(expr.rma[ , 7:12], design.mat.LNCaP)
head(limma_fit.LNCaP)
# limma_fit.LNCaP %>% filter(fdr < 0.05, abs(estimate) > log2(1.3))
# limma_fit.LNCaP %>% filter(p.value < 0.05, abs(estimate) > log2(1.3))
test <- filter(limma_fit.LNCaP, fdr < 0.05, abs(estimate) > log2(1.3))
print(paste("Number of Significant genes in LNCaP Group: ", dim(test)[1], sep = ""))
# test <- inner_join(limma_fit.LNCaP, Annot, by = c("gene" = "PROBEID"))
print("Notes: I tried the p.adjust method in STATS package and the fdrtool method in FDRTOOL package. However both of them converted p-value to non-significant levels.This is probabaly due to the existence of batch effect.")
```
2. Draw a hierarchical clustering of the 12 samples. Does the clustering
suggest the presence of batch effect?
Hint: use functions dist, hclust, plot
```{r part1-hclust}
# your code here
clustering <- expr.rma %>%
t() %>%
dist() %>%
hclust(method = "complete")
knitr::opts_chunk$set(fig.width=12, fig.height=8)
plot(clustering, main = "Hierarchical Clustering of the 12 Samples")
dates <- rep(0, 12)
for (i in 1:12){
datheader <- readCelHeader(paste("./data/",dir("./data")[i], sep=''))$datheader
# dd <- gsub(".*([0-9]{2,2}/[0-9]{2,2}/[0-9]{2,2}).*","\\1",datheader)
dd <- gsub(".*([0-9]{2,2}/[0-9]{2,2}/[0-9]{2,2} [0-9]{2,2}:[0-9]{2,2}:[0-9]{2,2}).*","\\1",datheader)
dates[i] <- paste(i, dd, sep =":")
}
print("The measuring date of 12 samples:")
print(dates)
print("The clustering suggests the presence of batch effect. If we cut the tree from the highest level, two subgroups could be found. They are of different size which are 4 and 8, respectively. After printing out the sample date and time (shown above), we can see that GM969458, GM969459, GM969464, GM969465 in the left subtree were all sampled on 03/12/09. On the other hand, these four samples came from the control and treated replicates of ABL and LNCap cell lines, which should have larger distance to each other at the beginning of the clustering process.")
```
3. Use ComBat (included in the `sva` package) to adjust for batch
effects and provide evidence that the batch effects are successfully
adjusted. Repeat the differential expression analysis using LIMMA, FDR
0.05 and FC 1.3. Are there significant genes reported?
```{r part1-combat}
# your code here
sensitive <- c(0, 1, 0, 1 , 0, 1, 0, 1, 0, 1 , 0, 1) # assign control and trt groups
# sensitive <- c(0, 1, 0, 1 , 0, 1, 2, 3, 2, 3 , 2, 3) # assign
design.mat <- model.matrix(formula(~sensitive))
# expr.rma.combat <- ComBat(dat = expr.rma, batch = c(0,1,0,1,0,1,2,3,2,3,2,3), mod = design.mat)
# expr.rma.combat <- ComBat(dat = expr.rma, c(1,1,2,2,2,2,1,1,2,3,2,2), mod = design.mat)
expr.rma.combat <- ComBat(dat = expr.rma, c(1,1,2,2,2,2,3,3,4,5,4,4), mod = design.mat)
clustering_combat <- expr.rma.combat %>%
t() %>%
dist() %>%
hclust(method = "complete")
plot(clustering_combat, main = "Hierarchical Clustering of the 12 samples after ComBat adjustment")
print("Evidence for batch effects were successfully adjusted:")
print("1. The two largest subtree have equal number of leafs which is 6.")
print("2. If we look into how leafs in the left subtree grouped with each other, we could find that GSM969458 and GSM969462 for control condition of ABL cell lines have the nearest distance to each other, although they were measured at difference time points (03/12/09 vs. 04/02/10). Similarly, other samples with the same treatment usually grouped together first, rather than the samples taken at the same batch were grouped first.")
limma_fit.ABL.combat <- fit_limma(expr.rma.combat[ , 1:6], design.mat.ABL)
head(limma_fit.ABL.combat)
limma_fit.ABL.combat <- filter(limma_fit.ABL.combat, fdr < 0.05, abs(estimate) > log2(1.3))
list.ABL <- inner_join(limma_fit.ABL.combat, Annot, by = c("gene" = "PROBEID"))
print(paste("Number of Significant genes in ABL Group: ", length(list.ABL$SYMBOL), sep = ""))
print(paste("First 50 genes:"))
head(list.ABL$SYMBOL, 50)
limma_fit.LNCaP.combat <- fit_limma(expr.rma.combat[ , 7:12], design.mat.LNCaP)
head(limma_fit.LNCaP.combat)
limma_fit.LNCaP.combat <- filter(limma_fit.LNCaP.combat, fdr < 0.05, abs(estimate) > log2(1.3))
list.LNCaP <- inner_join(limma_fit.LNCaP.combat, Annot, by = c("gene" = "PROBEID"))
print(paste("Number of Significant genes in LNCaP Group: ", length(list.LNCaP$SYMBOL), sep = ""))
print(paste("First 50 genes:"))
head(list.LNCaP$SYMBOL, 50)
```
4. FOR GRADUATES: Run K-means clustering of differentially expressed
genes across all 12 samples. Experiment with different K (there may not
be a correct answer here so just explore and explain your reasoning).
Hint: function kmeans
```{r part1-kmeans}
# your code here
limma_fit.ALL.combat <- fit_limma(expr.rma.combat, design.mat)
genelist <- limma_fit.ALL.combat %>%
filter(fdr < 0.05, abs(estimate) >= log2(1.3)) %>%
distinct(gene)
expr_subset.combat <- expr.rma.combat[genelist$gene, ]
print("K-means, centers = 3, nstart = 10, #1")
expr_km1 <- kmeans(t(expr_subset.combat), centers = 3, nstart = 10)
data.frame(type = c(rep(c("ABL control", "ABL siEZH2"), 3), rep(c("LNCaP control", "LNCaP siEZH2"), 3)),
cluster = expr_km1$cluster) %>%
table()
print("K-means, centers = 3, nstart = 10, #2")
expr_km1 <- kmeans(t(expr_subset.combat), centers = 3, nstart = 10)
data.frame(type = c(rep(c("ABL control", "ABL siEZH2"), 3), rep(c("LNCaP control", "LNCaP siEZH2"), 3)),
cluster = expr_km1$cluster) %>%
table()
print("K-means, centers = 3, nstart = 10, #3")
expr_km1 <- kmeans(t(expr_subset.combat), centers = 3, nstart = 10)
data.frame(type = c(rep(c("ABL control", "ABL siEZH2"), 3), rep(c("LNCaP control", "LNCaP siEZH2"), 3)),
cluster = expr_km1$cluster) %>%
table()
print("K-means, centers = 3, nstart = 100, #1")
expr_km1 <- kmeans(t(expr_subset.combat), centers = 3, nstart = 100)
data.frame(type = c(rep(c("ABL control", "ABL siEZH2"), 3), rep(c("LNCaP control", "LNCaP siEZH2"), 3)),
cluster = expr_km1$cluster) %>%
table()
print("K-means, centers = 3, nstart = 100, #2")
expr_km1 <- kmeans(t(expr_subset.combat), centers = 3, nstart = 100)
data.frame(type = c(rep(c("ABL control", "ABL siEZH2"), 3), rep(c("LNCaP control", "LNCaP siEZH2"), 3)),
cluster = expr_km1$cluster) %>%
table()
print("K-means, centers = 3, nstart = 100, #3")
expr_km1 <- kmeans(t(expr_subset.combat), centers = 3, nstart = 100)
data.frame(type = c(rep(c("ABL control", "ABL siEZH2"), 3), rep(c("LNCaP control", "LNCaP siEZH2"), 3)),
cluster = expr_km1$cluster) %>%
table()
```
5. Run the four list of differential genes (up / down, LNCaP / ABL)
separately on DAVID (http://david.abcc.ncifcrf.gov/, you might want to
read their Nat Prot tutorial) to see whether the genes in each list are
enriched in specific biological process, pathways, etc. What’s in common
and what’s the most significant difference in EZH2 regulated genes
between LNCaP and ABL?
```{r}
sensitive <- c(0, 1, 0, 1 , 0, 1, 0, 1, 0, 1 , 0, 1) # assign control and trt groups
design.mat <- model.matrix(formula(~sensitive))
limma_fit.ALL.combat <- fit_limma(expr.rma.combat, design.mat)
head(limma_fit.ABL.combat)
up_genes <- filter(limma_fit.ALL.combat, fdr < 0.05, estimate > log2(1.5))
down_genes <- filter(limma_fit.ALL.combat, fdr < 0.05, estimate < -log2(1.5))
write.csv(up_genes, file = "david_up.csv")
write.csv(down_genes, file = "david_down.csv")
limma_fit.ABL.combat <- fit_limma(expr.rma.combat[ , 1:6], design.mat.ABL)
limma_fit.ABL.combat <- filter(limma_fit.ABL.combat, fdr < 0.05, abs(estimate) > log2(1.5))
limma_fit.LNCaP.combat <- fit_limma(expr.rma.combat[ , 7:12], design.mat.LNCaP)
limma_fit.LNCaP.combat <- filter(limma_fit.LNCaP.combat, fdr < 0.05, abs(estimate) > log2(1.5))
write.csv(limma_fit.ABL.combat, file = "david_ABL.csv")
write.csv(limma_fit.LNCaP.combat, file = "david_LNCaP.csv")
# print("ABL cell line:")
# print("203358_s_at enhancer of zeste 2 polycomb repressive complex 2 subunit(EZH2) RG Homo sapiens")
# print("LNCaP cell line:")
print("The number of enriched clusters for both groups are similar, which are 48 clusters for the ABL group and 42 clusters for the LNCaP group. The most significant difference would be the max enrichment score. ABL group has a max score of 60.41, while LNCaP's max score is 1.72.")
```
6. FOR GRADUATES: Try Gene Set Enrichment analysis
(http://www.broadinstitute.org/gsea/index.jsp) on the siEZH2 experiments
in LNCaP and ABL separately. Do the two cell lines differ in the
enriched gene sets?
```{r part1-gsea}
# your code here
# install.packages("msigdbr")
library(msigdbr)
m_df = msigdbr(species = "Homo sapiens", category = "H")
m_list = m_df %>% split(x = .$gene_symbol, f = .$gs_name)
limma_fit.ABL.combat <- fit_limma(expr.rma.combat[ , 1:6], design.mat.ABL)
limma_fit.LNCaP.combat <- fit_limma(expr.rma.combat[ , 7:12], design.mat.LNCaP)
de_genes.ABL <- limma_fit.ABL.combat %>%
inner_join(Annot, c("gene" = "PROBEID")) %>%
filter(SYMBOL != "NA") %>%
distinct(SYMBOL, .keep_all = TRUE)
stats.ABL <- de_genes.ABL$statistic
names(stats.ABL) <- de_genes.ABL$SYMBOL
de_genes.LNCaP <- limma_fit.LNCaP.combat %>%
inner_join(Annot, c("gene" = "PROBEID")) %>%
filter(SYMBOL != "NA") %>%
distinct(SYMBOL, .keep_all = TRUE)
stats.LNCaP <- de_genes.LNCaP$statistic
names(stats.LNCaP) <- de_genes.LNCaP$SYMBOL
# BiocManager::install("fgsea")
# install.packages("Matrix")
library("fgsea")
library("Matrix")
fgsea_res.ABL <- fgsea(pathways = m_list, stats = stats.ABL, nperm = 10000)
fgsea_res.ABL %>%
filter(padj < 0.05) %>%
arrange(pval) %>%
head()
print("GSEA for ABL Cell Line")
plotEnrichment(m_list[["HALLMARK_MYOGENESIS"]], stats.ABL)
fgsea_res.LNCaP <- fgsea(pathways = m_list, stats = stats.LNCaP, nperm = 10000)
fgsea_res.LNCaP %>%
filter(padj < 0.05) %>%
arrange(pval) %>%
head()
print("GSEA for LNCaP Cell Line")
plotEnrichment(m_list[["HALLMARK_MYOGENESIS"]], stats.LNCaP)
print("The enrichment score for LNCaP cell line has a minimum of -0.2, compared to -0.12 for ABL group.")
```
# Part II. Microarray Clustering and Classification
The sample data is in file "taylor2010_data.txt" included in this
homework. This dataset has expression profiles of 23,974 genes in 27
normal samples, 129 primary cancer samples, 18 metastasized cancer
samples, and 5 unknown samples. Assume the data has been normalized and
summarized to expression index. The skeleton R code is provided here.
```{r loadtaylor}
taylor <- as.matrix(read.csv("data/taylor2010_data.txt", sep="\t",row.names=1))
index_normal <- grepl("N.P", colnames(taylor))
index_primary <- grepl("P.P", colnames(taylor))
index_met <- grepl("M.P", colnames(taylor))
n_normal <- sum(index_normal);
n_primary = sum(index_primary);
n_met = sum(index_met);
# class label (design vector)
taylor_classes = c(rep(0,n_normal), rep(1,n_primary), rep(2,n_met));
# train (known type samples), and test (unknown type samples)
train <- taylor[,1:174];
test <- taylor[,175:179];
tumortype_all <- factor(c(taylor_classes, rep(3, 5)), levels = 0:3,
labels = c("Normal", "Primary", "Metastasized",
"Unknown"))
tumortype_class <- factor(taylor_classes, levels = 0:2,
labels = c("Normal", "Primary",
"Metastasized"))
train_samps <- 1:174
test_samps <- 175:179
```
1. For the 174 samples with known type (normal, primary, metastasized),
use LIMMA to find the differentially expressed genes with fold change
threshold 1.3, and adjusted p-value threshold 0.05. How many
differentially expressed genes are there? Hint: the design vector
consists of type indicator for the 174 samples. For example, 0 for
normal, 1 for primary, and 2 for metastasized.
```{r part2-limma}
# your code here
sensitive <- taylor_classes # assign control and trt groups
design.mat.taylor.train <- model.matrix(formula(~sensitive))
limma_fit.taylor.train <- fit_limma(train, design.mat.taylor.train)
limma_fit.taylor.train <- filter(limma_fit.taylor.train, fdr < 0.05, abs(estimate) > log2(1.3))
print(paste("The number of differentially expressed genes is: ", dim(limma_fit.taylor.train)[1], sep = ""))
```
2. Perform k-means clustering on all samples using the differentially
expressed genes. Do the samples cluster according to disease status?
```{r part2-kmeans}
# your code here
genelist <- limma_fit.taylor.train %>%
distinct(gene)
expr_subset.taylor <- train[genelist$gene, ]
print("K-means, centers = 3, nstart = 10, #1")
taylor.train.km1 <- kmeans(t(expr_subset.taylor), centers = 3, nstart = 10)
data.frame(type = tumortype_class,
cluster = taylor.train.km1$cluster) %>%
table()
print("K-means, centers = 3, nstart = 10, #2")
taylor.train.km1 <- kmeans(t(expr_subset.taylor), centers = 3, nstart = 10)
data.frame(type = tumortype_class,
cluster = taylor.train.km1$cluster) %>%
table()
print("K-means, centers = 3, nstart = 100, #1")
taylor.train.km1 <- kmeans(t(expr_subset.taylor), centers = 3, nstart = 100)
data.frame(type = tumortype_class,
cluster = taylor.train.km1$cluster) %>%
table()
print("K-means, centers = 3, nstart = 100, #2")
taylor.train.km1 <- kmeans(t(expr_subset.taylor), centers = 3, nstart = 100)
data.frame(type = tumortype_class,
cluster = taylor.train.km1$cluster) %>%
table()
print("The samples don't cluster exactly to disease status.")
```
3. Draw PCA biplot on the samples with differentially expressed genes,
and use 4 different colors to distinguish the 4 types of samples
(normal, primary, metastasized and unknown). Do the samples from
different groups look separable?
Hint: use ggplot
```{r part2-pca-biplot}
# your code here
# install.packages("ggfortify")
library(ggfortify)
pca_result <- taylor[genelist$gene, ] %>% t() %>% prcomp(center = TRUE, scale. = TRUE)
# autoplot(pca_result)
edata_pc_df <- as.data.frame(pca_result$x)
edata_pc_df <- edata_pc_df %>%
mutate(batch = as.factor(tumortype_all),
hasCancer = as.factor(tumortype_all))
edata_pc_df[1:5, 1:5]
ggplot(edata_pc_df, aes(x = PC1, y = PC2, color = batch)) +
geom_point() +
ggtitle("PCA Biplot of All Samples")
# ggplot(edata_pc_df, aes(x = PC1, y = PC2, color = hasCancer)) +
# geom_point()
print("The samples look separable.")
```
4. FOR GRADUATES: What percent of variation in the data is captured in
the first two principle components? How many principle components do we
need to capture 80% of the variation in the data?
RHint: use function prcomp.
```{r part2-variation}
# your code here
eigenvals <- (pca_result$sdev)^2
n_eigenvals <- length(eigenvals)
var_explained <- cumsum(eigenvals) / sum(eigenvals)
ggplot(edata_pc_df, aes(x = PC1, y = PC2, color = batch)) +
geom_point() +
ggtitle("PCA Biplot of All Samples.") +
labs(x = paste(sprintf("%0.1f", var_explained[1]*100), "% of variance explained", sep = ""),
y = paste(sprintf("%0.1f", (var_explained[2] - var_explained[1])*100), "% of variance explained", sep = ""))
data.frame(index = 1:n_eigenvals,
var_explained = var_explained) %>%
ggplot(aes(x = index, y = var_explained)) + geom_line() +
xlab("Eigenvalue Index") + ylab("Cum. variance explained") +
ggtitle("PCA Variance Explained")
# which(var_explained > 0.8)
paste("The minimum # of PCs needed to capture 80% of the variance is: ", min(which(var_explained > 0.8)))
```
5. Based on the PCA biplot, can you classify the 5 unknown samples? Put
the PCA biplot in your HW write-up, and indicate which unknown sample
should be classified into which known type (normal, primary,
metastasized). Do you have different confidence for each unknown sample?
Answer: The five unkown samples were colored as purple. We can project them to the x-axis and labeled from 1-5. #1 point sit in the left most area, where most of its closest neighbours are "normal" sample. Thus it is with high confidence that #1 could be "normal" as well. The same applies to #4 and #5 who lie in the right most resion of the PCA biplot, where the majority of "metastasized" samples cluster.
However, unknown sample #2 lies between clusters "normal" and "primary", unknown sample #3 lies between clusters "primary" and "metastasized", which make those two samples difficult to be classified.
6. FOR GRADUATES: Use PCA on all samples and all the genes (instead of
the differentially expressed genes) for sample classification. Compare
to your results in the previous question. Which PCA plot looks better
and why?
```{r part2-pca2}
# your code here
pca_result.All <- taylor %>% t() %>% prcomp(center = TRUE, scale. = TRUE)
edata_pc_df.All <- as.data.frame(pca_result.All$x)
edata_pc_df.All <- edata_pc_df.All %>%
mutate(batch = as.factor(tumortype_all),
hasCancer = as.factor(tumortype_all))
edata_pc_df.All[1:5, 1:5]
eigenvals.All <- (pca_result.All$sdev)^2
n_eigenvals.All <- length(eigenvals.All)
var_explained.All <- cumsum(eigenvals.All) / sum(eigenvals.All)
ggplot(edata_pc_df.All, aes(x = PC1, y = PC2, color = batch)) +
geom_point() +
ggtitle("PCA Biplot of All Samples (all genes)") +
labs(x = paste(sprintf("%0.1f", var_explained.All[1]*100), "% of variance explained", sep = ""),
y = paste(sprintf("%0.1f", (var_explained.All[2] - var_explained.All[1])*100), "% of variance explained", sep = ""))
print("The PCA of all samples and all the genes performs worse than the previous PCA which only calculates the differentially expressed genes. The fours clusters mixed togeter. The percentage of variances explained by PC1 and PC2 also decrease. The reason could be limma removes genes that are unrelated to this biological process, or in other words, it filtered background noise.")
```
7. Run KNN (try K = 1, 3 and 5) on the differential genes and all the
samples, and predict the unknown samples based on the 174 labeled
samples. Hint: use library class and function knn.
```{r part2-knn}
# your code here
# BiocManager::install("caret")
library(caret)
# pca_result <- taylor[genelist$gene, ] %>% t() %>% prcomp(center = TRUE, scale. = TRUE)
# pca_result.All <- taylor %>% t() %>% prcomp(center = TRUE, scale. = TRUE)
print("For DE group")
train_ind <- train_samps
expr_train <- taylor[genelist$gene, train_ind]
expr_test <- taylor[genelist$gene, -train_ind]
type_train <- tumortype_all[train_ind]
type_test <- tumortype_all[-train_ind]
# type_test <- factor(c(rep(3, 5)), labels = c("Unknown"))
type_knn <- knn(t(expr_train), t(expr_test), type_train, k = 1)
confusionMatrix(type_knn, type_test)$table
type_knn <- knn(t(expr_train), t(expr_test), type_train, k = 3)
confusionMatrix(type_knn, type_test)$table
type_knn <- knn(t(expr_train), t(expr_test), type_train, k = 5)
confusionMatrix(type_knn, type_test)$table
print("For all genens")
train_ind <- train_samps
expr_train <- taylor[, train_ind]
expr_test <- taylor[, -train_ind]
type_train <- tumortype_all[train_ind]
type_test <- tumortype_all[-train_ind]
type_knn <- knn(t(expr_train), t(expr_test), type_train, k = 1)
confusionMatrix(type_knn, type_test)$table
type_knn <- knn(t(expr_train), t(expr_test), type_train, k = 3)
confusionMatrix(type_knn, type_test)$table
type_knn <- knn(t(expr_train), t(expr_test), type_train, k = 5)
confusionMatrix(type_knn, type_test)$table
```
8. Run SVM (try a linear kernel) on the differential genes and all the
samples, and predict the unknown samples based on the 174 labeled
samples. Hint: use library e1071 and function svm.
```{r part2-svm}
# your code here
library(e1071)
print("Run SVM on the differential genes:")
train_ind <- train_samps
expr_train <- taylor[genelist$gene, train_ind]
expr_test <- taylor[genelist$gene, -train_ind]
type_train <- tumortype_all[train_ind]
type_test <- tumortype_all[-train_ind]
# get prediction for train dataset
svm_result <- svm(t(expr_train), type_train, kernel = "linear")
confusionMatrix(svm_result$fitted, type_train)$table
# get prediction for test dataset
preds <- predict(svm_result, t(expr_test))
confusionMatrix(preds, type_test)$table
print("Run SVM on all the samples:")
train_ind <- train_samps
expr_train.all <- taylor[, train_ind]
expr_test.all <- taylor[, -train_ind]
type_train.all <- tumortype_all[train_ind]
type_test.all <- tumortype_all[-train_ind]
# get prediction for train dataset
svm_result.all <- svm(t(expr_train.all), type_train.all, kernel = "linear")
confusionMatrix(svm_result.all$fitted, type_train.all)$table
# get prediction for test dataset
preds.all <- predict(svm_result.all, t(expr_test.all))
confusionMatrix(preds.all, type_test.all)$table
```
9. FOR GRADUATES: Implement a 3-fold cross validation on your SVM
classifier, based on the 174 samples with known labels. What is your
average (of 3) classification error rate on the training data?
```{r part2-cv}
# your code here
shuffle_inds <- sample(1:nrow(t(expr_train)), replace = FALSE)
svm_tune <- tune(svm, t(expr_train)[shuffle_inds, ], type_train[shuffle_inds],
kernel = "linear",
ranges = list(cost = c(0.01, 0.1, 1, 10)),
tunecontrol = tune.control(cross = 3))
plot(svm_tune)
svm_tune
print(paste("(DE group) The average classification error rate on the training data is: ", sprintf("%.5f", mean(svm_tune$performances[, "error"]))))
#===========================#
shuffle_inds <- sample(1:nrow(t(expr_train)), replace = FALSE)
svm_tune.all <- tune(svm, t(expr_train.all)[shuffle_inds, ], type_train.all[shuffle_inds],
kernel = "linear",
ranges = list(cost = c(0.01, 0.1, 1, 10)),
tunecontrol = tune.control(cross = 3))
plot(svm_tune.all)
svm_tune.all
print(paste("(All genes group) The average classification error rate on the training data is: ", sprintf("%.5f", mean(svm_tune.all$performances[, "error"]))))
# confusionMatrix(svm_tune$fitted, type_train[shuffle_inds])$table
```
# Part III. High throughput sequencing read mapping
We will give you a simple example to test high throughput sequencing
alignment for RNA-seq data. Normally for paired-end sequencing data,
each sample will have two separate FASTQ files, with line-by-line
correspondence to the two reads from the same fragment. Read mapping
could take a long time, so we have created just two FASTQ files of one
RNA-seq sample with only 3M fragments (2 * 3M reads) for you to run STAR
instead of the full data. The files are located at
`/n/stat115/HW2_2019/`. The mapping will generate one single output
file. Make sure to use the right parameters for single-end (SE) vs
paired-end (PE) modes in BWA and STAR.
Please include the commands that you used to run BWA and STAR in your
answers.
1. Use BWA (Li & Durbin, Bioinformatics 2009) to map the reads to the
Hg38 version of the reference genome, available on Odyssey at
`/n/stat115/HW2_2019/bwa_hg38_index/hg38.fasta`. Use the PE alignment
mode and generate the output in SAM format. Use SAMTools on the output
to find out how many fragments are mappable and uniquely mappable.
```
bwa commands here
```
2. Use STAR (Dobin et al, Bioinformatics 2012) to map the reads to the
reference genome, available on Odyssey at
`/n/stat115/HW2_2019/STARIndex`. Use the paired-end alignment mode and
generate the output in SAM format. STAR should have a report. How many
fragments are mappable and how many are uniquely mappable?
```
STAR commands here
```
3. If you are getting a different number of mappable fragments between
BWA and STAR on the same data, why?
4. For GRADUATES: Run STAR using SE alignment mode on the left read
file. Take a look at the SE SAM file vs the PE SAM file. Are you getting
the same number of aligned fragments using PE mode vs SE mode?
```
STAR commands here
```
# Rules for submitting the homework:
Please submit your solution directly on the canvas website. Please
provide both your code in this Rmd document and an html file for your
final write-up. Please pay attention to the clarity and cleanness of
your homework.
The teaching fellows will grade your homework and give the grades with
feedback through canvas within one week after the due date. Some of the
questions might not have a unique or optimal solution. TFs will grade
those according to your creativity and effort on exploration, especially
in the graduate-level questions.