Homework2.Rmd

---
title: "Stat 115 2019: Homework 2"
author: "(your name)"
date: "Due: February 24, 2019"
output: html_document
---

```{r libraries, message = FALSE}
options(repos = c(CRAN = "http://cran.rstudio.com"))
library(affy)
library(affyPLM)
library(limma)
library(hgu133plus2.db)
library(sva)
library(ggplot2)
library(dplyr)
library(biobroom)
library(class)
library(e1071)
library(xlsx)

library(fdrtool)
library(affxparser)
```

# Part I. Differential Expression

In this part, we will continue to analyze the microarray gene expression
dataset from HW1.

1. The RMA run in HW1 will estimate the expression index of the samples
in a matrix. Use LIMMA, find the differentially expressed genes between
siEZH2 and control in LNCaP cells, and repeat the same in ABL cells. Use
false discovery rate (FDR) 0.05 and fold change (FC) 1.3 as cutoff to
filter the final result list. How many genes are differentially
expressed at this cutoff? Count in terms of gene symbol (e.g. TP53)
instead of transcript like (e.g. NM_000546)?

```{r part1-load, cache = TRUE}
celFiles <- list.celfiles(path = "data", full.names=TRUE)
data.affy <- ReadAffy(filenames = celFiles)
data.rma <- rma(data.affy)
expr.rma <- exprs(data.rma)
```

```{r part1-limma}
# your code here
# knitr::opts_chunk$set(echo = TRUE, fig.align = "center")  #to setup for figure alignment
# print(dir("./data"))
#=========================================================#
fit_limma <- function (expr.rma, design.mat){
    limma_fit <- lmFit(expr.rma, design.mat) %>%
      eBayes() %>%
      tidy() %>%
      mutate(fdr = p.adjust(p.value, method = "fdr")) %>%
      # mutate(fdr = fdrtool(p.value, statistic = "normal")$pval) %>%
      arrange(p.value)
    return(limma_fit)
}
#=========================================================#
Annot <- data.frame(
    # PROBEID = keys(hgu133plus2),
    REFSEQ = sapply(contents(hgu133plus2REFSEQ), paste, collapse=", "),
    SYMBOL = sapply(contents(hgu133plus2SYMBOL), paste, collapse=", "),
    DESC = sapply(contents(hgu133plus2GENENAME), paste, collapse=", "),
    stringsAsFactors = FALSE)
Annot$PROBEID <- rownames(Annot)
#=========================================================#
head(expr.rma)

#=========================================================#
sensitive <- c(0, 1, 0, 1 , 0, 1) # assign control and trt groups
design.mat.ABL <- model.matrix(formula(~sensitive))
limma_fit.ABL <- fit_limma(expr.rma[ , 1:6], design.mat.ABL)  
# limma_fit.ABL %>% filter(fdr < 0.05, abs(estimate) > log2(1.3))
test <- filter(limma_fit.ABL, fdr < 0.05, abs(estimate) > log2(1.3))
print(paste("Number of Significant genes in ABL Group: ", dim(test)[1], sep = ""))
#=========================================================#
sensitive <- c(0, 1, 0, 1 , 0, 1) # assign control and trt groups

design.mat.LNCaP <- model.matrix(formula(~sensitive))
# design.mat.LNCaP
limma_fit.LNCaP <- fit_limma(expr.rma[ , 7:12], design.mat.LNCaP)  
head(limma_fit.LNCaP)
# limma_fit.LNCaP %>% filter(fdr < 0.05, abs(estimate) > log2(1.3))
# limma_fit.LNCaP %>% filter(p.value < 0.05, abs(estimate) > log2(1.3))
test <- filter(limma_fit.LNCaP, fdr < 0.05, abs(estimate) > log2(1.3))
print(paste("Number of Significant genes in LNCaP Group: ", dim(test)[1], sep = ""))
# test <- inner_join(limma_fit.LNCaP, Annot, by = c("gene" = "PROBEID"))

print("Notes: I tried the p.adjust method in STATS package and  the fdrtool method in FDRTOOL package. However both of them converted p-value to non-significant levels.This is probabaly due to the existence of batch effect.")
```

2. Draw a hierarchical clustering of the 12 samples. Does the clustering
suggest the presence of batch effect?
Hint: use functions dist, hclust, plot

```{r part1-hclust}
# your code here
clustering <- expr.rma %>% 
                t() %>% 
                dist() %>%
                hclust(method = "complete")
knitr::opts_chunk$set(fig.width=12, fig.height=8) 
plot(clustering, main = "Hierarchical Clustering of the 12 Samples")

dates <- rep(0, 12)
for (i in 1:12){
  datheader <- readCelHeader(paste("./data/",dir("./data")[i], sep=''))$datheader

  # dd <- gsub(".*([0-9]{2,2}/[0-9]{2,2}/[0-9]{2,2}).*","\\1",datheader)
  dd <- gsub(".*([0-9]{2,2}/[0-9]{2,2}/[0-9]{2,2} [0-9]{2,2}:[0-9]{2,2}:[0-9]{2,2}).*","\\1",datheader)
  dates[i] <- paste(i, dd, sep =":")
  
}
print("The measuring date of 12 samples:")
print(dates)

print("The clustering suggests the presence of batch effect. If we cut the tree from the highest level, two subgroups could be found. They are of different size which are 4 and 8, respectively.  After printing out the sample date and time (shown above), we can see that GM969458, GM969459, GM969464, GM969465 in the left subtree were all sampled on 03/12/09. On the other hand, these four samples came from the control and treated replicates of ABL and LNCap cell lines, which should have larger distance to each other at the beginning of the clustering process.")
```

3. Use ComBat (included in the `sva` package) to adjust for batch
effects and provide evidence that the batch effects are successfully
adjusted. Repeat the differential expression analysis using LIMMA, FDR
0.05 and FC 1.3. Are there significant genes reported?

```{r part1-combat}
# your code here

sensitive <- c(0, 1, 0, 1 , 0, 1, 0, 1, 0, 1 , 0, 1) # assign control and trt groups
# sensitive <- c(0, 1, 0, 1 , 0, 1, 2, 3, 2, 3 , 2, 3) # assign 
design.mat <- model.matrix(formula(~sensitive))

# expr.rma.combat <- ComBat(dat = expr.rma, batch = c(0,1,0,1,0,1,2,3,2,3,2,3), mod = design.mat)
# expr.rma.combat <- ComBat(dat = expr.rma, c(1,1,2,2,2,2,1,1,2,3,2,2), mod = design.mat)
expr.rma.combat <- ComBat(dat = expr.rma, c(1,1,2,2,2,2,3,3,4,5,4,4), mod = design.mat)

clustering_combat <- expr.rma.combat %>% 
                     t() %>% 
                     dist() %>% 
                     hclust(method = "complete")
plot(clustering_combat, main = "Hierarchical Clustering of the 12 samples after ComBat adjustment")
print("Evidence for batch effects were successfully adjusted:")
print("1. The two largest subtree have equal number of leafs which is 6.")
print("2. If we look into how leafs in the left subtree grouped with each other, we could find that GSM969458 and GSM969462 for control condition of ABL cell lines have the nearest distance to each other, although they were measured at difference time points (03/12/09 vs. 04/02/10). Similarly, other samples with the same treatment usually grouped together first, rather than the samples taken at the same batch were grouped first.")

limma_fit.ABL.combat <- fit_limma(expr.rma.combat[ , 1:6], design.mat.ABL)  
head(limma_fit.ABL.combat)
limma_fit.ABL.combat <- filter(limma_fit.ABL.combat, fdr < 0.05, abs(estimate) > log2(1.3))
list.ABL <- inner_join(limma_fit.ABL.combat, Annot, by = c("gene" = "PROBEID"))
print(paste("Number of Significant genes in ABL Group: ", length(list.ABL$SYMBOL), sep = ""))
print(paste("First 50 genes:"))
head(list.ABL$SYMBOL, 50)

limma_fit.LNCaP.combat <- fit_limma(expr.rma.combat[ , 7:12], design.mat.LNCaP)  
head(limma_fit.LNCaP.combat)
limma_fit.LNCaP.combat <- filter(limma_fit.LNCaP.combat, fdr < 0.05, abs(estimate) > log2(1.3))
list.LNCaP <- inner_join(limma_fit.LNCaP.combat, Annot, by = c("gene" = "PROBEID"))
print(paste("Number of Significant genes in LNCaP Group: ", length(list.LNCaP$SYMBOL), sep = ""))
print(paste("First 50 genes:"))
head(list.LNCaP$SYMBOL, 50)

```

4. FOR GRADUATES: Run K-means clustering of differentially expressed
genes across all 12 samples. Experiment with different K (there may not
be a correct answer here so just explore and explain your reasoning).
Hint: function kmeans

```{r part1-kmeans}
# your code here
limma_fit.ALL.combat <- fit_limma(expr.rma.combat, design.mat)

genelist <- limma_fit.ALL.combat %>%
            filter(fdr < 0.05, abs(estimate) >= log2(1.3)) %>%
            distinct(gene)
expr_subset.combat <- expr.rma.combat[genelist$gene, ]

print("K-means, centers = 3, nstart = 10, #1")
expr_km1 <- kmeans(t(expr_subset.combat), centers = 3, nstart = 10)
data.frame(type = c(rep(c("ABL control", "ABL siEZH2"), 3), rep(c("LNCaP control", "LNCaP siEZH2"), 3)),
           cluster = expr_km1$cluster) %>%
    table()

print("K-means, centers = 3, nstart = 10, #2")
expr_km1 <- kmeans(t(expr_subset.combat), centers = 3, nstart = 10)
data.frame(type = c(rep(c("ABL control", "ABL siEZH2"), 3), rep(c("LNCaP control", "LNCaP siEZH2"), 3)),
           cluster = expr_km1$cluster) %>%
    table()

print("K-means, centers = 3, nstart = 10, #3")
expr_km1 <- kmeans(t(expr_subset.combat), centers = 3, nstart = 10)
data.frame(type = c(rep(c("ABL control", "ABL siEZH2"), 3), rep(c("LNCaP control", "LNCaP siEZH2"), 3)),
           cluster = expr_km1$cluster) %>%
    table()

print("K-means, centers = 3, nstart = 100, #1")
expr_km1 <- kmeans(t(expr_subset.combat), centers = 3, nstart = 100)
data.frame(type = c(rep(c("ABL control", "ABL siEZH2"), 3), rep(c("LNCaP control", "LNCaP siEZH2"), 3)),
           cluster = expr_km1$cluster) %>%
    table()

print("K-means, centers = 3, nstart = 100, #2")
expr_km1 <- kmeans(t(expr_subset.combat), centers = 3, nstart = 100)
data.frame(type = c(rep(c("ABL control", "ABL siEZH2"), 3), rep(c("LNCaP control", "LNCaP siEZH2"), 3)),
           cluster = expr_km1$cluster) %>%
    table()

print("K-means, centers = 3, nstart = 100, #3")
expr_km1 <- kmeans(t(expr_subset.combat), centers = 3, nstart = 100)
data.frame(type = c(rep(c("ABL control", "ABL siEZH2"), 3), rep(c("LNCaP control", "LNCaP siEZH2"), 3)),
           cluster = expr_km1$cluster) %>%
    table()

```

5. Run the four list of differential genes (up / down, LNCaP / ABL)
separately on DAVID  (http://david.abcc.ncifcrf.gov/, you might want to
read their Nat Prot tutorial) to see whether the genes in each list are
enriched in specific biological process, pathways, etc. What’s in common
and what’s the most significant difference in EZH2 regulated genes
between LNCaP and ABL?

```{r}
sensitive <- c(0, 1, 0, 1 , 0, 1, 0, 1, 0, 1 , 0, 1) # assign control and trt groups
design.mat <- model.matrix(formula(~sensitive))

limma_fit.ALL.combat <- fit_limma(expr.rma.combat, design.mat)
head(limma_fit.ABL.combat)
up_genes <- filter(limma_fit.ALL.combat, fdr < 0.05, estimate > log2(1.5))
down_genes <- filter(limma_fit.ALL.combat, fdr < 0.05, estimate < -log2(1.5))
write.csv(up_genes, file = "david_up.csv")
write.csv(down_genes, file = "david_down.csv")

limma_fit.ABL.combat <- fit_limma(expr.rma.combat[ , 1:6], design.mat.ABL)  
limma_fit.ABL.combat <- filter(limma_fit.ABL.combat, fdr < 0.05, abs(estimate) > log2(1.5))

limma_fit.LNCaP.combat <- fit_limma(expr.rma.combat[ , 7:12], design.mat.LNCaP)  
limma_fit.LNCaP.combat <- filter(limma_fit.LNCaP.combat, fdr < 0.05, abs(estimate) > log2(1.5))

write.csv(limma_fit.ABL.combat, file = "david_ABL.csv")
write.csv(limma_fit.LNCaP.combat, file = "david_LNCaP.csv")

# print("ABL cell line:")
# print("203358_s_at	enhancer of zeste 2 polycomb repressive complex 2 subunit(EZH2)	RG	Homo sapiens")
# print("LNCaP cell line:")
print("The number of enriched clusters for both groups are similar, which are 48 clusters for the ABL group and 42 clusters for the LNCaP group. The most significant difference would be the max enrichment score. ABL group has a max score of 60.41, while LNCaP's max score is 1.72.")
```
6. FOR GRADUATES: Try Gene Set Enrichment analysis
(http://www.broadinstitute.org/gsea/index.jsp) on the siEZH2 experiments
in LNCaP and ABL separately. Do the two cell lines differ in the
enriched gene sets?


```{r part1-gsea}
# your code here
# install.packages("msigdbr")
library(msigdbr)
m_df = msigdbr(species = "Homo sapiens", category = "H")
m_list = m_df %>% split(x = .$gene_symbol, f = .$gs_name)

limma_fit.ABL.combat <- fit_limma(expr.rma.combat[ , 1:6], design.mat.ABL)  
limma_fit.LNCaP.combat <- fit_limma(expr.rma.combat[ , 7:12], design.mat.LNCaP)  

de_genes.ABL <- limma_fit.ABL.combat %>%
                inner_join(Annot, c("gene" = "PROBEID")) %>%
                filter(SYMBOL != "NA") %>%
                distinct(SYMBOL, .keep_all = TRUE)
stats.ABL <- de_genes.ABL$statistic
names(stats.ABL) <- de_genes.ABL$SYMBOL

de_genes.LNCaP <- limma_fit.LNCaP.combat %>%
                  inner_join(Annot, c("gene" = "PROBEID")) %>%
                  filter(SYMBOL != "NA") %>%
                  distinct(SYMBOL, .keep_all = TRUE)
stats.LNCaP <- de_genes.LNCaP$statistic
names(stats.LNCaP) <- de_genes.LNCaP$SYMBOL

# BiocManager::install("fgsea")
# install.packages("Matrix")
library("fgsea")
library("Matrix")

fgsea_res.ABL <- fgsea(pathways = m_list, stats = stats.ABL, nperm = 10000)
fgsea_res.ABL %>% 
  filter(padj < 0.05) %>% 
  arrange(pval) %>% 
  head()
print("GSEA for ABL Cell Line")
plotEnrichment(m_list[["HALLMARK_MYOGENESIS"]], stats.ABL)

fgsea_res.LNCaP <- fgsea(pathways = m_list, stats = stats.LNCaP, nperm = 10000)
fgsea_res.LNCaP %>% 
  filter(padj < 0.05) %>% 
  arrange(pval) %>% 
  head()
print("GSEA for LNCaP Cell Line")
plotEnrichment(m_list[["HALLMARK_MYOGENESIS"]], stats.LNCaP)

print("The enrichment score for LNCaP cell line has a minimum of -0.2, compared to -0.12 for ABL group.")
```

# Part II. Microarray Clustering and Classification

The sample data is in file "taylor2010_data.txt" included in this
homework. This dataset has expression profiles of 23,974 genes in 27
normal samples, 129 primary cancer samples, 18 metastasized cancer
samples, and 5 unknown samples. Assume the data has been normalized and
summarized to expression index. The skeleton R code is provided here.

```{r loadtaylor}
taylor <- as.matrix(read.csv("data/taylor2010_data.txt", sep="\t",row.names=1))
index_normal <- grepl("N.P", colnames(taylor))
index_primary <- grepl("P.P", colnames(taylor))
index_met <- grepl("M.P", colnames(taylor))
n_normal <- sum(index_normal);
n_primary = sum(index_primary);
n_met = sum(index_met);

# class label (design vector)
taylor_classes = c(rep(0,n_normal), rep(1,n_primary), rep(2,n_met));

# train (known type samples), and test (unknown type samples)
train <- taylor[,1:174];
test <- taylor[,175:179];

tumortype_all <- factor(c(taylor_classes, rep(3, 5)), levels = 0:3,
                        labels = c("Normal", "Primary", "Metastasized",
                                   "Unknown"))
tumortype_class <- factor(taylor_classes, levels = 0:2,
                          labels = c("Normal", "Primary",
                                     "Metastasized"))

train_samps <- 1:174
test_samps <- 175:179
```

1. For the 174 samples with known type (normal, primary, metastasized),
use LIMMA to find the differentially expressed genes with fold change
threshold 1.3, and adjusted p-value threshold 0.05.  How many
differentially expressed genes are there? Hint: the design vector
consists of type indicator for the 174 samples. For example, 0 for
normal, 1 for primary, and 2 for metastasized.

```{r part2-limma}
# your code here
sensitive <- taylor_classes # assign control and trt groups
design.mat.taylor.train <- model.matrix(formula(~sensitive))
limma_fit.taylor.train <- fit_limma(train, design.mat.taylor.train)  
limma_fit.taylor.train <- filter(limma_fit.taylor.train, fdr < 0.05, abs(estimate) > log2(1.3))
print(paste("The number of differentially expressed genes is: ", dim(limma_fit.taylor.train)[1], sep = ""))
```

2. Perform k-means clustering on all samples using the differentially
expressed genes. Do the samples cluster according to disease status?

```{r part2-kmeans}
# your code here
genelist <- limma_fit.taylor.train %>%
            distinct(gene)
expr_subset.taylor <- train[genelist$gene, ]

print("K-means, centers = 3, nstart = 10, #1")
taylor.train.km1 <- kmeans(t(expr_subset.taylor), centers = 3, nstart = 10)
data.frame(type = tumortype_class,
           cluster = taylor.train.km1$cluster) %>%
    table()

print("K-means, centers = 3, nstart = 10, #2")
taylor.train.km1 <- kmeans(t(expr_subset.taylor), centers = 3, nstart = 10)
data.frame(type = tumortype_class,
           cluster = taylor.train.km1$cluster) %>%
    table()

print("K-means, centers = 3, nstart = 100, #1")
taylor.train.km1 <- kmeans(t(expr_subset.taylor), centers = 3, nstart = 100)
data.frame(type = tumortype_class,
           cluster = taylor.train.km1$cluster) %>%
    table()

print("K-means, centers = 3, nstart = 100, #2")
taylor.train.km1 <- kmeans(t(expr_subset.taylor), centers = 3, nstart = 100)
data.frame(type = tumortype_class,
           cluster = taylor.train.km1$cluster) %>%
    table()

print("The samples don't cluster exactly to disease status.")
```

3. Draw PCA biplot on the samples with differentially expressed genes,
and use 4 different colors to distinguish the 4 types of samples
(normal, primary, metastasized and unknown). Do the samples from
different groups look separable?
Hint: use ggplot

```{r part2-pca-biplot}
# your code here
# install.packages("ggfortify")
library(ggfortify)

pca_result <- taylor[genelist$gene, ] %>% t() %>% prcomp(center = TRUE, scale. = TRUE)
# autoplot(pca_result)
edata_pc_df <- as.data.frame(pca_result$x)

edata_pc_df <-  edata_pc_df %>%
                mutate(batch = as.factor(tumortype_all),
                hasCancer = as.factor(tumortype_all))
edata_pc_df[1:5, 1:5]

ggplot(edata_pc_df, aes(x = PC1, y = PC2, color = batch)) +
    geom_point() +
    ggtitle("PCA Biplot of All Samples")
# ggplot(edata_pc_df, aes(x = PC1, y = PC2, color = hasCancer)) +
#     geom_point()
print("The samples look separable.")
```

4. FOR GRADUATES: What percent of variation in the data is captured in
the first two principle components? How many principle components do we
need to capture 80% of the variation in the data?
RHint: use function prcomp.

```{r part2-variation}
# your code here
eigenvals <- (pca_result$sdev)^2
n_eigenvals <- length(eigenvals)
var_explained <- cumsum(eigenvals) / sum(eigenvals)

ggplot(edata_pc_df, aes(x = PC1, y = PC2, color = batch)) +
    geom_point() +
    ggtitle("PCA Biplot of All Samples.") +
    labs(x = paste(sprintf("%0.1f", var_explained[1]*100), "% of variance explained", sep = ""), 
         y = paste(sprintf("%0.1f", (var_explained[2] - var_explained[1])*100), "% of variance explained", sep = ""))

data.frame(index = 1:n_eigenvals,
           var_explained = var_explained) %>%
    ggplot(aes(x = index, y = var_explained)) + geom_line() +
    xlab("Eigenvalue Index") + ylab("Cum. variance explained") +
    ggtitle("PCA Variance Explained")
# which(var_explained > 0.8)
paste("The minimum # of PCs needed to capture 80% of the variance is: ", min(which(var_explained > 0.8)))
```

5. Based on the PCA biplot, can you classify the 5 unknown samples?  Put
the PCA biplot in your HW write-up, and indicate which unknown sample
should be classified into which known type (normal, primary,
metastasized). Do you have different confidence for each unknown sample?

Answer: The five unkown samples were colored as purple. We can project them to the x-axis and labeled from 1-5. #1 point sit in the left most area, where most of its closest neighbours are "normal" sample. Thus it is with high confidence that #1 could be "normal" as well. The same applies to #4 and #5 who lie in the right most resion of the PCA biplot, where the majority of "metastasized" samples cluster. 

However, unknown sample #2 lies between clusters "normal" and "primary", unknown sample #3 lies between clusters "primary" and "metastasized", which make those two samples difficult to be classified.

6. FOR GRADUATES: Use PCA on all samples and all the genes (instead of
the differentially expressed genes) for sample classification. Compare
to your results in the previous question. Which PCA plot looks better
and why?

```{r part2-pca2}
# your code here
pca_result.All <- taylor %>% t() %>% prcomp(center = TRUE, scale. = TRUE)
edata_pc_df.All <- as.data.frame(pca_result.All$x)

edata_pc_df.All <-  edata_pc_df.All %>%
                    mutate(batch = as.factor(tumortype_all),
                    hasCancer = as.factor(tumortype_all))
edata_pc_df.All[1:5, 1:5]

eigenvals.All <- (pca_result.All$sdev)^2
n_eigenvals.All <- length(eigenvals.All)
var_explained.All <- cumsum(eigenvals.All) / sum(eigenvals.All)

ggplot(edata_pc_df.All, aes(x = PC1, y = PC2, color = batch)) +
    geom_point() +
    ggtitle("PCA Biplot of All Samples (all genes)") +
    labs(x = paste(sprintf("%0.1f", var_explained.All[1]*100), "% of variance explained", sep = ""), 
         y = paste(sprintf("%0.1f", (var_explained.All[2] - var_explained.All[1])*100), "% of variance explained", sep = ""))

print("The PCA of all samples and all the genes performs worse than the previous PCA which only calculates the differentially expressed genes. The fours clusters mixed togeter. The percentage of variances explained by PC1 and PC2 also decrease. The reason could be limma removes genes that are unrelated to this biological process, or in other words, it filtered background noise.")
```

7. Run KNN (try K = 1, 3 and 5) on the differential genes and all the
samples, and predict the unknown samples based on the 174 labeled
samples. Hint: use library class and function knn.

```{r part2-knn}
# your code here
# BiocManager::install("caret")
library(caret)
# pca_result <- taylor[genelist$gene, ] %>% t() %>% prcomp(center = TRUE, scale. = TRUE)
# pca_result.All <- taylor %>% t() %>% prcomp(center = TRUE, scale. = TRUE)

print("For DE group")
train_ind <- train_samps
expr_train <- taylor[genelist$gene, train_ind]
expr_test <- taylor[genelist$gene, -train_ind]
type_train <- tumortype_all[train_ind]
type_test <- tumortype_all[-train_ind]
# type_test <- factor(c(rep(3, 5)), labels = c("Unknown"))

type_knn <- knn(t(expr_train), t(expr_test), type_train, k = 1)
confusionMatrix(type_knn, type_test)$table

type_knn <- knn(t(expr_train), t(expr_test), type_train, k = 3)
confusionMatrix(type_knn, type_test)$table

type_knn <- knn(t(expr_train), t(expr_test), type_train, k = 5)
confusionMatrix(type_knn, type_test)$table

print("For all genens")

train_ind <- train_samps
expr_train <- taylor[, train_ind]
expr_test <- taylor[, -train_ind]
type_train <- tumortype_all[train_ind]
type_test <- tumortype_all[-train_ind]

type_knn <- knn(t(expr_train), t(expr_test), type_train, k = 1)
confusionMatrix(type_knn, type_test)$table

type_knn <- knn(t(expr_train), t(expr_test), type_train, k = 3)
confusionMatrix(type_knn, type_test)$table

type_knn <- knn(t(expr_train), t(expr_test), type_train, k = 5)
confusionMatrix(type_knn, type_test)$table
```

8. Run SVM (try a linear kernel) on the differential genes and all the
samples, and predict the unknown samples based on the 174 labeled
samples. Hint: use library e1071 and function svm.

```{r part2-svm}
# your code here
library(e1071)
print("Run SVM on the differential genes:")
train_ind <- train_samps
expr_train <- taylor[genelist$gene, train_ind]
expr_test <- taylor[genelist$gene, -train_ind]
type_train <- tumortype_all[train_ind]
type_test <- tumortype_all[-train_ind]
# get prediction for train dataset
svm_result <- svm(t(expr_train), type_train, kernel = "linear")
confusionMatrix(svm_result$fitted, type_train)$table
# get prediction for test dataset
preds <- predict(svm_result, t(expr_test))
confusionMatrix(preds, type_test)$table

print("Run SVM on all the samples:")
train_ind <- train_samps
expr_train.all <- taylor[, train_ind]
expr_test.all <- taylor[, -train_ind]
type_train.all <- tumortype_all[train_ind]
type_test.all <- tumortype_all[-train_ind]
# get prediction for train dataset
svm_result.all <- svm(t(expr_train.all), type_train.all, kernel = "linear")
confusionMatrix(svm_result.all$fitted, type_train.all)$table
# get prediction for test dataset
preds.all <- predict(svm_result.all, t(expr_test.all))
confusionMatrix(preds.all, type_test.all)$table
```

9. FOR GRADUATES: Implement a 3-fold cross validation on your SVM
classifier, based on the 174 samples with known labels. What is your
average (of 3) classification error rate on the training data?

```{r part2-cv}
# your code here
shuffle_inds <- sample(1:nrow(t(expr_train)), replace = FALSE)
svm_tune <- tune(svm, t(expr_train)[shuffle_inds, ], type_train[shuffle_inds],
                 kernel = "linear",
                 ranges = list(cost = c(0.01, 0.1, 1, 10)),
                 tunecontrol = tune.control(cross = 3))
plot(svm_tune)
svm_tune
print(paste("(DE group) The average classification error rate on the training data is: ", sprintf("%.5f", mean(svm_tune$performances[, "error"]))))
#===========================#
shuffle_inds <- sample(1:nrow(t(expr_train)), replace = FALSE)
svm_tune.all <- tune(svm, t(expr_train.all)[shuffle_inds, ], type_train.all[shuffle_inds],
                 kernel = "linear",
                 ranges = list(cost = c(0.01, 0.1, 1, 10)),
                 tunecontrol = tune.control(cross = 3))
plot(svm_tune.all)
svm_tune.all
print(paste("(All genes group) The average classification error rate on the training data is: ", sprintf("%.5f", mean(svm_tune.all$performances[, "error"]))))
# confusionMatrix(svm_tune$fitted, type_train[shuffle_inds])$table
```

# Part III. High throughput sequencing read mapping

We will give you a simple example to test high throughput sequencing
alignment for RNA-seq data. Normally for paired-end sequencing data,
each sample will have two separate FASTQ files, with line-by-line
correspondence to the two reads from the same fragment. Read mapping
could take a long time, so we have created just two FASTQ files of one
RNA-seq sample with only 3M fragments (2 * 3M reads) for you to run STAR
instead of the full data. The files are located at
`/n/stat115/HW2_2019/`. The mapping will generate one single output
file. Make sure to use the right parameters for single-end (SE) vs
paired-end (PE) modes in BWA and STAR.

Please include the commands that you used to run BWA and STAR in your
answers.

1. Use BWA (Li & Durbin, Bioinformatics 2009) to map the reads to the
Hg38 version of the reference genome, available on Odyssey at
`/n/stat115/HW2_2019/bwa_hg38_index/hg38.fasta`. Use the PE alignment
mode and generate the output in SAM format. Use SAMTools on the output
to find out how many fragments are mappable and uniquely mappable.

```
bwa commands here
```

2. Use STAR (Dobin et al, Bioinformatics 2012) to map the reads to the
reference genome, available on Odyssey at
`/n/stat115/HW2_2019/STARIndex`. Use the paired-end alignment mode and
generate the output in SAM format. STAR should have a report.  How many
fragments are mappable and how many are uniquely mappable?

```
STAR commands here
```

3. If you are getting a different number of mappable fragments between
BWA and STAR on the same data, why?

4. For GRADUATES: Run STAR using SE alignment mode on the left read
file. Take a look at the SE SAM file vs the PE SAM file. Are you getting
the same number of aligned fragments using PE mode vs SE mode?

```
STAR commands here
```

# Rules for submitting the homework:

Please submit your solution directly on the canvas website. Please
provide both your code in this Rmd document and an html file for your
final write-up. Please pay attention to the clarity and cleanness of
your homework.

The teaching fellows will grade your homework and give the grades with
feedback through canvas within one week after the due date. Some of the
questions might not have a unique or optimal solution. TFs will grade
those according to your creativity and effort on exploration, especially
in the graduate-level questions.