add TCGA coad-analysis.Rmd

ksiegmund · Jul 27, 2022 · 3d49e15 · 3d49e15
1 parent 57fac07
commit 3d49e15
Showing 1 changed file with 269 additions and 0 deletions.
diff --git a/week11/w11code1-coad-analysis.Rmd b/week11/w11code1-coad-analysis.Rmd
@@ -0,0 +1,269 @@
+---
+title: "TCGA colon adenocarcinoma"
+author: "ks"
+date: "`r Sys.Date()`"
+output: html_document
+---
+
+# {.tabset}
+
+```{r setup, include=FALSE}
+knitr::opts_chunk$set(echo = TRUE)
+library(ggplot2)
+library(PMA)
+if(!require("corrplot")) {install.packages("corrplot")}
+library(corrplot)
+```
+
+## TCGA colon adenocarcinoma
+
+We're going to analyze colon adenocarcinoma data from the Cancer Genome Atlas (TCGA). Patient clinical information along with some tumor characteristics were downloaded from the R package: curatedTCGAData. The molecular data are gene expression and DNA methylation. The samples selected had the most complete information on the tumor variables MSI status and DNA methylation subtype.  
+
+The expression data were a count matrix from RNA sequencing and there was considerable variation in total counts (library size) across the samples. I normalized the counts by sample (library) using counts per million and removed genes that were not expressed in at least 10% of the samples.  Then I log2-transformed the data to stabilize the variances of the genes. On the log2 scale, the variance of gene expression is approximately independent of the mean expression level.  This is not true on the count scale.
+
+I also filtered the DNA methylation data on variance, requiring variance greater than the median (0.003) and dropping the ~500 features with missing values. Often investigators will also remove features that have a SNP at the target CpG site or that are mapped in repetitive regions. I did not filter on those criteria.
+
+Here are the variables in the data set:
+
+```{r load-data, echo=FALSE}
+load("../data/coad.RData")
+names(coad)
+str(coad)
+```
+
+I limited the covariates to a few interesting variables:   
+* age (=age_at_initial_pathologic_diagnosis)     
+* sex (male/female)  
+* MSI_status (MSI-H/MSI-L/MSS)     
+* methylation_subtype (CIMP.H, CIMP.L, Cluster3, Cluster4)     
+* purity (fraction tumor cells)  
+
+A variable that is important when studying tissue is the cell composition. Different cell types have different DNA methylation profiles and different genes expressed. Human tumors are a complex mixture of cells, and not all cells in the tumor are cancer cells. There are a number  of methods now for estimating tumor purity. I downloaded an estimate obtained from the method: ABSOLUTE (PMID: 22544022) (purity).
+
+Now we can summarize the clinical variables of primary interest by sex.
+
+```{r bysex}
+with(coad$covars,
+     boxplot(age~sex))
+msitab <- with(coad$covars,table(MSI_status,sex))
+msitab
+prop.table(msitab,2)
+```
+
+We see a strong association with females having more MSI high tumors than males.
+
+```{r cimpbysex}
+cimptab <- with(coad$covars,table(methylation_subtype,sex))  
+cimptab
+prop.table(cimptab,2)
+```  
+
+CIMP high tumors are also more common in females.  In fact, there is a strong  association between CIMP high and MSI high tumors.
+
+```{r tumor-vars}
+msicimp <- with(coad$covars,table(methylation_subtype,MSI_status))
+prop.table(msicimp,2) 
+```
+
+Let's see whether tumor purity is associated with any of the variables we've studied so far. 
+
+```{r tumor_purity}
+with(coad$covars,boxplot(purity~sex))
+with(coad$covars,plot(age,purity))
+with(coad$covars,boxplot(purity~MSI_status))
+with(coad$covars,boxplot(purity~methylation_subtype))
+```
+
+There doesn't appear to be any association of tumor purity with the other variables of interest. We can show this in the correlation matrix.
+
+```{r cors}
+cormat <- cor(coad$covars[,1:5],use="pairwise.complete.obs")
+corrplot::corrplot(cormat, method="square")
+```
+
+Still, is there an effect of (lack of) tumor purity on the gene expression and DNA methylation data?  
+
+## Gene Expression
+
+How many variables are in the gene expression data set?
+
+```{r gex}
+dim(coad$lcpmy)
+```
+
+Here's a PCA of the gene expression data. Because we have over 15,000 variables, we're going to filter the features (variables) and analyze the 500 most variable features.
+
+```{r pca}
+# pick 500 most variable features
+fmad  <- matrixStats::rowMads(coad$lcpmy)
+rfmad <- rank(-fmad)
+fidx <- which(rfmad <= 500)
+
+# transpose the expression matrix
+tE <- t(coad$lcpmy[fidx,])
+#tE <- t(coad$lcpmy)
+sdztE <- scale(tE)
+#pca
+my.pcaE <- prcomp(sdztE,retx=TRUE)
+epcs <- as.data.frame(my.pcaE$x)
+
+ggplot(epcs,  aes(x=PC1, y=PC2, color = coad$covars$MSI_status , shape = coad$covars$sex)) + geom_point(size=2.5) +
+  labs(color="MSI status",shape="Sex")
+```
+
+Nice! We see an association of PC1 sith MSI-H, and no noticeable association with sex.
+
+Let's look at the correlation of the first 8 PCs with our covariates
+
+```{r pcacov}
+pc1thru8 <- epcs[,1:8]
+cormat <- cor(coad$covars[,1:5],pc1thru8,use="pairwise.complete.obs")
+corrplot::corrplot(cormat)
+```
+
+The biological variables MSI.H and CIMP.H have the greatest association with PC1, followed by sex. Sex, tumor purity and age are the variables most correlated with PC2. And purity is the variable most correlated with PC3 & PC4.
+
+How much of the gene expression variation do these PCs explain?
+
+```{r cumvarplot}
+pcvar <- my.pcaE$sdev^2   #These are variances of PCs
+cumvar <- cumsum(pcvar)
+pctvar <- cumvar/max(cumvar)
+par(mfrow=c(1,2))
+plot(1:10,pcvar[1:10],xlab="PC",ylab="Variance")
+plot(1:10,pctvar[1:10],xlab="PCs",ylab="Percent Variance Explained",ylim=c(0,1))
+```
+
+10 PCs explain ~50% of the variance in the gene expression data set.
+
+
+## DNA methylation
+
+How many variables are in the DNA methylation data set?
+
+```{r dnam}
+dim(coad$dnam)
+```
+
+What is the distribution of measurements for a single sample?
+```{r density-dnam}
+plot(density(coad$dnam[,4]))
+```
+
+The DNAm values, called Beta values by Illumina, are proportions, bounded by 0 and 1. These should be transformed prior to analysis to be more normally distributed. A logit-like transformation works well here, M values = log2 (beta/1-beta).
+
+```{r transform}
+  mvals <- function(beta){  log2(beta/(1-beta)) }
+  mcoad <- mvals(coad$dnam)
+```
+
+Let's look at a pca plot of these. Again, I will analyze the 500 most variable features only. 
+
+EXTRA CREDIT: Try re-doing this plot without the filtering to see how it affects the relationships with the biological variables. 
+
+```{r pca-dnam}
+# pick 500 most variable features
+fmad  <- matrixStats::rowMads(mcoad)
+rfmad <- rank(-fmad)
+fidx <- which(rfmad <= 500)
+
+# transpose the DNAm matrix
+tM <- t(mcoad[fidx,])
+#tM <- t(mcoad)  #un-comment this line to analyze all variables without filtering
+sdztM <- scale(tM,center=TRUE,scale=
+                 matrixStats::colMads(tM))
+my.mpca <- prcomp(sdztM,retx=TRUE)
+mpcs <- as.data.frame(my.mpca$x)
+
+ggplot(mpcs,  aes(x=PC1, y=PC2, color = coad$covars$methylation_subtype , shape = coad$covars$sex)) + geom_point(size=2.5) +
+  labs(color="methylation subtype",shape="Sex")
+```
+
+PC1 and PC2 can separate patients by sex perfectly.
+
+How much of the DNAm variation do these PCs explain?
+
+```{r cumvarplot-dnam}
+pcvar <- my.mpca$sdev^2   #These are variances of PCs
+cumvar <- cumsum(pcvar)
+pctvar <- cumvar/max(cumvar)
+par(mfrow=c(1,2))
+plot(1:10,pcvar[1:10],xlab="PC",ylab="Variance")
+plot(1:10,pctvar[1:10],xlab="PCs",ylab="Percent Variance Explained",ylim=c(0,1))
+```
+
+PCs 1 & 2 explain over 60% of the variance measured by the top 500 variable features.
+
+Let's check the distribution of CpGs by chromosome.
+
+```{r mprop.table}
+table(coad$fdnam$Chr)
+```
+
+X chromosome methylation is different in males and females (X chromosome inactivation).  This is something you will always have to look out for when analyzing males and females together.
+
+```{r fraction-sexchrfeatures}
+sum(is.element(coad$fdnam$Chr,c("X","Y"))/length(coad$fdnam$Chr))
+```
+
+Just $7.6\%$ of the features are on sex chromosomes but when we filter on variance, we're selecting these into our analysis set  and they will have a large impact on the cluster results. Let's drop the features on the sex chromosomes before performing our dimension reduction techniques.
+
+```{r filtXY}
+mdnam <- mcoad[
+          !is.element(coad$fdnam$Chr,c("X","Y")),]
+fdnam <- coad$fdnam[!is.element(coad$fdnam$Chr,c("X","Y")),]
+dim(mdnam)
+nrow(fdnam)
+```
+
+Now what do we get from a PCA?
+```{r pca-dnam2}
+# pick 500 most variable features
+fmad  <- matrixStats::rowMads(mdnam)
+rfmad <- rank(-fmad)
+fidx <- which(rfmad <= 500)
+
+# transpose the dnam matrix
+tM <- t(mdnam[fidx,])
+sdztM <- scale(tM,center=TRUE,scale=
+                 matrixStats::colMads(tM))
+my.mpca <- prcomp(sdztM,retx=TRUE)
+mpca <- as.data.frame(my.mpca$x)
+
+ggplot(mpca,  aes(x=PC1, y=PC2, color = coad$covars$methylation_subtype, shape = coad$covars$sex)) + geom_point(size=2.5) +
+  labs(color="methylation subtype",shape="Sex")
+```
+
+The separation by sex has disappeared now. How do the top PCs correlate with the covariates?
+
+```{r mpcacov}
+covmat  <- cor(coad$covars[,1:5],mpca[,1:8],use="pairwise.complete.obs")
+corrplot::corrplot(covmat)
+```
+
+PC1 is highly correlated with both CIMP-H and MSI-H.  This resembles the correlation matrix with the PCs from the gene expression data.  Somewhat troubling is the correlation with purity that is seen in multiple PCs.
+
+How much of the DNAm variation do these PCs explain?
+
+```{r cumvar-dnam2}
+pcvar <- my.mpca$sdev^2   #These are variances of PCs
+cumvar <- cumsum(pcvar)
+pctvar <- cumvar/max(cumvar)
+par(mfrow=c(1,2))
+plot(1:10,pcvar[1:10],xlab="PC",ylab="Variance")
+plot(1:10,pctvar[1:10],xlab="PCs",ylab="Percent Variance Explained",ylim=c(0,1))
+```
+
+The first PC explains 40% of the variance, with the next 9 adding another 10%. 
+
+
+Results:  
+A PCA of a subset of the data may find a different cluster result than PCA of the full data set.  (e.g.,  DNA methylation finding clusters by sex)  
+
+Next week: CCA 
+
+## SessionInfo
+
+```{r SessionInfo}
+sessionInfo()
+```