Skip to content

Commit

Permalink
add week12
Browse files Browse the repository at this point in the history
  • Loading branch information
ksiegmund committed Aug 2, 2022
1 parent ba4a865 commit ad9fbb6
Showing 1 changed file with 363 additions and 0 deletions.
363 changes: 363 additions & 0 deletions week12/w12code1-coad-analysis.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,363 @@
---
title: "TCGA colon adenocarcinoma"
author: "ks"
date: "`r Sys.Date()`"
output: html_document
---

# {.tabset}

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(ggplot2)
library(PMA)
if(!require("corrplot")) {install.packages("corrplot")}
library(corrplot)
```

## TCGA colon adenocarcinoma

We're going to analyze colon adenocarcinoma data from the Cancer Genome Atlas (TCGA). Patient clinical information along with some tumor characteristics were downloaded from the R package: curatedTCGAData. The molecular data are gene expression and DNA methylation. The samples selected had the most complete information on the tumor variables MSI status and DNA methylation subtype.

The expression data were a count matrix from RNA sequencing and there was considerable variation in total counts (library size) across the samples. I normalized the counts by sample (library) using counts per million and removed genes that were not expressed in at least 10% of the samples. Then I log2-transformed the data to stabilize the variances of the genes. On the log2 scale, the variance of gene expression is approximately independent of the mean expression level. This is not true on the count scale.

I also filtered the DNA methylation data on variance, requiring variance greater than the median (0.003) and dropping the ~500 features with missing values. Often investigators will also remove features that have a SNP at the target CpG site or that are mapped in repetitive regions. I did not filter on those criteria.

Here are the variables in the data set:

```{r load-data, echo=FALSE}
load("../data/coad.RData")
str(coad)
```

I limited the covariates to a few interesting variables:
* age (=age_at_initial_pathologic_diagnosis)
* sex (male/female)
* MSI_status (MSI-H/MSI-L/MSS)
* methylation_subtype (CIMP.H, CIMP.L, Cluster3, Cluster4)
* purity (fraction tumor cells)

A variable that is important when studying tissue is the cell composition. Different cell types have different DNA methylation profiles and different genes expressed. Human tumors are a complex mixture of cells, and not all cells in the tumor are cancer cells. There are a number of methods now for estimating tumor purity. I downloaded an estimate obtained from the method: ABSOLUTE (PMID: 22544022) (purity).

Now we can summarize the clinical variables of primary interest by sex.

```{r bysex}
with(coad$covars,
boxplot(age~sex))
msitab <- with(coad$covars,table(MSI_status,sex))
msitab
prop.table(msitab,2)
```

We see a strong association with females having more MSI high tumors than males.

```{r cimpbysex}
cimptab <- with(coad$covars,table(methylation_subtype,sex))
cimptab
prop.table(cimptab,2)
```

CIMP high tumors are also more common in females. In fact, there is a strong association between CIMP high and MSI high tumors.

```{r tumor-vars}
msicimp <- with(coad$covars,table(methylation_subtype,MSI_status))
prop.table(msicimp,2)
```

Let's see whether tumor purity is associated with any of the variables we've studied so far.

```{r tumor_purity}
with(coad$covars,boxplot(purity~sex))
with(coad$covars,plot(age,purity))
with(coad$covars,boxplot(purity~MSI_status))
with(coad$covars,boxplot(purity~methylation_subtype))
```

There doesn't appear to be any association of tumor purity with the other variables of interest. We can show this in the correlation matrix.

```{r cors}
cormat <- cor(coad$covars[,1:5],use="pairwise.complete.obs")
corrplot::corrplot(cormat, method="square")
```

Still, what is the effect of (lack of) tumor purity on the analysis of the gene expression and DNA methylation data?

## Gene Expression

Here's a PCA of the gene expression data. Because we have over 15,000 variables, we're going to filter the features (variables) and analyze the 500 most variable features.

```{r pca}
# pick 500 most variable features
fmad <- matrixStats::rowMads(coad$lcpmy)
rfmad <- rank(-fmad)
fidx <- which(rfmad <= 500)
# transpose the expression matrix
tE <- t(coad$lcpmy[fidx,])
#tE <- t(coad$lcpmy)
sdztE <- scale(tE)
#pca
my.pcaE <- prcomp(sdztE,retx=TRUE)
epcs <- as.data.frame(my.pcaE$x)
ggplot(epcs, aes(x=PC1, y=PC2, color = coad$covars$MSI_status , shape = coad$covars$sex)) + geom_point(size=2.5) +
labs(color="MSI status",shape="Sex")
```

Nice! We see an association of PC1 sith MSI-H, and no noticeable association with sex.

Let's look at the correlation of the first 8 PCs with our covariates

```{r pcacov}
pc1thru8 <- epcs[,1:8]
cormat <- cor(coad$covars[,1:5],pc1thru8,use="pairwise.complete.obs")
corrplot::corrplot(cormat)
```

The biological variables MSI.H and CIMP.H have the greatest association with PC1, followed by sex. Sex, tumor purity and age are the variables most correlated with PC2. And purity is the variable most correlated with PC3 & PC4.

How much of the gene expression variation do these PCs explain?

```{r cumvarplot}
pcvar <- my.pcaE$sdev^2 #These are variances of PCs
cumvar <- cumsum(pcvar)
pctvar <- cumvar/max(cumvar)
par(mfrow=c(1,2))
plot(1:10,pcvar[1:10],xlab="PC",ylab="Variance")
plot(1:10,pctvar[1:10],xlab="PCs",ylab="Percent Variance Explained",ylim=c(0,1))
```

10 PCs explain ~50% of the variance in the gene expression data set.


## DNA methylation

Now let's see what happens when we do a similar analysis with DNA methylation data.

The DNAm values, called Beta values by Illumina, are proportions, bounded by 0 and 1. These should be transformed prior to analysis to be more normally distributed. A logit-like transformation works well here, M values = log2 (beta/1-beta).

```{r transform}
mvals <- function(beta){ log2(beta/(1-beta)) }
mcoad <- mvals(coad$dnam)
```

Let's look at a pca plot of these. Again, I will analyze the 500 most variable features only.

EXTRA CREDIT: Try re-doing this plot without the filtering to see how it affects the relationships with the biological variables.

```{r pca-dnam}
# pick 500 most variable features
fmad <- matrixStats::rowMads(mcoad)
rfmad <- rank(-fmad)
fidx <- which(rfmad <= 500)
# transpose the DNAm matrix
tM <- t(mcoad[fidx,])
#tM <- t(mcoad) #un-comment this line to analyze all variables without filtering
sdztM <- scale(tM,center=TRUE,scale=
matrixStats::colMads(tM))
my.mpca <- prcomp(sdztM,retx=TRUE)
mpcs <- as.data.frame(my.mpca$x)
ggplot(mpcs, aes(x=PC1, y=PC2, color = coad$covars$methylation_subtype , shape = coad$covars$sex)) + geom_point(size=2.5) +
labs(color="methylation subtype",shape="Sex")
```

PC1 and PC2 can separate patients by sex perfectly.

How much of the DNAm variation do these PCs explain?

```{r cumvarplot-dnam}
pcvar <- my.mpca$sdev^2 #These are variances of PCs
cumvar <- cumsum(pcvar)
pctvar <- cumvar/max(cumvar)
par(mfrow=c(1,2))
plot(1:10,pcvar[1:10],xlab="PC",ylab="Variance")
plot(1:10,pctvar[1:10],xlab="PCs",ylab="Percent Variance Explained",ylim=c(0,1))
```

PCs 1 & 2 explain over 60% of the variance measured by the top 500 variable features.

Let's check the distribution of CpGs by chromosome.

```{r mprop.table}
table(coad$fdnam$Chr)
```

X chromosome methylation is different in males and females (X chromosome inactivation). This is something you will always have to look out for when analyzing males and females together.

```{r fraction-sexchrfeatures}
sum(is.element(coad$fdnam$Chr,c("X","Y"))/length(coad$fdnam$Chr))
```

Just $7.6\%$ of the features are on sex chromosomes but when we filter on variance, we're selecting these into our analysis set and they will have a large impact on the cluster results. Let's drop the features on the sex chromosomes before performing our dimension reduction techniques.

```{r filtXY}
mdnam <- mcoad[
!is.element(coad$fdnam$Chr,c("X","Y")),]
fdnam <- coad$fdnam[!is.element(coad$fdnam$Chr,c("X","Y")),]
dim(mdnam)
nrow(fdnam)
```

Now what do we get from a PCA?
```{r pca-dnam2}
# pick 500 most variable features
fmad <- matrixStats::rowMads(mdnam)
rfmad <- rank(-fmad)
fidx <- which(rfmad <= 500)
# transpose the dnam matrix
tM <- t(mdnam[fidx,])
sdztM <- scale(tM,center=TRUE,scale=
matrixStats::colMads(tM))
my.mpca <- prcomp(sdztM,retx=TRUE)
mpca <- as.data.frame(my.mpca$x)
ggplot(mpca, aes(x=PC1, y=PC2, color = coad$covars$methylation_subtype, shape = coad$covars$sex)) + geom_point(size=2.5) +
labs(color="methylation subtype",shape="Sex")
```

The separation by sex has disappeared now. How do the top PCs correlate with the covariates?

```{r mpcacov}
covmat <- cor(coad$covars[,1:5],mpca[,1:8],use="pairwise.complete.obs")
corrplot::corrplot(covmat)
```

PC1 is highly correlated with both CIMP-H and MSI-H. This resembles the correlation matrix with the PCs from the gene expression data. Somewhat troubling is the correlation with purity that is seen in multiple PCs.

How much of the DNAm variation do these PCs explain?

```{r cumvar-dnam2}
pcvar <- my.mpca$sdev^2 #These are variances of PCs
cumvar <- cumsum(pcvar)
pctvar <- cumvar/max(cumvar)
par(mfrow=c(1,2))
plot(1:10,pcvar[1:10],xlab="PC",ylab="Variance")
plot(1:10,pctvar[1:10],xlab="PCs",ylab="Percent Variance Explained",ylim=c(0,1))
```

The first PC explains 40% of the variance, with the next 9 adding another 10%.

## Sparse CCA

With high-dimensional data, the standard canonical correlation analysis won't work. There are several new methods that have been proposed for estimating canonical correlations in this setting and we'll use the sparse CCA propsed by Witten et al. (2009) and available in the PMA package. This CCA has feature selection built in, but we need to pick the penalty parameter. We will run a function to search for the best penalty parameters. The smaller the penalty parameter, the fewer variables are fit in the model. Since we only have 163 subjects and we have nearly 100 times as many variables, I'm going to limit my search to low values of the penalty parameter.

First, we standardize the columns of the DNA methylation and gene expression matrices.

```{r stdize}
X <- scale(t(mdnam))
Z <- scale(t(coad$lcpmy))
```


This is going to take a little time to run because it will run 7 permutations over a grid of length 8. I should run more permutations, but this will be sufficient for illustration.

```{r CCAPermute}
set.seed(47)
perm.out <- CCA.permute(X,Z,typex="standard",
typez="standard",
penaltyxs=seq(.01,.2,len=8),
penaltyzs=seq(.01,.2,len=8),
nperms=7)
print(perm.out)
plot(perm.out)
```

Now let's fit CCA using the best penalty parameter, and ask for the top 3 canonical variables.

```{r BestPenaltyFit}
out <- CCA(X,Z,typex="standard",typez="standard",K=3,
penaltyx=perm.out$bestpenaltyx,
penaltyz=perm.out$bestpenaltyz, v=perm.out$v.init)
print(out)
```
Num non-zero u's gives the number of DNA methylation features selected in the 1st, 2nd, and 3rd canonical variables.
Num non-zero v's gives the number of gene expression features selected in the 1st, 2nd, and 3rd canonical variables.
Cor(Xu,Zv) are the correlations between the first three pairs of canonical variables.

Let's check that we can compute the same correlations from our data.
```{r computecor}
mcc1<- X%*%out$u[,1]
ecc1<- Z%*%out$v[,1]
plot(mcc1,ecc1,main=paste("correlation=",round(cor(mcc1,ecc1),7)))
mcc2<- X%*%out$u[,2]
ecc2<- Z%*%out$v[,2]
mcc3<- X%*%out$u[,3]
ecc3<- Z%*%out$v[,3]
ccvs <- cbind.data.frame(mcc1,ecc1,mcc2,ecc2,mcc3,ecc3)
```

Now let's see if the canonical variables correlate with the PCs.

```{r cors-wpcs}
pcs <- cbind.data.frame(mPC1=mpca[,1],
ePC1=epcs[,1],
mPC2=mpca[,2],
ePC2=epcs[,2],
mPC3=mpca[,3],
ePC3=epcs[,3]
)
svs <- cbind.data.frame(ccvs,pcs)
corrplot::corrplot(cor(svs))
```

One thing to realize is that these canonical variables are not independent of one another like they were in low-dimensional space, or like the PCs.

Do they correlate with our clinical variables? (Age, sex, MSI status, methylation subtype.)

```{r corcovar-ccvs}
cormat <- cor(coad$covars[,1:5],ccvs,use="pairwise.complete.obs")
round(cormat,2)
corrplot::corrplot(cormat,method="square")
```

The second pair of canonical variables are now much more correlated with tumor purity than they were from the PC analysis.

Let's recall what we saw with the PCs from separate analyses of the two data sets.

```{r corcovar-pcs}
cormat <- cor(coad$covars[,1:5],pcs,use="pairwise.complete.obs")
round(cormat,2)
corrplot::corrplot(cormat,method="square")
```

The correlation with tumor purity was less noticeable with the PCs of individual data types, but when we analyze the data jointly with CCA, this signal gets accentuated.

This shows that canonical correlation can prioritize signals shared from individual data types. Would we have found this if we had done a PCA of the concatenated data?

## PCA concatenated data

Our earlier PCA was of the 500 most variable features from each data set. Let's concatenate these and do a PCA on 1000 features.

```{r pca-concatenate}
my.cpca <- prcomp(cbind(sdztM,sdztE),retx=TRUE)
combpca <- as.data.frame(my.cpca$x)
ggplot(combpca, aes(x=PC1, y=PC2, color = coad$covars$methylation_subtype, shape = coad$covars$sex)) + geom_point(size=2.5) +
labs(color="methylation subtype",shape="Sex")
```

```{r corcovar-pcscombodata}
cormat <- cor(coad$covars[,1:5],combpca[,1:5],use="pairwise.complete.obs")
round(cormat,2)
corrplot::corrplot(cormat)
```

We find PCs 3 and 4 correlate with tumor purity in the analysis of the concatenated data sets.


Results we found:
1. A PCA of a subset of the data may find a different cluster result than PCA of the full data set. (e.g., DNA methylation finding clusters by sex)
2. Signals shared by different data types might be better prioritized by canonical correlation variables than a PCA of the concatenated data set. (e.g., tumor purity). However, we did have a different feature selection when running the CCA.

## SessionInfo

```{r SessionInfo}
sessionInfo()
```

0 comments on commit ad9fbb6

Please sign in to comment.