-
Notifications
You must be signed in to change notification settings - Fork 0
/
1. Exploratory Data Analysis - MSigDB Gene Sets + GTEx TPM.Rmd
309 lines (245 loc) · 10.3 KB
/
1. Exploratory Data Analysis - MSigDB Gene Sets + GTEx TPM.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
---
title: "Exploratory Data Analysis"
author:
- Kim Williame Lee^[De La Salle University, Manila, Philippines, kim_leejra@dlsu.edu.ph]
- Mark Edward M. Gonzales^[De La Salle University, Manila, Philippines, gonzales.markedward@gmail.com]
- Dr. Anish M.S. Shrestha^[De La Salle University, Manila, Philippines, anish.shrestha@dlsu.edu.ph]
output: html_notebook
---
## I. Preliminaries
### Loading libraries
```{r, warning=FALSE, message=FALSE}
library("tidyverse")
library("tibble")
library("msigdbr")
library("ggplot2")
library("TCGAbiolinks")
library("RNAseqQC")
library("DESeq2")
library("ensembldb")
library("purrr")
library("magrittr")
library("vsn")
library("matrixStats")
library("dplyr")
library("grex")
```
### Constants
```{r}
DATA_DIR <- "data/GTEx/"
```
## II. Loading the GTEx annotations
- `GTEx_Analysis_v8_Annotations_SampleAttributesDS.txt` - A de-identified, open access version of the sample annotations available in dbGaP (database of genotypes and phenotypes)
- `GTEx_Analysis_v8_Annotations_SubjectPhenotypesDS.txt` - A de-identified, open access version of the subject phenotypes available in dbGaP.
Download these files by running:
```
wget https://storage.googleapis.com/adult-gtex/annotations/v8/metadata-files/GTEx_Analysis_v8_Annotations_SampleAttributesDS.txt -P data/GTEx/
wget https://storage.googleapis.com/adult-gtex/annotations/v8/metadata-files/GTEx_Analysis_v8_Annotations_SubjectPhenotypesDS.txt -P data/GTEx/
```
```{r}
sample.df <- read.delim(paste0(DATA_DIR, "GTEx_Analysis_v8_Annotations_SampleAttributesDS.txt"), as.is = TRUE, header = TRUE, row.names = 1)
subject.df <- read.delim(paste0(DATA_DIR, "GTEx_Analysis_v8_Annotations_SubjectPhenotypesDS.txt"), as.is = TRUE, header = TRUE, row.names = 1)
```
The `DTHHRDY` column refers to the 4-point Hardy scale: https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/variable.cgi?study_id=phs000424.v4.p1&phv=169092
Refer to the metadata files here for more information: https://gtexportal.org/home/downloads/adult-gtex/metadata
```{r}
subject.df
```
Refer to the metadata files here for more information: https://gtexportal.org/home/downloads/adult-gtex/metadata
```{r}
sample.df
```
Extract entries that pertain to transcriptomic (RNA) data.
```{r}
rnaseq.sample.df <- sample.df[sample.df["SMAFRZE"] == "RNASEQ", ]
rnaseq.sample.df
```
## III. Filtering colon samples
The `SMTSD` column refers to the tissue type (i.e., the area from which the sample was taken).
```{r}
as.matrix(sort(table(rnaseq.sample.df["SMTSD"]), decreasing = TRUE))
```
We are only interested in those from the colorectal area.
```{r}
colon.sample.df <- rnaseq.sample.df %>% dplyr::filter(SMTSD == "Colon - Sigmoid")
colon.sample.df
```
## IV. Loading TPM data from GTEx
`GTEx_Analysis_2017-06-05_v8_RNASeQCv1.1.9_gene_tpm.gct` contains the gene TPMs.
Download this file by running:
```
wget https://storage.googleapis.com/adult-gtex/bulk-gex/v8/rna-seq/GTEx_Analysis_2017-06-05_v8_RNASeQCv1.1.9_gene_tpm.gct.gz -P data/GTEx/
```
```{r}
tpm.df <- read.delim(paste0(DATA_DIR, "GTEx_Analysis_2017-06-05_v8_RNASeQCv1.1.9_gene_tpm.gct"),
as.is = T, row.names = 1, check.names = FALSE, skip = 2
)
gene.names.df <- tpm.df[, "Description", drop = FALSE]
tpm.df <- tpm.df[, !(names(tpm.df) %in% c("Description"))]
{
cat(paste("Number of genes in table: ", dim(tpm.df)[1]))
}
```
Perform some data preprocessing.
```{r}
# Remove version number: https://www.rdocumentation.org/packages/grex/versions/1.9/topics/cleanid
tpm.df$ensembl <- cleanid(rownames(tpm.df))
head(tpm.df$ensembl)
```
FADD is one of the key genes involved in necroptosis.
```{r}
# Create a named vector to map names to IDs
name_to_id <- setNames(rownames(gene.names.df), gene.names.df$Description)
# Create a named vector to map IDs to names
id_to_name <- setNames(gene.names.df$Description, rownames(gene.names.df))
# To retrieve the ID for 'FADD'
print(name_to_id[["FADD"]])
print(id_to_name[["ENSG00000168040.4"]])
```
## V. Loading RCD gene sets
### Necroptosis
We obtain the gene set from the Human MSigDB Collections:
- `GOBP_NECROPTOTIC_SIGNALING_PATHWAY` refers to the necroptosis gene set (found via MSigDB's search functionality: https://www.gsea-msigdb.org/gsea/msigdb/human/search.jsp).
- The category and subcategory parameters were decided based on this gene set.
- `C5` consists of genes annotated by the same ontology term (https://www.gsea-msigdb.org/gsea/msigdb/).
- `GO:BP` refers to the "biological process" category in Gene Ontology.
CAVEAT! `GOBP_NECROPTOTIC_SIGNALING_PATHWAY` contains only 8 genes.
```{r}
necroptosis.genes <- msigdbr(species = "human", category = "C5", subcategory = "GO:BP") %>%
dplyr::filter(gs_name == "GOBP_NECROPTOTIC_SIGNALING_PATHWAY")
necroptosis.genes
```
### Ferroptosis
We obtain the gene set from the Human MSigDB Collections:
- `WP_FERROPTOSIS` refers to the ferroptosis gene set (found via MSigDB's search functionality: https://www.gsea-msigdb.org/gsea/msigdb/human/search.jsp).
- The category and subcategory parameters were decided based on this gene set.
- `C2` consists of the curated gene sets (https://www.gsea-msigdb.org/gsea/msigdb/).
- `CP:WIKIPATHWAYS` refers to the curated gene set from WikiPathways.
Note that there is another ferroptosis gene set: `GOBP_FERROPTOSIS`. <br>
However, it contains only 10 genes (for comparison, `WP_FERROPTOSIS` contains 64 genes).
```{r}
ferroptosis.genes <- msigdbr(species = "human", category = "C2", subcategory = "CP:WIKIPATHWAYS") %>%
dplyr::filter(gs_name == "WP_FERROPTOSIS")
ferroptosis.genes
```
### Pyroptosis
We obtain the gene set from the Human MSigDB Collections:
- `REACTOME_PYROPTOSIS` refers to the pyroptosis gene set (found via MSigDB's search functionality: https://www.gsea-msigdb.org/gsea/msigdb/human/search.jsp).
- The category and subcategory parameters were decided based on this gene set.
- `C2` consists of the curated gene sets (https://www.gsea-msigdb.org/gsea/msigdb/).
- `CP:REACTOME` refers to the curated gene set from Reactome.
`REACTOME_PYROPTOSIS` contains 27 genes.
```{r}
pyroptosis.genes <- msigdbr(species = "human", category = "C2", subcategory = "CP:REACTOME") %>%
dplyr::filter(gs_name == "REACTOME_PYROPTOSIS")
pyroptosis.genes
```
## VI. Exploratory Data Analysis
Select only the tissue samples (columns) from the colorectal area.
```{r}
tpm.colon.df <- tpm.df %>% dplyr::select(c(ensembl, rownames(colon.sample.df)))
tpm.colon.df
```
### Necroptosis
Get the TPM for the genes in the necroptosis gene set.
```{r}
tpm.colon.necro.df <- tpm.colon.df %>% dplyr::filter(ensembl %in% necroptosis.genes$ensembl_gene)
tpm.colon.necro.df2 <- left_join(tpm.colon.necro.df, necroptosis.genes %>% dplyr::select(ensembl_gene, gene_symbol), by = c("ensembl" = "ensembl_gene"))
tpm.colon.necro.df$gene_symbol <- tpm.colon.necro.df2$gene_symbol
tpm.colon.necro.df
```
Compute the median TPM per gene.
```{r}
gene.expressions.necro <- data.frame(
row.names = tpm.colon.necro.df$gene_symbol,
tissue = "Colon",
tpm = matrixStats::rowMedians(as.matrix(tpm.colon.necro.df %>% dplyr::select(-c("ensembl", "gene_symbol"))))
)
gene.expressions.necro
```
Plot the gene expression.
```{r}
ggplot(gene.expressions.necro, aes(y = tissue, x = rownames(gene.expressions.necro), fill = tpm)) +
geom_tile() +
scale_fill_gradient(low = "white", high = "red") +
theme_minimal() +
theme(
axis.text.x = element_text(size = 9, hjust = 1)
) +
labs(
title = "Expression of necroptosis-related genes in GTEx colon tissues",
x = "Gene",
y = "Area",
fill = "TPM"
) +
geom_text(aes(label = tpm), vjust = 1)
```
**Sanity Check:** RIPK1 and necroptosis - https://www.nature.com/articles/s12276-022-00847-4
### Ferroptosis
Get the TPM for the genes in the ferroptosis gene set.
```{r}
tpm.colon.ferro.df <- tpm.colon.df %>% dplyr::filter(ensembl %in% ferroptosis.genes$ensembl_gene)
tpm.colon.ferro.df2 <- left_join(tpm.colon.ferro.df, ferroptosis.genes %>% dplyr::select(ensembl_gene, gene_symbol), by = c("ensembl" = "ensembl_gene"))
tpm.colon.ferro.df$gene_symbol <- tpm.colon.ferro.df2$gene_symbol
tpm.colon.ferro.df
```
Compute the median TPM per gene.
```{r}
gene.expressions.ferro <- data.frame(
row.names = tpm.colon.ferro.df$gene_symbol,
tissue = "Colon",
tpm = matrixStats::rowMedians(as.matrix(tpm.colon.ferro.df %>% dplyr::select(-c("ensembl", "gene_symbol"))))
)
gene.expressions.ferro
```
Plot the gene expression.
```{r}
ggplot(gene.expressions.ferro, aes(y = tissue, x = rownames(gene.expressions.ferro), fill = tpm)) +
geom_tile() +
scale_fill_gradient(low = "white", high = "red") +
theme_minimal() +
theme(
axis.text.x = element_text(size = 7, angle = 90, hjust = 1)
) +
labs(
title = "Expression of ferroptosis-related genes in GTEx colon tissues",
x = "Gene",
y = "Area",
fill = "TPM"
)
```
**Sanity Check:** FTH1 and ferroptosis - https://www.nature.com/articles/s41420-022-00902-z
### Pyroptosis
Get the TPM for the genes in the pyroptosis gene set.
```{r}
tpm.colon.pyro.df <- tpm.colon.df %>% dplyr::filter(ensembl %in% pyroptosis.genes$ensembl_gene)
tpm.colon.pyro.df2 <- left_join(tpm.colon.pyro.df, pyroptosis.genes %>% dplyr::select(ensembl_gene, gene_symbol), by = c("ensembl" = "ensembl_gene"))
tpm.colon.pyro.df$gene_symbol <- tpm.colon.pyro.df2$gene_symbol
tpm.colon.pyro.df
```
Compute the median TPM per gene.
```{r}
gene.expressions.pyro <- data.frame(
row.names = tpm.colon.pyro.df$gene_symbol,
tissue = "Colon",
tpm = matrixStats::rowMedians(as.matrix(tpm.colon.pyro.df %>% dplyr::select(-c("ensembl", "gene_symbol"))))
)
gene.expressions.pyro
```
Plot the gene expression.
```{r}
ggplot(gene.expressions.pyro, aes(y = tissue, x = rownames(gene.expressions.pyro), fill = tpm)) +
geom_tile() +
scale_fill_gradient(low = "white", high = "red") +
theme_minimal() +
theme(
axis.text.x = element_text(size = 7, angle = 90, hjust = 1)
) +
labs(
title = "Expression of pyroptosis-related genes in GTEx colon tissues",
x = "Gene",
y = "Area",
fill = "TPM"
)
```
**Sanity Check:** CHMP4B and pyroptosis - https://pubmed.ncbi.nlm.nih.gov/38823000/