-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path08-abundance-analyses.qmd
327 lines (240 loc) · 14 KB
/
08-abundance-analyses.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
---
format:
html:
smooth-scroll: true
highlight-style: github
embed-resources: true
code-overflow: scroll
code-block-background: true
---
# Taxonomic Analysis with R {#sec-09-taxonomic-analysis-with-R}
::: {.callout-note icon="false"}
## Time
- Teaching: 40 min
- Exercises: 20 min
:::
::: {.callout-tip icon="false"}
## Questions
- How can we know which taxa are in our samples?
- How can we compare depth-contrasting samples?
- How can we manipulate our data to deliver a message?
:::
::: {.callout-warning icon="false"}
## Objectives
- Manipulate data types inside your phyloseq object.
- Extract specific information from taxonomic-assignation data.
:::
## Explore our samples at specific taxonomic levels
With the taxonomic assignment information that we obtained from Kraken, we have measured diversity, and we have visualized the taxa inside each sample with Krona and Pavian, but Phyloseq allows us to make this visualization more flexible and personalized. So now, we will use Phyloseq to make abundance plots of the taxa in our samples.
We will start our exploration at the Phylum level. In order to group all the OTUs that have the same taxonomy at a certain taxonomic rank,
we will use the function `tax_glom()`.
::: {style="background-color: rgb(234, 237, 237);"}
```
> percentages_glom <- tax_glom(percentages, taxrank = 'Phylum')
> View(percentages_glom@tax_table@.Data)
```
:::
![Taxonomic-data table after agglomeration at the phylum level](fig/09-01.png)
Another phyloseq function is `psmelt()`, which melts phyloseq objects into a `data.frame` to manipulate them with packages like `ggplot2` and `vegan`.
::: {style="background-color: rgb(234, 237, 237);"}
```
> percentages_df <- psmelt(percentages_glom)
> str(percentages_df)
'data.frame': 75 obs. of 6 variables:
$ OTU : chr "46157" "46157" "46157" "43389" ...
$ Sample : chr "2-kraken_report" "0-kraken_report" "1-kraken_report" "2-kraken_report" ...
$ Abundance: num 49.4 48 46.2 34.8 30.8 ...
$ Id : chr "2-kraken_report" "0-kraken_report" "1-kraken_report" "2-kraken_report" ...
$ Kingdom : chr "Bacteria" "Bacteria" "Bacteria" "Bacteria" ...
$ Phylum : chr "Proteobacteria" "Proteobacteria" "Proteobacteria" "Actinobacteriota" ...
```
:::
Now, let's create another data frame with the original data. This structure will help us to compare the absolute with the relative abundance and have a complete picture of our samples.
::: {style="background-color: rgb(234, 237, 237);"}
```
> absolute_glom <- tax_glom(physeq = merged_metagenomes, taxrank = "Phylum")
> absolute_df <- psmelt(absolute_glom)
> str(absolute_df)
'data.frame': 75 obs. of 6 variables:
$ OTU : chr "46157" "46157" "43389" "46157" ...
$ Sample : chr "2-kraken_report" "1-kraken_report" "2-kraken_report" "0-kraken_report" ...
$ Abundance: num 20055 17043 14131 12077 11343 ...
$ Id : chr "2-kraken_report" "1-kraken_report" "2-kraken_report" "0-kraken_report" ...
$ Kingdom : chr "Bacteria" "Bacteria" "Bacteria" "Bacteria" ...
$ Phylum : chr "Proteobacteria" "Proteobacteria" "Actinobacteriota" "Proteobacteria" ...
```
:::
With these objects and what we have learned regarding R data structures and `ggplot2`, we can compare them with a plot.
First, let's take some steps that will allow us to personalise our plot, making it accessible for color blindness.
We will create a color palette. With `colorRampPalette`, we will choose eight colors from the Dark2 palette and make a "ramp" with it; that is, convert those eight colors to the number of colors needed to have one for each phylum in our data frame.
We need to have our Phylum column in the factor structure for this.
::: {style="background-color: rgb(234, 237, 237);"}
```
> absolute_df$Phylum <- as.factor(absolute_df$Phylum)
> phylum_colors_abs<- colorRampPalette(brewer.pal(8,"Dark2")) (length(levels(absolute_df$Phylum)))
```
:::
Now, let´s create the figure for the data with absolute abundances (*, i.e.,* `absolute_plot` object)
::: {style="background-color: rgb(234, 237, 237);"}
```
> absolute_plot <- ggplot(data= absolute_df, aes(x=Sample, y=Abundance, fill=Phylum))+
geom_bar(aes(), stat="identity", position="stack")+
scale_fill_manual(values = phylum_colors_abs)
```
:::
With the `position="stack"` command, we are telling the `ggplot` function that the values must stack each other for each sample. In this way, we will
have all of our different categories (OTUs) stacked in one bar and not each in a separate one.
For more info [position_stack](https://ggplot2.tidyverse.org/reference/position_stack.html)
Next, we will create the figure for the representation of the relative abundance data and ask RStudio to show us both plots thanks to the `|` function from the library `patchwork`:
::: {style="background-color: rgb(234, 237, 237);"}
```
> percentages_df$Phylum <- as.factor(percentages_df$Phylum)
> phylum_colors_rel<- colorRampPalette(brewer.pal(8,"Dark2")) (length(levels(percentages_df$Phylum)))
> relative_plot <- ggplot(data=percentages_df, aes(x=Sample, y=Abundance, fill=Phylum))+
geom_bar(aes(), stat="identity", position="stack")+
scale_fill_manual(values = phylum_colors_rel)
> absolute_plot | relative_plot
```
:::
![Taxonomic diversity of absolute and relative abundance](fig/09-02.png)
At once, we can denote the difference between the two plots and how processing the data can
enhance the display of actual results. However, it is noticeable that we have too many taxa
to adequately distinguish the color of each one, less of the ones that hold the most incredible
abundance. In order to change that, we will use the power of data frames and R. We will change
the identification of the OTUs whose relative abundance is less than 0.2%:
::: {style="background-color: rgb(234, 237, 237);"}
```
> percentages_df$Phylum <- as.character(percentages_df$Phylum) # Return the Phylum column to be of type character
> percentages_df$Phylum[percentages_df$Abundance < 0.5] <- "Phyla < 0.5% abund."
> unique(percentages_df$Phylum)
[1] "Proteobacteria" "Actinobacteriota" "Bacteroidota"
[4] "Patescibacteria" "Deinococcota" "Chloroflexi"
[7] "Abditibacteriota" "Firmicutes" "Phyla < 0.5% abund."
```
:::
Let's ask R to display the figures again by re-running our code:
::: {style="background-color: rgb(234, 237, 237);"}
```
> percentages_df$Phylum <- as.factor(percentages_df$Phylum)
> phylum_colors_rel<- colorRampPalette(brewer.pal(8,"Dark2")) (length(levels(percentages_df$Phylum)))
> relative_plot <- ggplot(data=percentages_df, aes(x=Sample, y=Abundance, fill=Phylum))+
geom_bar(aes(), stat="identity", position="stack")+
scale_fill_manual(values = phylum_colors_rel)
> absolute_plot | relative_plot
```
:::
![Taxonomic diversity of absolute and relative abundance with corrections](fig/09-03.png)
## Going further, let's take an exciting lineage and explore it thoroughly
As we have already reviewed, Phyloseq offers many tools to manage and explore data. Let's take a
look at a function we already use but now with guided exploration. The `subset_taxa` command is used to extract specific lineages from a stated taxonomic level; we have used it to get
rid of the reads that do not belong to bacteria with `merged_metagenomes <- subset_taxa(merged_metagenomes, Kingdom == "Bacteria")`.
We will use it now to extract a specific phylum from our data and explore it at a lower
taxonomic level: Genus. We will take as an example the phylum Actionbacteriota, as members of this phylum were found to be enriched in diseased plants:
::: {style="background-color: rgb(234, 237, 237);"}
```
> streptomyces <- subset_taxa(merged_metagenomes, Phylum == "Actinobacteriota")
> unique(streptomyces@tax_table@.Data[,2])
[1] "Actinobacteriota"
```
:::
Let's do a little review of all that we saw today: **Transformation of the data; Manipulation of the
information; and plotting**:
::: {style="background-color: rgb(234, 237, 237);"}
```
> streptomyces_percentages <- transform_sample_counts(streptomyces, function(x) x*100 / sum(x) )
> streptomyces_glom <- tax_glom(streptomyces_percentages, taxrank = "Genus")
> streptomyces_df <- psmelt(streptomyces_glom)
> streptomyces_df$Genus[streptomyces_df$Abundance < 10] <- "Genera < 10.0 abund"
> streptomyces_df$Genus <- as.factor(streptomyces_df$Genus)
> genus_colors_streptomyces <- colorRampPalette(brewer.pal(8,"Dark2")) (length(levels(streptomyces_df$Genus)))
> plot_streptomyces <- ggplot(data=streptomyces_df, aes(x=Sample, y=Abundance, fill=Genus))+
geom_bar(aes(), stat="identity", position="stack")+
scale_fill_manual(values = genus_colors_streptomyces)
> plot_streptomyces
```
:::
![Diversity of Actinobacteriota at genus level inside our samples](fig/09-05.png)
::: {.callout-tip icon="false"}
## Exercise 1: Taxa agglomeration
With the following code, in the dataset with absolute abundances, group together the phyla with a small number of reads to have a better visualization of the data. Remember to check the data classes inside your data frame.
According to the [ColorBrewer](https://github.com/axismaps/colorbrewer/) package it is recommended not to have more than nine different colors in a plot.
What is the correct order to run the following chunks of code?
Compare your graphs with your partners'.
A) `absolute_df$Phylum <- as.factor(absolute_df$Phylum)`
B) `absolute_plot <- ggplot(data= absolute_df, aes(x=Sample, y=Abundance, fill=Phylum))+`
`geom_bar(aes(), stat="identity", position="stack")+`
`scale_fill_manual(values = phylum_colors_abs)`
C) `absolute_$Phylum[absolute_$Abundance < 300] <- "Minoritary Phyla"`
D) `phylum_colors_abs<- colorRampPalette(brewer.pal(8,"Dark2")) (length(levels(absolute_df$Phylum)))`
E) `absolute_df$Phylum <- as.character(absolute_df$Phylum)`
:::
::: {.callout-tip collapse="true" icon="false"}
## Solution
By grouping the samples with less than 300 reads, we can get a more decent plot.
Certainly, this will be difficult since each sample has a contrasting number of reads.
E) `absolute_df$Phylum <- as.character(absolute_df$Phylum)`
C) `absolute_df$Phylum[absolute_df$Abundance < 300] <- "Minoritary Phyla"`
A) `absolute_df$Phylum <- as.factor(absolute_df$Phylum)`
D) `phylum_colors_abs<- colorRampPalette(brewer.pal(8,"Dark2")) (length(levels(absolute_df$Phylum)))`
B) `absolute_plot <- ggplot(data= absolute_df, aes(x=Sample, y=Abundance, fill=Phylum))+`
`geom_bar(aes(), stat="identity", position="stack")+`
`scale_fill_manual(values = phylum_colors_abs)`
Show your plots:
`absolute_plot | relative_plot`
![](fig/09-06.png)
:::
::: {.callout-tip icon="false"}
## Exercise 2: Recap of abundance plotting
Match the chunk of code with its description and put them in the correct order to create a relative abundance plot at the genus level of a particular phylum:
| Description | Command |
| ----------- | ----------- |
| plot the relative abundance at the genus levels. | `plot_proteo` |
| Convert all the genera with less than 3% abundance into only one label. | `proteo_percentages <- transform_sample_counts(proteo, function(x) >x*100 / sum(x) )` |
| Make just one row that groups all the observations of the same genus.| `proteo <- subset_taxa(merged_metagenomes, Phylum == "Proteobacteria")` |
| Create a phyloseq object only with the reads assigned to a certain phylum.| `unique(proteo@tax_table@.Data[,2])` |
| Show the plot. | `proteo_glom <- tax_glom(proteo_percentages, taxrank = "Genus")` |
| Transform the phyloseq object to a data frame. | `plot_proteo <- ggplot(data=proteo_df, aes(x=Sample, y=Abundance, fill=Genus))+` |
| | `geom_bar(aes(), stat="identity", position="stack")+`|
| | `scale_fill_manual(values = genus_colors_proteo)` |
| Convert the Genus column into the factor structure. | `proteo_df$Genus[proteo_df$Abundance < 3] <- "Genera < 3% abund"` |
| Look at the phyla present in your phyloseq object. | `proteo_df <- psmelt(proteo_glom)` |
| Convert the abundance counts to relative abundance. | `genus_colors_proteo<- colorRampPalette(brewer.pal(8,"Dark2")) (length(levels(proteo_df$Genus)))` |
| Make a palette with the appropriate colors for the number of genera. | `proteo_df$Genus <- as.factor(proteo_df$Genus)` |
:::
::: {.callout-tip collapse="true" icon="false"}
## Solution
::: {style="background-color: rgb(234, 237, 237);"}
```
# Create a phyloseq object only with the reads assigned to a certain phylum.
> proteo <- subset_taxa(merged_metagenomes, Phylum == "Proteobacteria")
# Look at the phyla present in your phyloseq object
> unique(proteo@tax_table@.Data[,2])
# Convert the abundance counts to the relative abundance
> proteo_percentages <- transform_sample_counts(proteo, function(x) x*100 / sum(x) )
# Make just one row that groups all the observations of the same genus.
> proteo_glom <- tax_glom(proteo_percentages, taxrank = "Genus")
# Transform the phyloseq object to a data frame
> proteo_df <- psmelt(proteo_glom)
# Convert all the genera that have less than 3% of abundance into only one label
> proteo_df$Genus[proteo_df$Abundance < 3] <- "Genera < 3% abund"
# Convert the Genus column into the factor structure
> proteo_df$Genus <- as.factor(proteo_df$Genus)
# Make a palette with the appropriate colors for the number of genera
> genus_colors_proteo<- colorRampPalette(brewer.pal(8,"Dark2")) (length(levels(proteo_df$Genus)))
# Plot the relative abundance at the genus levels
> plot_proteo <- ggplot(data=proteo_df, aes(x=Sample, y=Abundance, fill=Genus))+
geom_bar(aes(), stat="identity", position="stack")+
scale_fill_manual(values = genus_colors_proteo)
# Show the plot
> plot_proteo
```
:::
![](fig/09-07.png)
A new plot with three bars representing the absolute abundance of Proteobacteria in each of the samples.
Each of the colors represents a Genus. Because we see relative abundances, all the bars have the same height.
:::
::: {.callout-caution icon="false"}
## Keypoints
- Depths and abundances can be visualized using phyloseq.
- The library `phyloseq` lets you manipulate metagenomic data in a taxonomic specific perspective.
:::