-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path07-tackling-diversity-with-R.qmd
375 lines (260 loc) · 16.3 KB
/
07-tackling-diversity-with-R.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
---
format:
html:
smooth-scroll: true
highlight-style: github
embed-resources: true
code-overflow: scroll
code-block-background: true
---
# Diversity Tackled with R {#sec-08-diversity-tackled-with-R}
::: {.callout-note icon="false"}
## Time
- Teaching: 40 min
- Exercises: 10 min
:::
::: {.callout-tip icon="false"}
## Questions
- How can we measure diversity?
- How can I use R to analyze diversity?
:::
::: {.callout-warning icon="false"}
## Objectives
- Plot alpha and beta diversity.
:::
*Look at your fingers; controlled by the mind can do great things. However, imagine if each one has a little brain of its own, with different ideas, desires, and fears ¡How wonderful things will be made out of an artist with such hands!* -Ode to multidisciplinarity
## First plunge into diversity
Species diversity, in its simplest definition, is the number of species in a particular area and their relative abundance (evenness).
Once we know the taxonomic composition of our metagenomes, we can do diversity analyses. Here we will discuss the two most used diversity metrics, α diversity (within one metagenome) and β (across metagenomes).
- α Diversity: Can be represented only as richness (*, i.e.,* the number of different species in an environment), or it can be measured considering the abundance of the species in the environment as well (*i.e.,* the number of individuals of each species inside the environment). To measure α-diversity, we use indexes such as Shannon's, Simpson's, Chao1, etc.
![Alpha diversity diagram](fig/08-01.png)
- β diversity is the difference (measured as distance) between two or more environments. It can be measured with metrics like Bray-Curtis dissimilarity, Jaccard distance, or UniFrac distance, to name a few. Each one of this measures are focused on a characteristic of the community (*e.g.,* Unifrac distance measures the phylogenetic relationship between the species of the community).
In the next example, we will look at the α and the β components of the diversity of a dataset of fishes in three lakes. The most simple way to calculate the β-diversity is to calculate the distinct species between two lakes (sites). Let us take as an example the diversity between Lake A and Lake B. The number of species in Lake A is 3. To this quantity, we will subtract the number of these species that are shared with the Lake B: 2. So the number of unique species in Lake A compared to Lake B is (3-2) = 1. To this number, we will sum the result of the same operations but now take Lake B as our reference site. In the end, the β diversity between Lake A and Lake B is (3-2) + (3-2) = 2. This process can be repeated, taking each pair of lakes as the focused sites.
![Alpha and beta diversity indexes of fishes in a pond](fig/08-02.png) If you want to read more about diversity, we recommend to you this [paper](https://link.springer.com/article/10.1007/s00442-010-1812-0) on the concept of diversity.
## α diversity
![](fig/08-03.png)
![](fig/08-04.png)
### β diversity
Diversity β measures how different two or more communities are, either in their composition (richness) or in the abundance of the organisms that compose it (abundance).
- Bray-Curtis dissimilarity: The difference in richness and abundance across environments (samples). Weight on abundance. Measures the differences from 0 (equal communities) to 1 (different communities)
- Jaccard distance: Based on the presence/absence of species (diversity). It goes from 0 (same species in the community) to 1 (no species in common)
- UniFrac: Measures the phylogenetic distance; how alike the trees in each community are. There are two types, without weights (diversity) and with weights (diversity and abundance)
There are different ways to plot and show the results of such analysis. Among others, PCA, PCoA, or NMDS analysis are widely used.
::: {.callout-tip icon="false"}
## Exercise 1: Simple measure of alpha and beta diversities.
In the next picture, there are two lakes with different fish species:
::: {#fig-two-lakes layout-ncol=2}
![](fig/08-05.png)
![](fig/08-06.png)
:::
Which of the options below is true for the alpha diversity in lakes A, B, and beta diversity between lakes A and B, respectively?
1. 4, 3, 1
2. 4, 3, 5
3. 9, 7, 16
Please, paste your result on the collaborative document provided by instructors.
:::
::: {.callout-tip collapse="true" icon="false"}
## Solution
Answer: 2. 4, 3, 5
**Alpha diversity** in this case, is the sum of different species. Lake **A** has **4** different species and lake **B** has **3** different species.
**Beta diversity** refers to the difference between lake A and lake B. If we use the formula in *Figure 2* we can see that to calculate beta diversity, we have to detect the number of species and the number of shared species in both lakes.
There is only one shared species, so we have to subtract the number of shared species to the total species and sum the result. In this case, in lake A, we have 4 different species and one shared species with lake B (4-1)=3, and in lake B we have three species and one shared species with lake A (3-1)=2.
If we add 3+2, the result is 5.
:::
## Plot alpha diversity
We want to know the bacterial diversity, so we will prune all non-bacterial organisms in our `merged_metagenomes` Phyloseq object.
To do this, we will make a subset of all bacterial groups and save them.
::: {style="background-color: rgb(234, 237, 237);"}
```
> merged_metagenomes <- subset_taxa(merged_metagenomes, Kingdom == "Bacteria")
```
:::
Now let us look at some statistics of our metagenomes. By the output of the `sample_sums()` command:
::: {style="background-color: rgb(234, 237, 237);"}
```
> sample_sums(merged_metagenomes)
0-kraken_report 1-kraken_report 2-kraken_report
32389 53156 55956
```
:::
we can see how many reads there are in the library.
Library B_Sample_97 is the smallest with 32389 reads, while library B_Sample_99 is the largest with 55958 reads.
Also we can obtain the Max, Min, and Mean output on `summary()`, which can give us a sense of the evenness.
For example, the OTU that occurs the most in the sample B_Sample_97 occurs 5979 times, while on average in sample B_Sample_98, an OTU occurs in 48.02 reads.
::: {style="background-color: rgb(234, 237, 237);"}
```
summary(merged_metagenomes@otu_table@.Data)
0-kraken_report 1-kraken_report 2-kraken_report
Min. : 0.00 Min. : 0.00 Min. : 0.00
1st Qu.: 0.00 1st Qu.: 1.00 1st Qu.: 1.00
Median : 1.00 Median : 2.00 Median : 2.00
Mean : 29.26 Mean : 48.02 Mean : 50.55
3rd Qu.: 3.00 3rd Qu.: 7.00 3rd Qu.: 6.50
Max. :5979.00 Max. :4459.00 Max. :7048.00
```
:::
To have a more visual representation of the diversity inside the samples (i.e., α diversity), we can now look at a graph created using Phyloseq:
::: {style="background-color: rgb(234, 237, 237);"}
```
> plot_richness(physeq = merged_metagenomes,
measures = c("Observed","Chao1","Shannon"))
```
:::
![Alpha diversity indexes for both samples](fig/08-07.png)
Each of these metrics can give an insight into the distribution of the OTUs inside our samples.
For example, the Chao1 diversity index gives more weight to singletons and doubletons observed in our samples, while Shannon is an entropy index remarking the impossibility of taking two reads out of the metagenome "bag" and that these two will belong to the same OTU.
::: {.callout-tip icon="false"}
## Exercise 2: Exploring function flags.
While using the help provided, explore these options available for the function in `plot_richness()`:
1. `title`
2. `nrow`
3. `sortby`
Use these options to generate new figures that show you other ways to present the data.
:::
::: {.callout-tip collapse="true" icon="false"}
## Solution
The code and the plot using the three options will look as follows:
The "title" option adds a title to the figure.
::: {style="background-color: rgb(234, 237, 237);"}
```
> plot_richness(physeq = merged_metagenomes, title = "Alpha diversity indexes for three samples from Healthy Upper Epidermidis", measures = c("Observed","Chao1","Shannon"))
```
:::
![Alpha diversity plot with the title.](fig/08-08.png)
The "nrow" option arranges the graphics horizontally.
::: {style="background-color: rgb(234, 237, 237);"}
```
> plot_richness(physeq = merged_metagenomes, measures = c("Observed","Chao1","Shannon"), nrow=3)
```
:::
![Alpha diversity plot with the three panels arranged in rows](fig/08-09.png)
The "sortby" option orders the samples from least to greatest diversity depending on the parameter. In this case, it is ordered by "Shannon" and tells us that the B_Sample_97 has the lowest diversity and the B_Sample_98 the highest.
::: {style="background-color: rgb(234, 237, 237);"}
```
> plot_richness(physeq = merged_metagenomes, measures = c("Observed","Chao1","Shannon"), sortby = "Shannon")
```
:::
![Samples sorted by Shannon in alpha diversity index plots.](fig/08-10.png)
Considering those mentioned above, together with the three graphs, we can say that B_Sample_98 and B_Sample_99 present a higher diversity compared to sample B_Sample_97.
Moreover, the diversity of the sample B_Sample_98 and B_Sample_99 is mainly given by singletons or doubletons. While the diversity of sample B_Sample_98 provided by species occurs in greater abundance.
Although the values of H (Shannon) above three are considered to have a lot of diversity.
:::
A caution when comparing samples is that differences in some alpha indexes may be the consequence of the difference in the total number of reads of the samples. A sample with more reads is more likely to have more different OTUs, so some normalization is needed.
Here we will work with relative abundances, but other approaches could help reduce this bias.
## Absolute and relative abundances
From the read counts that we just saw, it is evident that there is a great difference in the number of total sequenced reads in each sample.
Before we further process our data we should look to see if we have any non-identified reads. Marked as blank (i.e.,"") on the different taxonomic levels:
::: {style="background-color: rgb(234, 237, 237);"}
```
> summary(merged_metagenomes@tax_table@.Data== "")
Kingdom Phylum Class Order
Mode :logical Mode :logical Mode :logical Mode :logical
FALSE:1107 FALSE:1107 FALSE:1107 FALSE:1106
TRUE :1
Family Genus Species
Mode :logical Mode :logical Mode:logical
FALSE:1024 FALSE:860 TRUE:1107
TRUE :83 TRUE :247
```
:::
With the command above, we can see blanks on different taxonomic levels. For example, we have 247 blanks at the genus level.
Although we could expect to see some blanks at the species or even at the genus level; we will get rid of the ones at the genus level to proceed with the analysis:
::: {style="background-color: rgb(234, 237, 237);"}
```
> merged_metagenomes <- subset_taxa(merged_metagenomes, Genus != "") #Only genus that are no blank
> summary(merged_metagenomes@tax_table@.Data== "")
Kingdom Phylum Class Order
Mode :logical Mode :logical Mode :logical Mode :logical
FALSE:860 FALSE:860 FALSE:860 FALSE:859
TRUE :1
Family Genus Species
Mode :logical Mode :logical Mode:logical
FALSE:856 FALSE:860 TRUE:860
TRUE :4
```
:::
Next, since our metagenomes have different sizes, it is imperative to convert the number of assigned reads (i.e., absolute abundance) into percentages (i.e., relative abundances) to compare them.
Right now, our OTU table looks like this:
::: {style="background-color: rgb(234, 237, 237);"}
```
> head(merged_metagenomes@otu_table@.Data)
0-kraken_report 1-kraken_report 2-kraken_report
46157 5979 4459 7048
2444 236 308 517
26042 94 98 207
26048 10 40 34
26054 7 19 17
26040 7 25 29
```
:::
To make this transformation to percentages, we will take advantage of a function of Phyloseq:
::: {style="background-color: rgb(234, 237, 237);"}
```
> percentages <- transform_sample_counts(merged_metagenomes, function(x) x*100 / sum(x) )
> head(percentages@otu_table@.Data)
head(percentages@otu_table@.Data)
0-kraken_report 1-kraken_report 2-kraken_report
46157 23.78092435 12.09384323 17.36559405
2444 0.93866836 0.83536751 1.27383827
26042 0.37387638 0.26579875 0.51002809
26048 0.03977408 0.10848929 0.08377273
26054 0.02784186 0.05153241 0.04188636
26040 0.02784186 0.06780580 0.07145321
```
:::
Now, we are ready to compare the abundaces given by percantages of the samples with beta diversity indexes.
## Beta diversity
As we mentioned before, the beta diversity is a measure of how alike or different our samples are (overlap between discretely defined sets of species or operational taxonomic units).
To measure this, we need to calculate an index that suits the objectives of our research. By the next code, we can display all the possible distance metrics that Phyloseq can use:
::: {style="background-color: rgb(234, 237, 237);"}
```
> distanceMethodList
$UniFrac
[1] "unifrac" "wunifrac"
$DPCoA
[1] "dpcoa"
$JSD
[1] "jsd"
$vegdist
[1] "manhattan" "euclidean" "canberra" "bray" "kulczynski"
[6] "jaccard" "gower" "altGower" "morisita" "horn"
[11] "mountford" "raup" "binomial" "chao" "cao"
$betadiver
[1] "w" "-1" "c" "wb" "r" "I" "e" "t" "me" "j" "sor" "m"
[13] "-2" "co" "cc" "g" "-3" "l" "19" "hk" "rlb" "sim" "gl" "z"
$dist
[1] "maximum" "binary" "minkowski"
$designdist
[1] "ANY"
```
:::
Describing all these possible distance metrics is beyond the scope of this lesson, but here we show which are the ones that need a phylogenetic relationship between the species-OTUs present in our samples:
- Unifrac
- Weight-Unifrac
- DPCoA
We do not have a phylogenetic tree or phylogenetic relationships. So we can not use any of those three.
We will use [Bray-curtis](http://www.pelagicos.net/MARS6300/readings/Bray_&_Curtis_1957.pdf) since it is one of the most robust and widely used distance metrics to calculate beta diversity.
**Let's keep this up!** We already have all we need to begin the beta diversity analysis. We will use the Phyloseq command `ordinate` to generate a new object where the distances between our samples will be allocated after calculating them.
For this command, we need to specify which method we will use to generate a matrix. In this example, we will use Non-Metric Multidimensional Scaling or [NMDS](https://academic.oup.com/bioinformatics/article/21/6/730/199398).
NMDS attempts to represent the pairwise dissimilarity between objects in a low-dimensional space, in this case, a two-dimensional plot.
::: {style="background-color: rgb(234, 237, 237);"}
```
meta_ord <- ordinate(physeq = percentages, method = "NMDS", distance = "bray")
```
:::
If you get some warning messages after running this script, fear not. It is because we only have three samples. Few samples make the algorithm warn about the lack of difficulty in generating the distance matrix.
By now, we just need the command `plot_ordination()` to see the results from our beta diversity analysis:
::: {style="background-color: rgb(234, 237, 237);"}
```
> plot_ordination(physeq = percentages, ordination = meta_ord)
```
:::
![Beta diversity with NMDS of our three samples](fig/08-11.png)
In this NMDS plot, each point represents the combined abundance of all its OTUs. As depicted, each sample occupies space in the plot without forming any clusters.
This output is because each sample is different enough to be considered its own point in the NMDS space.
::: {.callout-caution icon="false"}
## Keypoints
- Alpha diversity measures the intra-sample diversity.
- Beta diversity measures the inter-sample diversity.
- Phyloseq includes diversity analyses such as alpha and beta diversity calculation.
:::