05-taxonomic-assignment.qmd

---
format:
  html:
    smooth-scroll: true
    highlight-style: github 
    embed-resources: true
    code-overflow: scroll
    code-block-background: true
---

# Taxonomic Assignment {#sec-06-taxonomic-assignment}

::: {.callout-note icon="false"}
## Time

-   Teaching: 30 min
-   Exercises: 15 min
:::

::: {.callout-tip icon="false"}
## Questions

-   How can I know to which taxa my sequences belong?
:::

::: {.callout-warning icon="false"}
## Objectives

-   Understand how taxonomic assignment works.
-   Use Kraken to assign taxonomies to reads and contigs
-   Visualize taxonomic assignations in graphics
:::

## What is a taxonomic assignment?

A taxonomic assignment is a process of assigning an Operational Taxonomic Unit (OTU, that is, groups of related individuals) to sequences that can be reads or contigs. Sequences are compared against a database constructed using complete genomes. When a sequence finds a good enough match in the database, it is assigned to the corresponding OTU. The comparison can be made in different ways.

### Strategies for taxonomic assignment

There are many programs for doing taxonomic mapping, and almost all of them follow one of the following strategies:

1.  BLAST: Using BLAST or DIAMOND, these mappers search for the most likely hit for each sequence within a database of genomes (i.e. mapping). This strategy is slow.

2.  Markers: They look for markers within the sequences, that correspond to those markers within a database that are associated with particular taxonomies. This allows the sequences to be classified and assigned a particular taxonomy depending on the hits obtained.

3.  K-mers: A genome database is broken into pieces of length k to be able to search for unique pieces by taxonomic group, from a lowest common ancestor (LCA), passing through phylum to species. Then, the algorithm breaks the query sequence (reads/contigs) into pieces of length k, looks for where these are placed within the tree and make the classification with the most probable position.

![Lowest common ancestor assignment example](fig/06-01.png)

### Abundance bias

When you do the taxonomic assignment of metagenomes, a key result is the abundance of each taxon or OTU in your sample. The absolute abundance of a taxon is the number of sequences (reads or contigs, depending on what you did) assigned to it. Moreover, its relative abundance is the proportion of sequences assigned to it. It is essential to be aware of the many biases that can skew the abundances along the metagenomics workflow (see figure below) and that because of them, we may not be obtaining the actual abundance of the organisms in the sample.

![Abundances biases during a metagenomics protocol](fig/06-02.png)

::: {.callout-tip icon="false"}
## Discussion 1: Relation between MAGs and depth

What do you think is harder to assign, a species (like *E. coli*) or a phylum (like Proteobacteria)?
:::

## Using Kraken 2

[Kraken 2](https://ccb.jhu.edu/software/kraken2/) is the newest version of Kraken, a taxonomic classification system using exact k-mer matches to achieve high accuracy and fast classification speeds.

When looking at the settings for running **Kraken2** we can see that in addition to our input files, we also need to select a database to compare them. The database you use will determine the result you get for your data. Imagine you are searching for a recently discovered lineage that is not part of the available databases. Would you find it?

There are [several databases](https://benlangmead.github.io/aws-indexes/k2) compatible to be used with Kraken2 in the taxonomical assignment process.

Therefore, when running Kraken2 we need to give the database we use serious consideration.

::: {.callout-tip icon="false"}
## Discussion 2: Considerations when selecting Kraken2 databases

In this workshop we are working with metabarcoding, targeting bacteria. However, the our dataset contains sequencing reads targetting fungi and sequencing reads of metashotgun data. Technically, we could use the standard database for all of these samples types, but our analyses would be more efficient if we use a targeted database.

Have a look at the available databases and consider which would be most suited for each sample type. Then select the database you will use for your sequencings targeting bacteria.
:::

To use Kraken2 our input is the **Unique.seqs** fasta output, we can use default parameters but under the **Create Report** tab, you want to select "Yes" for the `--report` option. For this step, try using two different databases you think would work with your sample.

Using **B_Sample_98** as our example, we can examine the Kraken2 `Classification` table:

| Column 1 | Column 2 | Column 3                       | Column 4 | Column 5                                                                                                                                                     |
|----------|----------|--------------------------------|----------|--------------------------------------------------------------------------------------------------------------------------------------------------------------|
| C        | 1        | Microbacterium (taxid 43534)   | 336      | 3:47 43309:1 3:5 43309:9 43534:9 43490:33 3:87 1:1 3:5 1:3 3:38 43490:1 3:5 43416:1 3:5 43416:8 43490:1 3:5 43309:2 43490:1 3:2 43416:5 43534:1 3:21 1:5 3:1 |
| C        | 2        | Chitinophagaceae (taxid 44006) | 332      | 1:1 3:5 1:3 3:4 43868:5 44006:33 43869:14 3:24 1:23 3:74 46157:2 3:5 46157:43 3:62                                                                           |
| C        | 3        | Bacteria (taxid 3)             | 336      | 3:194 1:1 3:5 1:3 3:2 1:6 3:83 1:2 3:5 1:1                                                                                                                   |

This information may need to be clarified. Let us take out our "cheatsheet" to understand some of its components:

| Column example                                                                                                                                               | Description                                                                                        |
|--------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------|
| C                                                                                                                                                            | Classified or unclassified                                                                         |
| 1                                                                                                                                                            | FASTA header of the sequence                                                                       |
| Microbacterium (taxid 43534)                                                                                                                                 | Tax ID                                                                                             |
| 336                                                                                                                                                          | Read length                                                                                        |
| 3:47 43309:1 3:5 43309:9 43534:9 43490:33 3:87 1:1 3:5 1:3 3:38 43490:1 3:5 43416:1 3:5 43416:8 43490:1 3:5 43309:2 43490:1 3:2 43416:5 43534:1 3:21 1:5 3:1 | kmers hit to a taxonomic ID **e.g.** tax ID 43309 has nine hits, tax ID 43416 has eight hits, etc. |

The Kraken2 file could be more readable. So let us look at the `Report` file:

![Kraken2 Report File](fig/06-03.png)

Below is an explanation regarding this information:

| Column example | Description                                                                                                                                               |
|----------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------|
| 78.13          | Percentage of reads covered by the clade rooted at this taxon                                                                                             |
| 587119         | Number of reads covered by the clade rooted at this taxon                                                                                                 |
| 587119         | Number of reads assigned directly to this taxon                                                                                                           |
| U              | A rank code, indicating (U)nclassified, (D)omain, (K)ingdom, (P)hylum, (C)lass, (O)rder, (F)amily, (G)enus, or (S)pecies. All other ranks are simply '-'. |
| 0              | NCBI taxonomy ID                                                                                                                                          |
| unclassified   | Indented scientific name                                                                                                                                  |

## Visualization of taxonomic assignment results

After we have the taxonomy assignation, what follows is some visualization of our results. [Krona](https://github.com/marbl/Krona/wiki) is a hierarchical data visualization software. Krona allows data to be explored with zooming and multi-layered pie charts and supports several bioinformatics tools and raw data formats. To use Krona in our results, let us first go into our taxonomy directory, which contains the pre-calculated Kraken outputs.

### Krona

With Krona, we will explore the taxonomy of our Kraken Reports. To do this, firs we need to convert our Kraken report to a format compatible with Krona on Galaxy. To do this, search for the tool **Krakentools: Convert kraken report file**, select **Kraken2 on data X: Report** and run with default settings.

Once this completes, search for **Krona pie chart**, change input from **Taxonomy** to **Tabular** and select **Krakentools: Convert kraken report file on data X**. Everything else can be left as default.

Once the job is complete, click the "eye icon" and view the Krona pie chart. You will see a page like this:

![Krona Pie Chart Example](fig/06-04.png)

::: {.callout-tip icon="false"}
##Exercise 1: Exploring Krona visualization

What percentage of bacteria is represented by the phylum Actinobacteriota?

Hint: A search box is in the window's top left corner.
:::

::: {.callout-tip collapse="true" icon="false"}
## Solution

30% of Bacteria corresponds to the phylum Actinobacteriota in sample B_sample_97. In the top right of the window, we see little pie charts that change whenever we change the visualization to expand certain taxa. This displays the percentages of the chosen segment at any point in the hierarchy.
:::

### Pavian

Pavian is another visualization tool that allows comparison between multiple samples. Pavian should be locally installed and needs R and Shiny, but we can try the [Pavian demo WebSite](https://fbreitwieser.shinyapps.io/pavian/) to visualize our results.

First, we need to download the files needed as inputs in Pavian; this time, we will visualize the assignment of the reads of both samples: `Kraken2 on data X: Report`.

These files correspond to our Kraken reports.

We go to the [Pavian demo WebSite](https://fbreitwieser.shinyapps.io/pavian/), click on Browse, and choose our reports. You need to select both reports at the same time.

![Browse and Select two Reports](fig/06-05.png)

We click on the Results Overview tab.

![Select Results Overview tab](fig/06-06.png)

We click on the Sample tab.

![Select Sample to Visualise](fig/06-07.png)

We can look at a comparison of both our samples in the Comparison tab.

![Comparison Between Samples - B_Sample_98 Silva](fig/06-09a.png)

![Comparison Between Samples - B_Sample_98 RDP](fig/06-09b.png)

::: {.callout-tip icon="false"}
## Discussion 3: Unclassified reads

As you can see, a percentage of our data could not be assigned to belong to a specific OTU.\
Which factors can affect the taxonomic assignation so that a read is unclassified?
:::

::: {.callout-tip collapse="true" icon="false"}
## Solution

**Unclassified reads** can be the result of different factors that can go from sequencing errors to problems with the algorithm being used to generate the result. The widely used Next-generation sequencing (NGS) platforms, showed [average error rate of 0.24±0.06% per base](https://www.nature.com/articles/s41598-018-29325-6). Besides the sequencing error, we need to consider the status of the database being used to perform the taxonomic assignation.

All the characterized genomes obtained by different research groups are scattered in different repositories, pages and banks in the cloud. Some are still unpublished. Incomplete databases can affect the performance of the taxonomic assignation. Imagine that the dominant OTU in your sample belongs to a lineage that has never been characterized and does not have a public genome available to be used as a template for the database. This possibility makes the assignation an impossible task and can promote the generation of false positives because the algorithm will assign a different identity to all those reads.
:::

::: {.callout-caution icon="false"}
## Keypoints

-   A database with previously gathered knowledge (genomes) is needed for taxonomic assignment.
-   Taxonomic assignment can be done using Kraken2.
-   Krona and Pavian are web-based tools to visualize the assigned taxa.
:::