-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path07-KrakenUniq.Rmd
250 lines (158 loc) · 6.9 KB
/
07-KrakenUniq.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
# Nanopore WGS Community Profiling (KrakenUniq)\
KrakenUniq
"confident and fast metagenomics classification using unique k-mer counts"
Default kmer = 31
![KrakenUniq Overview](./Figures/KrakenUniq.png){width=60%}
https://genomebiology.biomedcentral.com/articles/10.1186/s13059-018-1568-0
https://github.com/fbreitwieser/krakenuniq
https://gitlab.umiacs.umd.edu/derek/krakenuniq/-/blob/master/MANUAL.md
## Kraken/Centrifuge Family
**Kraken 1** is obsolete.
**Centrifuge** was written to improve Kraken's memory issues. It is completely new code with a different classification index. Its databases are also much smaller. Centrifuge can assign a sequence to multiple taxa. We'll use centrifuge later in the workshop.
**KrakenUniq** is based on Kraken 1. It adds an efficient algorithm to assess coverage of unique kmers, running at the same speed and with only slightly more memory than Kraken 1. KrakenUniq distinguishes low abundance organisms from false positives.
**Kraken 2** uses "probabilistic data structures" to reduce memory and increase speed at the expense of lower accuracy that leads to false positives during the classifications ("a few dozen out of millions").
**KrakenUniq** was updated in May 2022 (v0.7+) to help to deal with the large databases on computers without enough RAM to load them into memory. Databases can be read in chunks that fit in RAM.
![Kraken preload](./Figures/krakenFix.png){width=50%}
KrakenUniq gives you k-mer coverage information, reporting the number and percentage k-mers in an organism that were hit by reads. This helps to differentiate between false positives and good classifications.
![Alignment](./Figures/align.png){width=50%}
https://www.biorxiv.org/content/10.1101/2022.06.01.494344v1.full.pdf
http://ccb.jhu.edu/software/choosing-a-metagenomics-classifier/#:~:text=However%2C%20while%20Kraken%20provides%20only,1%20databases%2C%20not%20Kraken%202.
https://github.com/fbreitwieser/krakenuniq/blob/master/MANUAL.md
## KrakenUniq databases
KrakenUniq requires Kraken1 not Kraken2 databases
We won't build a database since it can take several days, but let's look at the documentation:
https://github.com/fbreitwieser/krakenuniq#database-building
You can also get prebuilt databases:
https://benlangmead.github.io/aws-indexes/k2
We will use a bacterial databases located here:
/home/data/metagenomics-2310/krakenuniq
Make a working directory and go into it.
```{bash,eval=FALSE}
mkdir ~/krakenuniq
cd ~/krakenuniq
```
The taxa it contains are here:
/home/data/metagenomics-2310/krakenuniq/library/bacteria/library_headers.orig
Take a look at the file using head.
<details>
<summary>Click for Answer</summary>
```{bash,eval=FALSE}
head /home/data/metagenomics-2310/krakenuniq/library/bacteria/library_headers.orig
```
</details>\
<!--Point out that the first sequences are Rhizobium, which is a nitrogen fixer.-->
How many sequences are there?
<details>
<summary>Click for Answer</summary>
```{bash,eval=FALSE}
wc -l /home/data/metagenomics-2310/krakenuniq/library/bacteria/library_headers.orig
```
</details>\
How many sequences are Frankia?
<details>
<summary>Click for Answer</summary>
```{bash,eval=FALSE}
grep -c Frankia /home/data/metagenomics-2310/krakenuniq/library/bacteria/library_headers.orig
```
</details>\
How many genera (field 2)?
<details>
<summary>Click for Answer</summary>
```{bash,eval=FALSE}
awk '{print $2}' /home/data/metagenomics-2310/krakenuniq/library/bacteria/library_headers.orig | sort -u| wc -l
```
</details>\
And how many of each genera?
<details>
<summary>Click for Answer</summary>
```{bash,eval=FALSE}
awk '{print $2}' /home/data/metagenomics-2310/krakenuniq/library/bacteria/library_headers.orig | sort | uniq -c | less
```
</details>\
## Link to the krakenuniq database
```{bash,eval=FALSE}
ln -s /home/data/metagenomics-2310/krakenuniq/ .
```
## Run KrakenUniq
Make sure you are in a screen. You can open a new screen or re-enter a screen you already have.
```{bash,eval=FALSE}
screen -S krakenuniq
```
Activate the KrakenUniq environment.
```{bash,eval=FALSE}
conda activate krakenuniq
```
Run Kraken on sample 3469-3.
```{bash,eval=FALSE}
time krakenuniq \
--db /home/data/metagenomics-2310/krakenuniq \
--threads 32 \
--report-file 3469-3.report \
--unclassified-out 3469-3.unclassified.fq \
--classified-out 3469-3.classified.fq \
~/microbe_fastq/3469-3.microbe.fq.gz \
> 3469-3.krakenuniqoutput
```
Make sure you got the following files:\
3469-3.classified.fna\
3469-3.krakenuniqoutput\
3469-3.report\
3469-3.unclassified.fna
This command should show you all of them and give you their sizes so you can make sure none of them are empty.
```{bash,eval=FALSE}
ls -l 3469*
```
Use the less command to look at each of them.
Take a closer look at the report and make sure that you understand each column (see below).
<!--Have the students explain some of these using actual data in the file.-->
____________
**%**\
percent reads belonging to that taxa (including its decsendants)\
**reads**\
reads belonging to that taxa (including its descendants)\
<!--The reads makes the most sense biologically, not the taxreads. It looks like the other columns are based off of reads and not taxreads.-->
**taxreads**\
reads assigned to this taxonomic level (and not to its descendants)
**kmers**\
number of unique kmers
**dup**\
average kmer count
**cov**\
coverage of the kmers of the database clade\
__________
How many classified and unclassified reads are there?\
Note: there are many ways to do this and you should get the same numbers no matter which file you look at.
What percentage of the reads are from the Frankia genus?
<details>
<summary>Click for Answer</summary>
```{bash,eval=FALSE}
# You can just look at the file
less 3469-3.report
# or
grep Frankia 3469-3.report |grep genus
## 51.04 1262819 420785 1309307 45.3 0.008317 1854 genus Frankia
## 51.04% of the reads are from Frankia.
```
</details>\
What is the next most frequent genus? What percentage of the reads belong to this genus?
<details>
<summary>Click for Answer</summary>
```{bash,eval=FALSE}
# You can just look at the file or
grep genus 3469-3.report|sort -g
```
<!--Have the students figure out what the -g is (general numerical sort) using the man command. It allows you to sort the regular and scientific notation together.-->
Now run it on the other sample (4956-3) and compare it to 3469-3.
<!--I put the students in groups for this.-->
## Pavian
Copy the reports to your computer.
Let's open them in Pavian.
<!--We haven't produced bam files for the alignment viewer. We do have sam files of reads aligned to the red alder genome but these aren't what is needed.-->
If you haven't installed Pavian, visit the installs section above.
Once you have Pavian installed, in your local Rstudio, type:\
pavian::runApp(port=5000)
More information on things you can do in Pavian is in the Centrifuge chapter.
Deactivate the environment
```{bash,eval=FALSE}
conda deactivate
```