-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path10-Centrifuge.Rmd
434 lines (284 loc) · 18.3 KB
/
10-Centrifuge.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
<!--We should probably update the databases that we pre-prepare for the studnets in 2025-->
# Illumina WGS Community Profiling (Centrifuge)
## What is Centrifuge?
Centrifuge is very rapid and memory-efficient system for the classification of DNA sequences from microbial
samples, with better sensitivity than and comparable accuracy to other leading systems. The system uses a
novel indexing scheme based on the Burrows-Wheeler transform (BWT) and the Ferragina-Manzini (FM) index,
optimized specifically for the metagenomic classification problem[1].
## Tutorial
For this tutorial we will use data from this publication “Metagenomic Characterization of the Human Intestinal
Microbiota in Fecal Samples from STEC-Infected Patients” [2]: https://www.frontiersin.org/articles/10.3389/fcimb.2018.00025/
All data used in this tutorial was retrieved from SRA with fastq-dump. SRAccession: PRJEB23207
In this study, whole genome metagenomic sequencing was used to investigate possible changes in the composition
of the intestinal microbiota in samples from patients with Shiga toxin-producing E. Coli (STEC) infection (N = 2) compared to Crohn’s Patients (N =4), healthy (N = 2) and healed controls (N = 3). Feces samples,
collected during an outbreak, from STEC infected patients showed a lower intestinal abundance of beneficial
microorganisms in comparison to controls where those microorganisms predominated. These differences were
observed with bioinformatic approaches and seemed to be related with the STEC infection. So, using the
metagenomics data from this study can you identify the STEC positive samples? Can you identify the specific
strain of STEC?
Many of the steps have been done for you to save time. However, you will create some smaller databases as practice and we have also provided instructions and code so that you will know how to build large database indices if you need them for your own research.
## Environment
```{bash, eval=FALSE}
# Remember to start a screen
# Activate the sra-tools environment
# Run this first; it makes sure your conda is pointing to the right path
# If there is a system reboot, you will have to run it again
echo "export PATH="/opt/anaconda3/bin:$PATH"" >> ~/.bashrc && source ~/.bashrc
conda activate sra-tools
```
## Practice Database
We will create a database that has only fungi and human genomes so that you learn how to create Centrifuge databases. For the Centrifuge analysis, we will use a much bigger database that we have created for you due to time constraints and space considerations. In the "Full Database" section we have the commands we used. If you need these for your own research, you can run them at a later time to ensure you have the most updated database and that you have the taxa you want included.
### RefSeq and EuPathDB Data
First create a centrifuge directory and another directory inside that called taxonomy.
```{bash, eval=FALSE}
# Create a directory to store results
# The -p directory creates parent directories (in this case "centrifuge") as needed
mkdir -p ~/centrifuge/taxonomy
```
Centrifuge provides a few scripts that will allow you to download the NCBI taxonomy tree and RefSeq reference genomes. In addition to the reference fasta files, we need the taxonomy tree files:
* nodes.dmp: Links taxonomy IDs to their parents
* names.dmp: Links taxonomy IDs to their scientific name
* seqid2taxid.map: Links sequence IDs to taxonomy IDs
If you are using custom taxonomy or sequence files for the database, please refer to the manual (https://ccb.jhu.edu/software/centrifuge/manual.shtml)
Let's download the taxonomy files. This gets the nodes.dmp and the names.dmp files.
**Parameters**
-o specifies the folder to which the files are downloaded
taxonomy database to download = refseq, genbank, contaminants or taxonomy
```{bash, eval=FALSE}
# Takes ~20 seconds
conda deactivate # deactivate sra-tools environments
conda activate centrifuge
centrifuge-download -o ~/centrifuge/taxonomy taxonomy
```
Now lets download complete **fungal** genomes from RefSeq. It will also make the seqid2taxid.map file.
**Parameters**
-a Only download genomes with the specified assembly level. Default: ’Complete Genome’. If you want any assembly level, use 'Any'
-m Mask low-complexity regions using dustmasker. This is part of the blast program that was installed in the metagenomics environment
-P Number of threads or processes to use when downloading
-d What domain to download. One or more of bacteria, viral, archaea, fungi, protozoa, invertebrate, plant, vertebrate_mammalian, vertebrate_other (comma separated).
refseq Tells the program to use the refseq database.
-o Output directory
```{bash, eval=FALSE}
# Takes about 1 minute
centrifuge-download -o library -a "Complete Genome" -m -P 4 -d "fungi" refseq > seqid2taxid.map
```
Download the **human** genome from RefSeq chromosome. This is our host since we have human samples. It will also create the seqid2taxid.map which we will append to the seqid2taxid.map file that we just made with fungal genomes.
**Additional parameters**
-t Only download the specified taxonomy IDs, comma separated. Default: any. 9606 is the taxonomy id for Human.
-c Only download genomes in the specified RefSeq category. Default: any.
```{bash, eval=FALSE}
# Took ~14 minutes during testing
centrifuge-download -o library -a "Chromosome" -m -P 4 -d "vertebrate_mammalian" \
-t 9606 -c "reference genome" refseq >> seqid2taxid.map
```
We will add in **Eukaryotic Pathogens**, which can be helpful for medical metagenomic samples. To save time, we will use data we previously downloaded. Instructions for downloading it are in the "Full Database" section. If you use this database for your own research, you might want to download it later on to make sure it is up to date.
Here is the link to the publication https://doi.org/10.1371/journal.pcbi.1006277.
Here is the webpage for the data download https://ccb.jhu.edu/data/eupathDB/
First, make a directory to put the eupath data in.
```{bash, eval=FALSE}
mkdir -p ~/centrifuge/eupath
cd ~/centrifuge/eupath
```
Copy our already downloaded file.
<!-- This was downloaded 11/8/2024 and took approximately 17 minutes-->
```{bash, eval=FALSE}
cp /home/data/metagenomics-2310/centrifuge/eupath/eupathDB.tar.gz .
```
This is a compressed tarball file that contains several files. Let's uncompress it and look into the library file it created.
```{bash, eval=FALSE}
tar -zxvf eupathDB.tar.gz
ls library
```
How many genomes are there?
Create a directory in the centrifuge library directory that has our fungal and human data and move the fasta files from this library directory there.
```{bash, eval=FALSE}
mkdir ~/centrifuge/library/eupath
mv ~/centrifuge/eupath/library/*.fna ~/centrifuge/library/eupath/
# Note because we are already in ~/centrifuge/eupath/ we could use relative paths:
# mv library/*.fna ../library/eupath/
```
Because we didn't do this through centrifuge-download, we have to create the information for the seqid2taxid.map file. We'll extract it from the prelim_map.txt file that came with the eupath data.
```{bash, eval=FALSE}
awk -F'\t' '{print $2,$3}' ~/centrifuge/eupath/library/prelim_map.txt | \
awk -F'|' '{print $1,$3}' | awk -F' ' '{print $1"\t"$2}' \
>> ~/centrifuge/seqid2taxid.map
```
Now we will concatenate all the reference genome fasta files (fungal, human, eukaryotic pathogens) that we downloaded. We'll make sure we are in the library directory first.
```{bash, eval=FALSE}
cd ~/centrifuge/library/
cat */*.fna > allgenomes.fna
```
Centrifuge doesn't like the header format of some sequences so let's reformt the headers.
```{bash, eval=FALSE}
sed 's/|kraken[^|]*|/ /' allgenomes.fna > allgenomes_fixed.fna
```
### Build database indices
-p   The number of processes/threads to use for creating the index.
--conversion-table   The seqid2taxid.map file that you created with the centrifuge-download commands.
--taxonomy-tree   The nodes.dmp file that was downloaded with the first centrifuge-download command.
--name-table   The names.dmp file that was downloaded with the first centrifuge-download command.
```{bash, eval=FALSE}
# Each run will use about 24 GB of RAM.
# It took 39 minutes to run during testing.
centrifuge-build -p 4 --conversion-table ~/centrifuge/seqid2taxid.map \
--taxonomy-tree ~/centrifuge/taxonomy/nodes.dmp \
--name-table ~/centrifuge/taxonomy/names.dmp \
~/centrifuge/library/allgenomes_fixed.fna \
~/centrifuge/allgenomes_indices
```
## Full Database
Due to time and space considerations, do not run this. If you need to update this database or want to create a similar database for your research, these commands should help you do that at a later date. Refer to the "Practice Database" section for more information on each command.
This database will include archaea, bacteria, viral and fungal genomes from RefSeq. At the time of download, there were 289 Archea, 13,287 bacterial, 8,583 viral, and 10 fungal genomes.
```{bash, eval=FALSE}
# Create a directory to store results
mkdir -p ~/centrifuge_fulldb/taxonomy
# Download microbial genomes
centrifuge-download -o library -a "Complete Genome" -m -P 30 -d "archaea,bacteria,viral,fungi" refseq > seqid2taxid.map
# Download the host genome (human in our case) and append info to the seqid2taxid.map file
centrifuge-download -o library -a "Chromosome" -m -P 4 -d "vertebrate_mammalian" \
-t 9606 -c "reference genome" refseq >> seqid2taxid.map
# Get eupath since we have medical metagenomes (wget downloads the data)
mkdir -p ~/centrifuge_fulldb/eupath
cd ~/centrifuge_fulldb/eupath
wget http://ccb.jhu.edu/data/eupathDB/dl/eupathDB.tar.gz
# Extract the eupath data and move fasta files to the centrifuge library directory
tar -zxvf eupathDB.tar.gz
mkdir ~/centrifuge_fulldb/library/eupath
mv ~/centrifuge_fulldb/eupath/library/*.fna ~/centrifuge/library/eupath/
# Add info to the seqid2taxid.map file
awk -F'\t' '{print $2,$3}' ~/centrifuge_fulldb/eupath/library/prelim_map.txt | \
awk -F'|' '{print $1,$3}' | awk -F' ' '{print $1"\t"$2}' \
>> ~/centrifuge_fulldb/seqid2taxid.map
# Cat the downloaded genomes together
cd ~/centrifuge_fulldb/library/
cat */*.fna > allgenomes.fna
# Reformat the headers
sed 's/|kraken[^|]*|/ /' allgenomes.fna > allgenomes_fixed.fna
# Build the database indices
centrifuge-build -p 4 --conversion-table ~/centrifuge_fulldb/seqid2taxid.map \
--taxonomy-tree ~/centrifuge_fulldb/taxonomy/nodes.dmp \
--name-table ~/centrifuge_fulldb/taxonomy/names.dmp \
~/centrifuge_fulldb/library/allgenomes_fixed.fna \
~/centrifuge_fulldb/allgenomes_indices
```
## Database building
Build the Centrifuge indexes using the centrifuge-build command. This has been done for you already. It took 14 hrs and ~ 410 GBs of RAM using 30 threads to build this database index on the server. And, this was just using the “Complete Genome” Refseq genomes for archea, bacteria, viruses, fungi, human, and EuPathDB. I attempted to build a Centrifuge index using “All” RefSeq genomes for archea, bacteria, viruses, fungi, human plus the EuPAthDB but ran out of memory on a high memory server and Centrifuge
was using 890 GBs of RAM before I killed the program. So, unless you have high-capacity compute you
may not be able to build the indexes on your system. But, the Centrifuge developers have provided various premade
indexes that you can download here, ftp://ftp.ccb.jhu.edu/pub/infphilo/centrifuge/data/p+h+v.tar.gz,
with wget.
## Classification
In this portion of the tutorial we will use the pre-built indexes that contain complete RefSeq genomes for archea,
bacteria, viruses and fungi. The human chromosome level reference genome was included and we also added
the EuPathDB database as well. Each file will use about 36 GB of RAM, so not every one can run
samples at the same time.
• Classify reads with Centrifuge.
• command flags:
-x Path to the indexes
– Index location
/home/data/metagenomics-2310/centrifuge/complete_genomes/Arc_Bac_Vir_Hum_Eupath_v2
-1 Read 1 fastq file
-2 Read 2 fastq file
-t Print wall-clock time taken by search phase. This is optional.
-p Number of alignment threads to launch
--met-file Send metrics to file at <path>
--met-stderr Send metrics to stderr
-S Output file name
```{bash, eval=FALSE}
# Make a directory to store your results:
mkdir -p ~/centrifuge/centrifuge_results
# Run Centrifuge. It will take 10 - 20 minutes per sample to run.
centrifuge \
-x /home/data/metagenomics-2310/centrifuge/complete_genomes/Arc_Bac_Vir_Hum_Eupath_v2 \
-1 /home/data/metagenomics-2310/centrifuge/ERR2271042_1.fastq \
-2 /home/data/metagenomics-2310/centrifuge/ERR2271042_2.fastq -t \
-p 4 --met-file ~/centrifuge/centrifuge_results/ERR2271042_meta.txt --met-stderr \
-S ~/centrifuge/centrifuge_results/ERR2271042_cent.out
# If you have many samples to analyze, use a for loop to run centrifuge on each sample sequentially.
# Use a for loop to store the names of the fastq files in a variable (like a list)
# named fastq using the basename command
fastq=`for i in data/centrifuge/*_1.fastq; do basename -s _1.fastq $i;done`
# You can view the contents of the fastq variable with the echo command
echo $fastq
# Use a second for loop to read through the variable list and run centrifuge on each file in the list
# sequentially.
for j in $fastq; do centrifuge -x /data/centrifuge/complete_genomes/Arc_Bac_Vir_Hum_Eupath_v2 \
-1 /data/centrifuge/${j}_1.fastq \
-2 /data/centrifuge/${j}_2.fastq -t -p 4 \
--met-file ~/centrifuge/centrifuge_results/${j}_meta.txt --met-stderr \
-S ~/centrifuge/centrifuge_results/${j}_cent.out; done
```
## Convert output to Kraken-style reports
• Create a Kraken style report from the Centrifuge output using the centrifuge-kreport command. This command used about
17 GB of memory and took about 1 minute per file to complete during testing.
• command flags:
-x Path to the indexes.
```{bash, eval=FALSE}
# Make a directory to store your output files:
mkdir -p ~/centrifuge/centrifuge_krn_results
centrifuge-kreport \
-x /home/data/metagenomics-2310/centrifuge/complete_genomes/Arc_Bac_Vir_Hum_Eupath_v2 \
~/centrifuge/centrifuge_results/ERR2271042_cent.out \
> ~/centrifuge/centrifuge_krn_results/ERR2271042_cent.out.krn
```
• Use a for loop if you have many samples to analyze.
```{bash, eval=FALSE}
# Use a for loop to store the names of the fastq files in a variable (like a list)
# named output using the basename command
output=`for i in /home/metag/centrifuge/centrifuge_results/*_cent.out; do basename -s _cent.out $i;done`
# You can view the contents of the fastq variable with the echo command
echo $output
# Use a second for loop to read through the variable list and run centrifuge on each file in the list
# sequentially.
for j in $output; do centrifuge-kreport \
-x /home/data/metagenomics-2310/centrifuge/complete_genomes/Arc_Bac_Vir_Hum_Eupath_v2 \
~/centrifuge/centrifuge_results/${j}_cent.out > \
~/centrifuge/centrifuge_krn_results/${j}.krn; done
```
## Parse Metadata
• Use awk to extract columns 2, 10, and 18 from the metadata file located at /home/metag/centrifuge/data/
PRJEB23207_metadata.txt . Then, pipe the output to grep and exclude samples that were sequenced with the “Ion
Torrent PGM” platform
```{bash, eval=FALSE}
# Copy the /home/data/metagenomics-2310/centrifuge/PRJEB23207_metadata.txt to your working directory
cp /home/data/metagenomics-2310/centrifuge/PRJEB23207_metadata.txt \
~/centrifuge
# Use awk and grep to parse the metadata file
# -F is the field separator flag. In this case the metadata file is tab-separated '\t'.
# -i case-insensitive, -v exclude the search string, i.e., exclude lines that contain "ion"
awk -F'\t' '{print $2,$10,$18}' PRJEB23207_metadata.txt | grep -i -v "ion"
```
## Visualize the Kraken Reports with Pavian
• First, download all *.krn files to your computer using secure copy (scp) with unix, linux, or MobaXterm terminals.
• This assumes that you have installed R on your computer.
```{bash, eval=FALSE}
# Run R by opening a terminal
R
# Or click on your R application from R-studio or R-console.
# Install Pavian if you haven't done so:
if (!require(remotes)) { install.packages("remotes") }
remotes::install_github("fbreitwieser/pavian")
# Start Pavian in your browser from R
pavian::runApp(port=5000)
```
• A web browser should pop up. If not, Pavian can be accessed by entering this url into your browser. http://127.0.0.1:5000
• Drag the .krn files to the "Browse" bar under the "Upload files" tab.
• Click on the “Results Overview” tab on the left side of the screen.
• Click on the “Sample” tab on the left side of the screen. You will notice an interactive Sankey chart that displays different
classified taxa. You can click and drag nodes to reorder the nodes if you want. You can select different samples by clicking
on the down arrow next to the “Select sample” header. Select sample ERR2270960 or ERR2270961 since we know that
these two samples are likely infected with STEC.
• From the “Sample” page, click on the “Table” tab. You should see an interactive, filterable and searchable table that
summarizes the results of Centrifuge. For example, type “Escherichia coli” into the “Search:” box. A histogram will appear
that displays the numbers of reads across all samples that were classified as “Escherichia coli”. Sort the table by the most
abundant taxa by clicking on the up arrow next to “TaxonReads”.
• Can you can you identify the strain(s) of STEC that samples ERR2270960 and ERR2270961 likely contain?
## References
Gigliucci F, von Meijenfeldt FAB, Knijn A, Michelacci V, Scavia G, Minelli F, Dutilh BE, Ahmad HM, Raangs GC, Friedrich
AW, Rossen JWA, Morabito S. Metagenomic Characterization of the Human Intestinal Microbiota in Fecal Samples from STECInfected
Patients. Front Cell Infect Microbiol. 2018 Feb 6;8:25. doi: 10.3389/fcimb.2018.00025. eCollection 2018. PubMed
PMID: 29468143; PubMed Central PMCID: PMC5808120.
Kim D, Song L, Breitwieser FP, Salzberg SL. Centrifuge: rapid and sensitive classification of metagenomic sequences. Genome
Res. 2016 Dec;26(12):1721-1729. Epub 2016 Oct 17. PubMed PMID: 27852649; PubMed Central PMCID: PMC5131823.
Lu J, Salzberg SL. Removing contaminants from databases of draft genomes. PLoS Comput Biol. 2018 Jun 25;14(6):e1006277.
doi: 10.1371/journal.pcbi.1006277. eCollection 2018 Jun. PubMed PMID: 29939994; PubMed Central PMCID: PMC6034898.