-
Notifications
You must be signed in to change notification settings - Fork 7
Centrifuge nt
The Centrifuge nt database is the NCBI BLAST nt database pre-processed so that it can be used by Centrifuge allowing rapid and sensitive classification within a huge range of organisms. It contains on the order of magnitude of the trillion of nucleotides, thus requiring vast supercomputing resources to generate it and, in some cases, to use it.
For further details about the NCBI nt database, please consult the NCBI nt database section in this page.
As an alternative to generate your own database, you can download the last version of the nt database that we have prepared at LLNL (Lawrence Livermore National Laboratory) following the instructions here: https://benlangmead.github.io/aws-indexes/centrifuge. This database was prepared using a novel pipeline incorporating different quality control measures, including reference decontamination and filtering.
For further details or versions, please see the manuscript Addressing the dynamic nature of reference data: a new nt database for robust metagenomic classification.
The most straightforward way of generating an updated version of the nt database is using the Makefile provided with Centrifuge:
cd your_centrifuge_folder/indices
make THREADS=16 nt
or whatever number of threads you can dedicate to the build of the nt database. As it will take some time, the larger this number, the better (of course, with the scalability limit of the computer architecture where you are running the code). In addition to Centrifuge, you will need dustmasker
in your path, a program that identifies and masks out low complexity parts. It is part of the NCBI BLAST command line applications. In case you don't want to do this masking, you can pass DONT_DUSTMASK=1
to make
.
If this "automatic" build fails or if you need or want to take control of each stage, in the rest of this page you will find the instructions to generate your own updated version of such database step by step, which requires significative supercomputing resources.
You will need high performance computing (HPC) resources to be able to generate the Centrifuge nt database. Typically, a current fat-node will do the job. The last successful build required 128 cores, 2 tebibyte of memory, a fast scratch storage system, and several weeks. These are the simplified step-by-step instructions:
- The first step is to download the NCBI nt database and unzip it (both operations will take some time as it has hundreds of GiB):
mkdir nt
cd nt
wget ftp://ftp.ncbi.nih.gov/blast/db/FASTA/nt.gz
gunzip nt.gz
mv -v nt nt.fa
- The same with the taxdump databases (this is the shorter step):
mkdir taxonomy
cd taxonomy
wget ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz
tar xvzf taxdump.tar.gz
cd ..
- We need to properly generate the accession to taxid mapping file (that will exceed 16 GiB as of Nov 2021), using the following commands:
wget "ftp://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/nucl_*.accession2taxid.gz"
wget ftp://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/pdb.accession2taxid.gz
gunzip -c *.accession2taxid.gz | awk -v OFS='\t' '{print $2, $3}' >> acc2tax.map
- This is an optional step, which is also optional if you use the Makefile to build the database and pass
DONT_DUSTMASK=1
tomake
. This step will mask low-complexity sequences by using DustMasker, a NCBI BLAST command-line application that you should install on your own (with the rest of the NCBI BLAST+ tools from here or alone from here). We will rundustmasker
with the DUST level (score threshold for subwindows) set to 20, which is the default. Finally, all the masked nucleotides from the DustMasker output will be remasked asN
usingsed
:
mv nt.fa nt_unmasked.fa
dustmasker -infmt fasta -in nt_unmasked.fa -level 20 -outfmt fasta | sed '/^>/! s/[^AGCT]/N/g' > nt.fa
- Last but not least, we issue the Centrifuge command that will generate the Centrifuge nt database (this is the part that actually benefits from high performance computing):
centrifuge-build --ftabchars=14 -p 32 --bmax 1342177280 --conversion-table acc2tax.map --taxonomy-tree taxonomy/nodes.dmp --name-table taxonomy/names.dmp nt.fa nt
The last process will finish with a line like this:
Total time for call to driver() for forward index: HH:MM:SS
That is the time of the last step. In our case (using 32 cores) it took more than 20 hours. As this is not a short time, if centrifuge-build
is launched not using a batch system but in an interactive session, I strongly recommend using any mechanism to protect the process from unintentional interruptions, for instance by nohup
:
nohup centrifuge-build --ftabchars=14 -p 32 --bmax 1342177280 --conversion-table acc2tax.map --taxonomy-tree taxonomy/nodes.dmp --name-table taxonomy/names.dmp nt.fa nt &
And tail -f nohup.out
to safely follow the progress.
It is not true that the NCBI nt database contains all the sequences from the NCBI nucleotide databases. Currently, it contains "all traditional divisions of GenBank, EMBL, and DDBJ excluding GSS, STS, PAT, EST, HTG, and WGS". So, it does include:
DB | Contents |
---|---|
TSA | Transcriptome shotgun data |
ENV | Environmental samples |
PHG | Phages |
BCT | Bacteria |
INV | Invertebrates |
VRL | Viruses |
MAM | Other mammals |
PLN | Plants |
SYN | Synthetic |
VRT | Other vertebrates |
UNA | Unannotated |
PRI | Primates |
ROD | Rodents |
HTC | High-throughput cDNA |
and it does NOT include:
DB | Contents |
---|---|
GSS | Genome survey sequences |
STS | Sequence tagged sites |
PAT | Patented sequences |
EST | Expressed sequence tags |
HTC | High-throughput cDNA |
WGS | Whole-genome shotgun data |
From these latter, in addition to nt, the following databases are downloadable from NCBI:
DB | Compressed filename(s) |
---|---|
STS | sts.gz |
PAT | patnt.gz |
EST | est_human.gz, est_mouse.gz, est_others.gz |
HTC | htgs.*tar.gz |
The WGS sequences can be downloaded on a project-basis approach.
If you use Recentrifuge in your research, please consider citing the paper. Thanks!
Martí JM (2019) Recentrifuge: Robust comparative analysis and contamination removal for metagenomics. PLOS Computational Biology 15(4): e1006967. https://doi.org/10.1371/journal.pcbi.1006967