Centrifuge nt

Overview

The Centrifuge nt database is the NCBI BLAST nt database pre-processed so that it can be used by Centrifuge allowing rapid and sensitive classification within a huge range of organisms. It contains on the order of magnitude of the trillion of nucleotides, thus requiring vast supercomputing resources to generate it and, in some cases, to use it.

For further details about the NCBI nt database, please consult the NCBI nt database section in this page.

Download it!

As an alternative to generate your own database, you can download the last version of the nt database that we have prepared at LLNL (Lawrence Livermore National Laboratory) following the instructions here: https://benlangmead.github.io/aws-indexes/centrifuge. This database was prepared using a novel pipeline incorporating different quality control measures, including reference decontamination and filtering.

For further details or versions, please see the manuscript Addressing the dynamic nature of reference data: a new nt database for robust metagenomic classification.

Fast-paced way

The most straightforward way of generating an updated version of the nt database is using the Makefile provided with Centrifuge:

cd your_centrifuge_folder/indices
make THREADS=16 nt

or whatever number of threads you can dedicate to the build of the nt database. As it will take some time, the larger this number, the better (of course, with the scalability limit of the computer architecture where you are running the code). In addition to Centrifuge, you will need dustmasker in your path, a program that identifies and masks out low complexity parts. It is part of the NCBI BLAST command line applications. In case you don't want to do this masking, you can pass DONT_DUSTMASK=1 to make.

If this "automatic" build fails or if you need or want to take control of each stage, in the rest of this page you will find the instructions to generate your own updated version of such database step by step, which requires significative supercomputing resources.

Step by step instructions

You will need high performance computing (HPC) resources to be able to generate the Centrifuge nt database. Typically, a current fat-node will do the job. The last successful build required 128 cores, 2 tebibyte of memory, a fast scratch storage system, and several weeks. These are the simplified step-by-step instructions:

The first step is to download the NCBI nt database and unzip it (both operations will take some time as it has hundreds of GiB):

mkdir nt
cd nt
wget ftp://ftp.ncbi.nih.gov/blast/db/FASTA/nt.gz
gunzip nt.gz
mv -v nt nt.fa

The same with the taxdump databases (this is the shorter step):

mkdir taxonomy
cd taxonomy
wget ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz
tar xvzf taxdump.tar.gz
cd ..

We need to properly generate the accession to taxid mapping file (that will exceed 16 GiB as of Nov 2021), using the following commands:

wget "ftp://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/nucl_*.accession2taxid.gz"
wget ftp://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/pdb.accession2taxid.gz
gunzip -c *.accession2taxid.gz | awk -v OFS='\t' '{print $2, $3}' >> acc2tax.map

This is an optional step, which is also optional if you use the Makefile to build the database and pass DONT_DUSTMASK=1 to make. This step will mask low-complexity sequences by using DustMasker, a NCBI BLAST command-line application that you should install on your own (with the rest of the NCBI BLAST+ tools from here or alone from here). We will run dustmasker with the DUST level (score threshold for subwindows) set to 20, which is the default. Finally, all the masked nucleotides from the DustMasker output will be remasked as N using sed:

mv nt.fa nt_unmasked.fa 
dustmasker -infmt fasta -in nt_unmasked.fa -level 20 -outfmt fasta | sed '/^>/! s/[^AGCT]/N/g' > nt.fa

Last but not least, we issue the Centrifuge command that will generate the Centrifuge nt database (this is the part that actually benefits from high performance computing):

centrifuge-build --ftabchars=14 -p 32 --bmax 1342177280 --conversion-table acc2tax.map --taxonomy-tree taxonomy/nodes.dmp --name-table taxonomy/names.dmp nt.fa nt

The last process will finish with a line like this:

Total time for call to driver() for forward index: HH:MM:SS

That is the time of the last step. In our case (using 32 cores) it took more than 20 hours. As this is not a short time, if centrifuge-build is launched not using a batch system but in an interactive session, I strongly recommend using any mechanism to protect the process from unintentional interruptions, for instance by nohup:

nohup centrifuge-build --ftabchars=14 -p 32 --bmax 1342177280 --conversion-table acc2tax.map --taxonomy-tree taxonomy/nodes.dmp --name-table taxonomy/names.dmp nt.fa nt &

And tail -f nohup.out to safely follow the progress.

Details

NCBI nt Database

It is not true that the NCBI nt database contains all the sequences from the NCBI nucleotide databases. Currently, it contains "all traditional divisions of GenBank, EMBL, and DDBJ excluding GSS, STS, PAT, EST, HTG, and WGS". So, it does include:

DB	Contents
TSA	Transcriptome shotgun data
ENV	Environmental samples
PHG	Phages
BCT	Bacteria
INV	Invertebrates
VRL	Viruses
MAM	Other mammals
PLN	Plants
SYN	Synthetic
VRT	Other vertebrates
UNA	Unannotated
PRI	Primates
ROD	Rodents
HTC	High-throughput cDNA

and it does NOT include:

DB	Contents
GSS	Genome survey sequences
STS	Sequence tagged sites
PAT	Patented sequences
EST	Expressed sequence tags
HTC	High-throughput cDNA
WGS	Whole-genome shotgun data

From these latter, in addition to nt, the following databases are downloadable from NCBI:

DB	Compressed filename(s)
STS	sts.gz
PAT	patnt.gz
EST	est_human.gz, est_mouse.gz, est_others.gz
HTC	htgs.*tar.gz

The WGS sequences can be downloaded on a project-basis approach.

If you use Recentrifuge in your research, please consider citing the paper. Thanks!

Martí JM (2019) Recentrifuge: Robust comparative analysis and contamination removal for metagenomics. PLOS Computational Biology 15(4): e1006967. https://doi.org/10.1371/journal.pcbi.1006967

Provide feedback

Saved searches

Use saved searches to filter your results more quickly