Skip to content

Running recentrifuge for Kraken

Jose Manuel Martí edited this page Jun 17, 2022 · 8 revisions

Quick start

Suppose you have installed Recentrifuge with pip and have used retaxdump to populate ./taxdump. Now, you would like to analyze and compare the Kraken output from samples S1, S2 and S3. In this case, the command would be:

rcf -k S1.krk -k S2.krk -k S3.krk

Often you have a dataset with a lot of samples and you just want to "recentrifuge" all of them. If they are in the directory my_Kraken_outputs_dir, you do that with the following line:

rcf -k my_Kraken_outputs_dir

Details

File format

Recentrifuge supports Kraken version 1 and version 2 output for both single- and paired-end datasets. If you use the -k option to indicate a directory with the samples instead of individual samples, Recentrifuge will suppose that the Kraken samples are those files there ending with .krk. Please rename your samples ending with .krk if you plan to use the Recentrifuge autodetection of Kraken samples in a directory.

Problematic Kraken options

  • --use-names: when processing your samples with Kraken, please do not use the this flag, as then it substitutes the taxid with the scientific name in the Kraken output files. The missing taxid would prevent them from being parsed by Recentrifuge.
  • --quick: please do not use this option since ultimately causes the Kraken output to lack any data about the score of the taxonomic assignments, thus being rejected by Recentrifuge input filters. Scoring information is critical to the robustness of Recentrifuge's analysis and visualization algorithms.

Use of compressed files

Recentrifuge supports gzip and bz2 compression for Kraken samples. For example, to process the Kraken output from samples S1 and S2 (gzipped), and S3 (bzipped2) the command would be:

rcf -k S1.gz -k S2.gz -k S3.bz2

Currently, the autodetection algorithm of Kraken samples does not support compressed files. In that case, please proceed as the preceding example or, alternatively, decompress the samples in the directory where you would like to use the autodetection feature (remember to use the extension .krk for the samples as discussed above).

Scoring schemes

There are different options to score the reads classified by Kraken. Recentrifuge supports the following specific scoring schemes for Kraken, which could be selected with the option -s/--scoring.:

  • SHEL (Single Hit Equivalent Length): This is a score value in pair bases roughly equivalent to a single hit to the database. In Kraken, this is calculated as the k-mer hit count of the top assignment, plus the default k-mer length in Kraken (35).
  • KRAKEN: This scoring scheme is only available for this classifier. It divides the k-mer hit count of the top assignment by the total k-mers in the read and multiplies the result by 100 to give a percentage of coverage (the fraction of the read k-mers covered by k-mers belonging to the read final assignment). This is the default scoring scheme for Kraken samples, and it supports the mixing of samples with different read length.

For each of those scoring schemes, the minscore parameter works for the statistics selected as the score: SHEL or k-mer coverage. So, for example, a minscore of 40 (indicated with the -y 40 option) for a SHEL scoring would filter those reads not hitting 40 nt, while for KRAKEN scoring would filter those reads with less than 40% k-mer coverage for the top assignment.

Recentrifuge also supports the following generic scoring schemes for Kraken, which are especially useful when there are reads with a diverse order of magnitude in length, like in nanopore sequencing:

  • LENGTH: The score of a read will be its length (or the combined length of mate pairs).
  • LOGLENGTH: Logarithm (base 10) of the length score.
  • NORMA: This score is the normalized score SHEL / LENGTH in percentage, so it takes into account both the assignment quality and the length of the read. Very useful when both the score assignments and lengths are variable among the reads.

For these three scoring schemes, the minscore parameter works for the calculated SHEL score of the read. So, for example, a minscore of 35 (indicated with the -y 35 option) will filter the same reads independently of the scoring scheme selected among these three.

A note about the sample statistics: the value reported for "score limit" in the sample statistics output matches the number entered for the minscore parameter in the command line. However, the score statistics (min, mean, max) are always SHEL-based in order to facilitate comparison with the results from other taxonomic classifiers, such as Centrifuge. So, if you are using KRAKEN scoring scheme and the "minscore" filter is working in a sample, you can expect the score limit and score minimum statistics to represent the same value but in different scoring schemes, respectively, the former in KRAKEN and the latter in SHEL.

Advanced example

Let's see a more complex example in detail. In order to analyze the Kraken output:

  • with the taxonomy files downloaded to /my/tax/dir,
  • from samples X1 (file X1.krk), X2 (file X2.krk) and X3 (file X3.krk),
  • with two negative controls (files CTRL1.krk and CTRL2.krk),
  • saving the output to Xsamples.rcf.html file,
  • with the scoring referred to the hit k-mer coverage percentage (KRAKEN),
  • filtering reads with 25% as a minimum value for such confidence,
  • and excluding the reads assigned to humans (taxid 9606),

the command would be:

rcf -n /my/tax/dir -k CTRL1.krk -k CTRL2.krk -k X1.krk -k X2.krk -k X3.krk -c 2 -o Xsamples.rcf.html -s KRAKEN -y 25 -x 9606

The complete guide to rcf options and flags is in the Recentrifuge command line page.