Skip to content

Running recentrifuge for CLARK

Jose Manuel Martí edited this page Mar 12, 2022 · 13 revisions

Quick start

Suppose you have installed Recentrifuge with pip and have used retaxdump to populate ./taxdump. Now, you would like to analyze and compare the CLARK full-mode output from samples S1, S2 and S3. In this case, the command would be:

rcf -r S1.csv -r S2.csv -r S3.csv

Often you have a dataset with a lot of samples and you just want to "recentrifuge" all of them. If they are in the directory my_CLARK_outputs_dir, you do that with the following line:

rcf -r my_CLARK_outputs_dir

Details

File format

Recentrifuge currently supports full-mode output of CLARK, CLARK-l, and CLARK-S, the spaced k-mers version of CLARK. As stated by the author, "The full mode (i.e., -m 0) loads all discriminative k-mers in RAM and provides confidence score for all assignments. Thus, it offers high sensitivity."

CLARK full-mode results are in CSV format with 8 columns, including various scoring indexes: hit counts, gamma, and confidence. Recentrifuge shows statistics about all these CLARK scores for every analyzed sample, for example:

Loading output file Specimen_1.csv... OK!
  Seqs read: 20_183_483	[4.04 Gnt]
  Seqs clas: 18_405_324	(8.81% unclassified)
  Seqs pass: 17_412_044	(5.40% rejected)
  Hit (score): min = 1.0, max = 420.0, avr = 97.7
  Conf. score: min = 0.3, max = 1.0, avr = 0.7
  Gamma score: min = 0.0, max = 2.5, avr = 0.9
  Read length: min = 200 nt, max = 200 nt, avr = 200 nt
  4305 taxa with assigned reads
Building from raw data... Specimen_1 sample OK!
Load elapsed time: 97 sec

In order to analyze your CLARK results with Recentrifuge, please run your flavor of CLARK with the -m 0 flag to select full-mode, thus enabling higher sensitivity and the scoring of the assignments needed by Recentrifuge to continuously evaluate the classifications confidence.

Scoring schemes

There are different options to score the reads classified by CLARK. Recentrifuge supports the following specific scoring schemes for CLARK, which could be selected with the option -s/--scoring.:

  • SHEL (Single Hit Equivalent Length): This is a score value in pair bases roughly equivalent to a single hit to the database. In CLARK, between the two assignments, this is the hit count score result of the top or the classified one, plus the default k-mer length (31). This is currently the default scoring scheme for CLARK data in Recentrifuge, but you will probably want to try also the following scoring schemes (CLARK_C and CLARK_G) for your datasets.
  • CLARK_C: This scoring scheme is not available for other classifiers. It takes the confidence score as the score for a read, conf=h1/(h1+h2), or 1-conf=h2/(h1+h2) in case the majority of a read is not classified (1st assignment unclassified). See CLARK's README file for details on how h1 and h2 are calculated. If you use this scoring, you will probably want to filter to a minimum of 0.5 (-y 0.5) or beyond, as under 0.5 the assignments have very low confidence.
  • CLARK_G: This scheme scores every read with its CLARK gamma score, so it is only available for this classifier.

For each of those scoring schemes, the minscore parameter works for the statistics selected as the score: hit counts, confidence, and gamma, respectively. So, for example, a minscore of 1 (indicated with the -y 1 option) for a SHEL scoring will not filter a single read, while for CLARK_C scoring no read will pass the filter.

Recentrifuge also supports the following generic scoring schemes for CLARK, which are useful when there are reads with a diverse order of magnitude in length, like in nanopore sequencing:

  • LENGTH: The score of a read will be its length (or the combined length of mate pairs).
  • LOGLENGTH: Logarithm (base 10) of the length score.
  • NORMA: This score is the normalized score SHEL / LENGTH in percentage, so it takes into account both the assignment quality and the length of the read. Very useful when both the score assignments and lengths are variable among the reads.

For these three scoring schemes, the minscore parameter works for the calculated SHEL score of the read. So, for example, a minscore of 35 (indicated with the -y 35 option) will filter the same reads independently of the scoring scheme selected among these three.

Advanced example

Let's see a more complex example in detail. In order to analyze the full-mode output of any flavor of CLARK:

  • with the taxonomy files downloaded to /my/tax/dir,
  • from samples X1 (file X1.csv), X2 (file X2.csv) and X3 (file X3.csv),
  • with two negative controls (files CTRL1.csv and CTRL2.csv),
  • saving the output to Xsamples.rcf.html file,
  • with the scoring referred to the confidence score per read (CLARK_C) calculated by CLARK's full-mode,
  • filtering reads with 0.3 as a minimum value for such confidence,
  • for the general samples (except negative controls), with 10 as the minimum number of reads assigned to one clade to avoid collapsing it to the parent clade,
  • with 5 for this last parameter but referred to the control samples,
  • and excluding the reads assigned to humans (taxid 9606),

the command would be:

rcf -n /my/tax/dir -r CTRL1.csv -r CTRL2.csv -r X1.csv -r X2.csv -r X3.csv -c 2 -o Xsamples.rcf.html -s CLARK_C -y 0.3 -m 10 -w 5 -x 9606

The complete guide to rcf options and flags is in the Recentrifuge command line page.