README_CLARK.txt

== AUTHORS

Rachid Ounit (1), Steve Wanamaker(2), Timothy J Close (2) and
Stefano Lonardi (1). 
1: Computer Science and Engineering department, University of California, 
Riverside CA 92521
2: Botany and Plant Sciences department, University of California, 
Riverside CA 92521


== ABOUT

  The problem of DNA sequence classification is central to several application
domains in molecular biology, genomics, metagenomics and genetics. The problem
is computationally challenging due to the size of datasets generated by modern 
sequencing instruments and the growing size of reference sequence databases. 
We present a novel method, called CLARK (CLAssifier based on Reduced K-mers), 
for supervised sequence classification based on discriminative k-mers. Somewhat 
unique among other metagenomic and genomic classification methods, CLARK provides
a confidence score for its assignments which can be used in downstream analysis. 
We demonstrated the utility of CLARK on two distinct specific classification 
problems, namely (1) the assignment of metagenomic reads to known bacterial 
genomes, and (2) the assignment of BAC clones and transcript to chromosome arms 
(in the absence of a finished assembly for the reference genome). 

Three classifiers or variants from the CLARK framework are provided :
CLARK (default): created for powerful workstation, it can require a significant 
amount of RAM to run with large database (e.g., all bacterial genomes from 
NCBI/RefSeq). This classifier queries k-mers with exact matching;

CLARK-l : created for workstations with limited memory (i.e., "l" for light), 
this software tool provides precise classification on small metagenomes. Indeed,
 for metagenomics analysis, CLARK-l works with a sparse or ''light'' database 
(up to 4 GB of RAM) that is built using distant and non-overlapping k-mers.
This classifier queries k-mers with exact matching;

CLARK-S (new): created for powerful workstation exploiting spaced k-mers 
(i.e., "S" for spaced), this classifier requires a higher RAM usage than 
CLARK or CLARK-l, but it does offer a higher sensitivity.
CLARK-S completes the CLARK series of classifiers.

== GENERAL NOTE

The performance of CLARK was introduced, as a poster presentation, at the 5th 
ACM Conference on Bioinformatics, Computational Biology, and Health Informatics 
(ACM-BCB), September 20-23, 2014 at Newport Beach, CA.

CLARK-S was presented at the 14 Workshop on Algorithms in Bioinformatics, Atlanta,
GA, September 2015.


== LICENCE

The source code of CLARK is distributed under the GNU GPL License. CLARK and its 
variants (such as CLARK-l) are a free software: you can redistribute it and/or 
modify it under the terms of the GNU General Public License as published by the 
Free Software Foundation, either version 3 of the License, or (at your option) 
any later version. 
CLARK and its variants are distributed in the hope that it will  be useful, but 
WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or 
FITNESS FOR A PARTICULAR PURPOSE. 
See the GNU General Public License for more details at http://www.gnu.org/licenses/ 
Copyright 2013-2016, Rachid Ounit (rouni001 at cs.ucr.edu).


== RELEASE

  On 09/01/2014, the version 1.0 is available: 
This version of CLARK is written in C++ and shell script.  It is designed for 
64-bit OS, and can be executed on the latest Linux and Mac systems. 
  On 02/20/2015, the version 1.1 is available:
Compared to v1.0, this version is more efficient to store the database in disk
(some non-necessary data were pruned, and thus a relative reduction of about 15%
can be observed for the disk space needed). The user can also observe the program
loads the database in RAM much faster than before. This version also allows to load 
the database in multithreaded-task. Finally, the v1.1 uses different settings for 
the sampling method used for the default mode, which leads to an increase of the 
classification speed of about 5%.
  On 04/15/2015, the version 1.1.1 is available:
Compared to v1.1, this version is more efficient to load the database from disk to
RAM. The relative reduction is about 16% in average. This improvement also implies
an increase of the classification speed. This release also includes scripts to 
facilitate the classification of metagenomic samples (see section "CLASSIFICATION OF 
METAGENOMIC SAMPLES").
  On 04/22/2015, the version 1.1.2 is available:
This release includes scripts to produce the abundance estimation per target (i.e.,
the count and proportion of objects assigned to a target) with filtering possible
on the confidence score. Also, the user can get the density of assigned objects per 
confidence score. The user can also pass to CLARK objects as compressed
files (GZ format) directly to "classify_metagenome.sh". Finally, bugs fixes and
code improvement are made.
  On 06/03/2015, the version 1.1.3 is available:
This release extends existing features, simplifies output produced by the full mode
(to reduce the disk spaced needed for the results file). Bugs fixes and
code improvement are included.
  On 06/15/2015, the version 1.2.1-beta is available:
This release contains the discriminative spaced k-mers (using multiple spaced-seed).
This version is still in the beta version while features and code improvements on 
the RAM usage are on-going. 
Bug fixes and code improvement are included.
  On 12/11/2015, the version 1.2.2-beta is available:
The speed and RAM usage of the full and spaced mode are significantly improved:
i) the RAM usage is now scalable and the upper-bound is lower than the maximum 
usage needed in previous versions; 
ii) the speed is increased by a factor of 5 to 9, in singlethreaded tasks. 
To analyze results, a script to plot the distribution of the assignments per gamma
score (as done with confidence score) is provided. 
In addition, the option to create Krona-compatible input files 
("http://www.biomedcentral.com/1471-2105/12/385") is available.
  On 04/07/2016, the version 1.2.3 is available:
The classifier CLARK-S is finalized. Scripts to generate the targets definition 
using the accession number instead of the GI number have been updated. 
Additional scripts have been added to facilitate the creation and changes 
of the customized databases. Code improvements and bug fixes are included.


This versatile method allows the classification of objects sequences against a 
database of targets. Sequences of targets must be given by standard fasta and/or 
fastq files, or eventually text file (two-column format, where each line is 
<k-mer> <count>, for example in the case of the file "target.txt" (using 6-mers):

$ cat target.txt
ATATAT 1234
GCGCGC 435
ATTTTT 234
...
)


== COMPATIBILITY BETWEEN VERSIONS

Between v1.* and v1.0:
The version 1.1 (or higher) does use a different algorithm when storing and loading 
the database, compared to v1.0. Therefore, databases files (TSK files) produced by 
the version 1.0 are not compatible with v1.1. This is why it is needed to rebuild 
the database using v1.1 (or newer version). When using v1.1 (or newer), three files are 
created for the database. 
Discriminative k-mers and their targets ID are stored respectively in the files of 
extension *.ky and *.lb. Sizes of all buckets of the hashtable are stored in the 
file of extension *.sz.


== SOFTWARE & SYSTEM REQUIREMENTS

1) EXTERNAL TOOLS & C++ COMPILER VERSION       
unlike most of other metagenomic and genomic classifiers, CLARK does not require
Uny tool from the BLAST family, Jellyfish or other external tool. The main
requirement is a 64-bit operating system (Linux or Mac), and the GNU GCC to
compile version 4.4 or higher. Multithreading operations are assured by the openmp
libraries. If these libraries are not installed, CLARK will run in single-threaded
task. CLARK is expected to successfully run on recent Mac and Linux OS.

2) MEMORY
CLARK can be RAM consuming, especially when loading large database(s). Latest versions
since v1.1.1 require about 58 GB in RAM to load the database in default mode (built 
from  bacteria genomes, default NCBI/RefSeq) and about 156 GB to build that database, 
using 31-mers. Using 20-mers instead, it requires 49 GB to load the database 
(and 107 GB to build the database). About CLARK-S, the memory usage during the 
classification is the highest (~101 GB) because the program loads in memory several 
sets of spaced k-mers (from several spaced seed).

For the space required in disk, CLARK needs about 36 GB to store the database built 
for bacteria (at the genus level).

If memory is a concern to you and you have small metagenomes then please consider
 using the RAM-light version, CLARK-l. This tool is designed to run on 4 GB RAM 
laptop, thus the memory consumption is the main objective of this tool. 
CLARK-l can load data from bacteria and viruses databases within 4 GB. 
If it can't then the user should change the parameter value for the gap 
(see "-g <iteraton>" option, in "MANUAL & OPTIONS") and set the iteration to a 
value higher than the default, 4. 
IMPORTANT NOTE: Results given by CLARK-l should be seen as a draft (or first order 
approximation) of results you would get by running CLARK or CLARK-S.


== INSTALLATION

First, download the zipped package of the latest version (v1.2.3), available from 
the CLARK webpage ("Download tab"), http://clark.cs.ucr.edu.
Second, uncompress the tar.gz file ("tar -xvf CLARKV1.2.3.tar.gz"), then go to 
the sub-directory "CLARKSCV1.2.3" and execute the installation script ("./install.sh"). 

The installation is done! You can now run CLARK and any of the provided scripts.

The installer builds binaries (CLARK, CLARK-l and CLARK-S, in the folder "exe" in 
CLARKSCV1.2.3). 

== SCRIPTS

After the installation, you can also notice that several scripts are available. 
Especially:

- set_targets.sh and classify_metagenome.sh: They allow you to classify your 
metagenomes against several database(s) (downloaded from NCBI or available 
"locally" in your disk)

- estimate_abundance.sh: It computes the abundance estimation (count/proportion 
of objects assigned to targets).

- evaluate_density_confidence.sh and evaluate_density_gamma.sh: These two scripts 
take in input one or several CLARK results (containing confidence/gamma scores) 
and output/plot the density of assignments per confidence score (or gamma score).

- buildSpacedDB.sh: It creates the sets of discriminate spaced k-mers from the 
selected databases.

- clean.sh: This script will delete permanently all data related (generated and 
downloaded) of the database directory defined in set_targets.h.

- resetCustomDB.sh: It resets the targets definition with sequences (newly 
added/modified) of the customized database. Any call of this script must be 
followed by a run of set_target.sh.

- updateTaxonomy.sh: To download the latest taxonomy data (taxonomy id, 
accession numbers, etc.) from the NCBI website.


== INTRODUCTION ON CLARK USAGE

To introduce how to use CLARK, we first begin with some definitions:

- target: a reference sequence or a set of reference sequences (e.g., reads, contigs, 
assemblies, etc.) of any length. In metagenomics, for a particular taxonomy rank r,
any target can be a genome, or genomes of a particular taxon defined at the level r. 
see examples in next sections). In genomics, a target can be reads or the assemblies 
of a chromosome (or any other genomic region of a genome). 
Thus, a target is defined by the context of the classification. 

- object: a sequence (e.g., a read, a contig, a scaffold/assembly, etc.), of any length, 
to be classified against a set of targets defined at the same taxonomy rank (see 
examples in next sections).

In general, target or object, sequences can be fasta or fastq files (or even spectrum),
see details in next sections.

We provide examples of how to classify a set of objects against a set of targets.
The general rule is the user must define the targets definitions (i.e., the 
(reference sequence) <-> (target ID) association), pass objects to be classified, 
files to store results and a directory to store the database (of target-specific k-mers). 
This database is needed, because time-consuming computations are needed to 
find and saved in disk target-specific k-mers. Once these data are saved in disk,
it will be loaded directly and quickly if CLARK needs them again.

We describe in detail and also provide examples in next sections. All options/parameters 
available are detailed in the section "MANUAL & OPTIONS".

For metagenomics, scripts are provided to download/preprocess databases
from NCBI (i.e., genomes related to Bacteria, Viruses and Human), see section 
"CLASSIFICATION OF METAGENOMIC SAMPLES".


== HOW TO CHOOSE THE K-MER LENGTH ?

CLARK is a k-mer based classification method, and the value of k is critical for the 
classification accuracy and speed. The user should use any value between 19 and 23 
in order to get high sensitivity. For high sensitivity, we recommend k = 20 or 21 
(along with the full mode).

However, if the precision and the speed are the main concern, the user should use 
any value between 26 and 32. However, the highest is k, the higher is the RAM usage.
As a good tradeoff between speed, precision, and RAM usage, we recommend k = 31 (along
with the default/express mode). The default k-mer length is 31.

In the version 1.*, CLARK supports any k-mer length up to 32.


== MANUAL & OPTIONS

In this section, we describe options directly available with the CLARK executable 
(binary file in the "exe" directory). 

CLARK offers several options to run the classification. 

A typical command line to run CLARK (or CLARK-l/CLARK-S) looks like:
$ ./CLARK -k <kmerSize> -T <fileTargets> -t <minFreqTarget> -D <directoryDB/> -O <fileObjects> -o <minFreqObject> -R <fileResults> -m <mode> -n <numberofthreads> ...

Definitions of parameters:

-k <kmerSize>,       	 	k-mer length:	integer, >= 2 and <= 32 (in version 1.*). 
			 	The default value for this parameter is 31, except for CLARK-l (it is 27). 
				For CLARK-S, k is set to 31 and the maximum number of mismatches is set to 9.

-T <fileTargets>,    	 	The targets definition is written in fileTargets: filename.  
				This is a two-column file (separated by space, comma or tab), such that for each line:
				column 1: the filename of a reference sequence
				column 2: the target ID (taxon name, or taxonomy ID, ...) of the reference sequence 
				(See example below).
				This parameter is mandatory.

-t <minFreqTarget>,    		minimum of k-mer frequency/occurence for the discriminative k-mers:  integer, >=0.
				The default value is 0. For example, for 1 (or, 2), the program will discard any 
				discriminative k-mer that appears only once (or, less than twice).

-D <directoryDB/>,   	 	Directory of the database : pathname.  
				This parameter is mandatory.

-O <fileObjects>,    	 	file containing objects: filename.  
				This parameter is mandatory.

-P <file1> <file2>,		Paired-end fastq files: filenames.

-o <minFreqObject>,    		minimum of k-mer frequency or occurence in objects:    integer, >=0.
				This option is only available in the spectrum mode (-m 3).

-R <fileResults>,    	 	file to store results:  filename.  
				Results are stored in CSV format in the file <fileResults>.csv (the extension 
				".csv" is automatically added to the filename).
				This parameter is mandatory. 

-m <mode>,           	 	mode of execution: 0,1,2 or 3
				0 (full):    To get detailed results, confidence scores and other statistics.
					     This mode loads all discriminative (contiguous/spaced) k-mers.
					     This mode is available for CLARK, CLARK-S and CLARK-l.
				1 (default): To get results summary and perform best trade-off between
					     classification speed, accuracy and RAM usage.
					     This mode loads half discriminative (contiguous/spaced) k-mers.	
					     This mode is available for CLARK and CLARK-l.
				2 (express): To get results summary with the highest speed possible.
				             This mode loads all discriminative (contiguous/spaced) k-mers.
					     This mode is available for CLARK, CLARK-S and CLARK-l.
				3 (spectrum):To classify data that are given in spectrum form (i.e.,  
					       sequence data are in a two-column file, <k-mer> <count>). 
					     This mode is available for CLARK and CLARK-l.
				
-n <numberofthreads>,	 	number of threads: integer >= 1.
				The program runs in parallel for the classification and, with the option
				'--ldm' on, the loading of the database into RAM.

-g <iterations>,		"gap" or number of non-overlapping k-mers to pass when creating the
				database (for CLARK-l only). The bigger the gap is, the lower the RAM 
				usage is. The default value is 4, but the user can increase this value if 
				needed to reduce the RAM usage but this will degrade the sensitivity.

-s <factor>,			sampling factor value (for CLARK/CLARK-S only). If you want the program to load 
				in memory only half the discriminative (contiguous/spaced) k-mers then use 
				"-s 2". If you want a third of these k-mers then use "-s 3", etc.
				The higher this factor is, the lower the RAM usage is. 
				The higher this factor is, the higher the classification speed/precision is. 
				However, our experiments show that the sensitivity can be quickly 
				degraded, especially for values higher than 3. 
				In the default mode, this factor is set to 2 because it represents a good 
				trade-off between speed, accuracy and RAM usage.

--tsk,               	 	(to request a detailed creation of the database (target specific k-mers files). 
				For each target, the program creates a file containing all its specific k-mers.
				BEWARE: This option requires a high amount of disk space to complete.) 
				This option is no more supported.

--ldm,               	 	to request the loading of database file by memory mapped-file. 
				This option accelerates the loading time but it will require an 
				additional amount of RAM significant. This option also allows to
				load the database in multithreaded-task.

--kso,               	 	to request a preliminary k-spectrum analysis of each object 
				(for mode 3 only). This analysis gives intervals of k-mers 
				frequency for which k-mers are assumed to be informative
				of the object. This interval is reported in the	results file.

--extended			to request an extended output (result file), only for the full mode.


When a filename is required, we recommend absolute paths.


== LOW-LEVEL DESCRIPTION & EXAMPLES

This section describes main features and options for running directly the executable 
CLARK (or CLARK-l/CLARK-S) built (in the folder "exe") after the installation. If you want
to use CLARK as a metagenome classier, we recommend you use the provided scripts
for that purpose and detailed in the section "CLASSIFICATION OF METAGENOMIC SAMPLES".
However, we recommend the reader to go through this section to become familiar with 
CLARK's options/parameters.

If you work directly with CLARK and CLARK-l files (e.g., to classify BAC/transcript 
to chromosomes) then we recommend you copy them to your working directory. 

In this section, the working directory is /home/user1/bin/ and the binaries CLARK, CLARK-l 
and CLARK-S have been copied in it. Indeed,

$ pwd
 /home/user1/bin/
$ ls
CLARK CLARK-l CLARK-S
$

We describe in the following paragraph 1) how to create the targets definition with 
examples for it, in the case CLARK is set to classify metagenomic samples. Then,
we describe how to set different modes, use paired-end reads and run multiple threads.
Finally, we explain how to use CLARK for assignning BAC/transcript to chromosome-arms/
centromeres. 


1) Generalities, and Default Mode: 

- Say, for simplicity, the user have ten genomes in its database and each genome is 
stored into one fasta file and is located in the current working directory (say, 
genome_1.fa, genome_2.fa, ..., genome_10.fa). Say that the first three genomes belong to 
the same species S1, the next three genomes belong to the same species S2, and the last four
genomes belong to S3. And finally, say that the first six genomes belong to the same
 genus G1, and the last four genomes belong to the same genus G2.

- Sequences (or objects) to be classified are three short single-reads all stored in 
one fasta file, "objects.fa". 

First, the user must define targets for the classification (the "targets_addresses.txt" 
file). For metagenomics, for a classification at the species-level, then the user must 
define targets accordingly.  So, at the species level, the targets_addresses.txt file is:

$ cat targets_addresses.txt
/home/user1/bin/genome_1.fa S1
/home/user1/bin/genome_2.fa S1
/home/user1/bin/genome_3.fa S1
/home/user1/bin/genome_4.fa S2
/home/user1/bin/genome_5.fa S2
/home/user1/bin/genome_6.fa S2
/home/user1/bin/genome_7.fa S3
/home/user1/bin/genome_8.fa S3
/home/user1/bin/genome_9.fa S3
/home/user1/bin/genome_10.fa S3
$

In this case, there are three distinct targets (S1, S2 and S3) that the program 
will use for the classification. Defining targets at the genus-level would be:

$ cat targets_addresses.txt
/home/user1/bin/genome_1.fa G1
/home/user1/bin/genome_2.fa G1
/home/user1/bin/genome_3.fa G1
/home/user1/bin/genome_4.fa G1
/home/user1/bin/genome_5.fa G1
/home/user1/bin/genome_6.fa G1
/home/user1/bin/genome_7.fa G2
/home/user1/bin/genome_8.fa G2
/home/user1/bin/genome_9.fa G2
/home/user1/bin/genome_10.fa G2
$

In this case, there are 2 distinct targets (G1 and G2). Results and performance of 
CLARK have been analyzed at the genus and the species level (see the manuscript).

In all these examples, each genome file is given by the absolute path (recommended),
however the user can also use relative path. Since files are in the working 
directory (where the binaries are), we can simply have:

$ cat targets_addresses.txt
./genome_1.fa G1
./genome_2.fa G1
./genome_3.fa G1
...
$

Naturally, G1 and G2 are just labels that the user is free to choose. The user 
can rather use the full genus names (e.g., Staphylococcus and Streptococcus instead of 
G1, and G2), or, as recommended, the taxonomy ID (i.e., 1279 and 1301 resp.).

The definition of targets is similar for genomics (the reader can see the paragraph
"6) Chromosome arms and Centromeres"). Defining targets for genomics or metagenomics 
is equivalent.

Second, based on the definition of the targets, CLARK builds a database. This database 
needs a directory to be created. We recommend the user to create a specific 
directory for each targets definition. For this example, we use the directory "./DBD/".
This directory can be anywhere in your disk, and independantly to the directory you 
chose for the source code.

Third, objects are short reads present in the file "objects.fa". Then, we want 
the classification results to be stored in the file named "results". The results format is 
CSV (see "RESULTS FORMAT" for more detail). If a file "results.csv" already exists then
 its content will be erased.

Finally, to run CLARK, using 31-mers, in default mode, run:
$ ./CLARK -k 31 -T ./targets_addresses.txt -D ./DBD/ -O ./objects.fa -R ./results

Or using 20-mers:
$ ./CLARK -k 20 -T ./targets_addresses.txt -D ./DBD/ -O ./objects.fa -R ./results

Note that the results file will have the extension "csv". 

If CLARK is run for the first time on the defined targets and the given k-mer length, 
then CLARK builds first the database, which gathers all Target Specific K-mers (TSK). 
If CLARK is run again on the same targets and same k-mer length, then it will directly 
load TSK files previously produced. Once the classification is done, 
the file results.csv is created.

1.1) Exclusion of discriminative k-mers of low frequency in the database

The user can request the removal of discriminative k-mers that have a frequency equal
 or lower than a certain threshold (like 1, or 2, or, ...). 
This can be done by using the option "-t", for which the default value is 0 (i.e., "-t 0"). 
For example, to exclude discriminative k-mers that "appear" only once, the command line
from the last example in 1) is :
$ ./CLARK -k 20 -T ./targets_addresses.txt -D ./DBD/ -O ./objects.fa -R ./results -t 1

To exclude discriminative k-mers that appear twice or less:
$ ./CLARK -k 20 -T ./targets_addresses.txt -D ./DBD/ -O ./objects.fa -R ./results -t 2

See the "MANUAL & OPTIONS" section for more info.

1.2) Exclusion of k-mers of low frequency in the objects

Similarly, the user can also request to ignore k-mers in objects that have a frequency 
equal or lower than a certain threshold (option "-o"). 
In version 1*, this option is available ONLY for the spectrum mode (mode 3), see 
"MANUAL & OPTIONS".


2) Fast, Full, Spaced and Spectrum Mode:

Previous instructions used the default mode of CLARK (i.e., "-m 1"). To use the "express"
 mode to get results more quickly, then simply add the option "-m 2". For the "full" mode 
that will generate detailed results (including confidence scores), add "-m 0".

For example, using 20-mers, to run in express mode:
$ ./CLARK -k 20 -T ./targets_addresses.txt -D ./DBD/ -O ./objects.fa -R ./results -m 2

In full mode:
$ ./CLARK -k 20 -T ./targets_addresses.txt -D ./DBD/ -O ./objects.fa -R ./results -m 0

Note that
$ ./CLARK -k 20 -T ./targets_addresses.txt -D ./DBD/ -O ./objects.fa -R ./results -m 1
is equivalent to
$ ./CLARK -k 20 -T ./targets_addresses.txt -D ./DBD/ -O ./objects.fa -R ./results

In the case objects are given as spectrum (a two-column file), the user should 
indicate it with the option "-m 3":
$ ./CLARK -k 20 -T ./targets_addresses.txt -D ./DBD/ -O ./objects.fa -R ./results -m 3


IMPORTANT NOTES:
Whatever you use CLARK, CLARK-l or CLARK-S:
- The full mode (i.e., "-m 0") loads all discriminative k-mers in RAM and provides 
confidence score for all assignments. Thus, it offers high sensitivity.  
If your data are in the spectrum format, then the spectrum mode (-m 3) is a solution 
for you. 
- If your primary concern is the speed, and do not need detailed results, nor confidence
scores, then use the "express" mode (-m 2), the fastest mode. 

For CLARK or CLARK-l:
- If you do not know what mode to use, then use the default mode (-m 1). It loads in RAM
about half discriminative k-mers stored in disk. Thus, the RAM usage is lower than that
of other modes, but its sensitivity is not the highest. However, the precision and speed
are high, and the default mode is a good compromise between speed, RAM usage and high 
precision/sensitivity.

Finally:
- If you analyze a metagenomic sample from a poorly known microbial habitat (i.e., 
the default RefSeq database is likely not to contain a large fraction of genomes of 
organisms present in your sample), for example, sea water, etc. then use CLARK-S 
with full mode. CLARK-S uses multiple spaced k-mers instead of contiguous k-mers, 
and it produces detailed results. However, this variant is the most RAM-consuming. 
Unlike CLARK, results produced by CLARK-S are already filtered (i.e., assignments 
with confidence score < 0.75 or gamma score < 0.06 are rejected). 
However, you can use a stricter filtering to get more precise results 
(see option of the script estimate_abundance.sh). 


2.1) Running CLARK-S:

The current release exploits spaced k-mers of length 31 and weight 22. Before classifying your 
metagenomic sample, the database of discriminative 31-mers (e.g., from bacterial genomes) and 
then the database of discriminative spaced 31-mers (using the script "buildSpacedDB.sh")
must be created.

Step 0 (Set the database and its directory "DBD"):

$ ./set_targets.sh ./DBD/ bacteria --phylum 
(or $ ./set_targets.sh ./DBD/ bacteria --species)

where "DBD" is the directory where you want to store bacteria genomes and create the database. 

Step 1 (Build the database of discriminative 31-mers):

If the database files of 31-mers are already created for the database and the rank specified 
in step 0 then you can skip this step.
This can be done by running any classification with k=31, for example: 
$ ./classify_metagenome.sh -O sample.fa -R result

where sample.fa is some fasta file data. This operation is only needed to create the  
database of discriminative 31-mers.

Step 2 (Create the databases of discriminative spaced k-mers):

$ ./buildSpacedDB.sh

This task will take several hours to create discriminative spaced k-mers for all the
3 spaced seeds (i.e., about 6 to 7 hours). This operation will write about 60 GBytes of
data on disk.
 
Finally, to classify your metagenome (for example, sample.fq) run:
$ ./exe/CLARK-S -T ./targets_addresses.txt -D ./DBD/ -O ./sample.fq -R ./result -m 0
or
$ ./classify_metagenome.sh -O sample.fq -R result --spaced

Note: If you decide to work with a different taxonomy rank and/or database then 
you will need to repeat the step 0 and step 1.


3) Paired-end reads:

CLARK accepts paired-end reads (fastq files) as objects. Indicate it with the option "-P"
and the two filenames (CLARK will concatenate paired reads with matching ID).
An example of command line, using default mode, is:

$ ./CLARK -k 20 -T ./targets_addresses.txt -D ./DBD/ -P ./sample1.fastq ./sample2.fastq -R paired.results


4) Multi-Objects (or multiple sample files):

The program can process several sample files at the same time (i.e., no need to reload the database 
or restart the program for each sample). 
To classify several fasta/fastq files under the same settings (e.g., mode, threads, etc.) ,
it is possible to pass all of them conveniently (i.e., without the need to reload 
the database for each file). Say the user has 3 fasta files, located in the current working 
directory, ./objects_1.fa, ./objects_2.fa and ./objects_3.fa. To pass these three files
at once, create a file (e.g., "objects.txt") containing filenames of these fasta files:

$ cat objects.txt
./objects_1.fa
./objects_2.fa
./objects_3.fa  
$

The objects.txt file is like a "meta-objects" file. Then, create a meta-results file 
(called "results.txt") such that the i-th line in it indicates the filename to store results
of the i-th fasta file indicated in "objects.txt":

$ cat results.txt
./results_1
./results_2
./results_3
$
./results_1.csv will contain results related to ./objects_1.fa, ..., and 
./results_3.csv will contain results related to ./objects_3.fa.

Indeed, once the classification is done, results_1.csv, results_2.csv and results_3.csv are created.
This way assures that the user can pass any amount of objects files in a scalable fashion.

Note: the "results.txt" file can be just a duplicate (or a simple copy) of "objects.txt". 
Since the program create results files with extension "csv", so you can create these results 
files by using the filename of the fastq/fasta file as prefix.

For example, if you run:
$ ./CLARK -k 20 -T ./targets_addresses.txt -D ./DBD/ -O ./objects.txt -R ./objects.txt

The program will store the results files as ./objects_1.fa.csv, ./objects_2.fa.csv and ./objects_3.fa.csv

Same for paired-end reads:
$ ./CLARK -k 20 -T ./targets_addresses.txt -O ./DBD/ -P ./paired1.txt ./paired2.txt -R ./results.txt

where files paired1.txt aand paired2.txt contain addresses of paired-end reads:
$ cat paired1.txt
./sample1_R1.fq
./sample2_R1.fq
./sample3_R1.fq

$ cat paired2.txt
./sample1_R2.fq
./sample2_R2.fq
./sample3_R2.fq

$ cat results.txt
./results_1
./results_2
./results_3

Once computations done, results for the paired-end reads for each of the three samples are stored in:
results_1.csv, results_2.csv and results_3.csv.


5) Multi-threading:

CLARK can exploit parallel threads to increase the computations speed.
The option to add is "-n <Number of threads>". For example, using 12 threads:
$ ./CLARK -k 20 -T ./targets_addresses.txt -D ./DBD/ -O ./objects.txt -R ./results.txt -n 12


6) Chromosome arms and Centromeres

We understand that users can work with chromosome arms as targets. In this case, 
centromeres can be inferred by CLARK. To allow that, the user must indicate in the 
targets_addresses.txt file sequences that can be related to centromeres.

For example, in the case of barley, consider 6 chromosome arms defined by 6 files 
(./chr_arm1L.fa, ./chr_arm1S.fa, ..., ./chr_arm3L.fa, ./chr_arm3S.fa). If the user 
does not want to define centromeres, then targets_addresses.txt would look like:

$ cat targets_addresses.txt
./chr_arm1L.fa 1HL
./chr_arm1S.fa 1HS
./chr_arm2L.fa 2HL
./chr_arm2S.fa 2HS
./chr_arm3L.fa 3HL
./chr_arm3S.fa 3HS
$

If the user wants the program to infer centromere of 2H only, then it will 
indicate the centromere name for it (2HC) at each target related to the chromosome
 2H (so ./chr_arm2L.fa, and ./chr_arm2S.fa) on the same line:

$ cat targets_addresses.txt
./chr_arm1L.fa 1HL
./chr_arm1S.fa 1HS
./chr_arm2L.fa 2HL  2HC
./chr_arm2S.fa 2HS  2HC
./chr_arm3L.fa 3HL
./chr_arm3S.fa 3HS
$

If all centromeres must be inferred, then targets_addresses.txt would look like:

$ cat targets_addresses.txt
./chr_arm1L.fa 1HL 1HC
./chr_arm1S.fa 1HS 1HC
./chr_arm2L.fa 2HL 2HC
./chr_arm2S.fa 2HS 2HC
./chr_arm3L.fa 3HL 3HC
./chr_arm3S.fa 3HS 3HC
$

When defining chromosome arms labels, we strongly recommend that the user follows the 
naming convention used of this example:
For the N-th chromosome, "NS" is the label for the short arm and "NL" is the 
label for the long arm and (same prefix), so "NC" is the label for the centromere. 


IMPORTANT NOTE: 
Examples of command lines above using fasta files are valid as well for
fastq files, except for the paired-end reads case.


== CLASSIFICATION OF METAGENOMIC SAMPLES

We provide several scripts to facilitate the classification in the context of 
metagenomics. CLARK can preprocess databases of bacteria, viruses or human (downloaded 
from NCBI) or a customized set of genomes.

First, we present here two scripts, "set_targets.sh" and "./classify_metagenome.sh" that 
work together.
Second, we present the script to get the abundance estimation (count and proportion for each 
target identified), "estimate_abundance.sh", from one or several results file(s).


1) Setting and classification

1.1) Step I: Setting targets

After the installation (./install.sh), the user must create a directory to store all 
reference sequences (bacteria, viruses, human and custom). For all our examples below,
we name this directory path in a generic way <DIR_DB/> for clarity. This directory
can be anywhere in your disk(s).

Then, the user must indicate what database(s) to consider for the classification among 
bacteria, viruses, human and/or custom.

For example, only bacteria genomes:
$ ./set_targets.sh <DIR_DB/> bacteria

To work with bacteria, viruses and human:
$ ./set_targets.sh <DIR_DB/> bacteria viruses human

To classify against a custom database:
The user will need to copy/move the sequences (fasta files with accession numbers in the 
header, i.e., ">accession.number ..." or ">gi|number|ref|accession.number| ...", 
and one fasta file per reference sequence) in the directory "Custom", inside <DIR_DB/>, 
and then run set_targets. In order words, the user must:
(1) create the directory "Custom" inside  <DIR_DB/> (if it does not exist yet);
(2) copy/move the sequences of interest in the Custom;
(3) run:
$ ./set_targets.sh <DIR_DB/> custom
or for example, when combining the bacteria genomes with the Custom sequences:
$ ./set_targets.sh <DIR_DB/> bacteria custom


In general case (when the user selects bacteria, viruses and/or human), 
if the directory <DIR_DB/> is empty, then the script will download all
the selected database(s) and also data of the taxonomy tree, from NCBI.
Once the sequences are found or downloaded in the directory, the script will build 
the targets for a given taxonomy rank. 

The default taxonomy rank is species. To use a different taxonomy rank, for example, 
genus, the command line is (from the example selecting bacteria, viruses and human):

$ ./set_targets.sh <DIR_DB/> bacteria viruses human --genus

In the current release, the user can choose between six ranks (species to phylum):
--species (the default value), --genus, --family, --order, --class or --phylum.

Indeed, the strength of CLARK is to be able to classify quickly and accurately 
metagenomic samples by using only one taxonomy rank. So as a general rule when 
classifying metagenomic reads: 
Consider first the genus or species rank, then if a high proportion of reads 
cannot be classified, reset your targets definition at a higher taxonomy rank 
(e.g., family or phylum).

Once set_targets.sh is finished, the user can proceed to the step II. However, if 
the user wants to modify the selected databases and/or taxonomy rank, then he/she will 
need to run again set_targets.sh with updated parameters before proceeding to step II.

1.2) Step II: Running the classification

The script to run the classification of a metagenomic sample against the database(s) 
previously set in step I is "classify_metagenome.sh".

For your convenience, this script runs the executable CLARK, CLARK-l or CLARK and allows 
you to pass few parameters.

For example, say objects to be classified are reads in "sample.fa" (e.g., located in the 
current directory), and results to be stored in "result.csv". A basic command line is:

$ ./classify_metagenome.sh -O ./sample.fa -R ./result

As explained in the section "MANUAL & OPTIONS", thanks to identifiers "-O" and "-R", 
the script will pass the objects file "sample.fa" and results will be stored
in "./result.csv". Objects are classified against the targets and the taxonomy rank
defined by the last execution of ./set_targets.sh.

IMPORTANT NOTES:
- The targets definition is automatically passed to CLARK in step II. The file has been
 computed by set_targets.sh.

- The script set_targets.sh assumes that each reference file from bacteria, viruses or 
custom database contains an accession numberr (in the RefSeq records format: 
i.e., ">accession.number ..." or ">gi|number|ref|accession.number| ..." ). 
If an accession number is missing in a file then this file will not be used 
for the classification. 

- set_targets.sh also maps the accession number found in each reference sequence to its 
taxonomy ID based on the latest NCBI taxonomy data. If a mapping cannot be made for a 
given sequence, then it will NOT be counted and excluded from the targets definition.
The total number of excluded files is prompted in the standard output, and all files that 
have been excluded are reported in the file "files_excluded.txt" (located in the specified
database directory (i.e., "./DBD/").
If some files are excluded, then it will probably mean that they have been removed for 
curations for example (visit the RefSeq FAQ webpage) or maybe because your local taxonomy
information are no nore up-to-date (see next point).

- You can update your local taxonomy database thanks to the script "updateTaxonomy.sh"
You can use this script before running set_targets.sh.

- If the user wants to work with a different customized database (for example, by removing
or adding more sequences of interest in the Custom folder) then the targets definition
must be reset. We made it simple with the script "resetCustomDB.sh": 
After the sequences in the Custom folder have been updated, just run:
$ ./resetCustomDB.sh
Then, run set_target.sh with the desired settings.

- The database files (*.ky, *.lb and *.sz) will be created inside a subdirectory of the 
specified database directory in step I (i.e., "./DBD/") by classify_metagenome.sh.

- The default values (the k-mer length, the mode, the number of threads, etc.) are used 
if not specified by the user, just like indicated in the previous section.

- The script classify_metagenome.sh still allows you to pass customized parameters and 
options, similarly to the previous section. classify_metagenome.sh follows options defined
in "MANUAL & OPTIONS"(see below some examples). So you can change the k-mer length,
the number of parallel threads, mode, etc.

We present below some examples of customized classification using classify_metagenome.sh.

a) To use 20-mers (instead of 31-mers):
$ ./classify_metagenome.sh -O ./sample.fa -R ./result -k 20

b) To request the full mode:
$ ./classify_metagenome.sh -O ./sample.fa -R ./result -m 0

c) To classify in full mode multiple sample files (single-end reads):
$ ./classify_metagenome.sh -O ./samples.txt -R ./samples.txt -m 0

where, the file "samples.txt" contains the addresses of all the sample files to be run:
$ cat samples.txt
./sample1.fa
./sample2.fa
./sample3.fa
./sample4.fastq
./sample5.fq
./sample6.fq
...

d) To classify in full mode multiple sample files (paired-end reads):
$ ./classify_metagenome.sh -O ./samples.R.txt ./samples.L.txt -R ./samples.R.txt -m 0

where, files "samples.R.txt" and "samples.L.txt" contain the addresses of all the fastq files
(right and left) to be run:
$ cat samples.R.txt
./sample1.R1.fq
./sample2.R1.fq
./sample3.R1.fq
...

$ cat samples.L.txt
./sample1.R2.fq
./sample2.R2.fq
./sample3.R2.fq
...

Observe in this example that the order of right and left paired-end reads of each sample must be 
preserved in "samples.R.txt" and "samples.L.txt" .

e) To request the express mode, and 8 threads:
$ ./classify_metagenome.sh -O ./sample.fa -R ./result -m 2 -n 8

f) To request the full mode, with gzipped objects file, and using 8 threads:
$ ./classify_metagenome.sh -O ./sample.fa.gz -R ./result -m 0 -n 8 --gzipped

Another example, in default mode, for classifying paired-end reads (./sample1.fastq 
and ./sample2.fastq):
$ ./classify_metagenome.sh -P ./sample1.fastq ./sample2.fastq -R ./result

Notes:
This script can run CLARK-l instead of CLARK, for workstations with limited RAM. 
Then, the user can indicate it with the option  "--light". For example:

$ ./classify_metagenome.sh -P ./sample1.fastq ./sample2.fastq -R ./result --light

This script can run CLARK-S instead of CLARK, if the database files of discriminative
of spaced k-mers have been built. To use CLARK-S, the option is "--spaced". For example:

$ ./classify_metagenome.sh -P ./sample1.fastq ./sample2.fastq -R ./result --spaced

g) To run CLARK-S with full mode and using 8 threads:
$ ./classify_metagenome.sh -O ./sample.fa -R ./result -m 0 -n 8 --gzipped --spaced

h) To run CLARK-S with express mode and using 8 threads on a gzipped file:
$ ./classify_metagenome.sh -O ./sample.fa.gz -R ./result -m 2 -n 8 --spaced

If you want to run CLARK-S but with a much lower RAM usage then you can decide
to download only half the discriminative spaced k-mers in memory using "-s 2".
For example:
$ ./classify_metagenome.sh -O ./sample.fa -R ./result --spaced -s 2

Note:
run "./classify_metagenome.sh" in the terminal to prompt the help/usage describing 
options and parameters.


2) Abundance estimation

The script "estimate_abundance.sh" can analyze CLARK results of a metagenomic sample, 
and can provide for each target identified, its scientific name, taxonomy id, lineage 
(superkingdom, phylum, class, order, family, genus), its count and proportion of objects 
assigned to it. This script also allows to apply some filtering conditions (based on
 the confidence score or gamma score of assignments) to obtain a stricter estimation.
The output format of estimate_abundance.sh is CSV.

For example, say a metagenomic sample contains 100 reads, results by CLARK (full mode) 
indicates that 20 reads (20%) are assigned to the target T1, 70 (70%) are assigned to T2, 
and 10 (10%) are not assigned ("NA"). Thus, the abundance estimation program reports 
a count of 20 (or 22.2% of the assigned reads) for T1, 70 (or 77.7%) for T2, and a 
count of 10 for the category "UNKNOWN" (collecting all unassigned reads or 
assignments that do not satisfy filtering conditions). 
We give below examples of command lines.

Parameters and options of this script are:

estimate_abundance.sh -c <minConfidenceScore> -g <minGamma> -D <Directory_Path> -F <result1>.csv <result2>.csv ... <result_n>.csv -a <minAbundance> ... 

Definition of parameters: 

-c <minConfidenceScore>    To filter assignments based on their confidence score (if available) 
			   using the threshold value minConfidenceScore (a value between 0.5 and 1.0). 
                           The abundance estimation for each target will count only
                           assignments with a confidence score higher than minConfidenceScore.
                      	   Assignments that have a confidence score lower than minConfidenceScore
                       	   will be marked as unclassified and so counted in the 
                      	   category UNKNOWN in the abundance estimation report.
                           The default value is 0.5.

-g <minGamma>         	   To filter assignments based on their gamma score (if available) using the
                      	   threshold value minGamma (a value between 0 and 1.0). 
                       	   The abundance estimation for each target will count only
                       	   assignments with a gamma score higher than minGamma.
                      	   Assignments that have a gamma score lower than minGamma
                       	   will be marked as unclassified and so counted in the 
                       	   category UNKNOWN in the abundance estimation report.
                           The default value is 0.

-D <Directory_Path>    	   The directory path of the database (the same you indicated when calling 
			   set_targets.sh). This parameter is required to load scientific names of 
			   all targets ONLY if you pass results of a metagenomic sample.

-F <result1>.csv ... <result_n>.csv	 results file or list of results files produced by CLARK.
                           Important Note: You can pass a results file produced by any mode of 
				           execution of CLARK (full, express, spectrum, default), 
					   but if you pass several files, make sure they all have 
                                           been produced under the same mode. For example, if you 
					   pass two files result1.csv and result2.csv then result1.csv
                                           and result2.csv should be produced through the same mode 
   					   (e.g., both by full mode).

-a <minAbundance(%)>       To filter abundance estimations that are higher than a certain threshold, 
			   minAbundance. minAbundance is a percentage value (between 0 and 100).

--highconfidence       	   To automatically count only high confidence assignments for the 
			   abundance estimation. This option is equivalent to "-c 0.75 -g 0.03".

--krona		 	   To export results in a 3-column file for the Krona tool 
			   (Hierarchical data browser, Ondov et al, BMC Bioinformatics, 2011,
                       	   doi: 10.1186/1471-2105-12-385. https://github.com/marbl/Krona/wiki).
                       	   With this option on, the program will create the file 'results.krn' containing 
                       	   the unnormalized abundance from CLARK's assignments (with the filtering options if any).
                       	   Then, to create the Krona diagram, call the executable ktImportTaxonomy (from
			   your Krona directory) with the option '-m 3'. For example:
			   $ ktImportTaxonomy -o results.html -m 3 results.krn

Examples:

i) To get the abundance estimation of the results file, ./result.csv:

$ ./estimate_abundance.sh -F ./result.csv -D <DIR_DB/>

where <DIR_PATH> is the directory of the database you have set with set_targets.sh.

To store its output in a file (e.g., "abundance.csv"), you can do by:

$ ./estimate_abundance.sh -F ./result.csv -D <DIR_DB/> > abundance.csv

ii) To filter high confidence assignments for the abundance estimation (i.e., 
low confidence assignments will then be reported in the UNKNOWN category):

$ ./estimate_abundance.sh -F ./result.csv -D <DIR_DB/> --highconfidence

iii) To filter assignnments by using a certain confidence score threshold 
(for example, 0.8) :

$ ./estimate_abundance.sh -F ./result.csv -D <DIR_DB/> -c 0.80

To filter assignnments using a gamma score threshold (for example, -g 0.03)

$ ./estimate_abundance.sh -F ./result.csv -D <DIR_DB/> -g 0.03

Note:
Filtering based on the confidence score and/or gamma on a results file is possible
only if the file was built using the full or spectrum mode. Indeed, default 
and express mode do not produce confidence scores or gamma scores.

IMPORTANT NOTE:
From the v1.2.3, results of CLARK-S are automatically filtered. Assignments with 
low confidence score (i.e., < 0.75) or low gamma value (i.e., < 0.06) are considered 
as low confidence assignments and thus marked as unknown or "NA". 
In other words, only high confidence assignments are kept to guarantee both 
high sensitivity and high precision.

iv) If your result files are large (i.e., contain a lot of targets) then the 
output of the abundance estimation program can be large too. You can filter 
abundances by setting a minimum abundance percentage. 

For example, to print out abundances higher than 2% only, then:

$ ./estimate_abundance.sh -F ./result.csv -D <DIR_DB/> -a 2

Finally, you can pass several results files at the same time:

$ ./estimate_abundance.sh -F ./result1.csv ./result2.csv ...  -D <DIR_DB/>

In that case, to assure consistency, the results files SHOULD be produced by 
the same mode (see parameters and options above). Results from different
modes of execution SHOULD NOT be mixed.

We also remind the user that another script manipulating results files
is offered, "evaluate_density_confidence.sh" (resp. "evaluate_density_gamma.sh"). 
This script estimates and plots the density of assignments per confidence score 
(resp. gamma score). It takes in input results file(s) produced 
in the full/spectrum mode only.


== RESULTS FORMAT

Results are in CSV format. Confidence scores and other statistics about assignments 
are computed only in the full/spectrum mode.

In the full mode (extended), the results format is the following for each line:
<Object_ID>,<hit count in target 1>,...,<hit count in target N>,<Length of object>,<Gamma>,<first assignment>,<hit count of first>,<second assignment>,<hit count of second>,<confidence score> where :
* the "Object_ID"        is the tag name indicated in the header (after ">" or "@") for each 
			 object, and N is the number of targets. 
			 In the spectrum mode, this tag name is merely the filename.
* hit count in target i  is the number of k-mers specific to target i that are in the object
			 (if the option "--extended" is on).
* Length of object       is the number of bases (A,C,G,T,U or N) the object has.
* Gamma                  is the ratio between the total number of hit found in the object 
			 (against all targets) and the number of k-mers in the object.
* first assignment       is the target ID of the target that obtained the highest hit count
                         (ties are broken arbitrarily: the 1st assignment is the target 
			 having the first hit).
* hit count of first     is the number of hit count for the first assignment (h1).
* second assignment      is the target ID of the target that obtained the second highest hit 
                         count (ties are broken arbitrarily).
* hit count of second    is the number of hit count for the second assignment (h2).
* confidence score       is the ratio : h1/(h1+h2).


In the default or express mode, the results format is the following for each line:
<Object_ID>,<Length of object>,<1st_assignment>, where :

* the "Object_ID"        is the tag name indicated in the header 
                         (after ">" or "@") for each object.
* Length of object       is the number of bases the object has.
* 1st assignment         is the target ID of the target that obtained the highest hit count
                         (ties are broken arbitrarily: the 1st assignment is the 
			 target having the first hit).


== VERSIONS

Version 1.2.3 April 7, 2016.
Version 1.2.2-b December 11, 2015.
Version 1.2.1-b June 15, 2015.
Version 1.1.3 June 3, 2015.
Version 1.1.2 April 22, 2015.
Version 1.1.1 April 15, 2015.
Version 1.1.  February 20, 2015.
Version 1.0.  September 01, 2014.


== CONTACT/INFO

For any feedback, suggestion, or question, please feel free to contact Rachid Ounit 
(rouni001 at cs.ucr.edu).