forked from Funatiq/cuclark
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathREADME_CLARK.txt
1119 lines (845 loc) · 51.1 KB
/
README_CLARK.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
== AUTHORS
Rachid Ounit (1), Steve Wanamaker(2), Timothy J Close (2) and
Stefano Lonardi (1).
1: Computer Science and Engineering department, University of California,
Riverside CA 92521
2: Botany and Plant Sciences department, University of California,
Riverside CA 92521
== ABOUT
The problem of DNA sequence classification is central to several application
domains in molecular biology, genomics, metagenomics and genetics. The problem
is computationally challenging due to the size of datasets generated by modern
sequencing instruments and the growing size of reference sequence databases.
We present a novel method, called CLARK (CLAssifier based on Reduced K-mers),
for supervised sequence classification based on discriminative k-mers. Somewhat
unique among other metagenomic and genomic classification methods, CLARK provides
a confidence score for its assignments which can be used in downstream analysis.
We demonstrated the utility of CLARK on two distinct specific classification
problems, namely (1) the assignment of metagenomic reads to known bacterial
genomes, and (2) the assignment of BAC clones and transcript to chromosome arms
(in the absence of a finished assembly for the reference genome).
Three classifiers or variants from the CLARK framework are provided :
CLARK (default): created for powerful workstation, it can require a significant
amount of RAM to run with large database (e.g., all bacterial genomes from
NCBI/RefSeq). This classifier queries k-mers with exact matching;
CLARK-l : created for workstations with limited memory (i.e., "l" for light),
this software tool provides precise classification on small metagenomes. Indeed,
for metagenomics analysis, CLARK-l works with a sparse or ''light'' database
(up to 4 GB of RAM) that is built using distant and non-overlapping k-mers.
This classifier queries k-mers with exact matching;
CLARK-S (new): created for powerful workstation exploiting spaced k-mers
(i.e., "S" for spaced), this classifier requires a higher RAM usage than
CLARK or CLARK-l, but it does offer a higher sensitivity.
CLARK-S completes the CLARK series of classifiers.
== GENERAL NOTE
The performance of CLARK was introduced, as a poster presentation, at the 5th
ACM Conference on Bioinformatics, Computational Biology, and Health Informatics
(ACM-BCB), September 20-23, 2014 at Newport Beach, CA.
CLARK-S was presented at the 14 Workshop on Algorithms in Bioinformatics, Atlanta,
GA, September 2015.
== LICENCE
The source code of CLARK is distributed under the GNU GPL License. CLARK and its
variants (such as CLARK-l) are a free software: you can redistribute it and/or
modify it under the terms of the GNU General Public License as published by the
Free Software Foundation, either version 3 of the License, or (at your option)
any later version.
CLARK and its variants are distributed in the hope that it will be useful, but
WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
FITNESS FOR A PARTICULAR PURPOSE.
See the GNU General Public License for more details at http://www.gnu.org/licenses/
Copyright 2013-2016, Rachid Ounit (rouni001 at cs.ucr.edu).
== RELEASE
On 09/01/2014, the version 1.0 is available:
This version of CLARK is written in C++ and shell script. It is designed for
64-bit OS, and can be executed on the latest Linux and Mac systems.
On 02/20/2015, the version 1.1 is available:
Compared to v1.0, this version is more efficient to store the database in disk
(some non-necessary data were pruned, and thus a relative reduction of about 15%
can be observed for the disk space needed). The user can also observe the program
loads the database in RAM much faster than before. This version also allows to load
the database in multithreaded-task. Finally, the v1.1 uses different settings for
the sampling method used for the default mode, which leads to an increase of the
classification speed of about 5%.
On 04/15/2015, the version 1.1.1 is available:
Compared to v1.1, this version is more efficient to load the database from disk to
RAM. The relative reduction is about 16% in average. This improvement also implies
an increase of the classification speed. This release also includes scripts to
facilitate the classification of metagenomic samples (see section "CLASSIFICATION OF
METAGENOMIC SAMPLES").
On 04/22/2015, the version 1.1.2 is available:
This release includes scripts to produce the abundance estimation per target (i.e.,
the count and proportion of objects assigned to a target) with filtering possible
on the confidence score. Also, the user can get the density of assigned objects per
confidence score. The user can also pass to CLARK objects as compressed
files (GZ format) directly to "classify_metagenome.sh". Finally, bugs fixes and
code improvement are made.
On 06/03/2015, the version 1.1.3 is available:
This release extends existing features, simplifies output produced by the full mode
(to reduce the disk spaced needed for the results file). Bugs fixes and
code improvement are included.
On 06/15/2015, the version 1.2.1-beta is available:
This release contains the discriminative spaced k-mers (using multiple spaced-seed).
This version is still in the beta version while features and code improvements on
the RAM usage are on-going.
Bug fixes and code improvement are included.
On 12/11/2015, the version 1.2.2-beta is available:
The speed and RAM usage of the full and spaced mode are significantly improved:
i) the RAM usage is now scalable and the upper-bound is lower than the maximum
usage needed in previous versions;
ii) the speed is increased by a factor of 5 to 9, in singlethreaded tasks.
To analyze results, a script to plot the distribution of the assignments per gamma
score (as done with confidence score) is provided.
In addition, the option to create Krona-compatible input files
("http://www.biomedcentral.com/1471-2105/12/385") is available.
On 04/07/2016, the version 1.2.3 is available:
The classifier CLARK-S is finalized. Scripts to generate the targets definition
using the accession number instead of the GI number have been updated.
Additional scripts have been added to facilitate the creation and changes
of the customized databases. Code improvements and bug fixes are included.
This versatile method allows the classification of objects sequences against a
database of targets. Sequences of targets must be given by standard fasta and/or
fastq files, or eventually text file (two-column format, where each line is
<k-mer> <count>, for example in the case of the file "target.txt" (using 6-mers):
$ cat target.txt
ATATAT 1234
GCGCGC 435
ATTTTT 234
...
)
== COMPATIBILITY BETWEEN VERSIONS
Between v1.* and v1.0:
The version 1.1 (or higher) does use a different algorithm when storing and loading
the database, compared to v1.0. Therefore, databases files (TSK files) produced by
the version 1.0 are not compatible with v1.1. This is why it is needed to rebuild
the database using v1.1 (or newer version). When using v1.1 (or newer), three files are
created for the database.
Discriminative k-mers and their targets ID are stored respectively in the files of
extension *.ky and *.lb. Sizes of all buckets of the hashtable are stored in the
file of extension *.sz.
== SOFTWARE & SYSTEM REQUIREMENTS
1) EXTERNAL TOOLS & C++ COMPILER VERSION
unlike most of other metagenomic and genomic classifiers, CLARK does not require
Uny tool from the BLAST family, Jellyfish or other external tool. The main
requirement is a 64-bit operating system (Linux or Mac), and the GNU GCC to
compile version 4.4 or higher. Multithreading operations are assured by the openmp
libraries. If these libraries are not installed, CLARK will run in single-threaded
task. CLARK is expected to successfully run on recent Mac and Linux OS.
2) MEMORY
CLARK can be RAM consuming, especially when loading large database(s). Latest versions
since v1.1.1 require about 58 GB in RAM to load the database in default mode (built
from bacteria genomes, default NCBI/RefSeq) and about 156 GB to build that database,
using 31-mers. Using 20-mers instead, it requires 49 GB to load the database
(and 107 GB to build the database). About CLARK-S, the memory usage during the
classification is the highest (~101 GB) because the program loads in memory several
sets of spaced k-mers (from several spaced seed).
For the space required in disk, CLARK needs about 36 GB to store the database built
for bacteria (at the genus level).
If memory is a concern to you and you have small metagenomes then please consider
using the RAM-light version, CLARK-l. This tool is designed to run on 4 GB RAM
laptop, thus the memory consumption is the main objective of this tool.
CLARK-l can load data from bacteria and viruses databases within 4 GB.
If it can't then the user should change the parameter value for the gap
(see "-g <iteraton>" option, in "MANUAL & OPTIONS") and set the iteration to a
value higher than the default, 4.
IMPORTANT NOTE: Results given by CLARK-l should be seen as a draft (or first order
approximation) of results you would get by running CLARK or CLARK-S.
== INSTALLATION
First, download the zipped package of the latest version (v1.2.3), available from
the CLARK webpage ("Download tab"), http://clark.cs.ucr.edu.
Second, uncompress the tar.gz file ("tar -xvf CLARKV1.2.3.tar.gz"), then go to
the sub-directory "CLARKSCV1.2.3" and execute the installation script ("./install.sh").
The installation is done! You can now run CLARK and any of the provided scripts.
The installer builds binaries (CLARK, CLARK-l and CLARK-S, in the folder "exe" in
CLARKSCV1.2.3).
== SCRIPTS
After the installation, you can also notice that several scripts are available.
Especially:
- set_targets.sh and classify_metagenome.sh: They allow you to classify your
metagenomes against several database(s) (downloaded from NCBI or available
"locally" in your disk)
- estimate_abundance.sh: It computes the abundance estimation (count/proportion
of objects assigned to targets).
- evaluate_density_confidence.sh and evaluate_density_gamma.sh: These two scripts
take in input one or several CLARK results (containing confidence/gamma scores)
and output/plot the density of assignments per confidence score (or gamma score).
- buildSpacedDB.sh: It creates the sets of discriminate spaced k-mers from the
selected databases.
- clean.sh: This script will delete permanently all data related (generated and
downloaded) of the database directory defined in set_targets.h.
- resetCustomDB.sh: It resets the targets definition with sequences (newly
added/modified) of the customized database. Any call of this script must be
followed by a run of set_target.sh.
- updateTaxonomy.sh: To download the latest taxonomy data (taxonomy id,
accession numbers, etc.) from the NCBI website.
== INTRODUCTION ON CLARK USAGE
To introduce how to use CLARK, we first begin with some definitions:
- target: a reference sequence or a set of reference sequences (e.g., reads, contigs,
assemblies, etc.) of any length. In metagenomics, for a particular taxonomy rank r,
any target can be a genome, or genomes of a particular taxon defined at the level r.
see examples in next sections). In genomics, a target can be reads or the assemblies
of a chromosome (or any other genomic region of a genome).
Thus, a target is defined by the context of the classification.
- object: a sequence (e.g., a read, a contig, a scaffold/assembly, etc.), of any length,
to be classified against a set of targets defined at the same taxonomy rank (see
examples in next sections).
In general, target or object, sequences can be fasta or fastq files (or even spectrum),
see details in next sections.
We provide examples of how to classify a set of objects against a set of targets.
The general rule is the user must define the targets definitions (i.e., the
(reference sequence) <-> (target ID) association), pass objects to be classified,
files to store results and a directory to store the database (of target-specific k-mers).
This database is needed, because time-consuming computations are needed to
find and saved in disk target-specific k-mers. Once these data are saved in disk,
it will be loaded directly and quickly if CLARK needs them again.
We describe in detail and also provide examples in next sections. All options/parameters
available are detailed in the section "MANUAL & OPTIONS".
For metagenomics, scripts are provided to download/preprocess databases
from NCBI (i.e., genomes related to Bacteria, Viruses and Human), see section
"CLASSIFICATION OF METAGENOMIC SAMPLES".
== HOW TO CHOOSE THE K-MER LENGTH ?
CLARK is a k-mer based classification method, and the value of k is critical for the
classification accuracy and speed. The user should use any value between 19 and 23
in order to get high sensitivity. For high sensitivity, we recommend k = 20 or 21
(along with the full mode).
However, if the precision and the speed are the main concern, the user should use
any value between 26 and 32. However, the highest is k, the higher is the RAM usage.
As a good tradeoff between speed, precision, and RAM usage, we recommend k = 31 (along
with the default/express mode). The default k-mer length is 31.
In the version 1.*, CLARK supports any k-mer length up to 32.
== MANUAL & OPTIONS
In this section, we describe options directly available with the CLARK executable
(binary file in the "exe" directory).
CLARK offers several options to run the classification.
A typical command line to run CLARK (or CLARK-l/CLARK-S) looks like:
$ ./CLARK -k <kmerSize> -T <fileTargets> -t <minFreqTarget> -D <directoryDB/> -O <fileObjects> -o <minFreqObject> -R <fileResults> -m <mode> -n <numberofthreads> ...
Definitions of parameters:
-k <kmerSize>, k-mer length: integer, >= 2 and <= 32 (in version 1.*).
The default value for this parameter is 31, except for CLARK-l (it is 27).
For CLARK-S, k is set to 31 and the maximum number of mismatches is set to 9.
-T <fileTargets>, The targets definition is written in fileTargets: filename.
This is a two-column file (separated by space, comma or tab), such that for each line:
column 1: the filename of a reference sequence
column 2: the target ID (taxon name, or taxonomy ID, ...) of the reference sequence
(See example below).
This parameter is mandatory.
-t <minFreqTarget>, minimum of k-mer frequency/occurence for the discriminative k-mers: integer, >=0.
The default value is 0. For example, for 1 (or, 2), the program will discard any
discriminative k-mer that appears only once (or, less than twice).
-D <directoryDB/>, Directory of the database : pathname.
This parameter is mandatory.
-O <fileObjects>, file containing objects: filename.
This parameter is mandatory.
-P <file1> <file2>, Paired-end fastq files: filenames.
-o <minFreqObject>, minimum of k-mer frequency or occurence in objects: integer, >=0.
This option is only available in the spectrum mode (-m 3).
-R <fileResults>, file to store results: filename.
Results are stored in CSV format in the file <fileResults>.csv (the extension
".csv" is automatically added to the filename).
This parameter is mandatory.
-m <mode>, mode of execution: 0,1,2 or 3
0 (full): To get detailed results, confidence scores and other statistics.
This mode loads all discriminative (contiguous/spaced) k-mers.
This mode is available for CLARK, CLARK-S and CLARK-l.
1 (default): To get results summary and perform best trade-off between
classification speed, accuracy and RAM usage.
This mode loads half discriminative (contiguous/spaced) k-mers.
This mode is available for CLARK and CLARK-l.
2 (express): To get results summary with the highest speed possible.
This mode loads all discriminative (contiguous/spaced) k-mers.
This mode is available for CLARK, CLARK-S and CLARK-l.
3 (spectrum):To classify data that are given in spectrum form (i.e.,
sequence data are in a two-column file, <k-mer> <count>).
This mode is available for CLARK and CLARK-l.
-n <numberofthreads>, number of threads: integer >= 1.
The program runs in parallel for the classification and, with the option
'--ldm' on, the loading of the database into RAM.
-g <iterations>, "gap" or number of non-overlapping k-mers to pass when creating the
database (for CLARK-l only). The bigger the gap is, the lower the RAM
usage is. The default value is 4, but the user can increase this value if
needed to reduce the RAM usage but this will degrade the sensitivity.
-s <factor>, sampling factor value (for CLARK/CLARK-S only). If you want the program to load
in memory only half the discriminative (contiguous/spaced) k-mers then use
"-s 2". If you want a third of these k-mers then use "-s 3", etc.
The higher this factor is, the lower the RAM usage is.
The higher this factor is, the higher the classification speed/precision is.
However, our experiments show that the sensitivity can be quickly
degraded, especially for values higher than 3.
In the default mode, this factor is set to 2 because it represents a good
trade-off between speed, accuracy and RAM usage.
--tsk, (to request a detailed creation of the database (target specific k-mers files).
For each target, the program creates a file containing all its specific k-mers.
BEWARE: This option requires a high amount of disk space to complete.)
This option is no more supported.
--ldm, to request the loading of database file by memory mapped-file.
This option accelerates the loading time but it will require an
additional amount of RAM significant. This option also allows to
load the database in multithreaded-task.
--kso, to request a preliminary k-spectrum analysis of each object
(for mode 3 only). This analysis gives intervals of k-mers
frequency for which k-mers are assumed to be informative
of the object. This interval is reported in the results file.
--extended to request an extended output (result file), only for the full mode.
When a filename is required, we recommend absolute paths.
== LOW-LEVEL DESCRIPTION & EXAMPLES
This section describes main features and options for running directly the executable
CLARK (or CLARK-l/CLARK-S) built (in the folder "exe") after the installation. If you want
to use CLARK as a metagenome classier, we recommend you use the provided scripts
for that purpose and detailed in the section "CLASSIFICATION OF METAGENOMIC SAMPLES".
However, we recommend the reader to go through this section to become familiar with
CLARK's options/parameters.
If you work directly with CLARK and CLARK-l files (e.g., to classify BAC/transcript
to chromosomes) then we recommend you copy them to your working directory.
In this section, the working directory is /home/user1/bin/ and the binaries CLARK, CLARK-l
and CLARK-S have been copied in it. Indeed,
$ pwd
/home/user1/bin/
$ ls
CLARK CLARK-l CLARK-S
$
We describe in the following paragraph 1) how to create the targets definition with
examples for it, in the case CLARK is set to classify metagenomic samples. Then,
we describe how to set different modes, use paired-end reads and run multiple threads.
Finally, we explain how to use CLARK for assignning BAC/transcript to chromosome-arms/
centromeres.
1) Generalities, and Default Mode:
- Say, for simplicity, the user have ten genomes in its database and each genome is
stored into one fasta file and is located in the current working directory (say,
genome_1.fa, genome_2.fa, ..., genome_10.fa). Say that the first three genomes belong to
the same species S1, the next three genomes belong to the same species S2, and the last four
genomes belong to S3. And finally, say that the first six genomes belong to the same
genus G1, and the last four genomes belong to the same genus G2.
- Sequences (or objects) to be classified are three short single-reads all stored in
one fasta file, "objects.fa".
First, the user must define targets for the classification (the "targets_addresses.txt"
file). For metagenomics, for a classification at the species-level, then the user must
define targets accordingly. So, at the species level, the targets_addresses.txt file is:
$ cat targets_addresses.txt
/home/user1/bin/genome_1.fa S1
/home/user1/bin/genome_2.fa S1
/home/user1/bin/genome_3.fa S1
/home/user1/bin/genome_4.fa S2
/home/user1/bin/genome_5.fa S2
/home/user1/bin/genome_6.fa S2
/home/user1/bin/genome_7.fa S3
/home/user1/bin/genome_8.fa S3
/home/user1/bin/genome_9.fa S3
/home/user1/bin/genome_10.fa S3
$
In this case, there are three distinct targets (S1, S2 and S3) that the program
will use for the classification. Defining targets at the genus-level would be:
$ cat targets_addresses.txt
/home/user1/bin/genome_1.fa G1
/home/user1/bin/genome_2.fa G1
/home/user1/bin/genome_3.fa G1
/home/user1/bin/genome_4.fa G1
/home/user1/bin/genome_5.fa G1
/home/user1/bin/genome_6.fa G1
/home/user1/bin/genome_7.fa G2
/home/user1/bin/genome_8.fa G2
/home/user1/bin/genome_9.fa G2
/home/user1/bin/genome_10.fa G2
$
In this case, there are 2 distinct targets (G1 and G2). Results and performance of
CLARK have been analyzed at the genus and the species level (see the manuscript).
In all these examples, each genome file is given by the absolute path (recommended),
however the user can also use relative path. Since files are in the working
directory (where the binaries are), we can simply have:
$ cat targets_addresses.txt
./genome_1.fa G1
./genome_2.fa G1
./genome_3.fa G1
...
$
Naturally, G1 and G2 are just labels that the user is free to choose. The user
can rather use the full genus names (e.g., Staphylococcus and Streptococcus instead of
G1, and G2), or, as recommended, the taxonomy ID (i.e., 1279 and 1301 resp.).
The definition of targets is similar for genomics (the reader can see the paragraph
"6) Chromosome arms and Centromeres"). Defining targets for genomics or metagenomics
is equivalent.
Second, based on the definition of the targets, CLARK builds a database. This database
needs a directory to be created. We recommend the user to create a specific
directory for each targets definition. For this example, we use the directory "./DBD/".
This directory can be anywhere in your disk, and independantly to the directory you
chose for the source code.
Third, objects are short reads present in the file "objects.fa". Then, we want
the classification results to be stored in the file named "results". The results format is
CSV (see "RESULTS FORMAT" for more detail). If a file "results.csv" already exists then
its content will be erased.
Finally, to run CLARK, using 31-mers, in default mode, run:
$ ./CLARK -k 31 -T ./targets_addresses.txt -D ./DBD/ -O ./objects.fa -R ./results
Or using 20-mers:
$ ./CLARK -k 20 -T ./targets_addresses.txt -D ./DBD/ -O ./objects.fa -R ./results
Note that the results file will have the extension "csv".
If CLARK is run for the first time on the defined targets and the given k-mer length,
then CLARK builds first the database, which gathers all Target Specific K-mers (TSK).
If CLARK is run again on the same targets and same k-mer length, then it will directly
load TSK files previously produced. Once the classification is done,
the file results.csv is created.
1.1) Exclusion of discriminative k-mers of low frequency in the database
The user can request the removal of discriminative k-mers that have a frequency equal
or lower than a certain threshold (like 1, or 2, or, ...).
This can be done by using the option "-t", for which the default value is 0 (i.e., "-t 0").
For example, to exclude discriminative k-mers that "appear" only once, the command line
from the last example in 1) is :
$ ./CLARK -k 20 -T ./targets_addresses.txt -D ./DBD/ -O ./objects.fa -R ./results -t 1
To exclude discriminative k-mers that appear twice or less:
$ ./CLARK -k 20 -T ./targets_addresses.txt -D ./DBD/ -O ./objects.fa -R ./results -t 2
See the "MANUAL & OPTIONS" section for more info.
1.2) Exclusion of k-mers of low frequency in the objects
Similarly, the user can also request to ignore k-mers in objects that have a frequency
equal or lower than a certain threshold (option "-o").
In version 1*, this option is available ONLY for the spectrum mode (mode 3), see
"MANUAL & OPTIONS".
2) Fast, Full, Spaced and Spectrum Mode:
Previous instructions used the default mode of CLARK (i.e., "-m 1"). To use the "express"
mode to get results more quickly, then simply add the option "-m 2". For the "full" mode
that will generate detailed results (including confidence scores), add "-m 0".
For example, using 20-mers, to run in express mode:
$ ./CLARK -k 20 -T ./targets_addresses.txt -D ./DBD/ -O ./objects.fa -R ./results -m 2
In full mode:
$ ./CLARK -k 20 -T ./targets_addresses.txt -D ./DBD/ -O ./objects.fa -R ./results -m 0
Note that
$ ./CLARK -k 20 -T ./targets_addresses.txt -D ./DBD/ -O ./objects.fa -R ./results -m 1
is equivalent to
$ ./CLARK -k 20 -T ./targets_addresses.txt -D ./DBD/ -O ./objects.fa -R ./results
In the case objects are given as spectrum (a two-column file), the user should
indicate it with the option "-m 3":
$ ./CLARK -k 20 -T ./targets_addresses.txt -D ./DBD/ -O ./objects.fa -R ./results -m 3
IMPORTANT NOTES:
Whatever you use CLARK, CLARK-l or CLARK-S:
- The full mode (i.e., "-m 0") loads all discriminative k-mers in RAM and provides
confidence score for all assignments. Thus, it offers high sensitivity.
If your data are in the spectrum format, then the spectrum mode (-m 3) is a solution
for you.
- If your primary concern is the speed, and do not need detailed results, nor confidence
scores, then use the "express" mode (-m 2), the fastest mode.
For CLARK or CLARK-l:
- If you do not know what mode to use, then use the default mode (-m 1). It loads in RAM
about half discriminative k-mers stored in disk. Thus, the RAM usage is lower than that
of other modes, but its sensitivity is not the highest. However, the precision and speed
are high, and the default mode is a good compromise between speed, RAM usage and high
precision/sensitivity.
Finally:
- If you analyze a metagenomic sample from a poorly known microbial habitat (i.e.,
the default RefSeq database is likely not to contain a large fraction of genomes of
organisms present in your sample), for example, sea water, etc. then use CLARK-S
with full mode. CLARK-S uses multiple spaced k-mers instead of contiguous k-mers,
and it produces detailed results. However, this variant is the most RAM-consuming.
Unlike CLARK, results produced by CLARK-S are already filtered (i.e., assignments
with confidence score < 0.75 or gamma score < 0.06 are rejected).
However, you can use a stricter filtering to get more precise results
(see option of the script estimate_abundance.sh).
2.1) Running CLARK-S:
The current release exploits spaced k-mers of length 31 and weight 22. Before classifying your
metagenomic sample, the database of discriminative 31-mers (e.g., from bacterial genomes) and
then the database of discriminative spaced 31-mers (using the script "buildSpacedDB.sh")
must be created.
Step 0 (Set the database and its directory "DBD"):
$ ./set_targets.sh ./DBD/ bacteria --phylum
(or $ ./set_targets.sh ./DBD/ bacteria --species)
where "DBD" is the directory where you want to store bacteria genomes and create the database.
Step 1 (Build the database of discriminative 31-mers):
If the database files of 31-mers are already created for the database and the rank specified
in step 0 then you can skip this step.
This can be done by running any classification with k=31, for example:
$ ./classify_metagenome.sh -O sample.fa -R result
where sample.fa is some fasta file data. This operation is only needed to create the
database of discriminative 31-mers.
Step 2 (Create the databases of discriminative spaced k-mers):
$ ./buildSpacedDB.sh
This task will take several hours to create discriminative spaced k-mers for all the
3 spaced seeds (i.e., about 6 to 7 hours). This operation will write about 60 GBytes of
data on disk.
Finally, to classify your metagenome (for example, sample.fq) run:
$ ./exe/CLARK-S -T ./targets_addresses.txt -D ./DBD/ -O ./sample.fq -R ./result -m 0
or
$ ./classify_metagenome.sh -O sample.fq -R result --spaced
Note: If you decide to work with a different taxonomy rank and/or database then
you will need to repeat the step 0 and step 1.
3) Paired-end reads:
CLARK accepts paired-end reads (fastq files) as objects. Indicate it with the option "-P"
and the two filenames (CLARK will concatenate paired reads with matching ID).
An example of command line, using default mode, is:
$ ./CLARK -k 20 -T ./targets_addresses.txt -D ./DBD/ -P ./sample1.fastq ./sample2.fastq -R paired.results
4) Multi-Objects (or multiple sample files):
The program can process several sample files at the same time (i.e., no need to reload the database
or restart the program for each sample).
To classify several fasta/fastq files under the same settings (e.g., mode, threads, etc.) ,
it is possible to pass all of them conveniently (i.e., without the need to reload
the database for each file). Say the user has 3 fasta files, located in the current working
directory, ./objects_1.fa, ./objects_2.fa and ./objects_3.fa. To pass these three files
at once, create a file (e.g., "objects.txt") containing filenames of these fasta files:
$ cat objects.txt
./objects_1.fa
./objects_2.fa
./objects_3.fa
$
The objects.txt file is like a "meta-objects" file. Then, create a meta-results file
(called "results.txt") such that the i-th line in it indicates the filename to store results
of the i-th fasta file indicated in "objects.txt":
$ cat results.txt
./results_1
./results_2
./results_3
$
./results_1.csv will contain results related to ./objects_1.fa, ..., and
./results_3.csv will contain results related to ./objects_3.fa.
Indeed, once the classification is done, results_1.csv, results_2.csv and results_3.csv are created.
This way assures that the user can pass any amount of objects files in a scalable fashion.
Note: the "results.txt" file can be just a duplicate (or a simple copy) of "objects.txt".
Since the program create results files with extension "csv", so you can create these results
files by using the filename of the fastq/fasta file as prefix.
For example, if you run:
$ ./CLARK -k 20 -T ./targets_addresses.txt -D ./DBD/ -O ./objects.txt -R ./objects.txt
The program will store the results files as ./objects_1.fa.csv, ./objects_2.fa.csv and ./objects_3.fa.csv
Same for paired-end reads:
$ ./CLARK -k 20 -T ./targets_addresses.txt -O ./DBD/ -P ./paired1.txt ./paired2.txt -R ./results.txt
where files paired1.txt aand paired2.txt contain addresses of paired-end reads:
$ cat paired1.txt
./sample1_R1.fq
./sample2_R1.fq
./sample3_R1.fq
$ cat paired2.txt
./sample1_R2.fq
./sample2_R2.fq
./sample3_R2.fq
$ cat results.txt
./results_1
./results_2
./results_3
Once computations done, results for the paired-end reads for each of the three samples are stored in:
results_1.csv, results_2.csv and results_3.csv.
5) Multi-threading:
CLARK can exploit parallel threads to increase the computations speed.
The option to add is "-n <Number of threads>". For example, using 12 threads:
$ ./CLARK -k 20 -T ./targets_addresses.txt -D ./DBD/ -O ./objects.txt -R ./results.txt -n 12
6) Chromosome arms and Centromeres
We understand that users can work with chromosome arms as targets. In this case,
centromeres can be inferred by CLARK. To allow that, the user must indicate in the
targets_addresses.txt file sequences that can be related to centromeres.
For example, in the case of barley, consider 6 chromosome arms defined by 6 files
(./chr_arm1L.fa, ./chr_arm1S.fa, ..., ./chr_arm3L.fa, ./chr_arm3S.fa). If the user
does not want to define centromeres, then targets_addresses.txt would look like:
$ cat targets_addresses.txt
./chr_arm1L.fa 1HL
./chr_arm1S.fa 1HS
./chr_arm2L.fa 2HL
./chr_arm2S.fa 2HS
./chr_arm3L.fa 3HL
./chr_arm3S.fa 3HS
$
If the user wants the program to infer centromere of 2H only, then it will
indicate the centromere name for it (2HC) at each target related to the chromosome
2H (so ./chr_arm2L.fa, and ./chr_arm2S.fa) on the same line:
$ cat targets_addresses.txt
./chr_arm1L.fa 1HL
./chr_arm1S.fa 1HS
./chr_arm2L.fa 2HL 2HC
./chr_arm2S.fa 2HS 2HC
./chr_arm3L.fa 3HL
./chr_arm3S.fa 3HS
$
If all centromeres must be inferred, then targets_addresses.txt would look like:
$ cat targets_addresses.txt
./chr_arm1L.fa 1HL 1HC
./chr_arm1S.fa 1HS 1HC
./chr_arm2L.fa 2HL 2HC
./chr_arm2S.fa 2HS 2HC
./chr_arm3L.fa 3HL 3HC
./chr_arm3S.fa 3HS 3HC
$
When defining chromosome arms labels, we strongly recommend that the user follows the
naming convention used of this example:
For the N-th chromosome, "NS" is the label for the short arm and "NL" is the
label for the long arm and (same prefix), so "NC" is the label for the centromere.
IMPORTANT NOTE:
Examples of command lines above using fasta files are valid as well for
fastq files, except for the paired-end reads case.
== CLASSIFICATION OF METAGENOMIC SAMPLES
We provide several scripts to facilitate the classification in the context of
metagenomics. CLARK can preprocess databases of bacteria, viruses or human (downloaded
from NCBI) or a customized set of genomes.
First, we present here two scripts, "set_targets.sh" and "./classify_metagenome.sh" that
work together.
Second, we present the script to get the abundance estimation (count and proportion for each
target identified), "estimate_abundance.sh", from one or several results file(s).
1) Setting and classification
1.1) Step I: Setting targets
After the installation (./install.sh), the user must create a directory to store all
reference sequences (bacteria, viruses, human and custom). For all our examples below,
we name this directory path in a generic way <DIR_DB/> for clarity. This directory
can be anywhere in your disk(s).
Then, the user must indicate what database(s) to consider for the classification among
bacteria, viruses, human and/or custom.
For example, only bacteria genomes:
$ ./set_targets.sh <DIR_DB/> bacteria
To work with bacteria, viruses and human:
$ ./set_targets.sh <DIR_DB/> bacteria viruses human
To classify against a custom database:
The user will need to copy/move the sequences (fasta files with accession numbers in the
header, i.e., ">accession.number ..." or ">gi|number|ref|accession.number| ...",
and one fasta file per reference sequence) in the directory "Custom", inside <DIR_DB/>,
and then run set_targets. In order words, the user must:
(1) create the directory "Custom" inside <DIR_DB/> (if it does not exist yet);
(2) copy/move the sequences of interest in the Custom;
(3) run:
$ ./set_targets.sh <DIR_DB/> custom
or for example, when combining the bacteria genomes with the Custom sequences:
$ ./set_targets.sh <DIR_DB/> bacteria custom
In general case (when the user selects bacteria, viruses and/or human),
if the directory <DIR_DB/> is empty, then the script will download all
the selected database(s) and also data of the taxonomy tree, from NCBI.
Once the sequences are found or downloaded in the directory, the script will build
the targets for a given taxonomy rank.
The default taxonomy rank is species. To use a different taxonomy rank, for example,
genus, the command line is (from the example selecting bacteria, viruses and human):
$ ./set_targets.sh <DIR_DB/> bacteria viruses human --genus
In the current release, the user can choose between six ranks (species to phylum):
--species (the default value), --genus, --family, --order, --class or --phylum.
Indeed, the strength of CLARK is to be able to classify quickly and accurately
metagenomic samples by using only one taxonomy rank. So as a general rule when
classifying metagenomic reads:
Consider first the genus or species rank, then if a high proportion of reads
cannot be classified, reset your targets definition at a higher taxonomy rank
(e.g., family or phylum).
Once set_targets.sh is finished, the user can proceed to the step II. However, if
the user wants to modify the selected databases and/or taxonomy rank, then he/she will
need to run again set_targets.sh with updated parameters before proceeding to step II.
1.2) Step II: Running the classification
The script to run the classification of a metagenomic sample against the database(s)
previously set in step I is "classify_metagenome.sh".
For your convenience, this script runs the executable CLARK, CLARK-l or CLARK and allows
you to pass few parameters.
For example, say objects to be classified are reads in "sample.fa" (e.g., located in the
current directory), and results to be stored in "result.csv". A basic command line is:
$ ./classify_metagenome.sh -O ./sample.fa -R ./result
As explained in the section "MANUAL & OPTIONS", thanks to identifiers "-O" and "-R",
the script will pass the objects file "sample.fa" and results will be stored
in "./result.csv". Objects are classified against the targets and the taxonomy rank
defined by the last execution of ./set_targets.sh.
IMPORTANT NOTES:
- The targets definition is automatically passed to CLARK in step II. The file has been
computed by set_targets.sh.
- The script set_targets.sh assumes that each reference file from bacteria, viruses or
custom database contains an accession numberr (in the RefSeq records format:
i.e., ">accession.number ..." or ">gi|number|ref|accession.number| ..." ).
If an accession number is missing in a file then this file will not be used
for the classification.
- set_targets.sh also maps the accession number found in each reference sequence to its
taxonomy ID based on the latest NCBI taxonomy data. If a mapping cannot be made for a
given sequence, then it will NOT be counted and excluded from the targets definition.
The total number of excluded files is prompted in the standard output, and all files that
have been excluded are reported in the file "files_excluded.txt" (located in the specified
database directory (i.e., "./DBD/").
If some files are excluded, then it will probably mean that they have been removed for
curations for example (visit the RefSeq FAQ webpage) or maybe because your local taxonomy
information are no nore up-to-date (see next point).
- You can update your local taxonomy database thanks to the script "updateTaxonomy.sh"
You can use this script before running set_targets.sh.
- If the user wants to work with a different customized database (for example, by removing
or adding more sequences of interest in the Custom folder) then the targets definition
must be reset. We made it simple with the script "resetCustomDB.sh":
After the sequences in the Custom folder have been updated, just run:
$ ./resetCustomDB.sh
Then, run set_target.sh with the desired settings.
- The database files (*.ky, *.lb and *.sz) will be created inside a subdirectory of the
specified database directory in step I (i.e., "./DBD/") by classify_metagenome.sh.
- The default values (the k-mer length, the mode, the number of threads, etc.) are used
if not specified by the user, just like indicated in the previous section.
- The script classify_metagenome.sh still allows you to pass customized parameters and
options, similarly to the previous section. classify_metagenome.sh follows options defined
in "MANUAL & OPTIONS"(see below some examples). So you can change the k-mer length,
the number of parallel threads, mode, etc.
We present below some examples of customized classification using classify_metagenome.sh.
a) To use 20-mers (instead of 31-mers):
$ ./classify_metagenome.sh -O ./sample.fa -R ./result -k 20
b) To request the full mode:
$ ./classify_metagenome.sh -O ./sample.fa -R ./result -m 0
c) To classify in full mode multiple sample files (single-end reads):
$ ./classify_metagenome.sh -O ./samples.txt -R ./samples.txt -m 0
where, the file "samples.txt" contains the addresses of all the sample files to be run:
$ cat samples.txt
./sample1.fa
./sample2.fa
./sample3.fa
./sample4.fastq
./sample5.fq
./sample6.fq
...
d) To classify in full mode multiple sample files (paired-end reads):
$ ./classify_metagenome.sh -O ./samples.R.txt ./samples.L.txt -R ./samples.R.txt -m 0
where, files "samples.R.txt" and "samples.L.txt" contain the addresses of all the fastq files
(right and left) to be run:
$ cat samples.R.txt
./sample1.R1.fq
./sample2.R1.fq
./sample3.R1.fq
...
$ cat samples.L.txt
./sample1.R2.fq
./sample2.R2.fq
./sample3.R2.fq
...
Observe in this example that the order of right and left paired-end reads of each sample must be
preserved in "samples.R.txt" and "samples.L.txt" .
e) To request the express mode, and 8 threads:
$ ./classify_metagenome.sh -O ./sample.fa -R ./result -m 2 -n 8
f) To request the full mode, with gzipped objects file, and using 8 threads:
$ ./classify_metagenome.sh -O ./sample.fa.gz -R ./result -m 0 -n 8 --gzipped
Another example, in default mode, for classifying paired-end reads (./sample1.fastq
and ./sample2.fastq):
$ ./classify_metagenome.sh -P ./sample1.fastq ./sample2.fastq -R ./result
Notes:
This script can run CLARK-l instead of CLARK, for workstations with limited RAM.
Then, the user can indicate it with the option "--light". For example:
$ ./classify_metagenome.sh -P ./sample1.fastq ./sample2.fastq -R ./result --light
This script can run CLARK-S instead of CLARK, if the database files of discriminative
of spaced k-mers have been built. To use CLARK-S, the option is "--spaced". For example:
$ ./classify_metagenome.sh -P ./sample1.fastq ./sample2.fastq -R ./result --spaced
g) To run CLARK-S with full mode and using 8 threads:
$ ./classify_metagenome.sh -O ./sample.fa -R ./result -m 0 -n 8 --gzipped --spaced
h) To run CLARK-S with express mode and using 8 threads on a gzipped file:
$ ./classify_metagenome.sh -O ./sample.fa.gz -R ./result -m 2 -n 8 --spaced
If you want to run CLARK-S but with a much lower RAM usage then you can decide
to download only half the discriminative spaced k-mers in memory using "-s 2".
For example:
$ ./classify_metagenome.sh -O ./sample.fa -R ./result --spaced -s 2
Note:
run "./classify_metagenome.sh" in the terminal to prompt the help/usage describing
options and parameters.
2) Abundance estimation
The script "estimate_abundance.sh" can analyze CLARK results of a metagenomic sample,
and can provide for each target identified, its scientific name, taxonomy id, lineage
(superkingdom, phylum, class, order, family, genus), its count and proportion of objects
assigned to it. This script also allows to apply some filtering conditions (based on
the confidence score or gamma score of assignments) to obtain a stricter estimation.
The output format of estimate_abundance.sh is CSV.
For example, say a metagenomic sample contains 100 reads, results by CLARK (full mode)
indicates that 20 reads (20%) are assigned to the target T1, 70 (70%) are assigned to T2,
and 10 (10%) are not assigned ("NA"). Thus, the abundance estimation program reports
a count of 20 (or 22.2% of the assigned reads) for T1, 70 (or 77.7%) for T2, and a
count of 10 for the category "UNKNOWN" (collecting all unassigned reads or
assignments that do not satisfy filtering conditions).
We give below examples of command lines.
Parameters and options of this script are:
estimate_abundance.sh -c <minConfidenceScore> -g <minGamma> -D <Directory_Path> -F <result1>.csv <result2>.csv ... <result_n>.csv -a <minAbundance> ...
Definition of parameters:
-c <minConfidenceScore> To filter assignments based on their confidence score (if available)
using the threshold value minConfidenceScore (a value between 0.5 and 1.0).
The abundance estimation for each target will count only
assignments with a confidence score higher than minConfidenceScore.
Assignments that have a confidence score lower than minConfidenceScore
will be marked as unclassified and so counted in the
category UNKNOWN in the abundance estimation report.
The default value is 0.5.
-g <minGamma> To filter assignments based on their gamma score (if available) using the
threshold value minGamma (a value between 0 and 1.0).
The abundance estimation for each target will count only
assignments with a gamma score higher than minGamma.
Assignments that have a gamma score lower than minGamma
will be marked as unclassified and so counted in the
category UNKNOWN in the abundance estimation report.
The default value is 0.
-D <Directory_Path> The directory path of the database (the same you indicated when calling
set_targets.sh). This parameter is required to load scientific names of
all targets ONLY if you pass results of a metagenomic sample.
-F <result1>.csv ... <result_n>.csv results file or list of results files produced by CLARK.
Important Note: You can pass a results file produced by any mode of
execution of CLARK (full, express, spectrum, default),
but if you pass several files, make sure they all have
been produced under the same mode. For example, if you
pass two files result1.csv and result2.csv then result1.csv
and result2.csv should be produced through the same mode
(e.g., both by full mode).
-a <minAbundance(%)> To filter abundance estimations that are higher than a certain threshold,
minAbundance. minAbundance is a percentage value (between 0 and 100).
--highconfidence To automatically count only high confidence assignments for the
abundance estimation. This option is equivalent to "-c 0.75 -g 0.03".
--krona To export results in a 3-column file for the Krona tool
(Hierarchical data browser, Ondov et al, BMC Bioinformatics, 2011,
doi: 10.1186/1471-2105-12-385. https://github.com/marbl/Krona/wiki).
With this option on, the program will create the file 'results.krn' containing
the unnormalized abundance from CLARK's assignments (with the filtering options if any).