ESS-Color is a bioinformatics tool for constructing compressed representation of sets of k-mer sets (i.e. compressed colored dBG).
- Linux operating system (64 bit)
- GCC >= 4.8 or a C++11 capable compiler
- Snakemake
- Git
- CMake 3.12+
- Rust (for ggcat)
- KMC
First, install all the pre-requisites and make sure the executables are in your PATH
. Then, install additional executables from source:
git clone https://github.com/medvedevgroup/ESSColor.git
cd ESSColor
bash compile.sh
You can move/copy ALL the executables in ESSColor/bin
to the bin directory that is already in your PATH. For instance, considering /usr/bin
is already in PATH, you need to run the command mv ESSColor/bin/* /usr/bin
to move all executables for ESS-Color software. An alternative to moving/copying executables is adding the location of ESSColor/bin
to your PATH.
ESS-Color uses a modified implemntation of ESS-Compress. We replace the unitig construction step in ESS-Compress by GGCAT for its optimized implementation. To install ggcat, first install rust.
To install rust:
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
rustup toolchain install nightly
To install ggcat:
git clone https://github.com/algbio/ggcat --recursive
cd ggcat/
cargo install --path crates/cmdline/ --locked
If the current ggcat version does not work with ESS-Color, please use the following commit (tested during release of manuscript):
git clone https://github.com/algbio/ggcat
git checkout dd64634a27467b9e56c8f7aad619eae7f4e7917a
git submodule init
git submodule update --recursive
cd ggcat/
cargo install --path crates/cmdline/ --locked
the binary is automatically copied to $HOME/.cargo/bin
Syntax: ./essColorCompress [parameters]
mandatory arguments:
-k [int] k-mer size (must be >=4)
-i [input-file] Path to input file. Input file is a single text file containing the list of multiple fasta/fastq files (one file per line)
-o [output-dir] Path to output directory. [warning: this directory is also used as temp directory, so make sure it does not contain input files]
optional arguments:
-a [int] Default=1. Sets a threshold X, such that k-mers that appear less than X times in the input dataset are filtered out.
-j [int] Default=1. Number of threads.
-p [output-prefix] Default="esscolor". Prefix of output compressed cdbg.
Upon successful completion, the output directory will contain the compressed colored dbGfile <output-prefix>.tar.gz
.
Syntax: ./essColorDecompress [parameters]
mandatory arguments:
-i [input-file] compressed cdBG generated by `essColorCompress`.
optional arguments:
-h Print this Help
The decompressed folder contains 3 files.
- simplitigs.fa
- a FASTA file with set of simplitigs correspoinding to the union ESS.
- meta.txt
- Contains a text file with a header indicating value of k-mer size, followed by C rows each indicating name of the samples.
- matrix.txt
- ordered color matrix in plaintext
$ cd example/
$ mkdir -p output_test
Let's say your 4 gzipped FASTA files are stored in folder example/mini_k18c4m7
, named
sample0.fa.gz
sample1.fa.gz
sample2.fa.gz
sample3.fa.gz
If you wish to runessColorCompress
on all 4 ".fa.gz" files, first make a list named list_mini_k18c4m7
containing the absolute path to all 4 files in each line.
$ ls $PWD/mini_k18c4m7/*.fa > list_mini_k18c4m7
Now, to compress this list using 8 threads, runLengh=16, k-mer size 18 and output to directory output_test/
, run the following command:
$ essColorCompress -i list_mini_k18c4m7 -k 18 -o output_test/ -j 8
Upon successful completion, the output directory will contain a file called esscolor.tar.gz
which is the compressed colored dBG.
Run $ essColorDecompress -i esscolor.tar.gz
Output simplitigs.fa
contains the non-labeled color matrix (ESS order). Output simplitigs.fa
contain simplitigs in a FASTA file.
Let's look at the first simplitig
$ cat simplitigs.fa | head -n 2
The first simplitig looks like this:
>
AAAAACAAAAAAAAAAAATTT
Let's look at the first 4 rows of the non-labeled color matrix
$ head -n 4 matrix.txt
The first 4 rows of the non-labeled color matrix looks like this:
1100
0001
0001
0001
The color vectors in text should be read in MSB order. So, color vector 1100 indicates that the first k-mer AAAAACAAAAAAAAAAAA is present in sample0.fa.gz and sample1.fa.gz and absent in other two. The 2nd to 4th k-mers (AAAACAAAAAAAAAAAAT, AAACAAAAAAAAAAAATT, AACAAAAAAAAAAAATTT) are present only in sample3.fa.gz.
If you are only interested to obtain the color matrix from a KMC database list, you can use the "genmatrix" module. (WARNING: In the current version of the software genmatrix, kmer size must be 32 at maximum. To support k>32, we use an alternative pipeline using joinCounts.)
genmatrix [OPTION...]
-c, --count-list arg [Mandatory] Path to KMC database files. One line per database
-d, --debug-verif Debug flag to verify if the output coresponds to the input (Time consuming).
-o, --outmatrix arg [Mandatory] Path to the output color matrix
-l, --spss arg [Mandatory] Path to the corresponding union SPSS
-s, --strout String output
Command example for a 100 ecoli matrix:
$ genmatrix -c db_list.txt -o matrix.bin -l kmers.bin
To generate matrix in plain text
$ genmatrix -c db_list.txt -l simplitigs.fa -o matrix.txt -s
To generate matrix in binary
$ genmatrix -c db_list.txt -l simplitigs.fa -o matrix.bin
The file db_list.txt must contain the paths to the KMC databases. The file simplitigs.fa must have the same k-mers in fasta format in de-duplicated manner. The path can be absolute or relative to the exec directory. The software is expecting one path per line.
The file matrix.bin contains the color matrix. The matrix has one row per kmer and C column (1 per sample). The columns have the same order than the databases in the db_list.txt file. In string format rows are separated using '\n' chars. Each row is composed of 100 chars that are 0 or 1 depending on the presence/absence of the row kmer in the column sample.
In binary format, a row is a large enough multiple of 64 bits.
For our 100 samples, a row is composed of 128 bits (16 Bytes).
The xth bit of the yth byte correspond to the sample
The file kmer.bin contains the kmer list corresponding to the matrix. In string format, there is one kmer per line.
In binary format, all the values inside of the file are 64 bits. Each 64 bit is decomposed in 8 bytes little endian ordered. First value is k, second is the number n of kmers, then are n values that are kmers.
If using ESS-Color in your research, please cite
- Amatur Rahman, Yoann Dufresne and Paul Medvedev, Compression algorithm for colored de Bruijn graphs, bioRxiv 2023.05.12.540616; doi: https://doi.org/10.1101/2023.05.12.540616