Frac-KMC is a FracMinHash sketch generator tool from FASTA/FASTQ files. This tool is a modified version of the k-mer counting tool KMC (hence the name).
KMC is an extremely fast k-mer couting tool. KMC is also very low-memory. It uses minimizers to count kmers very fast, using multiple bins and threads. Therefore, KMC that has been modified to compute FracMinHash sketches should be able to compute the sketches extremely fast. People traditionally use the software sourmash
(the command sourmash sketch
) to compute FracMinHash sketches. Frac-KMC is an attemp to make the sketching faster.
Initial investigations have revealed following results: we ran sourmash sktech
and fracKmcSketch
to generate FracMinHash sketch of Human reference genome (GrCh38) for various scaled values and kmer sizes. The running time is compared below.
It shows that Frac-KMC is up to 5.7 times faster to compute the same sketch.
Frac-KMC has been written to compute the FracMinHash sketches that are compatible with sourmash
. This means that after computing a FracMinHash sketch using Frac-KMC, you can use the sketch as an input to sourmash compare
. We have not tested compatibility with other sourmash
commands yet, but they will hopefully work too.
For now, Frac-KMC has only been compiled for linux systems. We are working on other versions too, to see if we can make a release.
Easiest wat to obtain the executables is by
- downloading the three executables in
wrappers/sketch/bin
, - adding execution permission to these three executables, and
- adding the directory where the executables are in your
PATH
variable
fracKmcSketch <fasta/fastq_filename> <sketch_name> --ksize 21 --scaled 1000 --seed 42
This command with create a sketch from the fasta/fastq file using 21-mers, a scaled value of 1000, and use 42 as the seed for the hash function. The resulting sketch should be compatible with a sketch computed using sourmash sketch dna input_filename -p k=21,scaled=1000 -o sketch_name
.
Frac-KMC is not associated with any manuscript yet, but if you use Frac-KMC, make sure to cite original KMC: