v2.0.0
Pre-release
Pre-release
vcfdist now supports Structural Variants (SVs)
Summary
Numerous performance and memory footprint improvements now make the evaluation of structural variants (SVs) feasible with vcfdist. However, it is still somewhat limited in terms of maximum variant size. With 64 GB RAM and 56 cores, we were able to evaluate a whole human genome in 00h:56m, 03h:40m, or 07h:32m by limiting the maximum variant size to 1kbp, 5kbp, or 10kbp respectively. For the last evaluation, the largest supercluster was almost 40kb which nearly maxed out our 64GB RAM. Memory usage for realignment and precision-recall calculations is still O(n*n).
Improvements
Variants
- if two variants occur at the same position, force the variant that consumes a reference base to occur second
- left shifting pass after clustering/realignment, in case variants at the start of a cluster can be shifted further left
Clustering
- clusters are merged as far as possible leftwards and rightwards during each iteration (previously at most one leftwards merge)
- iterative doubling of reach calculation to reduce unnecessary work/allocation
- fixed erroneous loop in WF SWG max reach, it's much faster now
- removed pointers from WF SWG max reach, since they're unused
- WF SWG max reach now requires O(n) memory, instead of O(n*n)
- added multi-threaded clustering, with one thread per haplotype on each contig
Precision/Recall
- converted from maps back to matrices, for improved efficiency and memory footprint
- pointers now stored in
uint8_t
for reduced memory usage - only calculated for the best phasing, which reduces footprint by 2x
- added multi-threaded precision-recall calculation
- work-balancing for multi-threading, sorting by supercluster size and spawning new threads for each alignment on large superclusters
Writing
- timers now store internal time in nanoseconds, making the summations more accurate
- colored output printing, removed unnecessary per-contig information