Skip to content

v2.0.0

Pre-release
Pre-release
Compare
Choose a tag to compare
@TimD1 TimD1 released this 05 Jun 22:07
· 172 commits to master since this release

vcfdist now supports Structural Variants (SVs)

Summary

Numerous performance and memory footprint improvements now make the evaluation of structural variants (SVs) feasible with vcfdist. However, it is still somewhat limited in terms of maximum variant size. With 64 GB RAM and 56 cores, we were able to evaluate a whole human genome in 00h:56m, 03h:40m, or 07h:32m by limiting the maximum variant size to 1kbp, 5kbp, or 10kbp respectively. For the last evaluation, the largest supercluster was almost 40kb which nearly maxed out our 64GB RAM. Memory usage for realignment and precision-recall calculations is still O(n*n).

Improvements

Variants

  • if two variants occur at the same position, force the variant that consumes a reference base to occur second
  • left shifting pass after clustering/realignment, in case variants at the start of a cluster can be shifted further left

Clustering

  • clusters are merged as far as possible leftwards and rightwards during each iteration (previously at most one leftwards merge)
  • iterative doubling of reach calculation to reduce unnecessary work/allocation
  • fixed erroneous loop in WF SWG max reach, it's much faster now
  • removed pointers from WF SWG max reach, since they're unused
  • WF SWG max reach now requires O(n) memory, instead of O(n*n)
  • added multi-threaded clustering, with one thread per haplotype on each contig

Precision/Recall

  • converted from maps back to matrices, for improved efficiency and memory footprint
  • pointers now stored in uint8_t for reduced memory usage
  • only calculated for the best phasing, which reduces footprint by 2x
  • added multi-threaded precision-recall calculation
  • work-balancing for multi-threading, sorting by supercluster size and spawning new threads for each alignment on large superclusters

Writing

  • timers now store internal time in nanoseconds, making the summations more accurate
  • colored output printing, removed unnecessary per-contig information