Skip to content

Releases: TimD1/vcfdist

v2.2.2

01 Nov 18:54
4309dd6
Compare
Choose a tag to compare
v2.2.2 Pre-release
Pre-release

Added source data for Figures 3-6 in paper.

v2.2.1

27 Oct 15:07
Compare
Choose a tag to compare
v2.2.1 Pre-release
Pre-release

Major Fixes

  • fixed indexing error while counting TRUTH variants
  • removed attempt to limit cluster size with g.max_reach_size, which resulted in incorrect behaviour, and replaced it with performing wf_swg_realign() in each updated cluster to recalculate minimum SWG distance

Minor Improvements

  • added parameters.txt output log file
  • outputting results to TSV and VCF is now optional
  • minor changes to console printing

v2.2.0

25 Oct 14:47
Compare
Choose a tag to compare
v2.2.0 Pre-release
Pre-release

Improved phase set/block analysis

  • vcfdist now skips unphased variants by default
  • add phasing info from PS tag
  • switch "errors" only counted if they occur within phase sets
  • added phase set count and NG50 (inputs)
  • calculated phase block NG50 (break on switches) and NGC50 (break on flips and switches)
  • reorganization and variable renaming in phase.cpp

General improvements

  • VCF parsing and warnings are more concise/consistent
  • allow variant selection by FILTER
  • truth and query variants are no longer realigned by default
  • removed unnecessary data copying from sort_superclusters() which significantly improves overall runtime and fixes uncounted runtime

v2.1.0

18 Oct 20:05
Compare
Choose a tag to compare
v2.1.0 Pre-release
Pre-release

More efficient superclustering

  • added max_reach_size global variable to limit supercluster size
    explosion from large INDELs; reaches now treated as if variant is
    max_reach_size

Added Docker image

  • added Dockerfile, building and uploading image now

Phasing evaluation improvements

  • updated supercluster.tsv and phase-blocks.tsv outputs, both now
    include a column for SUPERCLUSTER and PHASE_BLOCK id
  • phase-blocks.tsv renamed PHASE column to BLOCK_STATE
  • renamed summary.vcf tags PS to BS and PF to FE
  • phase and flip errors are now printed per-contig
  • phase blocks are now correctly printed if there is only one

Improved inputs/outputs/command-line

  • added more warnings for potential edit distance errors
  • added optional print argument to wf_ed()
  • removed obsolete query and truth specific alignment parameter options
  • collapsed and re-organized command-line argument printing
  • added warning for if the ratio of heterozygous variants on each
    haplotype is too far off

Bug fixes

  • fixed edge case of wf_ed() causing an error
  • if there are two INSertions at the same location, one is filtered now

v2.0.3

09 Oct 14:29
Compare
Choose a tag to compare
v2.0.3 Pre-release
Pre-release

Improved clustering and superclustering

  • new clustering method caches left/right reaches from previous iterations, and only recalculates active clusters, not the neighbors as well
  • superclustering now uses the cached cluster reaches, rather than gap-based heuristic
  • fixed bug: accidentally counted clusters per supercluster, not variants per supercluster
  • no Valgrind errors or warnings on ./run dataset

Improved handling of phasing

  • precision and recall are now calculated for both phasings, not just the better one. This improvement mostly matters just for superclusters where phasing is undecided until after backtracking
  • added phase block PB, phase switch PS and phase flip PF fields to summary VCF

Improved summary VCF output

  • added phase block PB, phase switch PS and phase flip PF fields to summary VCF
  • added "sync group" SG output to summary VCF, allowing you to determine which variants matched with one another from the Truth and Query VCFs
  • matching heterozygous variants are no longer consolidated to a single variant since they may have different credit BC, category BK, and sync group SG fields
  • fixed GT fields for Truth, which were incorrect

v2.0.1

21 Sep 18:50
Compare
Choose a tag to compare
v2.0.1 Pre-release
Pre-release

New vcfdist release in order to assign a Zenodo DOI.

v2.0.0

05 Jun 22:07
Compare
Choose a tag to compare
v2.0.0 Pre-release
Pre-release

vcfdist now supports Structural Variants (SVs)

Summary

Numerous performance and memory footprint improvements now make the evaluation of structural variants (SVs) feasible with vcfdist. However, it is still somewhat limited in terms of maximum variant size. With 64 GB RAM and 56 cores, we were able to evaluate a whole human genome in 00h:56m, 03h:40m, or 07h:32m by limiting the maximum variant size to 1kbp, 5kbp, or 10kbp respectively. For the last evaluation, the largest supercluster was almost 40kb which nearly maxed out our 64GB RAM. Memory usage for realignment and precision-recall calculations is still O(n*n).

Improvements

Variants

  • if two variants occur at the same position, force the variant that consumes a reference base to occur second
  • left shifting pass after clustering/realignment, in case variants at the start of a cluster can be shifted further left

Clustering

  • clusters are merged as far as possible leftwards and rightwards during each iteration (previously at most one leftwards merge)
  • iterative doubling of reach calculation to reduce unnecessary work/allocation
  • fixed erroneous loop in WF SWG max reach, it's much faster now
  • removed pointers from WF SWG max reach, since they're unused
  • WF SWG max reach now requires O(n) memory, instead of O(n*n)
  • added multi-threaded clustering, with one thread per haplotype on each contig

Precision/Recall

  • converted from maps back to matrices, for improved efficiency and memory footprint
  • pointers now stored in uint8_t for reduced memory usage
  • only calculated for the best phasing, which reduces footprint by 2x
  • added multi-threaded precision-recall calculation
  • work-balancing for multi-threading, sorting by supercluster size and spawning new threads for each alignment on large superclusters

Writing

  • timers now store internal time in nanoseconds, making the summations more accurate
  • colored output printing, removed unnecessary per-contig information

v1.3.1

19 May 15:54
Compare
Choose a tag to compare
v1.3.1 Pre-release
Pre-release

Improved support for monoploid/haploid contigs

  • ploidy is saved per contig and we ensure all variants on a contig are of the same ploidy
  • all summary VCFs now correctly output GT information with correct ploidy
  • if no GT field is found, monoploidy is assumed and a warning is printed

v1.3.0

17 May 14:41
Compare
Choose a tag to compare
v1.3.0 Pre-release
Pre-release

Improvements

  • Summary VCF now reports partial credit information
  • Added high-level timers for basic profiling overview of bottlenecks
  • Smith-Waterman-Gotoh distance and realignment algorithm now uses WaveFront algorithm, which drastically reduces RAM usage
  • Added demo/ directory with simple example usage of vcfdist
  • Added a LICENSE file, now licensed under GNU GPLv3

Fixes

  • Variants are now maximally left-shifted past supercluster starts (previously only within superclusters)
  • Partial Positive partial credit is now calculated based on minimum edit distance, not edit distance of original representation
  • Fixed all Valgrind memory errors

v1.2.3

21 Apr 02:22
Compare
Choose a tag to compare
v1.2.3 Pre-release
Pre-release
  • #11 fixed: corrected supercluster output in query.tsv and truth.tsv
  • #2 and #10 addressed: haploid GT allowed, as well as alleles >2
  • relaxed ctg requirements: BED, Query VCF, and Truth VCF no longer need exactly the same contigs
  • improved printing: more sane defaults for INFO/WARN/ERROR
  • new command-line options: added -c for citation, added -v for version, added -s -l for filtering short/long variants, added -t -q for keeping truth/query, removed -q -m for min/max qual