Releases: TimD1/vcfdist
v2.2.2
v2.2.1
Major Fixes
- fixed indexing error while counting TRUTH variants
- removed attempt to limit cluster size with
g.max_reach_size
, which resulted in incorrect behaviour, and replaced it with performingwf_swg_realign()
in each updated cluster to recalculate minimum SWG distance
Minor Improvements
- added
parameters.txt
output log file - outputting results to TSV and VCF is now optional
- minor changes to console printing
v2.2.0
Improved phase set/block analysis
- vcfdist now skips unphased variants by default
- add phasing info from
PS
tag - switch "errors" only counted if they occur within phase sets
- added phase set count and NG50 (inputs)
- calculated phase block NG50 (break on switches) and NGC50 (break on flips and switches)
- reorganization and variable renaming in
phase.cpp
General improvements
- VCF parsing and warnings are more concise/consistent
- allow variant selection by
FILTER
- truth and query variants are no longer realigned by default
- removed unnecessary data copying from
sort_superclusters()
which significantly improves overall runtime and fixes uncounted runtime
v2.1.0
More efficient superclustering
- added
max_reach_size
global variable to limit supercluster size
explosion from large INDELs; reaches now treated as if variant is
max_reach_size
Added Docker image
- added Dockerfile, building and uploading image now
Phasing evaluation improvements
- updated
supercluster.tsv
andphase-blocks.tsv
outputs, both now
include a column for SUPERCLUSTER and PHASE_BLOCK id phase-blocks.tsv
renamed PHASE column to BLOCK_STATE- renamed
summary.vcf
tagsPS
toBS
andPF
toFE
- phase and flip errors are now printed per-contig
- phase blocks are now correctly printed if there is only one
Improved inputs/outputs/command-line
- added more warnings for potential edit distance errors
- added optional
print
argument towf_ed()
- removed obsolete query and truth specific alignment parameter options
- collapsed and re-organized command-line argument printing
- added warning for if the ratio of heterozygous variants on each
haplotype is too far off
Bug fixes
- fixed edge case of
wf_ed()
causing an error - if there are two INSertions at the same location, one is filtered now
v2.0.3
Improved clustering and superclustering
- new clustering method caches left/right reaches from previous iterations, and only recalculates active clusters, not the neighbors as well
- superclustering now uses the cached cluster reaches, rather than gap-based heuristic
- fixed bug: accidentally counted clusters per supercluster, not variants per supercluster
- no Valgrind errors or warnings on
./run
dataset
Improved handling of phasing
- precision and recall are now calculated for both phasings, not just the better one. This improvement mostly matters just for superclusters where phasing is undecided until after backtracking
- added phase block
PB
, phase switchPS
and phase flipPF
fields to summary VCF
Improved summary VCF output
- added phase block
PB
, phase switchPS
and phase flipPF
fields to summary VCF - added "sync group"
SG
output to summary VCF, allowing you to determine which variants matched with one another from the Truth and Query VCFs - matching heterozygous variants are no longer consolidated to a single variant since they may have different credit
BC
, categoryBK
, and sync groupSG
fields - fixed
GT
fields for Truth, which were incorrect
v2.0.1
v2.0.0
vcfdist now supports Structural Variants (SVs)
Summary
Numerous performance and memory footprint improvements now make the evaluation of structural variants (SVs) feasible with vcfdist. However, it is still somewhat limited in terms of maximum variant size. With 64 GB RAM and 56 cores, we were able to evaluate a whole human genome in 00h:56m, 03h:40m, or 07h:32m by limiting the maximum variant size to 1kbp, 5kbp, or 10kbp respectively. For the last evaluation, the largest supercluster was almost 40kb which nearly maxed out our 64GB RAM. Memory usage for realignment and precision-recall calculations is still O(n*n).
Improvements
Variants
- if two variants occur at the same position, force the variant that consumes a reference base to occur second
- left shifting pass after clustering/realignment, in case variants at the start of a cluster can be shifted further left
Clustering
- clusters are merged as far as possible leftwards and rightwards during each iteration (previously at most one leftwards merge)
- iterative doubling of reach calculation to reduce unnecessary work/allocation
- fixed erroneous loop in WF SWG max reach, it's much faster now
- removed pointers from WF SWG max reach, since they're unused
- WF SWG max reach now requires O(n) memory, instead of O(n*n)
- added multi-threaded clustering, with one thread per haplotype on each contig
Precision/Recall
- converted from maps back to matrices, for improved efficiency and memory footprint
- pointers now stored in
uint8_t
for reduced memory usage - only calculated for the best phasing, which reduces footprint by 2x
- added multi-threaded precision-recall calculation
- work-balancing for multi-threading, sorting by supercluster size and spawning new threads for each alignment on large superclusters
Writing
- timers now store internal time in nanoseconds, making the summations more accurate
- colored output printing, removed unnecessary per-contig information
v1.3.1
Improved support for monoploid/haploid contigs
- ploidy is saved per contig and we ensure all variants on a contig are of the same ploidy
- all summary VCFs now correctly output
GT
information with correct ploidy - if no
GT
field is found, monoploidy is assumed and a warning is printed
v1.3.0
Improvements
- Summary VCF now reports partial credit information
- Added high-level timers for basic profiling overview of bottlenecks
- Smith-Waterman-Gotoh distance and realignment algorithm now uses WaveFront algorithm, which drastically reduces RAM usage
- Added
demo/
directory with simple example usage ofvcfdist
- Added a
LICENSE
file, now licensed under GNU GPLv3
Fixes
- Variants are now maximally left-shifted past supercluster starts (previously only within superclusters)
- Partial Positive partial credit is now calculated based on minimum edit distance, not edit distance of original representation
- Fixed all Valgrind memory errors
v1.2.3
- #11 fixed: corrected supercluster output in
query.tsv
andtruth.tsv
- #2 and #10 addressed: haploid GT allowed, as well as alleles >2
- relaxed ctg requirements: BED, Query VCF, and Truth VCF no longer need exactly the same contigs
- improved printing: more sane defaults for INFO/WARN/ERROR
- new command-line options: added
-c
for citation, added-v
for version, added-s
-l
for filtering short/long variants, added-t
-q
for keeping truth/query, removed-q
-m
for min/max qual