diff --git a/README.md b/README.md index 0cd3856..b1db7aa 100644 --- a/README.md +++ b/README.md @@ -69,7 +69,7 @@ Please cite the following works if you use vcfdist: ## Installation ### Option 1: GitHub Source -vcfdist is developed for Linux and its only dependencies are GCC v8+ and HTSlib. If you don't have HTSlib already, please set it up as follows: +vcfdist is developed for Linux and its only dependencies are GCC v9.1+ and HTSlib v1.17. If you don't have HTSlib already, please set it up as follows: ```bash > wget https://github.com/samtools/htslib/releases/download/1.17/htslib-1.17.tar.bz2 > tar -xvf htslib-1.17.tar.bz2 @@ -87,7 +87,7 @@ If you do already have HTSlib installed elsewhere, make sure you've added it to > make > sudo make install > vcfdist --version -vcfdist v2.3.3 +vcfdist v2.3.4 ``` ### Option 2: Docker Image diff --git a/docs/v2.3.4/01-Overview.md b/docs/v2.3.4/01-Overview.md new file mode 100644 index 0000000..cab080b --- /dev/null +++ b/docs/v2.3.4/01-Overview.md @@ -0,0 +1,42 @@ +vcfdist evaluates the correctness of a set of phased variant calls (query VCF) relative to a set of phased ground truth variant calls (truth VCF) for a subset (regions BED) of the desired genome (reference FASTA). vcfdist was designed to evaluate human genomes, but should work on other monoploid and diploid species. It can evaluate variants of any type, including STRs (simple tandem repeats) and CNVs (copy number variants), but vcfdist classifies variants into SNPs (single nucleotide polymorphisms), INDELS (insertions and deletions), and SVs (structural variants) during evaluation. Evaluating variants larger than 10,000 bases is not recommended at the moment, as it will require large amounts of memory (over 50GB RAM). Below is a diagrammatic overview of vcfdist. Inputs are shown in red, internal steps in yellow, and optional steps in gray. + +

overview

+ +## Index + - [Parameters and Usage](https://github.com/TimD1/vcfdist/wiki/02-Parameters-and-Usage) + - [Variant Filtering](https://github.com/TimD1/vcfdist/wiki/03-Variant-Filtering) + - [VCF Normalization](https://github.com/TimD1/vcfdist/wiki/04-VCF-Normalization) + - [Variant Clustering](https://github.com/TimD1/vcfdist/wiki/05-Variant-Clustering) + - [Precision and Recall](https://github.com/TimD1/vcfdist/wiki/06-Precision-and-Recall) + - [Phasing Analysis](https://github.com/TimD1/vcfdist/wiki/07-Phasing-Analysis) + - [Alignment Distance](https://github.com/TimD1/vcfdist/wiki/08-Alignment-Distance) + - [Outputs](https://github.com/TimD1/vcfdist/wiki/09-Outputs) + - [Variant Stratification](https://github.com/TimD1/vcfdist/wiki/10-Variant-Stratification) + +## Repository Structure + + + + + + + + + + + + + + + + + + + + + + + + + +
Folder Description
src contains all C++ source code for vcfdist
demo contains a simple self-contained vcfdist example script, including inputs and expected output
analysis contains analysis scripts for "vcfdist: accurately benchmarking phased small variant calls"
analysis-v2 contains analysis scripts for "Jointly benchmarking phased small and structural variant calls with vcfdist"
docs contains old wiki documentation
\ No newline at end of file diff --git a/docs/v2.3.4/02-Parameters-and-Usage.md b/docs/v2.3.4/02-Parameters-and-Usage.md new file mode 100644 index 0000000..ee7d304 --- /dev/null +++ b/docs/v2.3.4/02-Parameters-and-Usage.md @@ -0,0 +1,188 @@ +# Usage +vcfdist <query.vcf> <truth.vcf> <ref.fasta> [optional arguments] + +# Required Arguments +### query.vcf +*type: string, default: none* + +Phased VCF file containing variant calls to evaluate. + +### truth.vcf +*type: string, default: none* + +Phased VCF file containing ground truth variant calls. + +### ref.fasta +*type: string, default: none* + +FASTA file containing the draft reference sequence. + +# Optional Arguments + +## Inputs/Outputs: +### -b, --bed +*type: string, default: none* + +BED file containing regions to evaluate. Variants located on the border of a BED region are currently included in the evaluation (details [here](https://github.com/TimD1/vcfdist/wiki/03-Variant-Filtering#bed-region)). + +### -v, --verbosity +*type: integer, default: 1* + +Printing verbosity (0: succinct, 1: default, 2: verbose). +- Succinct: Only warnings, errors, and the precision-recall summary are logged to console. +- Default: High-level info on parsed variants, superclustering, phasing, output results, and timing is additionally logged. +- Verbose: For debugging; warnings are printed each time they occur with helpful data included. + +### -p, --prefix +*type: string, default: ./* + +Prefix for output files (directories need a trailing slash). +For example `-p results/` will store `results/summary.vcf`, `-p test_` will store `test_summary.vcf`. + +### -n, --no-output-files +*type: flag* + +Skip writing output files, only print summary to console. + +## Variant Filtering/Selection: +### -f, --filter +*type: comma-separated string, default: all variants pass filtering stage* + +Select just variants passing these FILTERs (OR operation). + +### -s, --smallest-variant +*type: integer, default: 1* + +Minimum variant size to be evaluated; smaller variants are ignored (SNPs are size 1). + +### -l, --largest-variant +*type: integer, default: 5000* + +Maximum variant size to be evaluated, larger variants are ignored. + +### -sv, --sv-threshold +*type: integer, default: 50* + +Variants of this size or larger are considered SVs, not INDELs. This is useful because precision-recall summary statistics are reported separately for SNPs, INDELs, and SVs. + +### -mn, --min-qual +*type: integer, default: 0* + +Minimum variant quality, lower quality variants are ignored. + +### -mx, --max-qual +*type: integer, default: 60* + +Maximum variant quality, higher quality variants are kept but their Q-score is thresholded to this value. + +## Clustering: +### -c, --cluster +*type: string [and integer], default: biwfa* + +Select clustering method, one of: `biwfa`, `size N`, and `gap N` +#### biwfa +Clusters are generated using bi-directional wave-front alignment, essentially an efficient algorithm for finding possible alternate alignments (and therefore if nearby variants are independent). See the papers on [BiWFA](https://academic.oup.com/bioinformatics/article/39/2/btad074/7030690) and [WFA](https://academic.oup.com/bioinformatics/article/37/4/456/5904262) for more details. This is the currently recommended (and default) vcfdist clustering algorithm because it is the most accurate; it will always find dependencies if they exist. However, when evaluating large structural variants (above 1kbp) it tends to create large clusters, which results in large memory usage and slower evaluations. For evaluating large variants, `--cluster size 100` may be preferable. + +#### gap N +Gap-based clustering is the simplest and fastest clustering method: group together all variants less than N bases apart. It is also the least accurate, and will miss variant dependencies if N is too small. Conversely, as N nears the reciprocal of the background rate of genomic variation between humans (one SNP every 1000 bases), clusters will grow to be very large. We recommend 50 < N < 200, and to limit evaluations to small variants when using this option. + +#### size N +This is a heuristic that compromises in terms of efficiency and accuracy, basically extending the gap N heuristic to work with larger variants. Once a variant is larger than size N, the required gap to consider it independent of an adjacent variant is the size of the variant, instead of N. + +### -i, --max-iterations +*type: integer, default: 4* + +Maximum number of iterations for expanding/merging clusters, only applicable if `--cluster biwfa` is selected (which is the default). + +## Re-Alignment: +### -rq, --realign-query +*type: flag* + +Realign query variants using Smith-Waterman parameters -x -o -e + +### -rt, --realign-truth +*type: flag* + +Realign truth variants using Smith-Waterman parameters -x -o -e + +### -ro, --realign-only +*type: flag* + +Standardize truth and query variant representations, then exit. + +### -x, --mismatch-penalty +*type: integer, default: 3* + +Smith-Waterman mismatch (substitution) penalty. + +### -o, --gap-open-penalty +*type: integer, default: 2* + +Smith-Waterman gap opening penalty. + +### -e, --gap-extend-penalty +*type: integer, default: 1* + +Smith-Waterman gap extension penalty. + +## Precision and Recall: +### -ct, --credit-threshold +*type: float, default: 0.70* + +Minimum partial credit (calculated as a fractional reduction in edit distance over if the variant is omitted) to consider a query variant a true positive. + +## Phasing: +### -pt, --phasing-threshold +*type: float, default: 0.60* + +Minimum fractional reduction in edit distance over other phasing in order to consider this supercluster phased. Phased superclusters are then used to calculate switch and flip errors. + +## Distance: +### -d, --distance +*type: flag* + +Flag to include alignment distance calculations, which are skipped by default. + +### -ex, --eval-mismatch-penalty +*type: integer, default: 3* + +Mismatch penalty (`--distance` evaluation only). + +### -eo, --eval-gap-open-penalty +*type: integer, default: 2* + +Gap opening penalty (`--distance` evaluation only). + +### -ee, --eval-gap-extend-penalty +*type: integer, default: 1* + +Gap extension penalty (`--distance` evaluation only). + +## Utilization: +### -t, --max-threads +*type: integer, default: 64* + +Maximum threads to use for clustering and precision/recall alignment. + +### -r, --max-ram +*type: float, default: 64.000* + +Approximate maximum RAM to use for precision/recall alignment. Evaluation of superclusters requiring RAM usage above this threshold will still occur, but with a warning. + +## Miscellaneous: +### -h, --help +*type: flag* + +Prints a help message listing all required and optional command-line parameters. + +### -a, --advanced +*type: flag* + +Prints a help message listing all command-line parameters, including advanced options that are not recommended for most users. + +### -ci, --citation +*type: flag* +Prints the BibTeX and MLA formatted citations for vcfdist. + +### -v, --version +Prints the current version of vcfdist. \ No newline at end of file diff --git a/docs/v2.3.4/03-Variant-Filtering.md b/docs/v2.3.4/03-Variant-Filtering.md new file mode 100644 index 0000000..8c562b0 --- /dev/null +++ b/docs/v2.3.4/03-Variant-Filtering.md @@ -0,0 +1,38 @@ +### Variant Locations +#### BED Region +The `--bed FILENAME` option can be used to select a region to evaluate. Variants outside this region are discarded from the truth and query VCFs. + +Variants on the border of BED regions are currently included (to match with vcfeval), but this will likely eventually be changed so that such variants are excluded (to match with Truvari). In the example below, the BED region is `ref 10 20`. As you can see, Truvari excludes deletions overlapping the border (including if the preceding reference base in the VCF overlaps) whereas vcfeval does not. This was done to be consistent with how most benchmarking truth BEDs were generated (discussion [here](https://github.com/ACEnglish/truvari/issues/193)). + +

+bed region example +

+ +#### Spanning deletions +If a variant call is present within a spanning deletion (denoted by an `ALT` field of `*`), it is discarded. + +#### Overlapping variants +If two variants overlap or two insertions occur at the exact same position (on the same haplotype), the second variant (and third, etc. if applicable) which would conflict and create an ambiguous or invalid sequence is discarded. + +### Variant Attributes + +#### Variant Genotype (`GT`) +Variants that are reference calls (`0|0`) or have unknown alleles (`.|.`) are discarded. All other variant genotypes are converted to `1|0` or `0|1` (splitting variants such as `1|1` or `1|2` into two variants and selecting the correct alleles). + +#### Genotype separators (`/` vs `|`) +In VCFs, a pipe separator is used for phased variants (`0|1`) and a forward slash is used for unphased variants (`0/1`). All unphased variants are discarded, unless the alleles are the same (`1/1`or `2/2`). + +#### Phase set tags (`PS`) +If `PS` is not defined in the VCF header, then all variants are assumed to be globally phased. If a variant is missing a `PS` tag, it is considered to be in the same phase set as adjacent variants missing phase tags (if present). + +#### VCF FILTER field +The `--filter FILTER1,FILTER2` option can be used to select variants passing FILTER1 or FILTER2. By default, if this option is not selected, all variants will pass this stage. + +#### Variant Size +The `-s SIZE` and `-l SIZE` options can be used to select the smallest and largest variant sizes to be evaluated (inclusive). Other variants are discarded. + +#### Variant Quality +The `--min-qual QUAL` and `--max-qual QUAL` options can be used to select the range of variant qualities to evaluate. Variants with lower quality are discarded. Variants with higher quality scores are kept, but their quality score is set to `max_qual`. + +### Complex Variants +Complex variants are left and right-trimmed, and then converted to an insertion plus a deletion (or substitution if that is the case). \ No newline at end of file diff --git a/docs/v2.3.4/04-VCF-Normalization.md b/docs/v2.3.4/04-VCF-Normalization.md new file mode 100644 index 0000000..f02267f --- /dev/null +++ b/docs/v2.3.4/04-VCF-Normalization.md @@ -0,0 +1,14 @@ +Following variant [clustering](https://github.com/TimD1/vcfdist/wiki/05-Variant-Clustering), variants are optionally realigned by selecting the `--realign-query` and/or `--realign-truth` options. The `--realign-only` flag can be used to skip downstream evaluations. + +### Best-Alignment Normalization +As initially introduced in [this manuscript](https://doi.org/10.1093/bioinformatics/btw748) and further explored in [our work](https://doi.org/10.1038/s41467-023-43876-x), best alignment normalization can be used to select between several possible variant representations when complex variants are involved. Affine gap [Smith Waterman](https://en.wikipedia.org/wiki/Smith%E2%80%93Waterman_algorithm) alignment is used to select the "best" variant representation, defined by a given set of alignment parameters. The design space for these parameters (m, x, o, e) is shown below, with many common alignment tools plotted and four example (A,B,C,D) alignments with their resulting variant representations. By default, the representation selected by vcfdist is at Point C. +

+the affine-gap design space for variant representation +

+ +### Standard VCF Normalization +The traditional method of variant normalization involves decomposing complex variants, trimming unnecessary bases from the variant representation, and then left-aligning INDELs. +

+variant decomposition, trimming, and left-shifting +

+This procedure is sufficient to create a unique canonical representation for a single variant, but not when multiple or complex variants are involved. \ No newline at end of file diff --git a/docs/v2.3.4/05-Variant-Clustering.md b/docs/v2.3.4/05-Variant-Clustering.md new file mode 100644 index 0000000..11fb4a3 --- /dev/null +++ b/docs/v2.3.4/05-Variant-Clustering.md @@ -0,0 +1,40 @@ +In order to make whole-genome alignment-based evaluation of variant calling tractable, we need a method of breaking contig-sized alignments (~100Mb) into many smaller sub-problems (each <50kb). This is done by identifying dependent variant calls through clustering, a process that occurs in two stages. Firstly, variants are clustered within their haplotype (there are 4 in total, from diploid truth and query VCFs). Then, a second stage which we call "superclustering" groups variants across all four haplotypes. + +### Clustering +#### Summary +There are currently three implemented clustering methods: biwfa (default, most accurate), gap N (simple, efficient), and size N (efficient, good for large SVs). For each method, the left and right "reaches" are first calculated, which define the genomic interval for which this variant is not independent of other variants. Any variants with overlapping intervals are merged into a single cluster. Only `biwfa` occurs iteratively, with up to `-i` iterations. + +#### `biwfa` +``` +left_reach = variant_end - [maximum reach of all leftward alignments starting from variant_end + with lesser alignment penalty to current variant(s) representation] +right_reach = variant_start - [maximum reach of all rightward alignments starting from variant_start + with lesser alignment penalty to current variant(s) representation] +``` +WFA is a time and space efficient affine-gap alignment algorithm, which we use bi-directionally to find possible alternate variant alignments (and therefore if nearby variants are independent). See the papers on [BiWFA](https://academic.oup.com/bioinformatics/article/39/2/btad074/7030690) and [WFA](https://academic.oup.com/bioinformatics/article/37/4/456/5904262) for more details. This is the currently recommended (and default) vcfdist clustering algorithm because it is the most accurate; it will always find dependencies if they exist. However, when evaluating large structural variants (above 1kbp) it tends to create large clusters (10-50kbp), which results in large memory usage and slower evaluations. For evaluating large variants, using `--cluster size 100` may be preferable. + +

+biwfa clustering +

+ +#### `gap GAP_WIDTH` +``` +left_reach = variant_start - GAP_WIDTH +right_reach = variant_end + GAP_WIDTH +``` +Gap-based clustering is the simplest and fastest clustering method: group together all variants less than N bases apart. It is also the least accurate, and will miss variant dependencies if GAP_WIDTH is too small. Conversely, as GAP_WIDTH nears the reciprocal of the background rate of genomic variation between humans (one SNP every 1000 bases), clusters will be merged on average and will grow to be very large. We recommend 50 < GAP_WIDTH < 200, and to limit evaluations to small variants when using this option. + +#### `size GAP_WIDTH` +``` +left_reach = variant_start - std::max(GAP_WIDTH, sizeof(variant)) +right_reach = variant_end + std::max(GAP_WIDTH, sizeof(variant)) +``` +This is a heuristic that compromises in terms of efficiency and accuracy, basically extending the gap heuristic to work with larger variants. Once a variant is larger than size GAP_WIDTH, the required gap to consider it independent of an adjacent variant is the size of the variant, instead of GAP_WIDTH. + +### Superclustering +Using the calculated left and right reaches for clusters on each of the four haplotypes, a merge occurs across all four haplotypes whenever cluster reaches overlap, into a supercluster. If there is no overlap, a cluster is converted into a supercluster nevertheless. + +

+superclustering +

+In the above image, clusters within the first supercluster are assumed to have overlapping left and right reaches, even if the variants within the clusters themselves don't always overlap. \ No newline at end of file diff --git a/docs/v2.3.4/06-Precision-and-Recall.md b/docs/v2.3.4/06-Precision-and-Recall.md new file mode 100644 index 0000000..0bc6271 --- /dev/null +++ b/docs/v2.3.4/06-Precision-and-Recall.md @@ -0,0 +1,9 @@ +### Summary +Variant correctness is determined by first calculating the minimum edit distance alignment of the query "sequence" to the truth sequence. The query "sequence" is not quite a sequence but instead a specific type of graph which allows arbitrary omission of query variants (so that we can ignore false positives). By backtracking through all optimal alignments, we can determine which query variants are false positives by seeing if the chosen path through the query graph included or excluded a particular variant. We can then assign credit to the remaining variants by marking the points at which all optimal paths diverge and coalesce. We call these "sync points". Between sync points, all query and truth variants are given the same `credit` based on the edit distance of the original truth variants and the current path between the two enclosing sync points. A diagram of this process is shown below. + +

+precision-recall algorithm +

+ +Some of the minor implementation details of this algorithm may change, but I expect the high-level process to remain the same. + diff --git a/docs/v2.3.4/07-Phasing-Analysis.md b/docs/v2.3.4/07-Phasing-Analysis.md new file mode 100644 index 0000000..b8c2cfa --- /dev/null +++ b/docs/v2.3.4/07-Phasing-Analysis.md @@ -0,0 +1,35 @@ +#### Phasing a single supercluster +In order to phase a single supercluster, we first perform four separate alignments and store the edit distance (ED) for each. +1. Query Haplotype #1 to Truth Haplotype #1 (`Q1T1`) +2. Query Haplotype #2 to Truth Haplotype #2 (`Q2T2`) +3. Query Haplotype #1 to Truth Haplotype #2 (`Q1T2`) +4. Query Haplotype #2 to Truth Haplotype #1 (`Q2T1`) + +Then, the two possible phasings are compared by summing the edit distances for a corresponding pair of alignments. + +Phasing A (original): `ED(Q1T1) + ED(Q2T2) = A` + +Phasing B (flipped): `ED(Q1T2) + ED(Q2T1) = B` + +If there is a significant difference between the edit distance resulting from the two possible phasings, the supercluster is considered phased: + +``` +(max(A,B) - min(A,B)) / max(A,B) > phasing_threshold ? phased : unphased; +if (phased) { phasing = A < B ? 'A' : 'B'; } +``` + +The default threshold is `--phasing-threshold 0.6`. A value of `0.0` will simply choose the phasing with lower edit distance, whereas a value of `1.0` will require one phasing to have an edit distance of zero in order to consider the supercluster phased. A threshold of `0.6` requires at least a 60% reduction in edit distance to consider the supercluster confidently phased. + +#### Calculating switch and flip errors +Once per-supercluster phasings are assigned, a dynamic programming algorithm is used to minimize the total number of switch and flip errors within each phase block. Black arrows represent penalized transitions, whereas colored arrows are not penalized. Note that changing the phasing threshold will affect which superclusters are considered phased, thus affecting the number of reported flip errors (an unphased supercluster will never be a flip error). + +

+phasing algorithm overview +

+ +Note: The label "Supercluster Phasing Error" in the above diagram we now refer to as a supercluster flip error. + +#### Metric Definitions +Phase set NG50 is the largest phased region such that all phased regions of that size and larger cover at least 50% of the genome. +Phase block switch NGC50 is calculated similarly, but instead of only breaking regions at new phase sets, it also breaks regions when switch errors occur. +Phase block switchflip NGC50 additionally breaks regions on flip errors. \ No newline at end of file diff --git a/docs/v2.3.4/08-Alignment-Distance.md b/docs/v2.3.4/08-Alignment-Distance.md new file mode 100644 index 0000000..316e78f --- /dev/null +++ b/docs/v2.3.4/08-Alignment-Distance.md @@ -0,0 +1,3 @@ +Alignment distance is no longer calculated during vcfdist's default evaluation, but can be re-enabled using the `--distance` flag. Instead of counting different categorizations of variant calls (SNPs, INDELs, SVs) which are ill-defined when complex variants are involved, this approach uses a purely sequence-based evaluation approach that is independent of variant representations. For this reason, it can be a useful supplemental comparison method. + +vcfdist performs Smith Waterman alignment at [Point B](https://github.com/TimD1/vcfdist/wiki/04-VCF-Normalization#best-alignment-normalization), or m=0, x=3, o=2, e=1. The total number of substitutions, insertions, and deletions parsed from this alignment are reported, as well as the total alignment score. \ No newline at end of file diff --git a/docs/v2.3.4/09-Outputs.md b/docs/v2.3.4/09-Outputs.md new file mode 100644 index 0000000..2316a6a --- /dev/null +++ b/docs/v2.3.4/09-Outputs.md @@ -0,0 +1,190 @@ + +# Output `.vcf` Files +### summary.vcf +Output summary VCF containing detailed information regarding each query and truth input variant. Please [note](https://github.com/TimD1/vcfdist/wiki/11-Implementation-Notes) that homozygous variants are split into two heterozygous variants prior to vcfdist evaluation. +| FORMAT Tag | Name | Type | Description | +| ---------- | ---- | ---- | ----------- | +| GT | GenoType | string | Genotype of the current variant. | +| BD | Benchmarking Decision | categorical | Benchmarking decision classification of the current variant (TP/FP/FN). | +| BC | Benchmarking Credit | float | Benchmarking credit allocated to the current variant in the range [0.0, 1.0]. | +| RD | Reference edit Distance | integer | Edit distance between the reference and truth sequences for this sync group. | +| QD | Query edit Distance | integer | Edit distance between the query and truth sequences for this sync group. | +| BK | Benchmarking Kategory | categorical | GA4GH category is assigned to 'gm' if credit == 1 (genotype match), 'lm' if credit > 0 (local match), or else '.' (no match). | +| QQ | Quality | integer | Variant PHRED-scaled quality score. | +| SC | SuperCluster | integer | 0-based index of supercluster in which this variant occurs, on this contig. | +| SG | Sync Group | integer | 0-based index of the current variant's sync group in this supercluster on this haplotype. Variants in the same sync group are considered complex, evaluated together, and assigned a single credit value. | +| PS | Phase Set | integer | Phase set identifier for the current variant. | +| PB | Phase Block | integer | 0-based index of current phase block in this contig. | +| BS | Block State | integer | Phasing state of the current phase block, which determines the truth-to-query haplotype mappings: 0 = T1Q1:T2Q2, 1 = T1Q2:T2Q1. | +| FE | Flip Error | integer | Whether the current supercluster's phasing is flipped (1) or not (0). | + +For full false positive (FP) variants where credit is 0.0, RD and QD are missing (.) because the query variant is never aligned and there is no corresponding truth variant. + +### orig-query.vcf, orig-truth.vcf +Original query and truth VCFs, as parsed by vcfdist. + +### query.vcf, truth.vcf +Output query and truth VCFs, (optionally standardized by vcfdist using --realign). + +# Output `.tsv` Files +`.tsv`, or tab-separated value files, are a simple text-based format used to store vcfdist's output. + +## Variants +### query.tsv, truth.tsv +Reports detailed information regarding each variant. +| Column | Type | Description | +| ------ | ---- | ----------- | +| CONTIG | string | Name of the contig on which this variant is located. | +| POS | integer | 0-based index of genomic position where this variant occurs on this contig. Insertions occur before this base. | +| HAP | integer | Variant occurs on this haplotype (0 or 1 if diploid). | +| REF | string | Reference allele of this variant. | +| ALT | string | Alternate allele of this variant. | +| QUAL | float | Quality score of this variant. | +| TYPE | categorical | Type of variant (REF/SNP/INS/DEL/CPX). | +| ERR_TYPE | categorical | Classification error type (TP/FP/FN). | +| CREDIT | float | Fraction of partial credit (in terms of correctness) that this variant received. --credit-threshold determines whether it is a true or false positive. | +| CLUSTER | integer | 0-based index of the cluster on the current haplotype of this contig containing this variant. | +| SUPERCLUSTER | integer | 0-based index of the supercluster on this contig containing this variant. | +| SYNC_GROUP | integer | 0-based index of credit grouping for variants on this haplotype and within this supercluster. | +| REF_DIST | integer | Edit distance between the reference and truth sequences for the current sync group. | +| QUERY_DIST | integer | Edit distance between the query and truth sequences for the current sync group. | +| LOCATION | categorical | Location of variant relative to BED regions (INSIDE/OUTSIDE/BORDER/OFF_CTG) | + +For full false positive (FP) variants where credit is 0.0, REF_DIST and QUERY_DIST set to 0 because the query variant is never aligned and there is no corresponding truth variant. + +## Clusters +### superclusters.tsv +Reports the size, location, and composition of each supercluster. +| Column | Type | Description | +| ------ | ---- | ----------- | +| CONTIG | string | Name of the contig on which this supercluster is located. | +| SUPERCLUSTER | integer | 0-based index of supercluster on this contig. | +| START | integer | 0-based leftmost genomic position of this supercluster on this contig, inclusive. | +| STOP | integer | 0-based rightmost genomic position of this supercluster on this contig, exclusive | +| SIZE | integer | Size of this supercluster, STOP-START. | +| QUERY1_VARS | integer | Number of variants from query haplotype 0 within this supercluster. | +| QUERY2_VARS | integer | Number of variants from query haplotype 1 within this supercluster. | +| TRUTH1_VARS | integer | Number of variants from truth haplotype 0 within this supercluster. | +| TRUTH2_VARS | integer | Number of variants from truth haplotype 1 within this supercluster. | +| ORIG_ED | integer | The minimum edit distance of this supercluster, assuming the original reported phasing is correct. | +| SWAP_ED | integer | The minimum edit distance of this supercluster, assuming the phasing of all variants is flipped. | +| PHASE_STATE | integer | Current phase block phasing state at this supercluster (0 = T1Q1:T2Q2, 1 = T1Q2: T2Q1). | +| SC_PHASE | categorical | Phase state of this supercluster, based on minimum edit distances. | +| PHASE_SET | integer | Phase set of this supercluster (`PS` tag input, genomic position of leftmost variant in phase set). | +| PHASE_BLOCK | integer | 0-based index of phase block within this contig. | +| FLIP_ERROR | integer | Whether this supercluster is considered a flip error (1) or not (0). | + +## Precision and Recall +### precision-recall-summary.tsv +High-level precision/recall overview of SNP/INDEL/SV performance. For each category, there is one line for performance at the chosen minimum quality score, and one line for the quality score threshold that results in the best performance. +| Column | Type | Description | +| ------ | ---- | ----------- | +| VAR_TYPE | categorical | Variant type category. SNP for substitutions, INDEL for variants smaller than --sv-threshold, and SV for larger variants. | +| MIN_QUAL | integer | Variants above this minimum quality score are used to calculate the performance metrics on this line. | +| TRUTH_TP | integer | Total measured true positive truth variants in this variant category. | +| QUERY_TP | integer | Total measured true positive query variants in this variant category. | +| TRUTH_FN | integer | Total measured false negative truth variants in this variant category. | +| QUERY_FP | integer | Total measured false positive query variants in this variant category. | +| PREC | float | Measured variant calling precision in the range [0.0, 1.0]: QUERY_TP / (QUERY_FP + QUERY_TP). | +| RECALL | float | Measured variant calling recall in the range [0.0, 1.0]: TRUTH_TP / (TRUTH_TP + TRUTH_FN). | +| F1_SCORE | float | Measured variant calling F1 score in the range [0.0, 1.0]: (2*PREC*RECALL) / (PREC + RECALL). | +| F1_QSCORE | float | PHRED-scaled measure of F1 score: -10*log10(1-F1_SCORE). | + +### precision-recall.tsv +For each category (SNP/INDEL/SV), there is one line reporting the precison/recall performance at each possible quality score. +| Column | Type | Description | +| ------ | ---- | ----------- | +| VAR_TYPE | categorical | Variant type category. SNP for substitutions, INDEL for variants smaller than --sv-threshold, and SV for larger variants. | +| MIN_QUAL | integer | Variants above this minimum quality score are used to calculate the performance metrics on this line. | +| PREC | float | Measured variant calling precision in the range [0.0, 1.0]: QUERY_TP / (QUERY_FP + QUERY_TP). | +| RECALL | float | Measured variant calling recall in the range [0.0, 1.0]: TRUTH_TP / (TRUTH_TP + TRUTH_FN). | +| F1_SCORE | float | Measured variant calling F1 score in the range [0.0, 1.0]: (2*PREC*RECALL) / (PREC + RECALL). | +| F1_QSCORE | float | PHRED-scaled measure of F1 score: -10*log10(1-F1_SCORE). | +| TRUTH_TOTAL | integer | Total truth variants in this variant category. | +| TRUTH_TP | integer | Total measured true positive truth variants in this variant category. | +| TRUTH_FN | integer | Total measured false negative truth variants in this variant category. | +| QUERY_TOTAL | integer | Total query variants in this variant category. | +| QUERY_TP | integer | Total measured true positive query variants in this variant category. | +| QUERY_FP | integer | Total measured false positive query variants in this variant category. | + +## Phasing Analysis +### phase-blocks.tsv +Reports the size, location, and composition of each phase block. +| Column | Type | Description | +| ------ | ---- | ----------- | +| CONTIG | string | Name of the contig which contains the current phase block. | +| PHASE_BLOCK | integer | 0-based index of this phase block within the contig. | +| START | integer | 0-based index of the leftmost genomic position within current phase block, inclusive. | +| STOP | integer | 0-based index of the rightmost genomic position within current phase block, exclusive. | +| SIZE | integer | Size of the current phase block. | +| SUPERCLUSTERS | integer | Total superclusters comprising this phase block. | +| FLIP_ERRORS | integer | Total flip errors that occur within this phase block. | +| SWITCH_ERRORS | integer | Total switch errors that occur within this phase block. | + +### phasing-summary.tsv +High-level summary of phasing performance. +| Column | Type | Description | +| ------ | ---- | ----------- | +| PHASE_BLOCKS | integer | Total phase blocks, across all evaluated contigs. | +| SWITCH_ERRORS | integer | Total switch errors, across all evaluated contigs. | +| FLIP_ERRORS | integer | Total flip errors, across all evaluated contigs. | +| NG_50 | integer | NG50 (break regions on new phase block), across all evaluated contigs. | +| SWITCH_NGC50 | integer | Switch NGC50 (break regions on new phase block or switch error), across all evaluated contigs. | +| SWITCHFLIP_NGC50 | integer | Switchflip NGC50 (break regions on new phase block, switch error, or flip error), across all evaluated contigs. | + +### switchflips.tsv +Detailed breakdown of all switch and flip errors. +| Column | Type | Description | +| ------ | ---- | ----------- | +| CONTIG | string | Contig on which phasing error occurred. | +| START | integer | 0-based index of leftmost genomic position where phasing error could have occurred. | +| STOP | integer | 0-based index of rightmost genomic position where phasing error could have occurred. | +| SWITCH_TYPE | categorical | Type of phasing error. | +| SUPERCLUSTER | integer | 0-based index of supercluster on this contig where the phasing error occurred. | +| PHASE_BLOCK | integer | 0-based index of phase block on this contig which the phasing error occurred within. | + +## Alignment Distance +### distance-summary.tsv +High-level alignment distance overview of SNP/INDEL performance. For each category (SNP/INDEL/ALL), there is one line for performance at the minimum, best, and maximum quality score threshold. +| Column | Type | Description | +| ------ | ---- | ----------- | +| VAR_TYPE | categorical | Variant category considered on this line (ALL/SNP/INS/DEL/INDEL). | +| MIN_QUAL | integer | Minimum variant quality considered. | +| EDIT_DISTANCE | integer | Total edit distance from truth sequence after query variants above MIN_QUAL are applied. | +| DISTINCT_EDITS | integer | Total number of edits from truth sequence after query variants above MIN_QUAL are applied. | +| ED_QSCORE | float | PHRED-scaled measure of query sequence quality in terms of edit distance: -10*log10(EDIT_DISTANCE / ORIG_EDIT_DISTANCE), where ORIG_EDIT_DISTANCE is the query edit distance from the truth sequence when no query variants are applied. | +| DE_QSCORE | float | PHRED-scaled measure of query sequence quality in terms of distinct edits: -10*log10(DISTINCT_EDITS / TRUTH_TOTAL), where TRUTH_TOTAL is total number of truth variants. | +| ALN_QSCORE | float | PHRED-scaled measure of query sequence quality in terms of alignment score: -10*log10(ALN_SCORE / ORIG_ALN_SCORE), where ORIG_ALN_SCORE is the query alignment score (to the truth sequence) when no query variants are applied. | + +### distance.tsv +There is one line reporting the alignment distance performance at each possible quality score. +| Column | Type | Description | +| ------ | ---- | ----------- | +| MIN_QUAL | integer | Minimum variant quality considered. | +| SUB_DE | integer | Total number of substitution edits during alignment. | +| INS_DE | integer | Total number of insertion edits during alignment. | +| DEL_DE | integer | Total number of deletion edits during alignment. | +| SUB_ED | integer | Total number of bases substituted during alignment. | +| INS_ED | integer | Total number of bases inserted during alignment. | +| DEL_ED | integer | Total number of bases deleted during alignment. | +| DISTINCT_EDITS | integer | Total number of edits from truth sequence after query variants above MIN_QUAL are applied. | +| EDIT_DIST | integer | Total edit distance from truth sequence after query variants above MIN_QUAL are applied. | +| ALN_SCORE | integer | Smith-Waterman alignment score of query sequence to truth sequence, after applying query variants above MIN_QUAL. +| ALN_QSCORE | float | PHRED-scaled measure of query sequence quality in terms of alignment score: -10*log10(ALN_SCORE / ORIG_ALN_SCORE), where ORIG_ALN_SCORE is the query alignment score (to the truth sequence) when no query variants are applied. | + +### edits.tsv +This file reports for each edit (where called query sequence differs from truth sequence) the contig, pos, hap, len, supercluster, and the quality range for which these edits occur. +| Column | Type | Description | +| ------ | ---- | ----------- | +| CONTIG | string | Name of the contig on which this edit occurs. | +| START | integer | 0-based genomic position on this contig where this edit is located. | +| HAP | integer | Haplotype on which this edit occurs. | +| TYPE | categorical | Edit category type (SNP/INS/DEL). | +| SIZE | integer | Size of the edit, in terms of reference or query bases affected. | +| SUPERCLUSTER | integer | 0-based index of the supercluster containing the edit, on this contig. | +| MIN_QUAL | integer | The edit is present at and above this quality score. | +| MAX_QUAL | integer | The edit is present below this quality score. | + +# Output `.txt` Files +### parameters.txt +This file stores all internal and command-line parameters used by vcfdist (for reproducibility and debugging purposes). Each line stores a single parameter in the format `parameter = value`. \ No newline at end of file diff --git a/docs/v2.3.4/10-Variant-Stratification.md b/docs/v2.3.4/10-Variant-Stratification.md new file mode 100644 index 0000000..37b40e7 --- /dev/null +++ b/docs/v2.3.4/10-Variant-Stratification.md @@ -0,0 +1,26 @@ +We currently output an intermediate VCF in GA4GH compatible format, meaning the results can be stratified and analyzed by `hap.py`'s quantification helper script `qfy.py`. +In order to use `qfy.py` please install `hap.py`. +`tabix` and `bgzip` should already be included as part of HTSlib. + +```bash +> ./vcfdist \ # run vcfdist + query.vcf.gz \ + truth.vcf.gz \ + reference.fasta \ + -b analysis-regions.bed \ + -p output-prefix/ +> bgzip output-prefix/summary.vcf # compress summary VCF +> tabix -p vcf output-prefix/summary.vcf.gz # index summary VCF +> export HGREF=/path/to/reference.fasta # set reference path +> source /path/to/happy/venv2/bin/activate # activate hap.py virtualenv +> python /path/to/happy/install/bin/qfy.py \ # run quantification script + -t ga4gh \ + --write-vcf \ + --write-counts \ + --stratification strat.tsv \ + --roc QUAL \ + --o results/qfy-output-prefix \ + output-prefix/summary.vcf.gz +``` +Ensure that `strat.tsv` contains one stratification region per line; each line consists of a region name and BED file name separated by a tab. +GIAB stratification regions for GRCh38 can be found here. \ No newline at end of file diff --git a/docs/v2.3.4/11-Implementation-Notes.md b/docs/v2.3.4/11-Implementation-Notes.md new file mode 100644 index 0000000..15e473a --- /dev/null +++ b/docs/v2.3.4/11-Implementation-Notes.md @@ -0,0 +1,9 @@ +### Homozygous variant counting +Prior to evaluation, all homozygous variants with genotype `1|1` are split into two heterozygous variants `0|1` and `1|0` (with identical reference and alternate alleles). We do this because variants on separate haplotypes may be realigned to a new representation, and to deal with cases where only one of the two calls is correct (or part of a complex variant). Thus, a fully-correct homozygous variant call will be counted as two true positives, whereas most other evaluation tools will count one true positive. + +### Limitations +The current version of vcfdist is not designed to support: + - somatic variants + - unphased variants + - overlapping variants (on the same haplotype) + - polyploid contigs diff --git a/docs/v2.3.4/12-Installation-Help.md b/docs/v2.3.4/12-Installation-Help.md new file mode 100644 index 0000000..bf81353 --- /dev/null +++ b/docs/v2.3.4/12-Installation-Help.md @@ -0,0 +1,24 @@ +Below are some common errors you may run into when attempting to install or run `vcfdist`. + +If you encounter any problems not listed below, please open a new [issue](https://github.com/TimD1/vcfdist/issues). + +### Failure to compile on Mac +On Mac devices, `g++` is aliased to `clang`, which is currently not supported. Please install `g++` through [homebrew](https://brew.sh/). + +### Errors with libstdc++fs +Although `std::filesystem` is officially part of the C++17 standard library, g++ < 9.1 and llvm < 9.0 require linking to `-lstdc++fs` during compilation. If you're using g++ < 9.1, please add `-lstdc++fs` to `LDFLAGS` in `src/Makefile` before running `make`. + +### Errors while loading shared libraries +You may encounter the following error while trying to run vcfdist: +```bash +vcfdist: error while loading shared libraries: libhts.so.2to3part12: cannot open shared object file: No such file or directory +``` +This occurs because the HTSlib library cannot be found. If you have installed HTSlib in `/usr/local`, add `/usr/local/lib` to your linker's library path: +```bash +export LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH +``` +Optionally, you can add this export to your [`.bashrc`](https://www.digitalocean.com/community/tutorials/bashrc-file-in-linux) file so that HTSlib is always discoverable: +```bash +echo "export LD_LIBRARY_PATH=/usr/local/lib:\$LD_LIBRARY_PATH" >> ~/.bashrc +``` +Please note the escape `\` character before `$`, and use of `>>` to append to your `.bashrc` (NOT `>`, which will overwrite the file). \ No newline at end of file diff --git a/docs/v2.3.4/13-Release-Instructions.md b/docs/v2.3.4/13-Release-Instructions.md new file mode 100644 index 0000000..1b33dfd --- /dev/null +++ b/docs/v2.3.4/13-Release-Instructions.md @@ -0,0 +1,40 @@ +## Release Notes +More information on previous releases of vcfdist can be found [here](https://github.com/timd1/vcfdist/releases). + +## Steps to Create New Release + +1. Compare against previous releases using the `run` and `demo` scripts, updating `output.txt` for `demo` if necessary + +2. Update `vcfdist` version in `src/globals.h` + +3. Update `vcfdist` version in `README.md` + +4. Cache vcfdist wiki. +``` +cd vcfdist/docs +git clone https://github.com/TimD1/vcfdist.wiki.git vX.X.X +rm -rf vX.X.X/.git +``` + +5. Make final Git commit of changes +```bash +git add src/globals.h README.md docs/ +git commit -m "version bump (vX.X.X)" +git checkout master +git merge dev +git push origin master +git checkout dev +``` + +6. Make new release and tag on Github + +7. Build and deploy new Docker image +```bash +cd ~/vcfdist/src +sudo docker login -u timd1 +sudo docker build --no-cache -t timd1/vcfdist . +sudo docker image ls +sudo docker image tag timd1/vcfdist:latest timd1/vcfdist:vX.X.X +sudo docker image push timd1/vcfdist:vX.X.X +sudo docker image push timd1/vcfdist:latest +``` \ No newline at end of file diff --git a/docs/v2.3.4/Home.md b/docs/v2.3.4/Home.md new file mode 100644 index 0000000..4d1f0d9 --- /dev/null +++ b/docs/v2.3.4/Home.md @@ -0,0 +1,64 @@ +**Welcome to the vcfdist wiki!** + +If you have any **questions**, **ideas**, or **feedback**: please open a new [discussion](https://github.com/TimD1/vcfdist/discussions). + +If you found a **bug** or have a **feature request**: please submit an [issue](https://github.com/TimD1/vcfdist/issues). + +For information on vcfdist **updates**: check the latest [releases](https://github.com/TimD1/vcfdist/releases). + +## Wiki Index + - [Home](https://github.com/TimD1/vcfdist/wiki/Home) + - [Overview](https://github.com/TimD1/vcfdist/wiki/01-Overview) + - [Parameters and Usage](https://github.com/TimD1/vcfdist/wiki/02-Parameters-and-Usage) + - [Variant Filtering](https://github.com/TimD1/vcfdist/wiki/03-Variant-Filtering) + - [VCF Normalization](https://github.com/TimD1/vcfdist/wiki/04-VCF-Normalization) + - [Variant Clustering](https://github.com/TimD1/vcfdist/wiki/05-Variant-Clustering) + - [Precision and Recall](https://github.com/TimD1/vcfdist/wiki/06-Precision-and-Recall) + - [Phasing Analysis](https://github.com/TimD1/vcfdist/wiki/07-Phasing-Analysis) + - [Alignment Distance](https://github.com/TimD1/vcfdist/wiki/08-Alignment-Distance) + - [Outputs](https://github.com/TimD1/vcfdist/wiki/09-Outputs) + - [Variant Stratification](https://github.com/TimD1/vcfdist/wiki/10-Variant-Stratification) + - [Implementation Notes](https://github.com/TimD1/vcfdist/wiki/11-Implementation-Notes) + - [Installation Help](https://github.com/TimD1/vcfdist/wiki/12-Installation-Help) + - [Release Instructions](https://github.com/TimD1/vcfdist/wiki/13-Release-Instructions) + +### Citing vcfdist +If you use vcfdist in your research, please cite the following works: +
+ +[Nature Comms] vcfdist: Accurately benchmarking phased small variant calls in human genomes + + +
+@article{dunn2023vcfdist,
+  author={Dunn, Tim and Narayanasamy, Satish},
+  title={vcfdist: Accurately benchmarking phased small variant calls in human genomes},
+  journal={Nature Communications},
+  year={2023},
+  volume={14},
+  number={1},
+  pages={8149},
+  issn={2041-1723},
+  doi={10.1038/s41467-023-43876-x},
+  URL={https://doi.org/10.1038/s41467-023-43876-x}
+}
+
+
+ +
+ +[bioRxiv] Jointly benchmarking small and structural variant calls with vcfdist + + +
+@article{dunn2024vcfdist,
+  author={Dunn, Tim and Zook, Justin M and Holt, James M and Narayanasamy, Satish},
+  title={Jointly benchmarking small and structural variant calls with vcfdist},
+  journal={bioRxiv},
+  year={2024},
+  publisher={Cold Spring Harbor Laboratory},
+  doi={10.1101/2024.01.23.575922},
+  URL={https://doi.org/10.1101/2024.01.23.575922}
+}
+
+
diff --git a/src/globals.h b/src/globals.h index 8524aea..410ada3 100644 --- a/src/globals.h +++ b/src/globals.h @@ -78,7 +78,7 @@ class Globals { void init_timers(std::vector timer_strs); // program data - const std::string VERSION = "2.3.3"; + const std::string VERSION = "2.3.4"; const std::string PROGRAM = "vcfdist"; std::vector timers; }; diff --git a/src/run b/src/run index c803706..4d58955 100755 --- a/src/run +++ b/src/run @@ -1,7 +1,5 @@ #!/bin/bash -source ../pipeline/includes.sh - # valgrind --leak-check=full \ # --show-leak-kinds=all \ # --track-origins=yes \ @@ -11,7 +9,7 @@ source ../pipeline/includes.sh ./vcfdist \ ../test/query20.vcf.gz \ ../test/truth20.vcf.gz \ - /x/gm24385/reference/GCA_000001405.15_GRCh38_no_alt_analysis_set.fasta \ + /home/timdunn/vcfdist/data/refs/GCA_000001405.15_GRCh38_no_alt_analysis_set.fasta \ -b ../test/bench20.bed \ -l 1000 \ -p ../out/dev.