Added docs/ for cached wiki.

TimD1 · Feb 10, 2024 · aaecae9 · aaecae9
1 parent ebec11d
commit aaecae9
Show file tree

Hide file tree

Showing 15 changed files with 718 additions and 4 deletions.
diff --git a/.gitignore b/.gitignore
@@ -21,10 +21,8 @@ analysis/*/*/
 analysis-v2/*/*/
 analysis-v2/hla
 
-# ignore input
-data/*/*
-
-# ignore output
+# ignore input/output
+data/
 out/
 
 # don't track test cases

diff --git a/docs/v2.3.3/01-Overview.md b/docs/v2.3.3/01-Overview.md
@@ -0,0 +1,42 @@
+vcfdist evaluates the correctness of a set of phased variant calls (query VCF) relative to a set of phased ground truth variant calls (truth VCF) for a subset (regions BED) of the desired genome (reference FASTA). vcfdist was designed to evaluate human genomes, but should work on other monoploid and diploid species. It can evaluate variants of any type, including STRs (simple tandem repeats) and CNVs (copy number variants), but vcfdist classifies variants into SNPs (single nucleotide polymorphisms), INDELS (insertions and deletions), and SVs (structural variants) during evaluation. Evaluating variants larger than 10,000 bases is not recommended at the moment, as it will require large amounts of memory (over 50GB RAM). Below is a diagrammatic overview of vcfdist. Inputs are shown in red, internal steps in yellow, and optional steps in gray.
+
+<p align="center"><img src="https://github.com/TimD1/vcfdist/assets/13918078/85ecfbdb-0028-4bc8-aae0-3422849d1fd2" alt="overview"/></p>
+
+## Index
+ - [Parameters and Usage](https://github.com/TimD1/vcfdist/wiki/02-Parameters-and-Usage)
+ - [Variant Filtering](https://github.com/TimD1/vcfdist/wiki/03-Variant-Filtering)
+ - [VCF Normalization](https://github.com/TimD1/vcfdist/wiki/04-VCF-Normalization)
+ - [Variant Clustering](https://github.com/TimD1/vcfdist/wiki/05-Variant-Clustering)
+ - [Precision and Recall](https://github.com/TimD1/vcfdist/wiki/06-Precision-and-Recall)
+ - [Phasing Analysis](https://github.com/TimD1/vcfdist/wiki/07-Phasing-Analysis)
+ - [Alignment Distance](https://github.com/TimD1/vcfdist/wiki/08-Alignment-Distance)
+ - [Outputs](https://github.com/TimD1/vcfdist/wiki/09-Outputs)
+ - [Variant Stratification](https://github.com/TimD1/vcfdist/wiki/10-Variant-Stratification)
+
+## Repository Structure
+<table>
+<tr>
+  <th> Folder </th>
+  <th> Description </th>
+</tr>
+<tr>
+  <td> <code>src</code> </td>
+  <td> contains all C++ source code for vcfdist</td>
+</tr>
+<tr>
+  <td> <code>demo</code> </td>
+  <td> contains a simple self-contained vcfdist example script, including inputs and expected output</td>
+</tr>
+<tr>
+  <td> <code>analysis</code> </td>
+  <td> contains analysis scripts for "vcfdist: accurately benchmarking phased small variant calls" </td>
+</tr>
+<tr>
+  <td> <code>analysis-v2</code> </td>
+  <td> contains analysis scripts for "Jointly benchmarking phased small and structural variant calls with vcfdist" </td>
+</tr>
+<tr>
+  <td> <code>docs</code> </td>
+  <td> contains old wiki documentation </td>
+</tr>
+</table>
diff --git a/docs/v2.3.3/02-Parameters-and-Usage.md b/docs/v2.3.3/02-Parameters-and-Usage.md
@@ -0,0 +1,188 @@
+# Usage
+vcfdist &lt;query.vcf&gt; &lt;truth.vcf&gt; &lt;ref.fasta&gt; [optional arguments]
+
+# Required Arguments
+### query.vcf
+*type: string, default: none*
+
+Phased VCF file containing variant calls to evaluate.
+
+### truth.vcf
+*type: string, default: none*
+
+Phased VCF file containing ground truth variant calls.
+
+### ref.fasta	
+*type: string, default: none*
+
+FASTA file containing the draft reference sequence.
+
+# Optional Arguments
+
+## Inputs/Outputs:
+### -b, --bed
+*type: string, default: none*
+
+BED file containing regions to evaluate. Variants located on the border of a BED region are currently included in the evaluation (details [here](https://github.com/TimD1/vcfdist/wiki/03-Variant-Filtering#bed-region)).
+
+### -v, --verbosity
+*type: integer, default: 1*
+
+Printing verbosity (0: succinct, 1: default, 2: verbose).
+- Succinct: Only warnings, errors, and the precision-recall summary are logged to console.
+- Default: High-level info on parsed variants, superclustering, phasing, output results, and timing is additionally logged.
+- Verbose: For debugging; warnings are printed each time they occur with helpful data included.
+
+### -p, --prefix
+*type: string, default: ./*
+
+Prefix for output files (directories need a trailing slash).
+For example `-p results/` will store `results/summary.vcf`, `-p test_` will store `test_summary.vcf`.
+
+### -n, --no-output-files
+*type: flag*
+
+Skip writing output files, only print summary to console. 
+
+## Variant Filtering/Selection:
+### -f, --filter
+*type: comma-separated string, default: all variants pass filtering stage*
+
+Select just variants passing these FILTERs (OR operation).
+
+### -s, --smallest-variant
+*type: integer, default: 1*
+
+Minimum variant size to be evaluated; smaller variants are ignored (SNPs are size 1).
+
+### -l, --largest-variant
+*type: integer, default: 5000*
+
+Maximum variant size to be evaluated, larger variants are ignored.
+
+### -sv, --sv-threshold
+*type: integer, default: 50*
+
+Variants of this size or larger are considered SVs, not INDELs. This is useful because precision-recall summary statistics are reported separately for SNPs, INDELs, and SVs.
+
+### -mn, --min-qual
+*type: integer, default: 0*
+
+Minimum variant quality, lower quality variants are ignored.
+
+### -mx, --max-qual
+*type: integer, default: 60*
+
+Maximum variant quality, higher quality variants are kept but their Q-score is thresholded to this value.
+
+## Clustering:
+### -c, --cluster
+*type: string [and integer], default: biwfa*
+
+Select clustering method, one of: `biwfa`, `size N`, and `gap N`
+#### biwfa
+Clusters are generated using bi-directional wave-front alignment, essentially an efficient algorithm for finding possible alternate alignments (and therefore if nearby variants are independent). See the papers on [BiWFA](https://academic.oup.com/bioinformatics/article/39/2/btad074/7030690) and [WFA](https://academic.oup.com/bioinformatics/article/37/4/456/5904262) for more details. This is the currently recommended (and default) vcfdist clustering algorithm because it is the most accurate; it will always find dependencies if they exist. However, when evaluating large structural variants (above 1kbp) it tends to create large clusters, which results in large memory usage and slower evaluations. For evaluating large variants, `--cluster size 100` may be preferable.
+
+#### gap N
+Gap-based clustering is the simplest and fastest clustering method: group together all variants less than N bases apart. It is also the least accurate, and will miss variant dependencies if N is too small. Conversely, as N nears the reciprocal of the background rate of genomic variation between humans (one SNP every 1000 bases), clusters will grow to be very large. We recommend 50 < N < 200, and to limit evaluations to small variants when using this option.
+
+#### size N
+This is a heuristic that compromises in terms of efficiency and accuracy, basically extending the gap N heuristic to work with larger variants. Once a variant is larger than size N, the required gap to consider it independent of an adjacent variant is the size of the variant, instead of N. 
+
+### -i, --max-iterations
+*type: integer, default: 4*
+
+Maximum number of iterations for expanding/merging clusters, only applicable if `--cluster biwfa` is selected (which is the default).
+
+## Re-Alignment:
+### -rq, --realign-query
+*type: flag*
+
+Realign query variants using Smith-Waterman parameters -x -o -e
+
+### -rt, --realign-truth
+*type: flag*
+
+Realign truth variants using Smith-Waterman parameters -x -o -e
+
+### -ro, --realign-only
+*type: flag*
+
+Standardize truth and query variant representations, then exit.
+
+### -x, --mismatch-penalty
+*type: integer, default: 3*
+
+Smith-Waterman mismatch (substitution) penalty.
+
+### -o, --gap-open-penalty
+*type: integer, default: 2*
+
+Smith-Waterman gap opening penalty.
+
+### -e, --gap-extend-penalty
+*type: integer, default: 1*
+
+Smith-Waterman gap extension penalty.
+
+## Precision and Recall:
+### -ct, --credit-threshold 
+*type: float, default: 0.70*
+
+Minimum partial credit (calculated as a fractional reduction in edit distance over if the variant is omitted) to consider a query variant a true positive. 
+
+## Phasing:
+### -pt, --phasing-threshold 
+*type: float, default: 0.60*
+
+Minimum fractional reduction in edit distance over other phasing in order to consider this supercluster phased. Phased superclusters are then used to calculate switch and flip errors.
+
+## Distance:
+### -d, --distance
+*type: flag*
+
+Flag to include alignment distance calculations, which are skipped by default.
+
+### -ex, --eval-mismatch-penalty
+*type: integer, default: 3*
+
+Mismatch penalty (`--distance` evaluation only).
+
+### -eo, --eval-gap-open-penalty
+*type: integer, default: 2*
+
+Gap opening penalty (`--distance` evaluation only).
+
+### -ee, --eval-gap-extend-penalty
+*type: integer, default: 1*
+
+Gap extension penalty (`--distance` evaluation only).
+
+## Utilization:
+### -t, --max-threads
+*type: integer, default: 64*
+
+Maximum threads to use for clustering and precision/recall alignment.
+
+### -r, --max-ram
+*type: float, default: 64.000*
+
+Approximate maximum RAM to use for precision/recall alignment. Evaluation of superclusters requiring RAM usage above this threshold will still occur, but with a warning.
+
+## Miscellaneous:
+### -h, --help
+*type: flag*
+
+Prints a help message listing all required and optional command-line parameters.
+
+### -a, --advanced
+*type: flag*
+
+Prints a help message listing all command-line parameters, including advanced options that are not recommended for most users.
+
+### -ci, --citation
+*type: flag*
+Prints the BibTeX and MLA formatted citations for vcfdist.
+
+### -v, --version
+Prints the current version of vcfdist.
diff --git a/docs/v2.3.3/03-Variant-Filtering.md b/docs/v2.3.3/03-Variant-Filtering.md
@@ -0,0 +1,38 @@
+### Variant Locations
+#### BED Region
+The `--bed FILENAME` option can be used to select a region to evaluate. Variants outside this region are discarded from the truth and query VCFs.
+
+Variants on the border of BED regions are currently included (to match with vcfeval), but this will likely eventually be changed so that such variants are excluded (to match with Truvari). In the example below, the BED region is `ref 10 20`. As you can see, Truvari excludes deletions overlapping the border (including if the preceding reference base in the VCF overlaps) whereas vcfeval does not. This was done to be consistent with how most benchmarking truth BEDs were generated (discussion [here](https://github.com/ACEnglish/truvari/issues/193)).
+
+<p align="center">
+<img src="https://github.com/TimD1/vcfdist/assets/13918078/cc360f0f-aba8-42ff-859e-af0ab8633e32" alt="bed region example"/>
+</p>
+
+#### Spanning deletions
+If a variant call is present within a spanning deletion (denoted by an `ALT` field of `*`), it is discarded.
+
+#### Overlapping variants
+If two variants overlap or two insertions occur at the exact same position (on the same haplotype), the second variant (and third, etc. if applicable) which would conflict and create an ambiguous or invalid sequence is discarded.
+
+### Variant Attributes
+
+#### Variant Genotype (`GT`)
+Variants that are reference calls (`0|0`) or have unknown alleles (`.|.`) are discarded. All other variant genotypes are converted to `1|0` or `0|1` (splitting variants such as `1|1` or `1|2` into two variants and selecting the correct alleles).
+
+#### Genotype separators (`/` vs `|`)
+In VCFs, a pipe separator is used for phased variants (`0|1`) and a forward slash is used for unphased variants (`0/1`). All unphased variants are discarded, unless the alleles are the same (`1/1`or `2/2`).
+
+#### Phase set tags (`PS`)
+If `PS` is not defined in the VCF header, then all variants are assumed to be globally phased. If a variant is missing a `PS` tag, it is considered to be in the same phase set as adjacent variants missing phase tags (if present).
+
+#### VCF FILTER field
+The `--filter FILTER1,FILTER2` option can be used to select variants passing FILTER1 or FILTER2. By default, if this option is not selected, all variants will pass this stage.
+
+#### Variant Size
+The `-s SIZE` and `-l SIZE` options can be used to select the smallest and largest variant sizes to be evaluated (inclusive). Other variants are discarded.
+
+#### Variant Quality
+The `--min-qual QUAL` and `--max-qual QUAL` options can be used to select the range of variant qualities to evaluate. Variants with lower quality are discarded. Variants with higher quality scores are kept, but their quality score is set to `max_qual`.
+
+### Complex Variants
+Complex variants are left and right-trimmed, and then converted to an insertion plus a deletion (or substitution if that is the case).
diff --git a/docs/v2.3.3/04-VCF-Normalization.md b/docs/v2.3.3/04-VCF-Normalization.md
@@ -0,0 +1,14 @@
+Following variant [clustering](https://github.com/TimD1/vcfdist/wiki/05-Variant-Clustering), variants are optionally realigned by selecting the `--realign-query` and/or `--realign-truth` options. The `--realign-only` flag can be used to skip downstream evaluations.
+
+### Best-Alignment Normalization
+As initially introduced in [this manuscript](https://doi.org/10.1093/bioinformatics/btw748) and further explored in [our work](https://doi.org/10.1038/s41467-023-43876-x), best alignment normalization can be used to select between several possible variant representations when complex variants are involved. Affine gap [Smith Waterman](https://en.wikipedia.org/wiki/Smith%E2%80%93Waterman_algorithm) alignment is used to select the "best" variant representation, defined by a given set of alignment parameters. The design space for these parameters (m, x, o, e) is shown below, with many common alignment tools plotted and four example (A,B,C,D) alignments with their resulting variant  representations. By default, the representation selected by vcfdist is at Point C.
+<p align="center">
+<img src="https://github.com/TimD1/vcfdist/assets/13918078/047cdde9-57d1-4625-993d-49071e2b6095" alt="the affine-gap design space for variant representation"/>
+</p>
+
+### Standard VCF Normalization
+The traditional method of variant normalization involves decomposing complex variants, trimming unnecessary bases from the variant representation, and then left-aligning INDELs.
+<p align="center">
+<img src="https://github.com/TimD1/vcfdist/assets/13918078/bf15f99c-1c00-4abe-bc03-d8d2afff1cf0" alt="variant decomposition, trimming, and left-shifting"/>
+</p>
+This procedure is sufficient to create a unique canonical representation for a single variant, but not when multiple or complex variants are involved.
diff --git a/docs/v2.3.3/05-Variant-Clustering.md b/docs/v2.3.3/05-Variant-Clustering.md
@@ -0,0 +1,40 @@
+In order to make whole-genome alignment-based evaluation of variant calling tractable, we need a method of breaking contig-sized alignments (~100Mb) into many smaller sub-problems (each <50kb). This is done by identifying dependent variant calls through clustering, a process that occurs in two stages. Firstly, variants are clustered within their haplotype (there are 4 in total, from diploid truth and query VCFs). Then, a second stage which we call "superclustering" groups variants across all four haplotypes.
+
+### Clustering
+#### Summary
+There are currently three implemented clustering methods: biwfa (default, most accurate), gap N (simple, efficient), and size N (efficient, good for large SVs). For each method, the left and right "reaches" are first calculated, which define the genomic interval for which this variant is not independent of other variants. Any variants with overlapping intervals are merged into a single cluster. Only `biwfa` occurs iteratively, with up to `-i` iterations.
+
+#### `biwfa`
+```
+left_reach = variant_end - [maximum reach of all leftward alignments starting from variant_end 
+                            with lesser alignment penalty to current variant(s) representation]
+right_reach = variant_start - [maximum reach of all rightward alignments starting from variant_start 
+                               with lesser alignment penalty to current variant(s) representation]
+```
+WFA is a time and space efficient affine-gap alignment algorithm, which we use bi-directionally to find possible alternate variant alignments (and therefore if nearby variants are independent). See the papers on [BiWFA](https://academic.oup.com/bioinformatics/article/39/2/btad074/7030690) and [WFA](https://academic.oup.com/bioinformatics/article/37/4/456/5904262) for more details. This is the currently recommended (and default) vcfdist clustering algorithm because it is the most accurate; it will always find dependencies if they exist. However, when evaluating large structural variants (above 1kbp) it tends to create large clusters (10-50kbp), which results in large memory usage and slower evaluations. For evaluating large variants, using `--cluster size 100` may be preferable.
+
+<p align="center">
+<img src="https://github.com/TimD1/vcfdist/assets/13918078/8f2f5b35-5a23-4079-ab21-d1660dc88e6a" alt="biwfa clustering"/>
+</p>
+
+#### `gap GAP_WIDTH`
+```
+left_reach = variant_start - GAP_WIDTH
+right_reach = variant_end + GAP_WIDTH
+```
+Gap-based clustering is the simplest and fastest clustering method: group together all variants less than N bases apart. It is also the least accurate, and will miss variant dependencies if GAP_WIDTH is too small. Conversely, as GAP_WIDTH nears the reciprocal of the background rate of genomic variation between humans (one SNP every 1000 bases), clusters will be merged on average and will grow to be very large. We recommend 50 < GAP_WIDTH < 200, and to limit evaluations to small variants when using this option.
+
+#### `size GAP_WIDTH`
+```
+left_reach = variant_start - std::max(GAP_WIDTH, sizeof(variant))
+right_reach = variant_end + std::max(GAP_WIDTH, sizeof(variant))
+```
+This is a heuristic that compromises in terms of efficiency and accuracy, basically extending the gap heuristic to work with larger variants. Once a variant is larger than size GAP_WIDTH, the required gap to consider it independent of an adjacent variant is the size of the variant, instead of GAP_WIDTH.
+
+### Superclustering
+Using the calculated left and right reaches for clusters on each of the four haplotypes, a merge occurs across all four haplotypes whenever cluster reaches overlap, into a supercluster. If there is no overlap, a cluster is converted into a supercluster nevertheless.
+
+<p align="center">
+<img width="800px" src="https://github.com/TimD1/vcfdist/assets/13918078/e1690116-0690-4776-86c9-a5d57d5b95a6" alt="superclustering"/>
+</p>
+In the above image, clusters within the first supercluster are assumed to have overlapping left and right reaches, even if the variants within the clusters themselves don't always overlap.
diff --git a/docs/v2.3.3/06-Precision-and-Recall.md b/docs/v2.3.3/06-Precision-and-Recall.md
@@ -0,0 +1,9 @@
+### Summary
+Variant correctness is determined by first calculating the minimum edit distance alignment of the query "sequence" to the truth sequence. The query "sequence" is not quite a sequence but instead a specific type of graph which allows arbitrary omission of query variants (so that we can ignore false positives). By backtracking through all optimal alignments, we can determine which query variants are false positives by seeing if the chosen path through the query graph included or excluded a particular variant. We can then assign credit to the remaining variants by marking the points at which all optimal paths diverge and coalesce. We call these "sync points". Between sync points, all query and truth variants are given the same `credit` based on the edit distance of the original truth variants and the current path between the two enclosing sync points. A diagram of this process is shown below. 
+
+<p align="center">
+<img src="https://github.com/TimD1/vcfdist/assets/13918078/da1f6d24-ae47-4a45-9be4-9235827a9b4f" alt="precision-recall algorithm"/>
+</p>
+
+Some of the minor implementation details of this algorithm may change, but I expect the high-level process to remain the same.
+