A Python program for evaluating site-level concordance of a query VCF against a truth VCF.
The summary metric CSV file contains:
- Variant type: SNV or INDEL
- Total number of variants in the truth and query VCF files
- Total true-positive, false-positive, and false-negative calls
- Recall and Precision
This tool compares two variant callsets against each other and produces a CSV file with summary metrics.
In the following examples, we assume that the code has been installed to the directory ${VCFCompare}
#option 1 (default): look at both SNV and INDEL concordance
$ python3 ${VCFCompare}/src/python/VCFCompare.py \
example/gatk_variants.vcf \
example/bcftools_variants.vcf \
--out test
$ ls test.*
#option 2 (optional): look at either SNV or INDEL concordance
$ python3 ${VCFCompare}/src/python/VCFCompare.py \
example/gatk_variants.vcf \
example/bcftools_variants.vcf \
--out test \
--type SNV (or INDEL)
$ ls test.*
test.SNV.csv (or test.INDEL.csv)
The example above compares an example run of GATK against an example run of bcftools 1.9 on the same random sample.
The summary metric CSV file contains:
Type | TRUTH.TOTAL | TP | FP | FN | QUERY.TOTAL | Recall | Precision |
SNV | 3610 | 3573 | 538 | 37 | 4111 | 0.989750693 | 0.869131598 |
INDEL | 205 | 104 | 101 | 101 | 247 | 0.507317073 | 0.421052632 |
If you want to visualize the difference and intersection between the truth and query VCF files, you can use the upset.R script under src/R/ The upset.R script runs on a set of VCFCompare.py results produced with the -o flag, as shown in the example above
$ Rscript ${VCFCompare}/src/R/upset.R \
-i test.csv \
-o TestRun
$ ls TestRun.*
TestRun.INDEL.pdf TestRun.SNV.pdf
This will produce two PDF files: one for SNVs and one for INDELs. Below is a screenshot for SNVs