Genotype likelihood simulator for VCF/BCF files
Installation
·
Tutorial
·
Report Bug
·
Request Feature
·
Cite
Table of Contents
vcfgl is a lightweight command-line program for simulating VCF/BCF and gVCF files. It allows you to simulate sequencing data with various parameters, such as read depth, base-calling error rates, quality score errors, and genotype likelihood models.
You can install vcfgl using one of the following methods:
You can download the latest release tarball from the GitHub releases page.
This method uses the htslib submodule included in the repository.
git clone https://github.com/isinaltinkaya/vcfgl.git;
cd vcfgl;
make;
This method assumes you have a systemwide htslib installation.
git clone https://github.com/isinaltinkaya/vcfgl.git;
cd vcfgl;
make HTSSRC="systemwide";
This method allows you to specify the path to your htslib installation.
git clone https://github.com/isinaltinkaya/vcfgl.git;
cd vcfgl;
make HTSSRC=path/to/htslib;
For detailed compilation instructions, in vcfgl directory, run:
make help
$ make help ---------------------------------------- Program: vcfgl Version: v0.4-bb8df2d License: GNU GPLv3.0 ---------------------------------------- Usage: make [target] [FLAG=value...] Targets: help - Print this help message dev - Compile in developer/debug mode (activates flags: -g -Wall -O0) clean - Clean up the directory test - Run unit tests Flags: HTSSRC - Specifies the source of HTSlib. Values: (empty) - Use the HTSlib submodule [default] systemwide - Use the systemwide HTSlib installation /path/to/htslib - Use the HTSlib installation at /path/to/htslib Examples: make - Compile in release mode using HTSlib submodule make HTSSRC=/path/to/htslib - Compile in release mode using /path/to/htslib make dev HTSSRC=systemwide - Compile in developer mode using the systemwide HTSlib installation Note: If no values are provided for HTSSRC, CXX, CXXFLAGS, or LIBS, defaults will be used.
You can access the command-line help page using vcfgl --help
or vcfgl -h
:
vcfgl --help
Usage: vcfgl -i [options] -h, --help _____________________ Print this help message and exit -v, --version __________________ Print version and build information and exit Option descriptions: -s, --long-option TYPE [X] _____ Description -s Short option (if any) --long-option Long option TYPE Type of the argument value, can be: - INT (integer) - INT+ (additive integer: sum values to use together - FLOAT (floating point number) - STRING (string) - FILE (filename) - x|y|z (one of the listed values x, y or z) [X] Default argument value (if any) _____ Connector to the option description for better readability General options: -V, --verbose INT [0] ___________ Verbosity level -@, --threads INT [1] ___________ Number of threads -s, --seed INT [time] ___________ Random seed for initializing the random number generator Input/Output: -i, --input FILE ________________ Input VCF/BCF file --source [0]|1 ______________ 0: Input REF/ALT alleles are in binary format (REF=0, ALT=1; typically outputted from msprime BinaryMutationModel) 1: Input REF/ALT alleles are in VCF format (REF=i, ALT=j(,k..); i, j and k from {A,C,G,T}; i.e. the regular VCF format) -o, --output STRING ['output'] __ Output filename prefix -O, --output-mode [b]|u|z|v _____ b: Compressed BCF (.bcf), u: uncompressed BCF (.bcf), z: compressed VCF (.vcf.gz), v: uncompressed VCF (.vcf) Simulation parameters: -d, --depth FLOAT|'inf' _________ Mean per-site read depth ('inf') Simulate true values (requires: -addFormatDP 0 -addInfoDP 0) -df, --depths-file FILE __________ File containing mean per-site read depth values for each sample. One value per line. -e, --error-rate FLOAT __________ Base-calling error probability -eq, --error-qs [0]|1|2 __________ 0: Do not simulate errors in quality scores. Assumes all quality score assignments are correct 1: Simulate site-specific errors in the probability of wrong base calls (requires: -bv FLOAT) 2: Simulate the errors in the reported quality scores and genotype likelihoods (requires: -bv FLOAT) -bv, --beta-variance FLOAT _______ Designated variance for the beta distribution -GL, --gl-model 1|[2] ____________ Genotype likelihood model to be used in simulation 1: Genotype likelihood model with correlated errors (a.k.a. Li model, samtools model, angsd -GL 1) 2: Canonical genotype likelihood model with independent errors (a.k.a. McKenna model, GATK model, angsd -GL 2) --gl1-theta FLOAT [0.83] ___ Theta parameter for the genotype likelihood model 1 (requires: -GL 1) --qs-bins FILE _____________ File containing the quality score binning to be used in the simulation --precise-gl [0]|1 _________ 0: Use the discrete phred-scaled error probabilities in the genotype likelihood calculation 1: Use precise error probabilities in the genotype likelihood calculation (requires: -GL 2) --i16-mapq INT [20] ________ Mapping quality score for I16 tag (requires: -addI16 1) --gvcf-dps INT(,INT..) _____ Minimum per-sample read depth range(s) for constructing gVCF blocks (requires: -doGVCF 1) Example: `--gvcf-dps 5,10,20` will group invariable sites into three types of gVCF blocks: [5,10), [10,20) and [20,inf) Sites with minimum depth < 5 will be printed as regular VCF records. --adjust-qs INT+ [0] _______ 0: Do not adjust quality scores 1: Use adjusted quality scores in genotype likelihoods (requires: --precise-gl 0) 2: Use adjusted quality scores in calculating the quality score sum (QS) tag (requires: -addQS 1) 4: Use adjusted quality scores in pileup output (requires: --printPileup 1) 8: Use adjusted quality scores in --printQScores output (requires: --printQScores 1) 16: Use adjusted quality scores in --printGlError output (requires: --printGlError 1) --adjust-by FLOAT [0.499] __ Adjustment value for quality scores (requires: --adjust-qs > 0) -explode [0]|1 _____________ 1: Explode to sites that are not in input file. Useful for simulating invariable sites when the input file only contains variable sites. Sets all genotypes in exploded sites to homozygous reference. --rm-invar-sites INT+ [0]___ 0: Do not remove invariable sites 1: Remove sites where all individuals' true genotypes in the input file are homozygous reference 2: Remove sites where all individuals' true genotypes in the input file are homozygous alternative 4: Remove sites where the all simulated reads among all individuals are the same base Example: '--rm-invar-sites 3' (1+2) will do both 1 and 2 (i.e. remove all homozygous sites) --rm-empty-sites [0]|1 _____ 0: Do not remove empty sites 1: Remove empty sites (i.e. sites where no reads were simulated) Output options: -doUnobserved INT [1] ______ 0: Trim unobserved alleles. Only alleles that are observed will be listed in REF and ALT fields 1: Use '<*>' notation to represent unobserved alleles 2: Use '' notation to represent unobserved alleles (a.k.a. GATK notation) 3: Explode unobserved bases from {A,C,G,T} list 4: Use '<*>' notation to represent unobserved alleles and explode unobserved bases from {A,C,G,T} list 5: Use '' notation to represent unobserved alleles and explode unobserved bases from {A,C,G,T} list -doGVCF [0]|1 ______________ 0: Disabled 1: Output in gVCF format (requires: --rm-invar-sites 0, -doUnobserved 2, -addPL 1 and --gvcf-dps INT) Output VCF/BCF tags: 0: Do not add, 1: Add -addGL 0|[1] _______________ Genotype likelihoods (GL) tag -addGP [0]|1 _______________ Genotype probabilities (GP) tag -addPL [0]|1 _______________ Phred-scaled genotype likelihoods (PL) tag -addI16 [0]|1 ______________ I16 tag -addQS [0]|1 _______________ Quality score sum (QS) tag -addFormatDP [1]|0 _________ Per-sample read depth (FORMAT/DP) tag -addFormatAD [0]|1 _________ Per-sample allelic read depth (FORMAT/AD) tag -addFormatADF [0]|1 ________ Per-sample forward-strand allelic read depth (FORMAT/ADF) tag -addFormatADR [0]|1 ________ Per-sample reverse-strand allelic read depth (FORMAT/ADR) tag -addInfoDP [0]|1 ___________ Total read depth (INFO/DP) tag -addInfoAD [0]|1 ___________ Total allelic read depth (INFO/AD) tag -addInfoADF [0]|1 __________ Total forward-strand allelic read depth (INFO/ADF) tag -addInfoADR [0]|1 __________ Total reverse-strand allelic read depth (INFO/ADR) tag Additional output formats: -printPileup [0]|1 _________ 0: Disabled, 1: Output in pileup format (.pileup.gz) -printTruth [0]|1 __________ 0: Disabled, 1: Output the VCF file containing the true genotypes (.truth.vcf) -printBasePickError [0]|1 __ 0: Disabled, 1: Print the base picking error probability to stdout. If --error-qs 1 is used, writes per-read base picking error probabilities to stdout. If --error-qs 0 or 2 is used, writes a single value which is used for all samples and sites. The columns are: type, sample_id, contig, site, read_index, base_pick_error_prob -printQsError [0]|1 ________ 0: Disabled, 1: Print the error probability used in quality score calculations to stdout. If --error-qs 2 is used, writes per-read quality score error probabilities to stdout. If --error-qs 0 or 1 is used, writes a single value which is used for all samples and sites. The columns are: type, sample_id, contig, site, read_index, error_prob -printGlError [0]|1 ________ 0: Disabled, 1: Print the error probability used in genotype likelihood calculations to stdout. (requires: -GL 2) Since -GL 1 works directly with quality scores, this option is only available when -GL 2 is used. If --error-qs 2 is used, writes per-read error probabilities to stdout. If --error-qs 0 or 1 is used, writes a single value which is used for all samples and sites. If --precise-gl 1 is used, the printed values are the same as those printed by -printQsError. The columns are: type, sample_id, contig, site, read_index, error_prob -printQScores [0]|1 ________ 0: Disabled, 1: Print the quality scores to stdout. The columns are: type, sample_id, contig, site, read_index, qScore
Option descriptions
Option format is -s, --long-option TYPE [X]: Description
where -s
is the short format of the option (if any), --long-option
is the long option, TYPE
is the type of the argument value, and [X]
denotes that the default argument value is X
(if any).
Type of the argument values can be:
Type | Description | Example |
---|---|---|
INT |
Integer | --long-option 3 |
INT+ |
Additive integer | --long-option 3 will do both --long-option 1 and --long-option 2 |
FLOAT |
Floating point number | --long-option 3.14 |
STRING |
String | --long-option "hello" |
FILE |
Filename | --long-option "file.txt" or --long-option /path/to/file.txt |
x|y|z |
One of the listed values x, y, or z | --long-option x |
INT(,INT..) |
A single integer or a comma-separated list of integers | --long-option 1,2,3 or --long-option 1 |
General options
-
-v, --verbose INT [0]
: Verbosity level. -
-@, --threads INT [1]
: Number of threads. -
-s, --seed INT [time]
: Random seed for initializing the random number generator. If not defined, the seed will be set using the current system time.N.B. If multiple simulations are run in parallel and the seed is not defined, the random number generator may be initialized with the same seed for all of the simulations, given that it is likely that the simulations start at the same time. Therefore, it is highly recommended to set the seed manually when running multiple simulations in parallel.
Input/Output
-
-i, --input FILE
: Input filename. -
-o, --output STRING ['output']
: Output filename prefix. If not defined, the output filename prefix will be set tooutput
. The program will append the suffix to the output filename prefix based on the given options. A simulation argument values log file (ending with.arg
) is always generated.-
Example:
vcfgl -i test/data/data1.vcf -o test/data1_output -e 0.1 -d 1
will generatedata1_output.arg
,data1_output.bcf
files in thetest
directory. -
Example:
vcfgl -i test/data/data1.vcf -e 0.1 -d 1 -printPileup 1 -printTruth 1 -O z -o data1
will generatedata1.arg
,data1.pileup.gz
(-printPileup 1
),data1.truth.vcf.gz
(-printTruth 1
), anddata1.vcf.gz
(-O z
) files in the current directory.
-
-
-O, --output-mode [b]|u|z|v
: Output file format.b
: [Default] Compressed BCF (.bcf
)u
: Uncompressed BCF (.bcf
)z
: Compressed VCF (.vcf.gz
)v
: Uncompressed VCF (.vcf
)
Simulation parameters
-d, --depth FLOAT|'inf'
: Mean per-site read depth-df, --depths-file FILE
: File containing mean per-site read depth values for each sample (one value per line)-e, --error-rate FLOAT
: Base-calling error probability-eq, --error-qs [0]|1|2
: Error simulation type:0
: Do not simulate errors in quality scores1
: Simulate errors in the per-site probability of wrong base calls (requires--beta-variance
)2
: Simulate errors in the reported quality scores and genotype likelihoods (requires--beta-variance
)
-bv, --beta-variance FLOAT
: Designated variance for the beta distribution-GL, --gl-model 1|[2]
: Genotype likelihood model to be used in simulation1
: Genotype likelihood model with correlated errors (Li model)2
: Canonical genotype likelihood model with independent errors (McKenna model)
--gl1-theta FLOAT [0.83]
: Theta parameter for genotype likelihood model 1 (requires-GL 1
)--qs-bins FILE
: File containing the quality score binning to be used in the simulation--precise-gl [0]|1
: Use precise error probabilities in the genotype likelihood calculation (requires-GL 2
)--i16-mapq INT [20]
: Mapping quality score for I16 tag (requires-addI16 1
)--gvcf-dps INT(,INT..)
: Minimum per-sample read depth range(s) for constructing gVCF blocks (requires-doGVCF 1
)- Example:
--gvcf-dps 5,10,20
will group invariable sites into three types of gVCF blocks:[5,10)
,[10,20)
, and[20,inf)
. Sites with minimum depth < 5 will be printed as regular VCF records.
- Example:
-explode [0]|1
: Explode to sites that are not in the input file--rm-invar-sites INT+
: Remove invariable sites--rm-empty-sites [0]|1
: Keep or remove empty sites in the output file-doUnobserved INT [1]
: Trim unobserved alleles-doGVCF [0]|1
: Output in GATK GVCF format-printPileup [0]|1
: Output in pileup format-printTruth [0]|1
: Output the VCF file containing the true genotypes-printBasePickError [0]|1
: Print the base picking error probability to stdout-printQsError [0]|1
: Print the error probability used in quality score calculations to stdout. The columns are: type, sample_id, contig, site, read_index, error_prob-printGlError [0]|1
: Print the error probability used in genotype likelihood calculations to stdout. The columns are: type, sample_id, contig, site, read_index, error_prob-printQScores [0]|1
: Print the quality scores to stdout. The columns are: type, sample_id, contig, site, read_index, qScore
Output VCF/BCF tags
Control which tags are added to the output VCF/BCF files.
-addGL 0|[1]
: Genotype likelihoods (GL) tag-addGP [0]|1
: Genotype probabilities (GP) tag-addPL [0]|1
: Phred-scaled genotype likelihoods (PL) tag-addI16 [0]|1
: I16 tag-addQS [0]|1
: Quality score sum (QS) tag-addFormatDP [1]|0
: Per-sample read depth (FORMAT/DP) tag-addFormatAD [0]|1
: Per-sample allelic read depth (FORMAT/AD) tag-addFormatADF [0]|1
: Per-sample forward-strand allelic read depth (FORMAT/ADF) tag-addFormatADR [0]|1
: Per-sample reverse-strand allelic read depth (FORMAT/ADR) tag-addInfoDP [0]|1
: Total read depth (INFO/DP) tag-addInfoAD [0]|1
: Total allelic read depth (INFO/AD) tag-addInfoADF [0]|1
: Total forward-strand allelic read depth (INFO/ADF) tag-addInfoADR [0]|1
: Total reverse-strand allelic read depth (INFO/ADR) tag
The basic functionality of the program (i.e. simulation without errors in quality scores, --error-qs 0
) is designed to simulate data equivalent to those simulated using msToGlf.
Similar to msToGlf, vcfgl can simulate the genotype likelihoods using direct genotype likelihood model via -GL 2
, and output pileup format via -printPileup 1
.
msToGlf -in input.ms -out output -err 0.01 -depth 1 -pileup 1
is equivalent to
vcfgl -i input.vcf -o output --error-qs 0 -e 0.01 -d 1 -GL 2 -printPileup 1
If you have any questions about the program, feature requests, or bug reports, you can open a new issue at GitHub Issues. Alternatively, you can contact the main developer using the contact information at isinaltinkaya.com.
The preprint is freely available on bioRxiv through the following link: bioRxiv Preprint
You can use the BibTex entry below for referencing this program in your work:
@article {Altinkaya2024.04.09.586324,
author = {Isin Altinkaya and Rasmus Nielsen and Thorfinn Sand Korneliussen},
title = {vcfgl: A flexible genotype likelihood simulator for VCF/BCF files},
elocation-id = {2024.04.09.586324},
year = {2024},
doi = {10.1101/2024.04.09.586324},
publisher = {Cold Spring Harbor Laboratory},
URL = {https://www.biorxiv.org/content/early/2024/04/09/2024.04.09.586324},
eprint = {https://www.biorxiv.org/content/early/2024/04/09/2024.04.09.586324.full.pdf},
journal = {bioRxiv}
}