-
Notifications
You must be signed in to change notification settings - Fork 0
Usage
Nikita Tikhomirov edited this page Oct 8, 2024
·
3 revisions
$ piawka
piawka v0.8.1
Usage:
piawka -g groups_tsv -v vcf_gz [OPTIONS]
Options:
-1, --persite output values for each site
-a, --all output more cols: numerator, denominator, nGeno, nMiss
-b, --bed <arg> BED file with regions to be analyzed
-B, --targets <arg> BED file with targets (faster than -b for numerous small regions)
-D, --nodxy do not output Dxy (and other pairwise metrics)
-f, --fst output Hudson Fst
-F, --fstwc output Weir and Cockerham Fst instead
-g, --groups <arg> 2-columns sample ID / group ID TSV file
-h, --help show this help message
-H, --het output only per-sample pi = heterozygosity
-j, --jobs <arg> number of parallel jobs to run
-m, --mult use multiallelic sites
-M, --miss max share of missing GT per group at site
-P, --nopi do not output pi values
-r, --rho output Ronfort's rho
-t, --tajimalike output Tajima's D-like stat (manages missing data but isn't a test)
-T, --tajima output vanilla Tajima's D instead (sensitive to missing data)
-v, --vcf <arg> gzipped and tabixed VCF file
See Options and Examples for further details.
- a VCF file: it is most sensible to include invariant sites for an unbiased estimation of π and Dxy. Consult the well-written guide by
pixy
authors.piawka
only looks at what looks like discrete genotype calls, so any cell can have any imaginable number of genotypes. Regardless of the calculation method,piawka
does not make assumptions about missing genotype calls and does not include them in the calculation (see why this is good here).piawka
parses multiallelic sites but does not parse sites with indels. - a groups file: this is a 2-column TSV file with no header, first column being the sample IDs from the VCF and the second being the group ID.
piawka
can handle arbitrary number of groups. If a sample from the VCF file is missing in the groups file, it will not be used for calculation. Groups file can also contain non-existent sample IDs, they will not be considered.
-
-1, --persite
output values for each site (instead of region/file-wide average). Will not work for--tajima
or--tajimalike
. -
-a, --all
output additional 8th—11th cols:numerator
,denominator
,nGeno
,nMiss
. Automatically turned on for multi-threaded analysis withough regions specified, because this is the only way for accurate integration of several outputs. -
-b, --bed <arg>
BED file with regions to be analyzed -
-B, --targets <arg>
BED file with targets (faster than -b for numerous small regions) -
-D, --nodxy
do not output Dxy (and other pairwise metrics) -
-f, --fst
output Fst values for population pairs (Hudson (1992) after Bhatia et al. (2013)). Note that Fst behavior is less well-described in presence of missing data! Therefore, consider running the analysis with different--miss
values. -
-F, --fstwc
output Weir and Cockerham Fst (Weir and Cockerham (1984) as interpreted by Bhatia et al. (2013)). Supersedes--fst
if toggled. -
-g, --groups <arg>
2-columns sample ID / group ID TSV file -
-h, --help
show this help message -
-H, --het
only output heterozygosity, i.e. within-sample$π$ values. All samples present in the first column ofgroups_file
are used, the second column is ignored. Same is runningpiawka --nodxy
with single-sample groups but much more efficient. -
-j, --jobs <arg>
number of parallel jobs to run -
-m, --mult
use multiallelic sites when calculating$π$ ,$D_{xy}$ and$F_{ST}$ . Default is biallelic sites only (including sites that are multiallelic in the VCF as a whole but have two alleles in the given groups). Note that behavior of multiallelic sites diversity is less well-described, use at own risk! -
-M, --miss
maximum share of missing genotypes at a site per group. Higher-value sites are skipped. Should be a number between 0 and 1; default 1. -
-P, --nopi
do not output pi values -
-r, --rho
calculates Ronfort (1998)'s$\rho$ , a ploidy-corrected divergence metric that is best fit for comparing divergence values between ploidy levels. Not to be used in mixed-ploidy comparisons! -
-t, --tajimalike
calculate Tajima's D-like statistic. WithMIS=0
it is identical to Tajima's D, otherwise it corrects for missing data using Ferretti-like adjustment of theta-W and differences in expected numbers of segregating sites under different sample sizes as derived by Tajima. The result seems to be centered around the "true" Tajima's D value under a broad range of missing data percentages, but its variance should be broader. Thus, statistical significance of deviation is harder to estimate (maybe jackknife/bootstrap?) Not to be used with mixed-ploidy groups! -
-T, --tajima
output canonical Tajima's D instead (sensitive to missing data) -
-v, --vcf <arg>
gzipped and tabixed VCF file
piawka
outputs a long-format table with the following columns:
-
locus : either genomic position of analyzed locus or custom
LOCUS
value. -
nSites : the length of the region in the analyzed VCF file. Can be passed via the
NSITES
argument. - pop1 : analyzed group 1
-
pop2 :
.
for pi values or group 2 for Dxy and Fst values - nUsed : number of sites used for pi calculation (i.e. SNPs and invariant sites; for weighted pi or dxy these should also pass the 50% genotyping rate threshold)
-
metric : pi or dxy (or "het" for heterozygosity), average weighted or
pixy
-like - value of the metric
If --all
was supplied, four additional columns are appended:
- numerator of the statistic
- denominator
- nGeno : total number of called sites, with a possible maximum of ploidy * n_samples * nUsed
- nMiss : total number of missing genotype calls