Skip to content
Nikita Tikhomirov edited this page Oct 8, 2024 · 3 revisions

Basic usage

$ piawka
piawka v0.8.1
Usage:
piawka -g groups_tsv -v vcf_gz [OPTIONS]
Options:
-1, --persite       output values for each site
-a, --all           output more cols: numerator, denominator, nGeno, nMiss
-b, --bed <arg>     BED file with regions to be analyzed
-B, --targets <arg> BED file with targets (faster than -b for numerous small regions)
-D, --nodxy         do not output Dxy (and other pairwise metrics)
-f, --fst           output Hudson Fst
-F, --fstwc         output Weir and Cockerham Fst instead
-g, --groups <arg>  2-columns sample ID / group ID TSV file
-h, --help          show this help message
-H, --het           output only per-sample pi = heterozygosity
-j, --jobs <arg>    number of parallel jobs to run
-m, --mult          use multiallelic sites
-M, --miss          max share of missing GT per group at site
-P, --nopi          do not output pi values
-r, --rho           output Ronfort's rho
-t, --tajimalike    output Tajima's D-like stat (manages missing data but isn't a test)
-T, --tajima        output vanilla Tajima's D instead (sensitive to missing data)
-v, --vcf <arg>     gzipped and tabixed VCF file

See Options and Examples for further details.

Input files

  • a VCF file: it is most sensible to include invariant sites for an unbiased estimation of π and Dxy. Consult the well-written guide by pixy authors. piawka only looks at what looks like discrete genotype calls, so any cell can have any imaginable number of genotypes. Regardless of the calculation method, piawka does not make assumptions about missing genotype calls and does not include them in the calculation (see why this is good here). piawka parses multiallelic sites but does not parse sites with indels.
  • a groups file: this is a 2-column TSV file with no header, first column being the sample IDs from the VCF and the second being the group ID. piawka can handle arbitrary number of groups. If a sample from the VCF file is missing in the groups file, it will not be used for calculation. Groups file can also contain non-existent sample IDs, they will not be considered.

Options

  • -1, --persite output values for each site (instead of region/file-wide average). Will not work for --tajima or --tajimalike.
  • -a, --all output additional 8th—11th cols: numerator, denominator, nGeno, nMiss. Automatically turned on for multi-threaded analysis withough regions specified, because this is the only way for accurate integration of several outputs.
  • -b, --bed <arg> BED file with regions to be analyzed
  • -B, --targets <arg> BED file with targets (faster than -b for numerous small regions)
  • -D, --nodxy do not output Dxy (and other pairwise metrics)
  • -f, --fst output Fst values for population pairs (Hudson (1992) after Bhatia et al. (2013)). Note that Fst behavior is less well-described in presence of missing data! Therefore, consider running the analysis with different --miss values.
  • -F, --fstwc output Weir and Cockerham Fst (Weir and Cockerham (1984) as interpreted by Bhatia et al. (2013)). Supersedes --fst if toggled.
  • -g, --groups <arg> 2-columns sample ID / group ID TSV file
  • -h, --help show this help message
  • -H, --het only output heterozygosity, i.e. within-sample $π$ values. All samples present in the first column of groups_file are used, the second column is ignored. Same is running piawka --nodxy with single-sample groups but much more efficient.
  • -j, --jobs <arg> number of parallel jobs to run
  • -m, --mult use multiallelic sites when calculating $π$, $D_{xy}$ and $F_{ST}$. Default is biallelic sites only (including sites that are multiallelic in the VCF as a whole but have two alleles in the given groups). Note that behavior of multiallelic sites diversity is less well-described, use at own risk!
  • -M, --miss maximum share of missing genotypes at a site per group. Higher-value sites are skipped. Should be a number between 0 and 1; default 1.
  • -P, --nopi do not output pi values
  • -r, --rho calculates Ronfort (1998)'s $\rho$, a ploidy-corrected divergence metric that is best fit for comparing divergence values between ploidy levels. Not to be used in mixed-ploidy comparisons!
  • -t, --tajimalike calculate Tajima's D-like statistic. With MIS=0 it is identical to Tajima's D, otherwise it corrects for missing data using Ferretti-like adjustment of theta-W and differences in expected numbers of segregating sites under different sample sizes as derived by Tajima. The result seems to be centered around the "true" Tajima's D value under a broad range of missing data percentages, but its variance should be broader. Thus, statistical significance of deviation is harder to estimate (maybe jackknife/bootstrap?) Not to be used with mixed-ploidy groups!
  • -T, --tajima output canonical Tajima's D instead (sensitive to missing data)
  • -v, --vcf <arg> gzipped and tabixed VCF file

Output

piawka outputs a long-format table with the following columns:

  • locus : either genomic position of analyzed locus or custom LOCUS value.
  • nSites : the length of the region in the analyzed VCF file. Can be passed via the NSITES argument.
  • pop1 : analyzed group 1
  • pop2 : . for pi values or group 2 for Dxy and Fst values
  • nUsed : number of sites used for pi calculation (i.e. SNPs and invariant sites; for weighted pi or dxy these should also pass the 50% genotyping rate threshold)
  • metric : pi or dxy (or "het" for heterozygosity), average weighted or pixy-like
  • value of the metric

If --all was supplied, four additional columns are appended:

  • numerator of the statistic
  • denominator
  • nGeno : total number of called sites, with a possible maximum of ploidy * n_samples * nUsed
  • nMiss : total number of missing genotype calls
Clone this wiki locally