Skip to content

Available statistics

Nikita Tikhomirov edited this page Oct 8, 2024 · 2 revisions

The metrics discussed below are well-reviewed here, including the implications of polyploidy for their calculation.

$π$ and $D_{xy}$

First and foremost, piawka reports $π$ (nucleotide diversity = gene diversity = expected heterozygosity) and $D_{xy}$ (absolute divergence = $π_{XY}$ = Nei's D) weighted by the amount of available data as in pixy:

$$ π, D_{xy} = { \sum^n N_{diff} \over \sum^n N_{comp} } $$

Where $N_{diff}$ and $N_{comp}$ denote numbers of differences versus comparisons (within-group for $π$, between groups for $D_{xy}$, missing haplotypes excluded) and $n$ stands for the number of sites used for calculation. This means that only one division per VCF file is performed after numerators and denominators from all sites are summarized. This metric gives lower weight to sites with fewer genotyped alleles (i.e. fewer possible comparisons) and is robust against missing data.

These metrics are output by default. They can be calculated for every site separately using --persite, not unlike most other metrics here.

Heterozygosity

In polyploids, heterozygosity can be seen as a special case of $π$ (calculated within a single sample). We made it a separate option because it allows for more efficient calculation.

$F_{ST}$ and Ronfort's $\rho$

piawka can calculate relative divergence $F_{ST}$ for pairs of populations. So far it does not account for the diversity outside the currently analyzed pair of populations.

Two popular $F_{ST}$ estimators (Hudson & Weir-Cockerham) are implemented following this paper, the former claimed being more reliable. Sample size is calculated as the number of diploid individuals (number of allelels / 2), which is the most simple way to account for polyploid samples.

Ronfort's $\rho$ is a metric tailored for comparisons between ploidy levels. That is, if one wants to compare divergence of two diploid populations to the divergence of two tetraploid populations, $\rho$ is a good bet. Note that this does not hold if populations contain individuals of both ploidy levels at once! In this case, $F_{ST} is the way to go.

Tajima's $D$ and Tajima's $D$-like metric

Another single-population metric reported is Tajima's $D$, a summary of the shape of the allele frequency spectrum (AFS) affected by selection and demographic history.

Unlike other metrics in piawka, canonical Tajima's $D$ depends on the sample size. To make $D$ values comparable between windows in presence of missing data (at the cost of the ability to directly use $D$ values for significance tests), we supply a home-brewed "Tajima's $D$-like" metric that estimates contribution of the missing genotypes to the AFS as if they were neutrally evolving. This is not perfect but seems to get the job done in the test datasets.

Again unlike other metrics here, Tajima's $D$ cannot be calculated for one site only as it needs an AFS.

Clone this wiki locally