B) hogwash algorithms

How hogwash works

This package reads in one phenotype (either continuous or binary), a matrix of binary genotypes, and a phylogenetic tree. Given these inputs it performs an ancestral reconstruction of that phenotype and each genotype. The ancestral reconstructions are used to perform one of several tests to associate the the genotypes with the phenotype:

Continuous Test
Synchronous Test
PhyC

Once a test finishes running it returns data to summarize the results, including a P-value for each tested genotype, and plots of the data.

A note on nomenclature:

Talking about phylogenetic trees:

Nodes: bifurcation points
Tips: termini
Edges: the line defined by two adjacent nodes or a tip and the nearest node

phylogenetic tree

Convergent evolution vs parallel evolution:

While convergent evolution and parallel evolution have specific meanings, those definitions are often interchanged. For the purposes of this software package, I prefer to call all episodes of independent evolution of a trait "convergence."

What is convergence?

Convergence describes when a trait evolves independently multiple times on a tree. In this example we have a phenotype shown in black and pink (antibiotic sensitivity and resistance, respectively). In the left tree the phenotype evolves only once so there is no convergence. In the right tree the phenotype evolves four times and therefore is convergent.

Genotypes can also display convergence; in this example genomic variants are shown as colored stars. Consider the blue star variants. In the left tree the blue variants arise only once and therefore do not converge. In the right tree the blue star variants evolve four times and therefore converge.

In both trees the phenotype (antibiotic resistance) and the blue genomic variant occur in exactly the same samples, but the relationship between the phenotype and genotype are far less likely to have occurred in the tree on the right than in the tree on the left. The relationship between that phenotype and genotype in the right tree is of far more interest than in the left tree.

All tests start with ancestral reconstruction & confidence

Ancestral reconstruction

Given a phylogenetic tree and trait for each tree tip an ancestral reconstruction is created wherein the value of the trait at each internal node is predicted. To learn more about the ancestral reconstruction tool please read the ape::ace() documentation. Ancestral reconstruction for binary characters is performed with maximum likelihood and either an equal rates or all rates different model (the best model is chosen for each reconstruction).

Binary phenotype: ancestral reconstruction

From the ancestral reconstruction above we may infer that antibiotic resistance evolved three separate times:

Ancestor to t1 & t9
Ancestor to t7
Ancestor to t2 & t6

Confidence

A tree edge is excluded from analysis if it:

has low bootstrap support (default: < 70%),
is very long (>10% of total tree edge length),
or has low ancestral reconstruction support (maximum likelihood < .875) from either the phenotype or genotype reconstruction.

1. Continuous test (input: continuous phenotype)

Does the phenotype change more than expected by chance on genotype transition edges than on genotype non-transition edges?

Genotype transitions are defined as any edge where the parent node and child node are not equal.

Genotype edge type	parent node value	child node value
Transition	wild type (0)	mutant (1)
Transition	mutant (1)	wild type (0)
Non-transition	wild type (0)	wild type (0)
Non-transition	mutant (1)	mutant (1)

The absolute value of the phenotype change on each edge is measured and scaled from 0 to 1. We calculate sum of |∆phenotype| on only the genotype transition edges. Then a permutation test is performed wherein the classification of each edge as a genotype transition or non-transition edge is randomized. The new sum of |Δphenotype| on the permuted genotype transition edges is calculated. An empirical P-value is calculated based on the observed vs. permuted sums.

2. Synchronous test (input: binary phenotype)

Do genotype transitions occur more often than expected by chance on phenotype transition edges than on phenotype non-transition edges?

This test is an extension of PhyC (see below), but with the goal of requiring a more stringent association between the genotype and phenotype. The number of edges on the tree where both a genotype transition and phenotype transition occurs is calculated. Then a permutation test is run where the classification of edges as genotype transition edges are randomized on the tree. The number of edges where the permuted genotype transitions coincide with the phenotype transition is recorded for each permutation, creating a null distribution. An empirical P-value is calculated based on the observed number of edges as compared to the null distribution.

Transition edges are defined as in (1.) but phenotypes, which are now binary, are also classified in this way:

Genotype or phenotype edge type	parent node value	child node value
Transition	wild type (0)	mutant (1)
Transition	mutant (1)	wild type (0)
Non-transition	wild type (0)	wild type (0)
Non-transition	mutant (1)	mutant (1)

3. PhyC (input: binary phenotype)

Does the genotype transition from wild type to mutant more often than expected by chance on edges where the phenotype is present than where the phenotype is absent?

This test is my implementation of the PhyC algorithm* as described in Farhat et al.’s 2013 Nature Genetics paper. If a genotype mutates more often on edges with the phenotypic trait of interest, there is a positive correlation between the genotype mutation and the phenotype. This approach controls for population structure by requiring the overlap of the phenotype with the genotype transition, rather than the overlap of the phenotype presence with genotype presence.

The number of edges on the tree where both a genotype mutates (0 → 1) and the phenotype is present occurs is calculated. Then a permutation test is run where the genotype mutations (0 → 1) are randomized on the tree. The number of edges where the permuted genotype mutation (0 → 1) coincides with the phenotype transition is recorded for each permutation, creating a null distribution. An empirical P-value is calculated based on the observed number of edges as compared to the null distribution.

Edges definitions are as follows:

Genotype: the only genotype edges of interest are edges where a mutation appears.
Phenotype: rather than consider changes in the phenotype, only consider the presence or absence of the phenotype on each edge.

Genotype edge type	parent node value	child node value
Transition	wild type (0)	mutant (1)
Non-transition	mutant (1)	wild type (0)
Non-transition	wild type (0)	wild type (0)
Non-transition	mutant (1)	mutant (1)

Phenotype edge type	Phenotype edge value
Present	mutant (1)
Absent	wild type (0)

*Note: Our implementation of PhyC has some changes from the original:

PhyC was originally implemented with Bonferroni multiple test correction, but this implementation uses False Discovery Rate.
Hogwash reduces the multiple testing burden by testing only those genotype-phenotype pairs for which convergence is detectable; genotypes fewer than 2 transition edges are excluded and genotype-phenotype pairs with fewer than 2 edges where the genotype transition overlaps with phenotype presence are assigned a P-value of 1.
Ancestral reconstruction for genotypes and phenotypes was performed using only maximum likelihood (the original PhyC used multiple approaches).
Users only supply one phylogenetic tree to hogwash instead of three. Largely, these changes were implemented to make hogwash fast and easy to use.

Comparing the three tests:

	Phenotype	Genotype	Question
Continuous	Continous. Absolute value of phenotype change on each edge.	Transition edges: node unequal. Non-transition edges: nodes equal.	Does the phenotype change more than expected on genotype transition edges than on genotype non-transition edges?
Synchronous	Discrete. Transition edges: nodes unequal. Non-transition edges: nodes equal.	Transition edges: nodes unequal. Non-transition edges: nodes equal.	Do genotype transitions occur more often than expected on phenotype transition edges than on phenotype non-transition edges?
PhyC	Discrete. Ancestral reconstruction edge value: phenotype is 1 or 0.	Transition edges: parent node == 0 & child node == 1	Does the genotype transition from wild type (0) to mutant (1) more often than expected by chance on phenotype present (1) edges than phenotype absent (0) edges?

Grouping

A feature of hogwash is the ability to organize genotypes into biologically meaningful groups. Testing for an association between an individual SNP and a phenotype is quite stringent, but patterns may emerge when grouping together biologically related genotypes. For example, grouping together all variants (insertions, deletions and SNPs) within a gene or promoter region could allow the user to identify a particular gene as being associated with a phenotype while any individual variant within that gene may not have deep penetrance in the isolates being tested. Grouping genotypes can increase the power to identify convergent evolution because they capture larger trends in functional impact at the group level and reduce the multiple testing correction burden. Use cases for this method could be to group SNPs into genes or genes into pathways. Each of the three tests can be run on disaggregated data or aggregated data with the inclusion of a grouping key which is described later in the wiki.

The user can choose between two grouping methods: either pre- or post-ancestral reconstruction grouping.

Post-ancestral reconstruction grouping occurs after the ancestral states and genotype transitions are determined for each individual (un-grouped) genotype. This method was the only grouping method in hogwash release 1.0.0. and is the default grouping method for release 1.2.0+. We recommend using post-ancestral reconstruction grouping as it is the most comprehensive method and treats each variant as having its own evolutionary history. However, post-ancestral reconstruction is a slower option than grouping prior to ancestral reconstruction.
Pre-ancestral reconstruction grouping occurs before the genotype ancestral states are inferred; genotypes are grouped and then ancestral reconstruction is performed. Pre-ancestral reconstruction grouping is very fast, but is not as sensitive as post-ancestral reconstruction grouping. In fact, grouping pre-ancestral reconstruction may obscure some associations, especially if the individual loci that comprise the group substantially overlap at the tree tips. This was the only grouping method in hogwash release 1.1.0.

To illustrate the differences between the two grouping methods compare the following scenarios:

Pre-ancestral reconstruction grouping

We've constructed data such that 9 of the 13 tips of this tree have a variant within GENE1. Each variant is unique, but all originate in GENE1. The variants are grouped together prior to ancestral reconstruction; as a result there are are only two genotype transition edges identified in the plot on the right.

Post-ancestral reconstruction grouping

Using the same data, but grouping the variants together only after ancestral reconstruction we observe 9 genotype transition edges. The strength of the post-ancestral reconstruction method is in scenarios such as this, where functionally related but poorly penetrant variants occur in clonal populations. Note that post-ancestral reconstruction grouping is far slower than pre-ancestral reconstruction grouping as the ancestral reconstruction step is rather slow.

Next: running hogwash.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly