Skip to content

F) Exploring your hogwash results

Katie Saund edited this page Oct 9, 2020 · 25 revisions

Sometimes running hogwash can lead to an a large number of results. This page will provide some suggestions for dissecting these results.

Plot all of your variants: P-value vs. epsilon

Example code to plot your variants:

library(tidyverse)
load("phyc_output_file_name.rda")
df <- as_tibble(cbind(hogwash_phyc$hit_pvals$fdr_corrected_pvals, 
                      hogwash_phyc$convergence$epsilon))
colnames(df) <- c("P-value", "Epsilon")
df %>% 
  ggplot(aes(x = Epsilon, y = `P-value`)) + 
  geom_jitter()

Use the raw P-value to adjust the P-value with your preferred method

If you would rather use Bonferroni to adjust your P-values you can access the raw, -log(P-values) in hogwash_phyc$raw_pvals and then apply Bonferroni yourself.

p.adjust(exp(-hogwash_phyc$raw_pvals$neg_log_unadjusted_pvals), method = "bonferroni")

Highly significant variants with low convergence may be due to high betagenotype

We expect a positive correlation between the -log(P-value) and epsilon for the variants tested, as shown in this plot.

Sometimes variants clearly buck this trend; some variants are highly significant but also have low epsilon values. Notice many of the genotypes highlighted in red.

In our own data we find that some variants with relatively high betagenotype are highly significant but have very low epsilon values. The red genotypes from the previous plot are shown again here in red.

We've observed that some of these variants are highly dispersed on the tree and as a result can have fairly low confidence ancestral reconstructions so their reconstruction is limited to a very shallow portion of the tree. In the following tree, which illustrates genotype transitions for one such genotype, notice that much of the tree is low confidence (grey edges). Sim25149 has a high betagenotype of 54.

Contrast this to another genotype with far more high confidence tree edges; Sim162 has a low betagenotype = 5.

Here are those two genotypes highlighted on the P-value vs. Epsilon plot.

In effect, the restriction of the reconstruction to the tips of the tree and other very recent edges means that genotype transition edges may be found on only very short tree edges. As a result of these short edge length genotype transitions in addition to the presence some moderately long non-transition edges in the reconstruction the sampling method of the permutation step will result in a highly significant P-value for the genotype.

Sometimes edges leading to tree tips get classified as low confidence

Sometimes your phenotype and genotype plots may have an edge leading to a tree tip classified as low confidence (grey color). At first this can be alarming because a phenotype or genotype reconstruction at a tree tip should always be high confidence, because the value is simply inferred as the tip value. But, keep in mind that hogwash also considers the length of the edge. If an edge is very long (10+% of the sum of all edge lengths in the tree) then the edge is classified as low confidence.

To illustrate this scenario we've re-plotted the above tree but now (a) recolored the tree edges that are >10% of the total tree length (orange) or <10% of the total tree length (black) and (b) used the tree edge lengths in the plot. Notice that the tree edges that are greyed out in the phenotype reconstruction largely match the overly long edges.

If you're interested in how edges are classified as overly long or not see the function hogwash::identify_short_edges().

When the ancestral reconstructions are a bit... wacky.

Most of the time the ancestral reconstruction inferred from your tree & genotype look pretty great (plausible, fairly parsimonious). This one, for example:

The ancestral reconstruction step is performed by ape::ace(). For binary inputs (genotypes or binary phenotypes) ancestral reconstruction is run twice: once with an equal rates (ER) model and once with an all rates different (ARD) model. Hogwash chooses the best model based on comparison between the models' log likelihoods and the Akaike Information Criterion (AIC). Most of the time, this seems to choose a pretty good looking reconstruction. For example, hogwash will choose ER (left) over ARD (right) in this case:

However, occasionally, either the chosen model isn't very convincing or both models aren't convincing. This is a short coming of ancestral reconstruction. Hogwash may provide hard to interpret results if the ancestral reconstruction looks questionable.

Take, for example, this case below. Both the ER and ARD reconstruction results are pretty questionable as there are seemingly spurious red edges leading to black tips and black edges leading to the one red clonal group. A parsimonious reconstruction might have given us a tree where one red edge leads to the red clonal group; then hogwash would exclude this particular variant from analysis because it doesn't converge on the tree at all.

Keep this odd case in mind if you get some inexplicable hogwash results -- the issue may be the underlying ancestral reconstruction.