-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comparing dates between simulated and inferred tree sequences #301
Comments
This is very interesting, thanks @nspope. I'm thinking that one of the strategies of this method is not to look at all the nodes, but simply focus upon those present in the inferred tree sequence. However, in in inferred TS with no mismatch, each mutation should be the source of a single ancestor, so this may not be that different to identifying nodes by the mutations directly above them? Maybe we should plot those as different colours on the plot above, to contrast? It might be worth summarising discrepancies from the predicted by looking at the histogram too (as in the Brandt paper). I wonder if it would be helpful to plot this? |
Yes, I think that's right -- should coincide closely in this case. Though, it's useful to get a measure of similarity (between query node and "best-matching" true node), because these may be used to filter / weight the result.
Yes, although I think that when nodes are omitted due to polytomies it'll create artefacts in marginal statistics that don't really reflect dating error. For example, if calculating the distribution of pair coalescence times, a tree with large polytomies should have a dearth of younger pair-times relative to a similar binary tree. That's not really an issue with dating per se, as the node ages themselves could be totally reasonable. |
Anyway, it's useful to have a measure that doesn't rely on having any mutations in the tree sequence; that is "branch only". |
As a side note, I could imagine doing something like this by (a) removing mutations from the original ts; (b) adding new mutations under infinite-sites with msprime to the original ts; (c) mapping the mutations to the new ts by parsimony; |
For tree sequences inferred from simulations, we can't directly compare inferred dates to true dates because node sets differ between the inference and the simulation.
Instead, we've been looking at summary statistics such as the distribution of pairwise coalescence times, relative to what is expected under the coalescent with recombination. Unfortunately, time-indexed statistics are likely to be biased if there are polytomies in the inferred tree sequence, because the degrees of the polytomies are likely to depend on node ages. It's not clear to me how to systematically characterize/correct for these biases, so as to assess whether or not the inferred tree sequence has reasonable estimates for node dates. For example, polytomies may contribute to the distortions seen in #198 (comment). This seems especially problematic for comparing tsdate to other inference tools like Relate that only output binary trees. Also, marginal statistics are a pretty indirect way to get at what we've after (node-level predictions).
@petrelharp and @hfr1tz3 pointed me towards a tree discrepancy metric they developed in https://github.com/petrelharp/num_edges/blob/main/Discrepancy%20Function.ipynb, that gives a meaningful way to compare ages across different tree sequences by finding a mapping between nodes. For a given node in the inferred tree sequence, the idea is to find the node in the true tree sequence that shares the same set of descendants over the longest shared span. This is something like finding the "true" ancestral segment that is most similar to a given inferred ancestral segment.
This seems to work quite well on a small (1Mb, 200 samples) example, plotting the inferred ages against the ages of the "most similar" nodes from the simulation:
and suggests that tsdate is doing better than we'd judge from the global statistics in #198 (comment). It'd be great to make this fast enough to work with larger simulations. Code:
The text was updated successfully, but these errors were encountered: