-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigate effect of polytomies / topological uncertainty on posteriors #359
Comments
another consideration is that a polytomy implies a greater total edge span than the original binary topologies; which I'd think would introduce bias. IIRC we don't see a relationship between arity and bias, however. |
Now that we have a decent routine to create polytomies (tskit-dev/tskit#2926), we can test out the effect of polytomies on dating. Here's an example, using the true topologies, without and with induced polytomies (where edges without a mutation on them are removed to make a polytomy, see the trees below the plot). It appears as if making polytomies like this biases the mutation dates to younger times as the sample size increases: Example first tree (original, then with induced polytomies): FWIW, the pattern doesn't change much if we use the metadata-stored mutation times instead. Click to reveal code to reproduce the plot above:import itertools
import collections
import numpy as np
def remove_edges(ts, edge_id_remove_list):
edges_to_remove_by_child = collections.defaultdict(list)
edge_id_remove_list = set(edge_id_remove_list)
for remove_edge in edge_id_remove_list:
e = ts.edge(remove_edge)
edges_to_remove_by_child[e.child].append(e)
# sort left-to-right for each child
for k, v in edges_to_remove_by_child.items():
edges_to_remove_by_child[k] = sorted(v, key=lambda e: e.left)
# check no overlaps
for e1, e2 in zip(edges_to_remove_by_child[k], edges_to_remove_by_child[k][1:]):
assert e1.right <= e2.left
# Sanity check: this means the topmost node will deal with modified edges left at the end
assert ts.edge(-1).parent not in edges_to_remove_by_child
new_edges = collections.defaultdict(list)
tables = ts.dump_tables()
tables.edges.clear()
samples = set(ts.samples())
# Edges are sorted by parent time, youngest first, so we can iterate over
# nodes-as-parents visiting children before parents by using itertools.groupby
for parent_id, ts_edges in itertools.groupby(ts.edges(), lambda e: e.parent):
# Iterate through the ts edges *plus* the polytomy edges we created in previous steps.
# This allows us to re-edit polytomy edges when the edges_to_remove are stacked
edges = list(ts_edges)
if parent_id in new_edges:
edges += new_edges.pop(parent_id)
if parent_id in edges_to_remove_by_child:
for e in edges:
assert parent_id == e.parent
l = -1
if e.id in edge_id_remove_list:
continue
# NB: we go left to right along the target edges, reducing edge e as required
for target_edge in edges_to_remove_by_child[parent_id]:
# As we go along the target_edges, gradually split e into chunks.
# If edge e is in the target_edge region, change the edge parent
assert target_edge.left > l
l = target_edge.left
if e.left >= target_edge.right:
# This target edge is entirely to the LHS of edge e, with no overlap
continue
elif e.right <= target_edge.left:
# This target edge is entirely to the RHS of edge e with no overlap.
# Since target edges are sorted by left coord, all other target edges
# are to RHS too, and we are finished dealing with edge e
tables.edges.append(e)
e = None
break
else:
# Edge e must overlap with current target edge somehow
if e.left < target_edge.left:
# Edge had region to LHS of target
# Add the left hand section (change the edge right coord)
tables.edges.add_row(left=e.left, right=target_edge.left, parent=e.parent, child=e.child)
e = e.replace(left=target_edge.left)
if e.right > target_edge.right:
# Edge continues after RHS of target
assert e.left < target_edge.right
new_edges[target_edge.parent].append(
e.replace(right=target_edge.right, parent=target_edge.parent)
)
e = e.replace(left=target_edge.right)
else:
# No more of edge e to RHS
assert e.left < e.right
new_edges[target_edge.parent].append(e.replace(parent=target_edge.parent))
e = None
break
if e is not None:
# Need to add any remaining regions of edge back in
tables.edges.append(e)
else:
# NB: sanity check at top means that the oldest node will have no edges above,
# so the last iteration should hit this branch
for e in edges:
if e.id not in edge_id_remove_list:
tables.edges.append(e)
assert len(new_edges) == 0
tables.sort()
return tables.tree_sequence()
def unsupported_edges(ts, per_interval=False):
"""
Return the internal edges that are unsupported by a mutation.
If ``per_interval`` is True, each interval needs to be supported,
otherwise, a mutation on an edge (even if there are multiple intervals
per edge) will result in all intervals on that edge being treated
as supported.
"""
edges_to_remove = np.ones(ts.num_edges, dtype="bool")
edges_to_remove[[m.edge for m in ts.mutations()]] = False
# We don't remove edges above samples
edges_to_remove[np.isin(ts.edges_child, ts.samples())] = False
if per_interval:
return np.where(edges_to_remove)[0]
else:
keep = (edges_to_remove == False)
for p, c in zip(ts.edges_parent[keep], ts.edges_child[keep]):
edges_to_remove[np.logical_and(ts.edges_parent == p, ts.edges_child == c)] = False
return np.where(edges_to_remove)[0]
###########
from matplotlib import pyplot as plt
import stdpopsim
import tsdate
print(f"Using tsdate {tsdate.__version__}")
species = stdpopsim.get_species("HomSap")
model = species.get_demographic_model("AmericanAdmixture_4B11")
contig = species.get_contig("chr20", mutation_rate=model.mutation_rate, length_multiplier=0.1)
engine = stdpopsim.get_engine("msprime")
sizes = (1, 10, 100, 1000, 10000)
fig, axes = plt.subplots(len(sizes), 2, figsize=(10, 5*len(sizes)))
axes[0][0].set_title("True topology")
axes[0][1].set_title("Topology with induced polytomies")
axes[-1][0].set_xlabel("True mutation times")
axes[-1][1].set_xlabel("True mutation times")
for ax, s in zip(axes, sizes):
samples = {'AFR': s, 'EUR': s, 'ASIA': s, 'ADMIX': s}
ts = engine.simulate(model, contig, samples, seed=123)
print(ts.num_trees, "trees,", ts.num_sites, "sites,", ts.num_edges, "edges")
poly_ts = remove_edges(ts, unsupported_edges(ts))
print(poly_ts.num_edges, "edges in unresolved ts")
# Check it is doing the right thing
dated_ts = tsdate.variational_gamma(ts.simplify(), mutation_rate=model.mutation_rate, normalisation_intervals=100)
dated_poly_ts = tsdate.variational_gamma(poly_ts.simplify(), mutation_rate=model.mutation_rate, normalisation_intervals=100)
x = [max(m.time for m in s.mutations) for s in ts.sites()]
y = [max(m.time for m in s.mutations) for s in dated_ts.sites()]
y_poly = [max(m.time for m in s.mutations) for s in dated_poly_ts.sites()]
ax[0].set_ylabel(f"Inferred times ({ts.num_samples} samples)")
ax[0].hexbin(x, y, bins="log", xscale="log", yscale="log")
ax[0].plot(np.logspace(1, 5), np.logspace(1, 5), "-", c="red")
ax[1].hexbin(x, y_poly, bins="log", xscale="log", yscale="log")
ax[1].plot(np.logspace(1, 5), np.logspace(1, 5), "-", c="red") resulting in the plot above, and outputting:
|
Nice! |
This is great @hyanwong, thanks. I can think of a few things to try that might reduce bias-- will report back. |
Thanks @nspope : from a few tests it appears as if the bias is less pronounced in tsinfer inferred tree sequences. Plots below - the right hand column is tsinfer on the same data: |
Looking first at node ages ... the reason there's bias in dating nodes after introducing polytomies is because there's more mutational area than was in the original binary trees. E.g. we're increasing the total branch length, which means that when we match moments using segregating sites we end up shrinking the timescale. To be a bit more precise: the current normalisation strategy calculates total edge area and total number of mutations, then rescales time such that the expected number of mutations matches the total number of mutations. Instead, consider doing the following: for each tree, sample a path from a randomly selected leaf to the root. Only accumulate edge area and mutations on the sampled paths. This should be unbiased, because the "path length" is the same regardless of the presence of polytomies. In fact, we can do this sampling deterministically, because the probability that a randomly selected path passes through a given edge is proportional to the number of samples subtended by that edge. E.g. we normalise as before but weight edges by the number of samples they subtend. Using this alternative "path normalisation" strategy seems to greatly help with bias (1000 samples, 10 Mb): This more-or-less carries over for mutations: |
Oh wow. This is amazing. Thanks Nate. Does it cause any overcorrection problems for tsinferred tree sequences? I assume it shouldn't... |
Another way to phrase this is that we're moment matching against a different summary statistic (rather than segregating sites), that is the expected number of differences between a single sample and the root. In my opinion this choice of summary statistic is a more conceptually straightforward way to measure time with mutational density. |
I did a quick check on inferred simulated tree sequences -- the original routine was more or less unbiased (as Yan observed above) and the new routine does about the same. Would be interesting to compare the two on real data. Regardless, this new routine seems like the right approach. |
That's great.
Absolutely. We should go with the new approach. I wonder how both approaches perform on reinference? I can check this once there's instructions for how to run the new version. |
The API is exactly the same, with the new normalisation scheme used by default. The old normalisation scheme can be toggled if you pass |
Great, thanks for the info. Is it currently much slower than the old version? It seems maybe not? |
It shouldn't be, but would be good to check (if you enable logging it'll print out time spent during normalisation). There's an additional pass over edges, but this is done in numba. It might add few minutes on GEL or UKBB sized data, so would be good to enable logging there to get a sense for the overhead. |
Actually, I don't think it'll change reinference at all -- ancestor building just uses the ordering of mutations, right? Normalisation won't change the order, just the inter-node time differences. |
Hannes had an interesting idea: how do polytomies affect the variation in posterior times for a node? We could test this by taking a known topology and collapsing some of the nodes into polytomies, then dating, and looking at how the posterior distribution of times of the component nodes compares to the posterior distribution estimated for the collapsed polytomy.
The text was updated successfully, but these errors were encountered: