-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incorporate recombination clock #5
Comments
The fundamental problem is that a consequence of the tree sequence data format is that you inevitably lose some of the edges when a recombination occurs and this leads the recombination clock to underestimate node age. This problem can be managed in the n=2 case (this is how GEVA works), but our current thinking is that it is an intractable problem in n=3 and above... or maybe not! I think it'll be worth bringing this up again with Gil |
There's an interesting application of this to the phasing singletons issue in #374. We could probably test (with real data), how much extra information there is in the haplotype lengths (i.e. the recombination clock) for phasing singletons, once the tsdated branch lengths are taken into account. The haplotype lengths were essentially what was used by Platt et al in Jody Hey's lab for phasing singletons (the inspiration for Nate's singleton phasing idea) Since any analysis of singletons is only based on data present at the very tips of the trees, we would be testing the "best case" scenario, where we might expect the recombination clock to be pretty valuable for estimating dates. Essentially this is like trying to predict the phase (mum or dad) from (a) the difference in tsdated times between the samples in the individual and the two different parent nodes and (b) the difference in span of the edge between the samples in the individual and the two different parent nodes. We can then do a logistic regression of phase on (a) and (b). You might also expect the average branch length to play a role in the robustness of the model, so I have done an analysis in R with that added that as a covariate, and also put in the interaction terms. I've used log dates and log spans too. This is only a rough way to do the analysis. A better way would be to use the theoretically expected IBD lengths (like in Platt et al) to calibrate the expected contribution of haplotype length on phasing. But I think it's good enough to get a feel for the relative magnitude of contributions. df <- read.csv("analysis_output.csv")
model <- glm(TruePhase ~ difflogtime * meanlogtime * logspandiff, data=df, family="binomial")
anova(model) Gives:
In other words, we can probably increase the predictive accuracy of the phasing algorithm by about 25% more than we currently have (deviance of This gives a (very) rough idea of how much extra information we have in the recombination clock in the best-case situation of "recent" variants. However, as "recent" is relative to the number of samples in the dataset, I suspect that haplotype length will provide more information than this as the number of samples increases to tens or hundreds of thousands (we should test this when we have the GEL data). Code for this exampleimport tszip
import tsdate
# file from https://github.com/tskit-dev/tsdate/issues/374#issuecomment-2176188312
ts = tszip.decompress("1kgp_bal1500_phased_singletons.trees.tsz")
dts = tsdate.date(ts, mutation_rate=1.29e-8, singletons_phased=False, progress=True)
# Identify singletons:
def id_singletons(correct_phase_ts, dts, combine_edges=False):
"""
combine_edges=True is slower and uses the combined span of all edges with the same parent/child
combination in the spandiff and spanmean values. This seesm to make the fit worse, so not worth doing
"""
assert np.all(correct_phase_ts.breakpoints(as_array=True) == dts.breakpoints(as_array=True))
assert correct_phase_ts.num_nodes == dts.num_nodes
assert np.all(correct_phase_ts.samples() == dts.samples())
samples = set(dts.samples())
data = []
for cp_tree, tree in zip(correct_phase_ts.trees(), dts.trees()):
if tree.num_edges == 0:
continue
for cp_site, site in zip(cp_tree.sites(), tree.sites()):
if len(cp_site.mutations) != 1:
continue
if cp_site.mutations[0].node not in samples:
continue
cp_indiv = correct_phase_ts.nodes_individual[cp_site.mutations[0].node]
indiv = dts.nodes_individual[site.mutations[0].node]
assert cp_indiv == indiv
if cp_indiv == tskit.NULL:
continue
cp_indiv = correct_phase_ts.individual(cp_indiv)
nodes = np.array(cp_indiv.nodes)
assert cp_site.mutations[0].node in nodes
assert len(nodes) == 2
assert nodes[0] != nodes[1]
# Find parents above each node in this position
parents = np.array([cp_tree.parent(u) for u in nodes])
if combine_edges:
edgespans = []
for u, p in zip(nodes, parents):
use = np.logical_and(correct_phase_ts.edges_child == u, correct_phase_ts.edges_parent == p)
edgespans.append(np.sum(correct_phase_ts.edges_right[use] - correct_phase_ts.edges_left[use]))
else:
edges = [correct_phase_ts.edge(cp_tree.edge(u)) for u in nodes]
for e, u, p in zip(edges, nodes, parents):
assert e.parent == p
assert e.child == u
edgespans = [e.right - e.left for e in edges]
assert parents[0] != tskit.NULL
times = dts.nodes_time[parents]
difftime = np.diff(times)[0]
meantime = np.mean(times)
spandiff = np.diff(edgespans)[0]
spanmean = np.mean(edgespans)
difflogtime = np.diff(np.log(times))[0]
meanlogtime = np.mean(np.log(times))
logspandiff = np.diff(np.log(edgespans))[0]
logspanmean = np.mean(np.log(edgespans))
val = int((nodes[0] == cp_site.mutations[0].node))
data.append([val, difftime, meantime, spandiff, spanmean, difflogtime, meanlogtime, logspandiff, logspanmean])
return data
data = id_singletons(ts, dts)
import pandas as pd
df = pd.DataFrame(
data,
columns=["TruePhase", "difftime", "meantime", "spandiff", "spanmean", "difflogtime", "meanlogtime", "logspandiff", "logspanmean"]
)
df.to_csv("analysis_output.csv") |
It would be easy to test using simulations how this proportion is changed as the number of samples changes. Also, FWIW, here's the ANOVA table when
|
So if I simplify down to the first 100 samples, it indeed seems to be (as I would predict) that the extra information provided by the recombination clock decreases (in this case to 3% additional deviance reduction). I would hope that with e.g. the GEL or UKBB data, we would see the opposite, in that the haplotype length would provide quite a lot of extra useful information on branch lengths / singleton phasing that we aren't currently getting from tsdate.
|
Just adding this (obvious) issue so that we have somewhere to keep track of this discussion on this. Perhaps @awohns can add a little explanation about what the critical sticking points are?
The text was updated successfully, but these errors were encountered: