Incorporate recombination clock #5

hyanwong · 2019-09-05T10:47:37Z

Just adding this (obvious) issue so that we have somewhere to keep track of this discussion on this. Perhaps @awohns can add a little explanation about what the critical sticking points are?

awohns · 2019-09-05T21:55:41Z

The fundamental problem is that a consequence of the tree sequence data format is that you inevitably lose some of the edges when a recombination occurs and this leads the recombination clock to underestimate node age. This problem can be managed in the n=2 case (this is how GEVA works), but our current thinking is that it is an intractable problem in n=3 and above... or maybe not! I think it'll be worth bringing this up again with Gil

hyanwong · 2024-06-24T12:53:44Z

There's an interesting application of this to the phasing singletons issue in #374. We could probably test (with real data), how much extra information there is in the haplotype lengths (i.e. the recombination clock) for phasing singletons, once the tsdated branch lengths are taken into account. The haplotype lengths were essentially what was used by Platt et al in Jody Hey's lab for phasing singletons (the inspiration for Nate's singleton phasing idea)

Since any analysis of singletons is only based on data present at the very tips of the trees, we would be testing the "best case" scenario, where we might expect the recombination clock to be pretty valuable for estimating dates.

Essentially this is like trying to predict the phase (mum or dad) from (a) the difference in tsdated times between the samples in the individual and the two different parent nodes and (b) the difference in span of the edge between the samples in the individual and the two different parent nodes.

We can then do a logistic regression of phase on (a) and (b). You might also expect the average branch length to play a role in the robustness of the model, so I have done an analysis in R with that added that as a covariate, and also put in the interaction terms. I've used log dates and log spans too.

This is only a rough way to do the analysis. A better way would be to use the theoretically expected IBD lengths (like in Platt et al) to calibrate the expected contribution of haplotype length on phasing. But I think it's good enough to get a feel for the relative magnitude of contributions.

df <- read.csv("analysis_output.csv")
model <- glm(TruePhase ~ difflogtime * meanlogtime * logspandiff, data=df, family="binomial")
anova(model)

Gives:

Analysis of Deviance Table

Model: binomial, link: logit

Response: TruePhase

Terms added sequentially (first to last)


                                    Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
NULL                                               119207     165246              
difflogtime                          1  19540.4    119206     145705 < 2.2e-16 ***
meanlogtime                          1      8.8    119205     145697  0.002975 ** 
logspandiff                          1   5365.7    119204     140331 < 2.2e-16 ***
difflogtime:meanlogtime              1    778.3    119203     139552 < 2.2e-16 ***
difflogtime:logspandiff              1      1.6    119202     139551  0.204540    
meanlogtime:logspandiff              1    199.3    119201     139352 < 2.2e-16 ***
difflogtime:meanlogtime:logspandiff  1      1.0    119200     139351  0.322044    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

In other words, we can probably increase the predictive accuracy of the phasing algorithm by about 25% more than we currently have (deviance of logspandiff: 5365.7 compared to that of difflogtime: 19540.4), by also incorporating the length of the haplotype into the calculation.

This gives a (very) rough idea of how much extra information we have in the recombination clock in the best-case situation of "recent" variants. However, as "recent" is relative to the number of samples in the dataset, I suspect that haplotype length will provide more information than this as the number of samples increases to tens or hundreds of thousands (we should test this when we have the GEL data).

Code for this example

import tszip
import tsdate

# file from https://github.com/tskit-dev/tsdate/issues/374#issuecomment-2176188312
ts = tszip.decompress("1kgp_bal1500_phased_singletons.trees.tsz")
dts = tsdate.date(ts, mutation_rate=1.29e-8, singletons_phased=False, progress=True)

# Identify singletons:
def id_singletons(correct_phase_ts, dts, combine_edges=False):
    """
    combine_edges=True is slower and uses the combined span of all edges with the same parent/child
    combination in the spandiff and spanmean values. This seesm to make the fit worse, so not worth doing
    """
    assert np.all(correct_phase_ts.breakpoints(as_array=True) == dts.breakpoints(as_array=True))
    assert correct_phase_ts.num_nodes == dts.num_nodes
    assert np.all(correct_phase_ts.samples() == dts.samples())
    samples = set(dts.samples())
    data = []
    for cp_tree, tree in zip(correct_phase_ts.trees(), dts.trees()):
        if tree.num_edges == 0:
            continue
        for cp_site, site in zip(cp_tree.sites(), tree.sites()):
            if len(cp_site.mutations) != 1:
                continue
            if cp_site.mutations[0].node not in samples:
                continue
            cp_indiv = correct_phase_ts.nodes_individual[cp_site.mutations[0].node]
            indiv = dts.nodes_individual[site.mutations[0].node]
            assert cp_indiv == indiv
            if cp_indiv == tskit.NULL:
                continue
            cp_indiv = correct_phase_ts.individual(cp_indiv)
            nodes = np.array(cp_indiv.nodes)
            assert cp_site.mutations[0].node in nodes
            assert len(nodes) == 2
            assert nodes[0] != nodes[1]
            # Find parents above each node in this position
            parents = np.array([cp_tree.parent(u) for u in nodes])
            if combine_edges:
                edgespans = []
                for u, p in zip(nodes, parents):
                    use = np.logical_and(correct_phase_ts.edges_child == u, correct_phase_ts.edges_parent == p)
                    edgespans.append(np.sum(correct_phase_ts.edges_right[use] - correct_phase_ts.edges_left[use]))
            else:
                edges = [correct_phase_ts.edge(cp_tree.edge(u)) for u in nodes]
                for e, u, p in zip(edges, nodes, parents):
                    assert e.parent == p
                    assert e.child == u
                edgespans = [e.right - e.left for e in edges]
            assert parents[0] != tskit.NULL
            times = dts.nodes_time[parents]
            difftime = np.diff(times)[0]
            meantime = np.mean(times)
            spandiff = np.diff(edgespans)[0]
            spanmean = np.mean(edgespans)
            difflogtime = np.diff(np.log(times))[0]
            meanlogtime = np.mean(np.log(times))
            logspandiff = np.diff(np.log(edgespans))[0]
            logspanmean = np.mean(np.log(edgespans))
            val = int((nodes[0] == cp_site.mutations[0].node))
            data.append([val, difftime, meantime, spandiff, spanmean, difflogtime, meanlogtime, logspandiff, logspanmean])
    return data
data = id_singletons(ts, dts)

import pandas as pd
df = pd.DataFrame(
    data,
    columns=["TruePhase", "difftime", "meantime",  "spandiff", "spanmean", "difflogtime", "meanlogtime", "logspandiff", "logspanmean"]
)
df.to_csv("analysis_output.csv")

hyanwong · 2024-06-25T17:53:39Z

It would be easy to test using simulations how this proportion is changed as the number of samples changes. Also, FWIW, here's the ANOVA table when logspandiff is first. So it looks like there is about the same amount of information (actually a tad more) in the haplotype length as we are using in the node age differences. But (as expected), the majority of information is shared between haplotype length and tsdate-estimated time..

                                 Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
NULL                                            119207     165246              
spandiff                          1  20991.6    119206     144254 < 2.2e-16 ***
difflogtime                       1   3886.5    119205     140368 < 2.2e-16 ***
meanlogtime                       1      7.0    119204     140361  0.008256 ** 
spandiff:difflogtime              1      4.1    119203     140357  0.043326 *  
spandiff:meanlogtime              1    400.4    119202     139956 < 2.2e-16 ***
difflogtime:meanlogtime           1      2.5    119201     139954  0.113650    
spandiff:difflogtime:meanlogtime  1     15.4    119200     139938 8.809e-05 ***

hyanwong · 2024-06-25T18:06:35Z

It would be easy to test using simulations how this proportion is changed as the number of samples changes

So if I simplify down to the first 100 samples, it indeed seems to be (as I would predict) that the extra information provided by the recombination clock decreases (in this case to 3% additional deviance reduction). I would hope that with e.g. the GEL or UKBB data, we would see the opposite, in that the haplotype length would provide quite a lot of extra useful information on branch lengths / singleton phasing that we aren't currently getting from tsdate.

                                 Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
NULL                                             16623      23045              
difflogtime                       1   9289.8     16622      13755 < 2.2e-16 ***
meanlogtime                       1      0.4     16621      13755   0.53894    
spandiff                          1    301.6     16620      13453 < 2.2e-16 ***
difflogtime:meanlogtime           1     32.6     16619      13421 1.143e-08 ***
difflogtime:spandiff              1      3.8     16618      13417   0.05229 .  
meanlogtime:spandiff              1     25.5     16617      13392 4.455e-07 ***
difflogtime:meanlogtime:spandiff  1      4.2     16616      13387   0.03971 *

awohns added this to the tsdate Wish List milestone Nov 21, 2020

awohns added the enhancement New feature or request label Nov 21, 2020

hyanwong mentioned this issue Jun 24, 2024

Phasing singletons with expectation propagation #374

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incorporate recombination clock #5

Incorporate recombination clock #5

hyanwong commented Sep 5, 2019

awohns commented Sep 5, 2019 •

edited

Loading

hyanwong commented Jun 24, 2024 •

edited

Loading

hyanwong commented Jun 25, 2024 •

edited

Loading

hyanwong commented Jun 25, 2024

Incorporate recombination clock #5

Incorporate recombination clock #5

Comments

hyanwong commented Sep 5, 2019

awohns commented Sep 5, 2019 • edited Loading

hyanwong commented Jun 24, 2024 • edited Loading

hyanwong commented Jun 25, 2024 • edited Loading

hyanwong commented Jun 25, 2024

awohns commented Sep 5, 2019 •

edited

Loading

hyanwong commented Jun 24, 2024 •

edited

Loading

hyanwong commented Jun 25, 2024 •

edited

Loading