Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorporate recombination clock #5

Open
hyanwong opened this issue Sep 5, 2019 · 4 comments
Open

Incorporate recombination clock #5

hyanwong opened this issue Sep 5, 2019 · 4 comments
Labels
enhancement New feature or request

Comments

@hyanwong
Copy link
Member

hyanwong commented Sep 5, 2019

Just adding this (obvious) issue so that we have somewhere to keep track of this discussion on this. Perhaps @awohns can add a little explanation about what the critical sticking points are?

@awohns
Copy link
Member

awohns commented Sep 5, 2019

The fundamental problem is that a consequence of the tree sequence data format is that you inevitably lose some of the edges when a recombination occurs and this leads the recombination clock to underestimate node age. This problem can be managed in the n=2 case (this is how GEVA works), but our current thinking is that it is an intractable problem in n=3 and above... or maybe not! I think it'll be worth bringing this up again with Gil

@awohns awohns added this to the tsdate Wish List milestone Nov 21, 2020
@awohns awohns added the enhancement New feature or request label Nov 21, 2020
@hyanwong
Copy link
Member Author

hyanwong commented Jun 24, 2024

There's an interesting application of this to the phasing singletons issue in #374. We could probably test (with real data), how much extra information there is in the haplotype lengths (i.e. the recombination clock) for phasing singletons, once the tsdated branch lengths are taken into account. The haplotype lengths were essentially what was used by Platt et al in Jody Hey's lab for phasing singletons (the inspiration for Nate's singleton phasing idea)

Since any analysis of singletons is only based on data present at the very tips of the trees, we would be testing the "best case" scenario, where we might expect the recombination clock to be pretty valuable for estimating dates.

Essentially this is like trying to predict the phase (mum or dad) from (a) the difference in tsdated times between the samples in the individual and the two different parent nodes and (b) the difference in span of the edge between the samples in the individual and the two different parent nodes.

We can then do a logistic regression of phase on (a) and (b). You might also expect the average branch length to play a role in the robustness of the model, so I have done an analysis in R with that added that as a covariate, and also put in the interaction terms. I've used log dates and log spans too.

This is only a rough way to do the analysis. A better way would be to use the theoretically expected IBD lengths (like in Platt et al) to calibrate the expected contribution of haplotype length on phasing. But I think it's good enough to get a feel for the relative magnitude of contributions.

df <- read.csv("analysis_output.csv")
model <- glm(TruePhase ~ difflogtime * meanlogtime * logspandiff, data=df, family="binomial")
anova(model)

Gives:

Analysis of Deviance Table

Model: binomial, link: logit

Response: TruePhase

Terms added sequentially (first to last)


                                    Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
NULL                                               119207     165246              
difflogtime                          1  19540.4    119206     145705 < 2.2e-16 ***
meanlogtime                          1      8.8    119205     145697  0.002975 ** 
logspandiff                          1   5365.7    119204     140331 < 2.2e-16 ***
difflogtime:meanlogtime              1    778.3    119203     139552 < 2.2e-16 ***
difflogtime:logspandiff              1      1.6    119202     139551  0.204540    
meanlogtime:logspandiff              1    199.3    119201     139352 < 2.2e-16 ***
difflogtime:meanlogtime:logspandiff  1      1.0    119200     139351  0.322044    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

In other words, we can probably increase the predictive accuracy of the phasing algorithm by about 25% more than we currently have (deviance of logspandiff: 5365.7 compared to that of difflogtime: 19540.4), by also incorporating the length of the haplotype into the calculation.

This gives a (very) rough idea of how much extra information we have in the recombination clock in the best-case situation of "recent" variants. However, as "recent" is relative to the number of samples in the dataset, I suspect that haplotype length will provide more information than this as the number of samples increases to tens or hundreds of thousands (we should test this when we have the GEL data).

Code for this example
import tszip
import tsdate

# file from https://github.com/tskit-dev/tsdate/issues/374#issuecomment-2176188312
ts = tszip.decompress("1kgp_bal1500_phased_singletons.trees.tsz")
dts = tsdate.date(ts, mutation_rate=1.29e-8, singletons_phased=False, progress=True)

# Identify singletons:
def id_singletons(correct_phase_ts, dts, combine_edges=False):
    """
    combine_edges=True is slower and uses the combined span of all edges with the same parent/child
    combination in the spandiff and spanmean values. This seesm to make the fit worse, so not worth doing
    """
    assert np.all(correct_phase_ts.breakpoints(as_array=True) == dts.breakpoints(as_array=True))
    assert correct_phase_ts.num_nodes == dts.num_nodes
    assert np.all(correct_phase_ts.samples() == dts.samples())
    samples = set(dts.samples())
    data = []
    for cp_tree, tree in zip(correct_phase_ts.trees(), dts.trees()):
        if tree.num_edges == 0:
            continue
        for cp_site, site in zip(cp_tree.sites(), tree.sites()):
            if len(cp_site.mutations) != 1:
                continue
            if cp_site.mutations[0].node not in samples:
                continue
            cp_indiv = correct_phase_ts.nodes_individual[cp_site.mutations[0].node]
            indiv = dts.nodes_individual[site.mutations[0].node]
            assert cp_indiv == indiv
            if cp_indiv == tskit.NULL:
                continue
            cp_indiv = correct_phase_ts.individual(cp_indiv)
            nodes = np.array(cp_indiv.nodes)
            assert cp_site.mutations[0].node in nodes
            assert len(nodes) == 2
            assert nodes[0] != nodes[1]
            # Find parents above each node in this position
            parents = np.array([cp_tree.parent(u) for u in nodes])
            if combine_edges:
                edgespans = []
                for u, p in zip(nodes, parents):
                    use = np.logical_and(correct_phase_ts.edges_child == u, correct_phase_ts.edges_parent == p)
                    edgespans.append(np.sum(correct_phase_ts.edges_right[use] - correct_phase_ts.edges_left[use]))
            else:
                edges = [correct_phase_ts.edge(cp_tree.edge(u)) for u in nodes]
                for e, u, p in zip(edges, nodes, parents):
                    assert e.parent == p
                    assert e.child == u
                edgespans = [e.right - e.left for e in edges]
            assert parents[0] != tskit.NULL
            times = dts.nodes_time[parents]
            difftime = np.diff(times)[0]
            meantime = np.mean(times)
            spandiff = np.diff(edgespans)[0]
            spanmean = np.mean(edgespans)
            difflogtime = np.diff(np.log(times))[0]
            meanlogtime = np.mean(np.log(times))
            logspandiff = np.diff(np.log(edgespans))[0]
            logspanmean = np.mean(np.log(edgespans))
            val = int((nodes[0] == cp_site.mutations[0].node))
            data.append([val, difftime, meantime, spandiff, spanmean, difflogtime, meanlogtime, logspandiff, logspanmean])
    return data
data = id_singletons(ts, dts)

import pandas as pd
df = pd.DataFrame(
    data,
    columns=["TruePhase", "difftime", "meantime",  "spandiff", "spanmean", "difflogtime", "meanlogtime", "logspandiff", "logspanmean"]
)
df.to_csv("analysis_output.csv")

@hyanwong
Copy link
Member Author

hyanwong commented Jun 25, 2024

It would be easy to test using simulations how this proportion is changed as the number of samples changes. Also, FWIW, here's the ANOVA table when logspandiff is first. So it looks like there is about the same amount of information (actually a tad more) in the haplotype length as we are using in the node age differences. But (as expected), the majority of information is shared between haplotype length and tsdate-estimated time..

                                 Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
NULL                                            119207     165246              
spandiff                          1  20991.6    119206     144254 < 2.2e-16 ***
difflogtime                       1   3886.5    119205     140368 < 2.2e-16 ***
meanlogtime                       1      7.0    119204     140361  0.008256 ** 
spandiff:difflogtime              1      4.1    119203     140357  0.043326 *  
spandiff:meanlogtime              1    400.4    119202     139956 < 2.2e-16 ***
difflogtime:meanlogtime           1      2.5    119201     139954  0.113650    
spandiff:difflogtime:meanlogtime  1     15.4    119200     139938 8.809e-05 ***

@hyanwong
Copy link
Member Author

It would be easy to test using simulations how this proportion is changed as the number of samples changes

So if I simplify down to the first 100 samples, it indeed seems to be (as I would predict) that the extra information provided by the recombination clock decreases (in this case to 3% additional deviance reduction). I would hope that with e.g. the GEL or UKBB data, we would see the opposite, in that the haplotype length would provide quite a lot of extra useful information on branch lengths / singleton phasing that we aren't currently getting from tsdate.

                                 Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
NULL                                             16623      23045              
difflogtime                       1   9289.8     16622      13755 < 2.2e-16 ***
meanlogtime                       1      0.4     16621      13755   0.53894    
spandiff                          1    301.6     16620      13453 < 2.2e-16 ***
difflogtime:meanlogtime           1     32.6     16619      13421 1.143e-08 ***
difflogtime:spandiff              1      3.8     16618      13417   0.05229 .  
meanlogtime:spandiff              1     25.5     16617      13392 4.455e-07 ***
difflogtime:meanlogtime:spandiff  1      4.2     16616      13387   0.03971 *  

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants