Phasing singletons with expectation propagation #374

nspope · 2024-05-03T19:59:06Z

It'll frequently be the case that singletons are not phased. Singleton variants aren't used by tsinfer, but are crucial for correct dating. Here's how we'll handle unphased singletons:

Split edges leading to the same individual into "phase blocks" -- e.g. the intersection of intervals for these edges. For example, over some region of the genome the pair of edges leading to the same individual might have coordinates $[l_i, r_i]$ and $[l_j, r_j]$. The phase block for these two edges would have coordinates $[\max(l_i, l_j), \min(r_i, r_j)]$.
Use an expectation propagation update by phase block rather than separately for the two constituent edges. For example, say we have a phase block where the parents of the constituent edges have ages $t_i$ and $t_j$. The children of the edges are both present-day haplotypes (of age 0). If the phase block has (mutation-weighted) span $\mu$ and contains $y$ singleton mutations, then the total mutational target is $\lambda_{ij} = \mu (t_i + t_j)$ and the mutation likelihood is $y \sim \mathrm{Poisson}(\lambda_{ij})$.

So, if we have variational posteriors $t_i \sim \mathrm{Gamma}(\alpha_i, \beta_i), t_j \sim \mathrm{Gamma}(\alpha_j, \beta_j)$, the partition function used in the expectation propagation update for the phase block is,
$$\int_0^{\infty} \int_0^{\infty} t_i^{\alpha_i - 1} e^{-\beta_i t_i} t_j^{\alpha_j - 1} e^{-\beta_j t_j} (t_i + t_j)^{y} e^{-\mu (t_i + t_j)} dt_i dt_j = \mathcal{B}(\alpha_i, \alpha_j) \frac{\Gamma(\alpha_i + \alpha_j + y)}{(\beta_i + \mu)^{\alpha_i + \alpha_j + y}} {_2} F_1 (\alpha_j, \alpha_i + \alpha_j + y; \alpha_i + \alpha_j; 1 - \frac{\beta_j + \mu}{\beta_i + \mu})$$ and we get the updated moments of $t_i, t_j$ by differentiating this wrt $\alpha, \beta$ as usual.

Finally, we will want to phase individual mutations (e.g. for normalisation, or to date the mutations). Let $z_{i}$ be an indicator that the mutation is on the branch leading to parent $i$. Conditional on node ages, $\mathbb{P}(z_i = 1 \mid t_i, t_j) = \frac{t_i}{t_i + t_j}$. We have variational posteriors for node ages rather than point estimates, so the unconditional probability is $\mathbb{P}(z_i = 1) \propto \int_0^\infty \int_0^\infty \mathbb{P}(z_i = 1 \mid t_i, t_j) t_i^{\alpha_i - 1} e^{-\beta_i t_i} t_j^{\alpha_j - 1} e^{-\beta_j t_j} dt_i dt_j$. This integral has a very similar looking solution to the partition function above (e.g. with integer perturbations to the parameters).

This gets substantially more complicated if the individual in question is not in the present day (especially if the age of the individual is unknown). So for now we'll have to assume that mutations on ancient samples are phased (I think these are often pseudohaploid sequences anyways).

nspope · 2024-05-03T20:05:36Z

I think the big API question is whether we want singletons to be present in the input tree sequence or not. If the former, they could be rephased (e.g. the output might swap node labels for some mutations). If the latter, the question is how to pass them into tsdate and how to return them (e.g. should they just be used for dating? Should they be added to the mutation table?)

In general we'll have to assume singletons from a particular individual are all phased or all unphased.

hyanwong · 2024-05-03T20:22:07Z

I don't think the standard tsdate should mess with anything other than the node and mutation times (and metadata). There's a cleanliness to making sure that the dating algorithm doesn't mess with the number of rows in the tables.

But if a pair of same-length arrays (individual_ids, singleton_position) is passed in, it would, I guess, be possible to pass out an array of node IDs (and times?) of the same length, specify which nodes are the most likely for the singleton at each position (or I guess you could pass out a probability of the singleton being in the first vs the second node).

You could then have a separate tsdate.util.XXX function to add these singletons to the resulting tree sequence.

nspope · 2024-05-03T20:54:48Z

Hmm ... so this wouldn't be usable from the command-line, then? It seems more clunky to me to have a series of functions that have to be called in a particular order rather than a single invocation, with a flag phase_singletons for example.

I also feel this way about split_disjoint_nodes, which also modifies tables, in that this seems to universally improve the performance of the dating algorithm. So it should really be done by default, not through a separate routine which will (probably) be skipped by many users.

hyanwong · 2024-05-03T22:41:29Z

this wouldn't be usable from the command-line, then

That's the same for input, right? I think you would input and output the data from/to files or from/to stdin/stdout? But maybe Jerome has a better idea?

it should really be done by default, not through a separate routine

There is a "tsdate.pre_process()" routine, which I think is how we should incorporate split_disjoint_nodes. We should recommend that this is done before dating.

hyanwong · 2024-05-03T22:45:03Z

this wouldn't be usable from the command-line, then

Just to add, you need a way to input the data that is independent of the tree sequence (which assumes phased data), so incorporating singletons via the CLI will always involve messing with extra files. It doesn't seem too bad to me to output extra files in this case too.

hyanwong · 2024-05-03T22:47:34Z

Oh, and one last thought: I suppose that you could have a wrapper function that called all this stuff in the right order. So tsdate.date() is guaranteed not to change the topology, just the times, but e.g. tsdate.date_with_pre_and_post_processing(...) wraps all the commonly used steps together. We'd need to think of a better name, of course.

nspope · 2024-05-03T23:19:09Z

I seem to be unusually fixated on this particular API detail, so I'll work on the lower-level stuff for the moment and we can revisit once we've got an idea of how all this works in practice. But:

Just to add, you need a way to input the data that is independent of the tree sequence (which assumes phased data)

I think this is where I'm hung up -- you don't need to supply additional data if the singletons are already in the TS. If they're mapped to a edge, then they're implicitly phased. The phasing could be wrong, of course, so it's worth having a routine to correct the mapping. This seems much cleaner to me than having multiple input/output streams and utility functions, which are confusing to use and maintain.

Maybe a compromise here is to assume that only the mutations in the TS are going to be used for dating. So, singletons are included in the TS, regardless of whether they're well-phased or randomly assigned to a branch. But there's a flag to date which makes the dating algorithm agnostic to phase of singleton branches. However the dating algorithm doesn't modify the tables-- there would be a separate utility that could adjust the phase of singletons based on the node times.

The reason I don't like a separate input/output stream for singleton mutations is because in general, we won't be able to support a mix of phased and unphased mutations in the same individual, with a principled algorithm like the one above. What do you do if the input TS was subset down from tsinfer output, so that there's singletons already in there? Throw an error, in which case we'd need another utility function to strip singleton mutations? Silently assume that they're unphased and lump them in with the additional input stream of singleton mutations? IMO it's nicer to assume the data is wrapped in a single input, and have it so that additional arguments modify the behavior of the algorithm rather than augment the input.

I suppose that you could have a wrapper function that called all this stuff in the right order

This is a nice idea!

hyanwong · 2024-05-04T08:04:41Z

If they're mapped to a edge, then they're implicitly phased.

I guess this is the main problem I have. I'm not convinced we should be adding unphased things to the tree sequence by forcing a random phasing. Or at least, we would need to think hard about how unphased stuff should be represented in a tree sequence. I think @jeromekelleher and/or @petrelharp should comment here though. There might be a way to put a mutation in a table without an associated node (perhaps tskit.NULL in the node field?) so that it doesn't actually change the genotypes. I think this is a much bigger issue though.

jeromekelleher · 2024-05-07T08:45:38Z

How about we make the internal functions first, and decide about the external API later? The internal function would be based on numpy arrays as input and output, taking an input tree sequence as immutable? So, write the absolute minimal core piece of functionality in a way that is convenient and non-destructive with tests, and then we can talk about the external API more fruitfully I think.

I agree with @nspope that it should be easy to do the right things here from a CLI invocation.

hyanwong · 2024-05-07T08:49:03Z

That seems like an excellent plan @jeromekelleher

nspope · 2024-06-12T03:19:48Z

Finally got this working... first taking a look at ages of singleton mutations. This is the usual equillibrium population, human-ish parameters setting. We have to use the mutation posterior in this case. I'm comparing against the branch midpoint on the true trees to avoid randomness that we can't hope to infer.

Correct vs phase-agnostic, with true topologies:

and with inferred topologies:

Looks pretty much identical!

nspope · 2024-06-12T04:33:40Z

For a baseline, here's the original algorithm but with a random phase assigned to singletons:

True trees,

Inferred trees,

hyanwong · 2024-06-12T11:11:46Z

Wow. This is very neat, and convincing too. Does adding unphased singletons make the algorithm more numerically unstable?

nspope · 2024-06-12T16:27:46Z

I think I've got the numerics under control (previous issues were a bug). But we'll see, I think the GEL 40K is a good litmus test

petrelharp · 2024-06-12T17:59:56Z

Holy shit, that looks great.

nspope · 2024-06-13T17:57:30Z

Whoops, mistake on my part. These definitely looked too good to be true. Here's the correct comparison,

Still pretty good. While we integrate over phase to get mutation age, we can also get MAP estimates of what the phase should be. This gets it right ~80% of the time in the example above.

duncanMR · 2024-06-18T14:03:27Z

To test the dating performance on singletons with new phase-agnostic algorithm, I used the 1500 individual subset I made of the 1kgp-chr20p data. That way, I could use the singletons zarr from here which has a column for the true phase. I first had to fix an oversight: there were ~300 sites with two alternatives alleles that are both singletons, so I just kept one of them in each case.

import tsdate
import tszip
import zipfile
import os
import xarray as xr
import pandas as pd
import numpy as np

zip_path = 'data/1kgp_bal1500_singletons.zarr.zip'
output_path = os.path.splitext(zip_path)[0]
with zipfile.ZipFile(zip_path, 'r') as zip_ref:
    print(f'Unzipping to {output_path}')
    zip_ref.extractall(output_path)
    ds = xr.open_dataset(output_path, engine='zarr')

ts = tszip.decompress('data/1kgp_bal1500-chr20p-filterNton23-truncate-0-0-0-mm0-post-processed.trees.tsz')
ts = ts.simplify()
ts = tsdate.util.split_disjoint_nodes(ts)

#Keep just one allele for multi-allelic singletons
dupe_pos = ds['position'].values
duplicate_mask = ~pd.Series(dupe_pos).duplicated(keep='first').values
assert(np.sum(duplicate_mask) == len(np.unique(dupe_pos)))
duplicate_mask_xarray = xr.DataArray(duplicate_mask,
                                     dims=['variants'],
                                     name='duplicate_position_mask')
ds.update({'duplicate_position_mask': duplicate_mask_xarray})

#construct arrays for adding singletons
dedupe_ds = ds.where(ds.duplicate_position_mask, drop=True)
stn_pos = dedupe_ds['position'].values
stn_sample_id = dedupe_ds['sample_id'].values
stn_ancestral = dedupe_ds['allele'][:,0].values
stn_derived = dedupe_ds['allele'][:,1].values

#map sample_ids to individual ids
sample_id_dict = {}
for individual in ts.individuals():
    sample_id = individual.metadata['sample_id']
    sample_id_dict[sample_id] = individual.id
stn_individ = np.array([sample_id_dict[sample_id] for sample_id in stn_sample_id])

unphased_ts = tsdate.phasing.insert_unphased_singletons(ts, 
                                                position = stn_pos,
                                                individual = stn_individ,
                                                ancestral_state = stn_ancestral,
                                                derived_state = stn_derived
                                                )

There were 303 339 singletons to add to the ts. The insert_phased_singletons function only took a few seconds on my laptop, so performance doesn't seem to be an issue @nspope. One odd issue I had was that split_disjoint_nodes failed if I did it after adding the singletons, hence why I simplified and split disjoint nodes first.

Click to reveal split_disjoint_nodes error

# This is unphased_ts run without simplification or splitting disjoint nodes of the ts at the beginning
unphased_ts = unphased_ts.simplify()
unphased_ts = tsdate.util.split_disjoint_nodes(unphased_ts)

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
Cell In[16], [line 2](vscode-notebook-cell:?execution_count=16&line=2)
      [1](vscode-notebook-cell:?execution_count=16&line=1) unphased_ts = unphased_ts.simplify()
----> [2](vscode-notebook-cell:?execution_count=16&line=2) unphased_ts = tsdate.util.split_disjoint_nodes(unphased_ts)

File ~/trees/tsinfer/env/lib/python3.11/site-packages/tsdate/util.py:550, in split_disjoint_nodes(ts, record_provenance)
    [541](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/duncan/trees/tsinfer/~/trees/tsinfer/env/lib/python3.11/site-packages/tsdate/util.py:541) node_is_sample = np.bitwise_and(ts.nodes_flags, tskit.NODE_IS_SAMPLE).astype(bool)
    [542](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/duncan/trees/tsinfer/~/trees/tsinfer/env/lib/python3.11/site-packages/tsdate/util.py:542) edges_parent, edges_child, nodes_order, split_nodes = _split_disjoint_nodes(
    [543](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/duncan/trees/tsinfer/~/trees/tsinfer/env/lib/python3.11/site-packages/tsdate/util.py:543)     ts.edges_parent,
    [544](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/duncan/trees/tsinfer/~/trees/tsinfer/env/lib/python3.11/site-packages/tsdate/util.py:544)     ts.edges_child,
   (...)
    [547](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/duncan/trees/tsinfer/~/trees/tsinfer/env/lib/python3.11/site-packages/tsdate/util.py:547)     node_is_sample,
    [548](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/duncan/trees/tsinfer/~/trees/tsinfer/env/lib/python3.11/site-packages/tsdate/util.py:548) )
--> [550](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/duncan/trees/tsinfer/~/trees/tsinfer/env/lib/python3.11/site-packages/tsdate/util.py:550) mutations_node = _relabel_mutations_node(
    [551](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/duncan/trees/tsinfer/~/trees/tsinfer/env/lib/python3.11/site-packages/tsdate/util.py:551)     ts.mutations_node,
    [552](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/duncan/trees/tsinfer/~/trees/tsinfer/env/lib/python3.11/site-packages/tsdate/util.py:552)     ts.sites_position[ts.mutations_site],
    [553](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/duncan/trees/tsinfer/~/trees/tsinfer/env/lib/python3.11/site-packages/tsdate/util.py:553)     nodes_order,
    [554](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/duncan/trees/tsinfer/~/trees/tsinfer/env/lib/python3.11/site-packages/tsdate/util.py:554)     edges_parent,
    [555](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/duncan/trees/tsinfer/~/trees/tsinfer/env/lib/python3.11/site-packages/tsdate/util.py:555)     edges_child,
    [556](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/duncan/trees/tsinfer/~/trees/tsinfer/env/lib/python3.11/site-packages/tsdate/util.py:556)     ts.edges_left,
    [557](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/duncan/trees/tsinfer/~/trees/tsinfer/env/lib/python3.11/site-packages/tsdate/util.py:557)     ts.edges_right,
    [558](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/duncan/trees/tsinfer/~/trees/tsinfer/env/lib/python3.11/site-packages/tsdate/util.py:558)     ts.indexes_edge_insertion_order,
    [559](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/duncan/trees/tsinfer/~/trees/tsinfer/env/lib/python3.11/site-packages/tsdate/util.py:559)     ts.indexes_edge_removal_order,
    [560](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/duncan/trees/tsinfer/~/trees/tsinfer/env/lib/python3.11/site-packages/tsdate/util.py:560) )
...
--> [513](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/duncan/trees/tsinfer/~/trees/tsinfer/env/lib/python3.11/site-packages/tsdate/util.py:513)     assert nodes_map[mutations_node[m]] != tskit.NULL
    [514](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/duncan/trees/tsinfer/~/trees/tsinfer/env/lib/python3.11/site-packages/tsdate/util.py:514)     output[m] = nodes_map[mutations_node[m]]
    [515](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/duncan/trees/tsinfer/~/trees/tsinfer/env/lib/python3.11/site-packages/tsdate/util.py:515)     m += 1

AssertionError:

It seems to be a problem with the node id assigned to the singletons that are added, but I haven't looked into it beyond that. Anyway, dating with singletons_phased=False was fast on my laptop in this case. The main issue is that 60% of the singletons are removed due to having 'invalid posteriors', which is higher than I expected:

import logging
logging.getLogger().setLevel(logging.INFO)

unphased_dated_ts = tsdate.date(unphased_ts, mutation_rate=1.29e-08, singletons_phased=False)

2024-06-18 11:39:13,481 [1640] INFO     root: Extracted mutations in 0.09639501571655273 seconds
2024-06-18 11:39:13,977 [1640] INFO     root: Found 119208 unphased singleton mutations
2024-06-18 11:39:13,978 [1640] INFO     root: Split unphased singleton edges into 1048995 blocks
2024-06-18 11:39:13,980 [1640] INFO     root: Phased singletons in 0.4944934844970703 seconds
2024-06-18 11:39:56,218 [1640] INFO     root: Skipped 0 edges with invalid factors
2024-06-18 11:39:56,219 [1640] INFO     root: Calculated node posteriors in 41.7481415271759 seconds
2024-06-18 11:39:56,630 [1640] INFO     root: Skipped 184131 mutations with invalid posteriors
2024-06-18 11:39:56,631 [1640] INFO     root: Calculated mutation posteriors in 0.40827155113220215 seconds
2024-06-18 11:39:56,654 [1640] INFO     root: Switched phase of 36525 singletons
2024-06-18 11:40:10,803 [1640] INFO     root: Timescale rescaled in 14.147191524505615 seconds
2024-06-18 11:40:10,874 [1640] INFO     root: Modified ages of 938 nodes to satisfy constraints
2024-06-18 11:40:10,919 [1640] INFO     root: Set ages of 184131 nonsegregating mutations to root times.
2024-06-18 11:40:10,945 [1640] INFO     root: Constrained node ages in 0.08892679214477539 seconds
2024-06-18 11:40:10,956 [1640] INFO     root: Set metadata schema on NodeTable
2024-06-18 11:40:14,065 [1640] INFO     root: Set metadata schema on MutationTable
2024-06-18 11:40:18,648 [1640] INFO     root: Inserted node and mutation metadata in 7.70140266418457 seconds
2024-06-18 11:40:20,790 [1640] INFO     root: Sorted tree sequence in 2.140110731124878 second

I then adapted the convenience function for adding singletons to add singletons with the true phases (0/1) that I stored in the zarr:

Click to reveal code to add phased singletons

import tskit


def insert_phased_singletons(
    ts,
    position,
    individual,
    ancestral_state,
    derived_state,
    phase,
):
    """
    Insert phased singletons into the tree sequence.
    :param tskit.TreeSequence ts: the tree sequence to add singletons to
    :param np.ndarray position: the position of the variants
    :param np.ndarray individual: the individual id in which the variant occurs
    :param np.ndarray ancestral_state: the ancestral state of the variant
    :param np.ndarray derived_state: the derived state of the variant
    :param np.ndarray phase: the phase of the variant (0 or 1)
    :returns: A copy of the tree sequence with singletons inserted
    """
    tables = ts.dump_tables()
    sites_id = {p: i for i, p in enumerate(ts.sites_position)}
    for pos, ind, ref, alt, phase in zip(
        position, individual, ancestral_state, derived_state, phase
    ):
        if pos in sites_id:
            if ref != ts.site(sites_id[pos]).ancestral_state:
                raise ValueError(
                    f"Existing site at position {pos} has a different ancestral state"
                )
            muts = ts.site(sites_id[pos]).mutations
            set_time = len(muts) and np.isfinite(muts[0].time)
        else:
            sites_id[pos] = tables.sites.add_row(position=pos, ancestral_state=ref)
            set_time = False
        # TODO: more efficient to do in bulk?
        site = sites_id[pos]
        node = ts.individual(ind).nodes[phase]
        time = ts.nodes_time[node] if set_time else tskit.UNKNOWN_TIME
        tables.mutations.add_row(
            site=site,
            node=node,
            time=time,
            derived_state=alt,
        )
    tables.sort()
    tables.build_index()
    tables.compute_mutation_parents()
    return tables.tree_sequence()


phased_ts = insert_phased_singletons(
    ts,
    position=stn_pos,
    individual=stn_individ,
    ancestral_state=stn_ancestral,
    derived_state=stn_derived,
    phase=stn_phase,
)
phased_ts

Dating with singletons_phased=True is still fast, but again 60% of singletons are skipped:

phased_dated_ts = tsdate.date(phased_ts, mutation_rate=1.29e-08, singletons_phased=True)

2024-06-18 12:03:42,365 [1640] INFO     root: Extracted mutations in 0.09936237335205078 seconds
2024-06-18 12:03:42,523 [1640] INFO     root: Found 0 unphased singleton mutations
2024-06-18 12:03:42,525 [1640] INFO     root: Split unphased singleton edges into 0 blocks
2024-06-18 12:03:42,526 [1640] INFO     root: Phased singletons in 0.1573958396911621 seconds
2024-06-18 12:04:23,682 [1640] INFO     root: Skipped 0 edges with invalid factors
2024-06-18 12:04:23,684 [1640] INFO     root: Calculated node posteriors in 40.6736102104187 seconds
2024-06-18 12:04:24,023 [1640] INFO     root: Skipped 184131 mutations with invalid posteriors
2024-06-18 12:04:24,024 [1640] INFO     root: Calculated mutation posteriors in 0.33773255348205566 seconds
2024-06-18 12:04:24,027 [1640] INFO     root: Switched phase of 0 singletons
2024-06-18 12:04:37,877 [1640] INFO     root: Timescale rescaled in 13.847219467163086 seconds
2024-06-18 12:04:37,942 [1640] INFO     root: Modified ages of 936 nodes to satisfy constraints
2024-06-18 12:04:37,975 [1640] INFO     root: Set ages of 184131 nonsegregating mutations to root times.
2024-06-18 12:04:37,996 [1640] INFO     root: Constrained node ages in 0.07012248039245605 seconds
2024-06-18 12:04:37,998 [1640] INFO     root: Set metadata schema on NodeTable
2024-06-18 12:04:41,254 [1640] INFO     root: Set metadata schema on MutationTable
2024-06-18 12:04:45,927 [1640] INFO     root: Inserted node and mutation metadata in 7.9303083419799805 seconds
2024-06-18 12:04:48,069 [1640] INFO     root: Sorted tree sequence in 2.1404826641082764 seconds

Finally, I used evaluations.mutations_time to compare the times of singletons with true phase (x) to the phase-agnostic singleton times:

I then generated a random Bernoulli vector to use for the phase and ran dating with singletons_phased=True (in hindsight I realise I could have just used the arbitarily assigned phases from insert_unphased_singletons instead):

Click to reveal code to reproduce the plots above:

from tsdate import evaluation

#True phase vs phase-agnostic
evaluation.mutations_time(
    phased_dated_ts,
    unphased_dated_ts,
    use_posterior=False,
    min_freq=None,
    max_freq=1,
    plotpath='results/1500_subset_singletons.png',
    title='1500 subset singletons: true phase vs phase-agnostic dating',
    subtending_node=False,
)

#True phase vs random phase
np.random.seed(42)
stn_phase_random = np.random.randint(2, size=len(stn_pos))
random_phase_ts = insert_phased_singletons(
    ts,
    position=stn_pos,
    individual=stn_individ,
    ancestral_state=stn_ancestral,
    derived_state=stn_derived,
    phase=stn_phase_random,
)
random_phase_dated_ts = tsdate.date(random_phase_ts, mutation_rate=1.29e-08, singletons_phased=True)

evaluation.mutations_time(
    phased_dated_ts,
    random_phase_dated_ts,
    use_posterior=False,
    min_freq=None,
    max_freq=1,
    plotpath='results/1500_subset_singletons_random.png',
    title='1500 subset singletons: true phase vs random phase',
    subtending_node=False,
)

Qualitatively, they seem similar to the simulation results, just with a tonne more variance. @nspope, is the high proportion of variants with invalid posteriors also surprising to you? If you'd like to investigate it further, here is the TS with the correctly phased singletons added. Let me know if there are other files you want:

1kgp_bal1500_phased_singletons.trees.zip

I'm going to test it in GEL next.

duncanMR · 2024-06-18T15:51:03Z

Just remembered that I can't install from GitHub in GEL anymore, so I unfortunately can't test it until they approve the Airlock transfer of this branch of tsdate, annoyingly! I just submitted the request, so it should be sorted within two days.

nspope · 2024-06-18T15:59:09Z

Thanks @duncanMR, very helpful and a lot to chew on.

is the high proportion of variants with invalid posteriors also surprising to you?

could you check the count of nonsegregating mutations (above the root)? ideally those'll be the only ones where we can't assign a posterior.

duncanMR · 2024-06-18T16:45:48Z

Thanks @duncanMR, very helpful and a lot to chew on.

is the high proportion of variants with invalid posteriors also surprising to you?

could you check the count of nonsegregating mutations (above the root)? ideally those'll be the only ones where we can't assign a posterior.

I see: the number of mutations above root matches the number with invalid posteriors. So it's the root-to-sample edges problem rearing it's head again! It's concerning how many there are here.

nspope · 2024-06-18T16:50:00Z

it's the root-to-sample edges problem

Oh -- I think I missed the discussion of this previously. What is going on? Are these not simply mutations that are segregating in the full data but fixed in the subset?

duncanMR · 2024-06-18T17:08:14Z

it's the root-to-sample edges problem

Oh -- I think I missed the discussion of this previously. What is going on? Are these not simply mutations that are segregating in the full data but fixed in the subset?

Woops, just realised my mistake. The problem is simply that many of the singletons I extracted fall outside the range of site positions in the TS, because I extracted them directly from the zarr and forgot to limit the positions by the site density mask. So mutations were being placed onto nodes in the first and last null trees in the TS that cover the range of the genome that's been masked out. So it's nothing to worry about, sorry for the false alarm!

hyanwong · 2024-06-18T17:27:27Z

Aha - of course! Easy mistake to make. We should catch this and error out, I think.

nspope · 2024-06-18T19:41:44Z

We should catch this and error out, I think

Do you think this is a problem @hyanwong? It's not unusual to have some "above-the-root" mutations introduced by tsinfer (why, I don't know).

nspope · 2024-06-18T20:29:52Z

@duncanMR two things that occurred to me -- we should be using mutation posteriors here; as that takes a weighted average of the two branch midpoints (posterior mean) rather than always choosing the longest branch (MAP estimate). I'd also be interested in seeing the parent node age-- an upper bound on mutation age may be more of interest if we're focusing on old singletons

hyanwong · 2024-06-18T20:33:44Z

We should catch this and error out, I think

Do you think this is a problem @hyanwong? It's not unusual to have some "above-the-root" mutations introduced by tsinfer (why, I don't know).

No, I mean we should error out if if singletons are being added into regions where there are no edges

hyanwong · 2024-06-18T20:35:35Z

I assume the "above the root" mutations are added by parsimony, when e.g. we force the wrong ancestral state for a triallelic site

duncanMR · 2024-06-19T07:49:38Z

@duncanMR two things that occurred to me -- we should be using mutation posteriors here; as that takes a weighted average of the two branch midpoints (posterior mean) rather than always choosing the longest branch (MAP estimate). I'd also be interested in seeing the parent node age-- an upper bound on mutation age may be more of interest if we're focusing on old singletons

Cool, I'll look into that today. I've also got tsdate through airlock so I'll test in GEL today too.

I assume the "above the root" mutations are added by parsimony, when e.g. we force the wrong ancestral state for a triallelic site

Yeah, above-the-root mutations are extremely rare in practice: I've only seen a handful of them in the largest tree sequences. Hopefully we can figure out a solution in the long-term with my DPhil project.

duncanMR · 2024-06-19T16:33:30Z

When adding singletons in GeL that Sam extracted, there are sites where the alternative allele (e.g. A) is a singleton, but there is another alternate allele (e.g. T) at the same site that is at a higher frequency and already in the TS. Do we want to exclude these singletons from being added? It seems a bit inconsistent to exclude multi-allelics from inference, but allow multi-allelics if one of the alleles is a singleton. There are quite a few: 9250 sites have this problem in chr17p alone.

duncanMR · 2024-06-19T16:57:20Z

@nspope Adding singletons is failing for chr17p in GEL. It's a bit annoying that it didn't show up in the 1kgp data, since I can't share a reproducible example. It looks like a numerical issue: a mutation phase was generated outside [0,1]. I can try and set up a debugger to find out what the problematic phase is.

nspope · 2024-06-19T17:06:10Z

@duncanMR thanks Duncan. I can probably just clamp values to that range, if it's in the range of fp error. Would you mind inserting a print statement to see how far out of wack it is?

duncanMR · 2024-06-20T18:36:23Z

It turns out that it's not an issue where the phase is within fp error of 0 or 1. After updating my singletons filtering procedure, I now have 1 million singletons to add to the ~1million variants already in the chr17p TS. Of those, there are three singletons which for some reason have invalid posteriors, but they aren't above the root this time and aren't in null trees. They are on short branches at the bottom of trees.

I first tried removing the three sites, but bizarrely, five more sites popped up with invalid posteriors. The quick fix was easy: the bug is that the invalid posteriors cause there to be three nan values in the self.mutation_phase array, which then get passed into reallocate_unphased and trigger the assert statements. All you have to do is check for nan values and skip them. For some reason, this problem doesn't occur with invalid posteriors due to mutations being placed in null trees. So it seems to work fine now:

@nspope I will add some comments to the PR where I've had issues so far.

hyanwong · 2024-06-21T07:25:03Z

When adding singletons in GeL that Sam extracted, there are sites where the alternative allele (e.g. A) is a singleton, but there is another alternate allele (e.g. T) at the same site that is at a higher frequency and already in the TS. Do we want to exclude these singletons from being added? It seems a bit inconsistent to exclude multi-allelics from inference, but allow multi-allelics if one of the alleles is a singleton. There are quite a few: 9250 sites have this problem in chr17p alone.

We don't ignore multiallelics in tsdate though, do we? True, they aren't used the help tsinfer inference, but once they are placed back on the ts by parsimony, they can be used for dating. I don't know if we do this for the GEL data though: it could be that Ben excludes them from the pipeline entirely.

duncanMR · 2024-06-21T09:07:18Z

When adding singletons in GeL that Sam extracted, there are sites where the alternative allele (e.g. A) is a singleton, but there is another alternate allele (e.g. T) at the same site that is at a higher frequency and already in the TS. Do we want to exclude these singletons from being added? It seems a bit inconsistent to exclude multi-allelics from inference, but allow multi-allelics if one of the alleles is a singleton. There are quite a few: 9250 sites have this problem in chr17p alone.

We don't ignore multiallelics in tsdate though, do we? True, they aren't used the help tsinfer inference, but once they are placed back on the ts by parsimony, they can be used for dating. I don't know if we do this for the GEL data though: it could be that Ben excludes them from the pipeline entirely.

My impression was that we exclude all multi-allelic sites instead of just keeping one allele at random, but we can check with Ben later.

hyanwong · 2024-06-21T09:41:59Z

We don't need to keep one at random: tsinfer can place non-inference sites back into the tree sequence using parsimony. They will then be used by tsdate.

It might be that multiallelics are discarded by the snakemake pipeline before tsinfer actually sees them, and are not placed back into the tree sequence using parsimony. There is an argument that perhaps we should (or if not, we should adjust the mutation rate to take into account the fact that some sites have been discarded)

duncanMR · 2024-06-21T11:10:55Z

That's a good point. We definitely aren't placing any multi-allelics by parsimony at the moment (there is one mutation per site) but I don't know if they are available in the zarr to be added later. If that's the case, then I think adding back all the multi-allelics (not just singletons) makes sense.

hyanwong · 2024-06-21T11:58:45Z

I suspect they are in the zarr but masked out. @benjeffery will know. We can ask at 4.30 today.

duncanMR · 2024-06-21T12:35:36Z

After adding and dating the unphased singletons to 1kgp-chr20p, the upper bound estimates (parent time) of singleton ages have a much lighter tail than posterior means of higher frequency variants. Seems reasonable to me:

Click to reveal code to reproduce the plot above:

import numpy as np
import tsdate
import logging
import tszip
import pandas as pd

#generated elsewhere
singletons_df = pd.read_csv('data/unphased_chr20p_singletons.csv')

ts = tszip.decompress('data/1kgp_all-chr20p-filterNton23-truncate-0-0-0-mm0-post-processed.trees.tsz')
ts = ts.simplify()
ts = tsdate.util.split_disjoint_nodes(ts)

singletons_df.drop_duplicates(subset=['position'], keep='first', inplace=True)
left = min(ts.sites_position)
right = max(ts.sites_position)
singletons_df = singletons_df[(singletons_df.position >= left) & (singletons_df.position <= right)]

stn_pos = singletons_df['position'].values
stn_sample_id = singletons_df['sample_id'].values
stn_ancestral = singletons_df['ref'].values
stn_derived = singletons_df['alt'].values

sample_id_dict = {}
for individual in ts.individuals():
    sample_id = individual.metadata['sample_id']
    sample_id_dict[sample_id] = individual.id

stn_individ = np.array([sample_id_dict[sample_id] for sample_id in stn_sample_id])

#insert singletons and date
singletons_ts = tsdate.phasing.insert_unphased_singletons(ts, 
                                                position = stn_pos,
                                                individual = stn_individ,
                                                ancestral_state = stn_ancestral,
                                                derived_state = stn_derived
                                                )
logging.getLogger().setLevel(logging.INFO)
singletons_dated_ts = tsdate.date(singletons_ts, mutation_rate=1.29e-08, singletons_phased=False)

stn_site_id = np.where(np.isin(ts.sites_position, stn_pos))[0]
assert(len(stn_site_id) == len(stn_pos))

stn_node = np.full(len(stn_site_id), -1)
stn_post_time = np.full(len(stn_site_id), np.nan)
stn_parent_time = np.full(len(stn_site_id), np.nan)
tree = ts.first()

for index, site_id in enumerate(stn_site_id):
    site = ts.site(site_id)
    pos = site.position
    assert(len(site.mutations) == 1)
    mut = site.mutations[0]
    stn_post_time[index] = mut.metadata['mn']
    child = mut.node
    tree.seek(pos)
    parent = tree.parent(mut.node)
    stn_parent_time[index] = tree.time(parent)

non_stn_pos = ts.sites_position[~np.isin(ts.sites_position, stn_pos)]
non_stn_site_id = np.where(np.isin(ts.sites_position, non_stn_pos))[0]
non_stn_post_time = np.full(len(non_stn_site_id), np.nan)

for index, site_id in enumerate(non_stn_site_id):
    site = ts.site(site_id)
    pos = site.position
    assert(len(site.mutations) == 1)
    mut = site.mutations[0]
    non_stn_post_time[index] = mut.metadata['mn']

def compute_cdf(data):
    sorted_data = np.sort(data)
    cdf = np.arange(1, len(sorted_data) + 1) / len(sorted_data)
    return sorted_data, cdf

stn_post_time_sorted, stn_post_time_cdf = compute_cdf(stn_post_time)
stn_parent_time_sorted, stn_parent_time_cdf = compute_cdf(stn_parent_time)
non_stn_post_time_sorted, non_stn_post_time_cdf = compute_cdf(non_stn_post_time)

plt.plot(stn_post_time_sorted, stn_post_time_cdf, label='singletons: posterior time')
plt.plot(stn_parent_time_sorted, stn_parent_time_cdf, label='singletons: parent time')
plt.plot(non_stn_post_time_sorted, non_stn_post_time_cdf, label='all other variants: posterior time')

plt.xscale('log')
plt.title('CDFs for singletons vs other variants in 1kgp-chr20p-all')
plt.xlabel('mutation time')
plt.ylabel('CDF')
plt.legend()
plt.grid(True)
plt.show()

I'm going to look at the 1kgp annotations dataframe next.

duncanMR · 2024-06-24T12:35:12Z

In case anyone needs them, I fixed the problem with the singletons being placed outside the range of the TS. Here is the undated TS with phased singletons added:

1kgp_bal1500_phased_singletons.trees.zip

The dated TS is too big for Github so you can get it here: https://drive.google.com/drive/folders/1vn5xMlBBXTB-_sosX9LpaQ_45E36UVoa?usp=drive_link

hyanwong · 2024-06-24T13:20:43Z

Using a rough analysis, I reckon our current singleton phasing approach misses about 25% of the information we could incorporate if we also added haplotype lengths. See #5 (comment)

I don't think this is enough to warrant modifying out existing approach, but something like this might make an interesting observation in the paper discussion.

hyanwong mentioned this issue Jun 12, 2024

Phase-agnostic singletons #412

Merged

hyanwong mentioned this issue Jun 24, 2024

Incorporate recombination clock #5

Open

Phasing singletons with expectation propagation #374

Phasing singletons with expectation propagation #374

Comments

nspope commented May 3, 2024

nspope commented May 3, 2024 • edited Loading

hyanwong commented May 3, 2024

nspope commented May 3, 2024 • edited Loading

hyanwong commented May 3, 2024 • edited Loading

hyanwong commented May 3, 2024

hyanwong commented May 3, 2024 • edited Loading

nspope commented May 3, 2024 • edited Loading

hyanwong commented May 4, 2024

jeromekelleher commented May 7, 2024

hyanwong commented May 7, 2024

nspope commented Jun 12, 2024 • edited Loading

nspope commented Jun 12, 2024

hyanwong commented Jun 12, 2024

nspope commented Jun 12, 2024

petrelharp commented Jun 12, 2024

nspope commented Jun 13, 2024 • edited Loading

duncanMR commented Jun 18, 2024 • edited Loading

duncanMR commented Jun 18, 2024

nspope commented Jun 18, 2024

duncanMR commented Jun 18, 2024 • edited Loading

nspope commented Jun 18, 2024

duncanMR commented Jun 18, 2024

hyanwong commented Jun 18, 2024

nspope commented Jun 18, 2024

nspope commented Jun 18, 2024

hyanwong commented Jun 18, 2024

hyanwong commented Jun 18, 2024 • edited Loading

duncanMR commented Jun 19, 2024

duncanMR commented Jun 19, 2024

duncanMR commented Jun 19, 2024

nspope commented Jun 19, 2024

duncanMR commented Jun 20, 2024

hyanwong commented Jun 21, 2024

duncanMR commented Jun 21, 2024 • edited Loading

hyanwong commented Jun 21, 2024

duncanMR commented Jun 21, 2024

hyanwong commented Jun 21, 2024

duncanMR commented Jun 21, 2024

duncanMR commented Jun 24, 2024

hyanwong commented Jun 24, 2024 • edited Loading

nspope commented May 3, 2024 •

edited

Loading

nspope commented May 3, 2024 •

edited

Loading

hyanwong commented May 3, 2024 •

edited

Loading

hyanwong commented May 3, 2024 •

edited

Loading

nspope commented May 3, 2024 •

edited

Loading

nspope commented Jun 12, 2024 •

edited

Loading

nspope commented Jun 13, 2024 •

edited

Loading

duncanMR commented Jun 18, 2024 •

edited

Loading

duncanMR commented Jun 18, 2024 •

edited

Loading

hyanwong commented Jun 18, 2024 •

edited

Loading

duncanMR commented Jun 21, 2024 •

edited

Loading

hyanwong commented Jun 24, 2024 •

edited

Loading