Ancient samples that are directly ancestral to other samples #2260

apragsdale · 2024-02-12T23:24:56Z

apragsdale
Feb 12, 2024

Hi all, I'm wondering if it is possible and (if so) there is a preferred way to run a simulation in which we sample ancient individuals that are direct ancestors to other samples that lived more recently.

[I am pretty sure this has been discussed before, perhaps on Slack, but I didn't find anything in Issues or Discussions here, so I thought this could be a good place to ask in a searchable location.]

A simple example would be to sample the entire population at time zero as well as the entire population in the previous generation. For this to work, we would need to use a discrete simulation model like the DTWF, instead of the continuous Hudson model. Essentially, can I "census" a population at a given time, and have those censused individuals also be ancestral to later samples? Note that this is different from a msprime.CensusEvent, which simply adds nodes to all branches at a given time, which is handy for bookkeeping but would miss extant lineages that are not ancestral in a given tree to more recent sample.

import msprime

demog = msprime.Demography.isolated_model([2])  # single population of size 2 diploids
samples = [
    msprime.SampleSet(2),  # sample whole population at time zero
    msprime.SampleSet(2, time=1),  # whole population one generation ago
]
ts = msprime.sim_ancestry(
    random_seed=1, samples=samples, demography=demog, model="dtwf"
)

print(ts.draw_text())

This should print out something like this:

5.00┊          12     ┊
    ┊       ┏━━━┻━━━┓ ┊
4.00┊      11       ┃ ┊
    ┊    ┏━━┻━━━┓   ┃ ┊
2.00┊   10      9   ┃ ┊
    ┊ ┏━┳┻┳━┓  ┏┻━┓ ┃ ┊
1.00┊ ┃ ┃ 5 7  8  6 4 ┊
    ┊ ┃ ┃     ┏┻┓     ┊
0.00┊ 0 2     1 3     ┊
    0                 1

What would be desired for the censused population at time 1 is for the branches from nodes 0 to 3 to go to nodes 4 to 7 (which are also sample nodes). As it is, the population size at time 1 is larger than the specified size in the model, and we end up with cousins between generations 0 and 1, but no parent-offspring relationships among our samples. Is this something that is possible to do (either within DTWF, or another option like fixed pedigree simulation)?

[I think this question is somewhat related to Discussion #2188, but not quite the same.]

Answered by petrelharp

Feb 13, 2024

I think you could do this more simply with a union? Something like:

rng = np.random.default_rng()
pop_size = 3
demog = msprime.Demography.isolated_model([2])
L = 1
sample_time = 2

# history up until sample_time
ts1 = msprime.sim_ancestry(
    samples=[msprime.SampleSet(2)],
    demography=demog,
    end_time=sample_time,
    sequence_length=L,
    recombination_rate=0.1,
)
# history before that
ts2 = msprime.sim_ancestry(
    samples=[msprime.SampleSet(pop_size, time=sample_time)],
    demography=demog,
    sequence_length=L,
    recombination_rate=0.1,
)

roots = [n.id for n in ts1.nodes() if n.time == sample_time]
tips = ts2.samples()
rng.shuffle(tips)
node_mapping = [tskit.NULL for _ …

View full answer

apragsdale · 2024-02-13T04:18:54Z

apragsdale
Feb 13, 2024
Author

Edit: Looks like I should check out the additional_nodes option here. That may be a much more elegant solution to what I put below.

I've been playing around with this a bit. The goal below is to sample the entire population at time zero and at some time in the past. The solution I came up with is a bit hack-y, so I'd be interested if anyone has a better idea here. I know this is a long comment, but I tried to be complete with the code and my thought process here.

We'll set up a single population of size 2 diploids, and set the ancient sample time at t=3:

import msprime
import numpy as np
seed = 3
pop_size = 2
demog = msprime.Demography.isolated_model([pop_size])
sample_time = 3
L = 2

A naive DTWF simulation will not result in direct ancestry from any ancient samples to those at time zero:

samples = [
    msprime.SampleSet(pop_size),  # sample whole population at time zero
    msprime.SampleSet(pop_size, time=sample_time),  # whole population at sample time
]
ts = msprime.sim_ancestry(
    random_seed=seed,
    samples=samples,
    demography=demog,
    model="dtwf",
    sequence_length=L,
    recombination_rate=0.1,
)
print(ts.draw_text())

Note the exceeded population size at the sample time:

    ┊       ┏━━━┻┳━━┓ ┊         ┏━┻━━┓  ┊
6.00┊       ┃    ┃  ┃ ┊        14    ┃  ┊
    ┊       ┃    ┃  ┃ ┊       ┏━┻━┓  ┃  ┊
5.00┊      13    ┃  ┃ ┊       ┃   ┃  ┃  ┊
    ┊     ┏━┻━┓  ┃  ┃ ┊       ┃   ┃  ┃  ┊
4.00┊    11   ┃ 12  ┃ ┊      11   ┃ 12  ┊
    ┊   ┏━┻━┓ ┃ ┏┻┓ ┃ ┊    ┏━━┻━┓ ┃ ┏┻┓ ┊
3.00┊   ┃   4 ┃ 5 6 7 ┊    ┃    4 7 5 6 ┊
    ┊   ┃     ┃       ┊    ┃            ┊
2.00┊  10     ┃       ┊   10            ┊
    ┊  ┏┻━┓   ┃       ┊  ┏━┻━┓          ┊
1.00┊  8  ┃   ┃       ┊  8   9          ┊
    ┊ ┏┻┓ ┃   ┃       ┊ ┏┻┓ ┏┻┓         ┊
0.00┊ 0 3 2   1       ┊ 0 3 1 2         ┊
    0                 1                 2

Let's instead try to stop our simulation at the sample time, and then "fill in" the ancient sampled population before finishing the simulation:

samples = [
    msprime.SampleSet(pop_size),  # sample whole population at time zero
]
ts = msprime.sim_ancestry(
    random_seed=seed,
    samples=samples,
    demography=demog,
    model="dtwf",
    end_time=sample_time,
    sequence_length=L,
    recombination_rate=0.1,
)
print(ts.draw_text())

Here's our halted simulated tree sequence:

3.00┊   8   7 ┊         ┊
    ┊   ┃   ┃ ┊         ┊
2.00┊   6   ┃ ┊    6    ┊
    ┊  ┏┻━┓ ┃ ┊  ┏━┻━┓  ┊
1.00┊  4  ┃ ┃ ┊  4   5  ┊
    ┊ ┏┻┓ ┃ ┃ ┊ ┏┻┓ ┏┻┓ ┊
0.00┊ 0 3 2 1 ┊ 0 3 1 2 ┊
    0         1         2

Now we can try to add the rest of the samples into the population at the sample time. These added nodes won't be ancestral to our samples at time 0, but we still want them in order to census the full population at the ancient sample time:

# get the node table columns
tables = ts.tables

times = tables.nodes.time
flags = tables.nodes.flags
pops = tables.nodes.population
indivs = tables.nodes.individual

# set flags to 1 for nodes at time of sample time
flags[times == sample_time] = 1

nodes_to_add = 2 * pop_size - sum(times == sample_time)
times = np.concatenate((times, [sample_time] * nodes_to_add)).astype(times.dtype)
flags = np.concatenate((flags, [1] * nodes_to_add)).astype(flags.dtype)
pops = np.concatenate((pops, [0] * nodes_to_add)).astype(pops.dtype)
indivs = np.concatenate((indivs, [-1] * nodes_to_add)).astype(pops.dtype)

# now assign those ancient sample nodes to diploid individuals
diploids = np.random.permutation(np.where(times == sample_time)[0]).reshape(pop_size, 2)
for dip in diploids:
    indivs[dip] = max(indivs) + 1
    tables.individuals.add_row()

# reset node table columns and sort
tables.nodes.set_columns(flags=flags, time=times, population=pops, individual=indivs)
tables.sort()
ts = tables.tree_sequence()

# finish the simulation
ts = msprime.sim_ancestry(
    random_seed=seed,
    initial_state=ts,
    model="dtwf",
    demography=demog,
    start_time=sample_time,
    recombination_rate=0.1,
)
print(ts.draw_text())

This isn't quite right, because one of the trees had fully coalesced before reaching the sample time, but we're closer:

8.00┊       14     ┊        14        ┊
    ┊     ┏━━┻━━┓  ┊    ┏━━━━┻━━━┓    ┊
5.00┊     ┃     ┃  ┊    ┃       13    ┊
    ┊     ┃     ┃  ┊    ┃      ┏━┻━┓  ┊
4.00┊    12    11  ┊    ┃     11  12  ┊
    ┊   ┏━┻━┓ ┏━┻┓ ┊    ┃    ┏━┻┓ ┏┻┓ ┊
3.00┊   8   9 7 10 ┊    ┃    7 10 8 9 ┊
    ┊   ┃     ┃    ┊    ┃             ┊
2.00┊   6     ┃    ┊    6             ┊
    ┊  ┏┻━┓   ┃    ┊  ┏━┻━┓           ┊
1.00┊  4  ┃   ┃    ┊  4   5           ┊
    ┊ ┏┻┓ ┃   ┃    ┊ ┏┻┓ ┏┻┓          ┊
0.00┊ 0 3 2   1    ┊ 0 3 1 2          ┊
    0              1                  2

We need those edges in the coalesced marginal trees to reach up to the sampling time. The simplest work-around I came up with is to add a "dummy" node above the sampling time. This prevents any marginal tree from ever fully coalescing before hitting the sampling time, and then we can remove that node and continue the simulation to completion:

samples = [
    msprime.SampleSet(pop_size),  # sample whole population at time zero
    msprime.SampleSet(1, ploidy=1, time=sample_time + 1),  # our "dummy" node
]

ts = msprime.sim_ancestry(
    random_seed=seed,
    samples=samples,
    demography=demog,
    model="dtwf",
    end_time=sample_time,
    sequence_length=L,
    recombination_rate=0.1,
)

print("Running DTWF up to the sample time:")
print(ts.draw_text())

# get the node table columns
tables = ts.tables

times = tables.nodes.time
flags = tables.nodes.flags
pops = tables.nodes.population
indivs = tables.nodes.individual

# remove node above sample time
to_keep = times <= sample_time
to_del = np.where(to_keep == False)[0]
times = times.compress(to_keep)
flags = flags.compress(to_keep)
pops = pops.compress(to_keep)
indivs = indivs.compress(to_keep)

# set flags to 1 for nodes at time of sample time
flags[times == sample_time] = 1

nodes_to_add = 2 * pop_size - sum(times == sample_time)
times = np.concatenate((times, [sample_time] * nodes_to_add)).astype(times.dtype)
flags = np.concatenate((flags, [1] * nodes_to_add)).astype(flags.dtype)
pops = np.concatenate((pops, [0] * nodes_to_add)).astype(pops.dtype)
indivs = np.concatenate((indivs, [-1] * nodes_to_add)).astype(pops.dtype)
# now assign those ancient sample nodes to diploid individuals
tables.individuals.clear()
for i in range(sum(np.unique(indivs) >= 0)):
    tables.individuals.add_row()
diploids = np.random.permutation(np.where(times == sample_time)[0]).reshape(pop_size, 2)
for dip in diploids:
    indivs[dip] = max(indivs) + 1
    tables.individuals.add_row()

# reset edge table by shifting node indexes as required (since we removed a node)
p = tables.edges.parent
c = tables.edges.child
p[p > to_del] -= 1
c[c > to_del] -= 1
tables.edges.set_columns(
    left=tables.edges.left, right=tables.edges.right, parent=p, child=c
)

# reset node table columns and sort
tables.nodes.set_columns(flags=flags, time=times, population=pops, individual=indivs)
tables.sort()
ts = tables.tree_sequence()

print("Tree sequence with added samples and removed 'dummy' node:")
print(ts.draw_text())

# finish the simulation
ts = msprime.sim_ancestry(
    random_seed=seed,
    initial_state=ts,
    model="dtwf",
    demography=demog,
    start_time=sample_time,
    recombination_rate=0.1,
)

print("Finishing the simulation with DTWF:")
print(ts.draw_text())

I think this leaves us with what we want:

Running DTWF up to the sample time:
4.00┊         4 ┊         4 ┊
    ┊           ┊           ┊
3.00┊   9   8   ┊    9      ┊
    ┊   ┃   ┃   ┊    ┃      ┊
2.00┊   7   ┃   ┊    7      ┊
    ┊  ┏┻━┓ ┃   ┊  ┏━┻━┓    ┊
1.00┊  5  ┃ ┃   ┊  5   6    ┊
    ┊ ┏┻┓ ┃ ┃   ┊ ┏┻┓ ┏┻┓   ┊
0.00┊ 0 3 2 1   ┊ 0 3 1 2   ┊
    0           1           2

Tree sequence with added samples and removed 'dummy' node:
3.00┊   8   7 9 10 ┊    8    7 9 10 ┊
    ┊   ┃   ┃      ┊    ┃           ┊
2.00┊   6   ┃      ┊    6           ┊
    ┊  ┏┻━┓ ┃      ┊  ┏━┻━┓         ┊
1.00┊  4  ┃ ┃      ┊  4   5         ┊
    ┊ ┏┻┓ ┃ ┃      ┊ ┏┻┓ ┏┻┓        ┊
0.00┊ 0 3 2 1      ┊ 0 3 1 2        ┊
    0              1                2

Finishing the simulation with DTWF:
7.00┊      14      ┊                ┊
    ┊   ┏━━━┻━━━┓  ┊                ┊
5.00┊   ┃      13  ┊         13     ┊
    ┊   ┃     ┏━┻┓ ┊       ┏━━┻━━┓  ┊
4.00┊   ┃    11  ┃ ┊      12    11  ┊
    ┊   ┃   ┏━┻┓ ┃ ┊    ┏━━┻━┓ ┏━┻┓ ┊
3.00┊   8   7 10 9 ┊    8    9 7 10 ┊
    ┊   ┃   ┃      ┊    ┃           ┊
2.00┊   6   ┃      ┊    6           ┊
    ┊  ┏┻━┓ ┃      ┊  ┏━┻━┓         ┊
1.00┊  4  ┃ ┃      ┊  4   5         ┊
    ┊ ┏┻┓ ┃ ┃      ┊ ┏┻┓ ┏┻┓        ┊
0.00┊ 0 3 2 1      ┊ 0 3 1 2        ┊
    0              1                2

1 reply

jeromekelleher Feb 13, 2024
Maintainer

Nice! I'd have to think hard about the details to be sure, but to me this approach works.

petrelharp · 2024-02-13T04:54:38Z

petrelharp
Feb 13, 2024
Maintainer

I think you could do this more simply with a union? Something like:

rng = np.random.default_rng()
pop_size = 3
demog = msprime.Demography.isolated_model([2])
L = 1
sample_time = 2

# history up until sample_time
ts1 = msprime.sim_ancestry(
    samples=[msprime.SampleSet(2)],
    demography=demog,
    end_time=sample_time,
    sequence_length=L,
    recombination_rate=0.1,
)
# history before that
ts2 = msprime.sim_ancestry(
    samples=[msprime.SampleSet(pop_size, time=sample_time)],
    demography=demog,
    sequence_length=L,
    recombination_rate=0.1,
)

roots = [n.id for n in ts1.nodes() if n.time == sample_time]
tips = ts2.samples()
rng.shuffle(tips)
node_mapping = [tskit.NULL for _ in ts2.nodes()]
for t, n in zip(tips[:len(roots)], roots):
    node_mapping[t] = n

ts = ts1.union(ts2, node_mapping, check_shared_equality=False)

print(ts.draw_text())

gets

5.76┊          15    ┊
    ┊        ┏━━┻━━┓ ┊
4.97┊       14     ┃ ┊
    ┊     ┏━━┻━━┓  ┃ ┊
3.87┊    13     ┃  ┃ ┊
    ┊  ┏━━┻━┓   ┃  ┃ ┊
2.43┊  ┃   12   ┃  ┃ ┊
    ┊  ┃   ┏┻━┓ ┃  ┃ ┊
2.39┊  ┃  11  ┃ ┃  ┃ ┊
    ┊  ┃  ┏┻┓ ┃ ┃  ┃ ┊
2.00┊  5  7 8 9 6 10 ┊
    ┊  ┃  ┃     ┃    ┊
0.29┊  4  ┃     ┃    ┊
    ┊ ┏┻┓ ┃     ┃    ┊
0.00┊ 0 3 1     2    ┊
    0                1

6 replies

apragsdale Feb 13, 2024
Author

This is nice. I like the union approach you suggest. This is what I've settled on for now: we can include that "dummy" node above the sample time to prevent full coalescence in any local tree, and then just use ts1.simplify(range(2 * pop_size), keep_unary=True) to remove it before the union.

import msprime
import tskit
import numpy as np

pop_size = 2
sample_time = 6
L = 4
r = 0.1

rng = np.random.default_rng()

samples = [
    msprime.SampleSet(pop_size),
    msprime.SampleSet(1, ploidy=1, time=sample_time + 1),
]

# recent history up to sample time
ts1 = msprime.sim_ancestry(
    samples=samples,
    population_size=pop_size,
    model="dtwf",
    end_time=sample_time,
    sequence_length=L,
    recombination_rate=r,
)

# remove dummy node above the sample time
ts1 = ts1.simplify(range(2 * pop_size), keep_unary=True)

# history before sample time
ts2 = msprime.sim_ancestry(
    samples=[msprime.SampleSet(pop_size, time=sample_time)],
    model="dtwf",
    population_size=pop_size,
    sequence_length=L,
    recombination_rate=r,
)

# remap roots to samples in ts2
roots = [n.id for n in ts1.nodes() if n.time == sample_time]
tips = ts2.samples()
rng.shuffle(tips)
node_mapping = [tskit.NULL for _ in ts2.nodes()]
for t, n in zip(tips[:len(roots)], roots):
    node_mapping[t] = n

ts = ts1.union(ts2, node_mapping, check_shared_equality=False)

print(ts.draw_text())

Gives, for example, something like this:

17.00┊          13    ┊                ┊              ┊
     ┊        ┏━━┻━━┓ ┊                ┊              ┊
10.00┊        ┃     ┃ ┊          12    ┊        12    ┊
     ┊        ┃     ┃ ┊       ┏━━━┻━━┓ ┊     ┏━━━┻━━┓ ┊
7.00 ┊       11     ┃ ┊      11      ┃ ┊    11      ┃ ┊
     ┊     ┏━━┻┳━┓  ┃ ┊    ┏━━┻━┳━┓  ┃ ┊ ┏━━━╋━━━┓  ┃ ┊
6.00 ┊     7   8 9 10 ┊    7    8 9 10 ┊ 7   8   9 10 ┊
     ┊   ┏━┻━┓        ┊    ┃           ┊ ┃   ┃        ┊
2.00 ┊   6   ┃        ┊    6           ┊ ┃   6        ┊
     ┊  ┏┻━┓ ┃        ┊  ┏━┻━┓         ┊ ┃  ┏┻━┓      ┊
1.00 ┊  5  ┃ ┃        ┊  5   4         ┊ ┃  4  ┃      ┊
     ┊ ┏┻┓ ┃ ┃        ┊ ┏┻┓ ┏┻┓        ┊ ┃ ┏┻┓ ┃      ┊
0.00 ┊ 0 3 2 1        ┊ 0 3 1 2        ┊ 0 1 2 3      ┊
     0                1                2              4

apragsdale Feb 13, 2024
Author

There is a small wrinkle that I found. By doing ts1.union(ts2, ...), the sample node flags from ts2 will be overwritten by the flags from ts1, changing them from samples to not samples. I think the solution is to reverse the node mapping and the union operation:

node_mapping = [tskit.NULL for _ in ts1.nodes()]
for t, n in zip(tips[:len(roots)], roots):
    node_mapping[n] = t

ts = ts2.union(ts1, node_mapping, check_shared_equality=False)

petrelharp Feb 14, 2024
Maintainer

That makes sense. Also, i should have said - it's the mis-matching node flags for the samples that makes check_shared_equality=False necessary (I think).

And gee, that's a clever way to keep it simulating!

apragsdale Feb 14, 2024
Author

I see, that makes sense. I had never used the union function before, and it’s a nice way to avoid having to messily manipulate tables like I was
trying.

Is there anything here that would be useful to have in the documentation? otherwise I think this is fully answered.

petrelharp Feb 14, 2024
Maintainer

Well, we could make this a short vignette somewhere? Not necessary, I think, but it's a nice example...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ancient samples that are directly ancestral to other samples #2260

{{title}}

Replies: 2 comments 7 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Ancient samples that are directly ancestral to other samples #2260

apragsdale Feb 12, 2024

Replies: 2 comments · 7 replies

apragsdale Feb 13, 2024 Author

jeromekelleher Feb 13, 2024 Maintainer

petrelharp Feb 13, 2024 Maintainer

apragsdale Feb 13, 2024 Author

apragsdale Feb 13, 2024 Author

petrelharp Feb 14, 2024 Maintainer

apragsdale Feb 14, 2024 Author

petrelharp Feb 14, 2024 Maintainer

apragsdale
Feb 12, 2024

Replies: 2 comments 7 replies

apragsdale
Feb 13, 2024
Author

jeromekelleher Feb 13, 2024
Maintainer

petrelharp
Feb 13, 2024
Maintainer

apragsdale Feb 13, 2024
Author

apragsdale Feb 13, 2024
Author

petrelharp Feb 14, 2024
Maintainer

apragsdale Feb 14, 2024
Author

petrelharp Feb 14, 2024
Maintainer