Confusion about ploidy and coalescent time scale #1761

grahamgower · 2021-07-12T14:53:29Z

grahamgower
Jul 12, 2021
Collaborator

The ploidy section in the docs implies that the difference between simulating with ploidy=1 and ploidy=2 is just the coalescent time scale (plus the sampling). Indeed, one might imagine that to obtain the behaviour of the legacy msprime.simulate() function, one can just set ploidy=1 and then multiply the time scale. At least for some models, this isn't sufficient. And the effect might be quite subtle. Below, I've compared TMRCA distributions for the legacy simulate() function, a time-adjusted sim_ancestry(ploidy=1) and the appropriate incantation sim_ancestry(ploidy=2, samples=[msprime.SampleSet(..., ploidy=1), ...]).

I guess the differences for my "strong bottleneck" model below is driven by the rate of coalescence being different with the diffferent ploidy values? But the ploidy docs don't mention coalescent rates, so I certainly missed this (probably basic) point. Or maybe there's something else going on that I've missed? It might also be prudent to have an "equivalent to msprime.simulate()" example for sim_ancestry() somewhere?

import numpy as np
import msprime
import matplotlib.pyplot as plt

# constant size
pc_a = [msprime.PopulationConfiguration(initial_size=10_000)]
demog_a = msprime.Demography.from_old_style(population_configurations=pc_a)

# bottleneck
pc_b = [msprime.PopulationConfiguration(initial_size=1000) ]
de_b = [msprime.PopulationParametersChange(200, initial_size=10_000)]
demog_b = msprime.Demography.from_old_style(
        population_configurations=pc_b, demographic_events=de_b
)

# strong bottleneck
pc_c = [msprime.PopulationConfiguration(initial_size=100) ]
de_c = [msprime.PopulationParametersChange(200, initial_size=10_000)]
demog_c = msprime.Demography.from_old_style(
        population_configurations=pc_c, demographic_events=de_c
)

NUM_SAMPLES = 20
NUM_REPLICATES = 10000
RANDOM_SEED = None #1234

def sim_legacy(pc, de):
    return msprime.simulate(
        population_configurations=pc,
        demographic_events=de,
        samples=[(0, 0)] * NUM_SAMPLES,
        num_replicates=NUM_REPLICATES,
        random_seed=RANDOM_SEED
    )

def sim_ploidy1(demography):
    return msprime.sim_ancestry(
        demography=demography,
        ploidy=1,
        samples=NUM_SAMPLES,
        num_replicates=NUM_REPLICATES,
        random_seed=RANDOM_SEED
    )

def sim_ploidy2(demography):
    return msprime.sim_ancestry(
        demography=demography,
        ploidy=2,
        samples=[msprime.SampleSet(NUM_SAMPLES, population=0, ploidy=1)],
        num_replicates=NUM_REPLICATES,
        random_seed=RANDOM_SEED
    )

def tmrca(ts_iter):
    """Get vector of tmrcas, one for each replicate."""
    tmrca = []
    for ts in ts_iter:
        for tree in ts.trees():
            tmrca.append(tree.time(tree.root))
    return np.array(tmrca)


a_legacy = tmrca(sim_legacy(pc_a, None))
a_p1 = 2 * tmrca(sim_ploidy1(demog_a))
a_p2 = tmrca(sim_ploidy2(demog_a))

b_legacy = tmrca(sim_legacy(pc_b, de_b))
b_p1 = 2 * tmrca(sim_ploidy1(demog_b))
b_p2 = tmrca(sim_ploidy2(demog_b))

c_legacy = tmrca(sim_legacy(pc_c, de_c))
c_p1 = 2 * tmrca(sim_ploidy1(demog_c))
c_p2 = tmrca(sim_ploidy2(demog_c))

def plot_qq(ax, title, /, **kwargs):
    (x_label, x), (y_label, y) = kwargs.items()
    quantiles = np.linspace(0, 1, 101)
    xq = np.nanquantile(x, quantiles)
    yq = np.nanquantile(y, quantiles)
    ax.scatter(xq, yq, marker="o", edgecolor="black", facecolor="none")
    ax.scatter(xq[50], yq[50], marker="x", lw=2, c="red", label="median")
    ax.set_xlabel(x_label)
    ax.set_ylabel(y_label)
    ax.set_title(title)
    ax.legend()
    # diagonal line
    xlim = ax.get_xlim()
    ylim = ax.get_ylim()
    min_ = min(xlim[0], ylim[0])
    max_ = max(xlim[1], ylim[1])
    ax.autoscale(False) # don't change the x/y limits anymore
    ax.plot([min_, max_], [min_, max_], c="lightgray", ls="--", lw=1, zorder=-10)

fig, axs = plt.subplots(
    ncols=3, nrows=2, tight_layout=True, figsize=plt.figaspect(9 / 16)
)
plot_qq(axs[0,0], "constant size", ploidy1=a_p1, ploidy2=a_p2)
plot_qq(axs[0,1], "bottleneck", ploidy1=b_p1, ploidy2=b_p2)
plot_qq(axs[0,2], "strong bottleneck", ploidy1=c_p1, ploidy2=c_p2)

plot_qq(axs[1,0], "constant size", legacy=a_legacy, ploidy2=a_p2)
plot_qq(axs[1,1], "bottleneck", legacy=b_legacy, ploidy2=b_p2)
plot_qq(axs[1,2], "strong bottleneck", legacy=c_legacy, ploidy2=c_p2)
fig.suptitle("TMRCA quantile-quantile")
fig.savefig("/tmp/ploidy.pdf")

jeromekelleher · 2021-07-12T16:52:23Z

jeromekelleher
Jul 12, 2021
Maintainer

I'm not sure what's going on here @grahamgower, I would have thought this was equivalent also. This is weird, when I run the three examples with equal seeds I would have thought we'd get identical random trajectories but we don't:

ts1 = sim_legacy(pc_c, de_c)
ts2 = sim_ploidy1(demog_c)
ts3 = sim_ploidy2(demog_c)

print(ts1.tables.nodes)
print(ts2.tables.nodes)
print(ts3.tables.nodes)

╔══╤═════╤══════════╤══════════╤═════════════╤════════╗
║id│flags│population│individual│time         │metadata║
╠══╪═════╪══════════╪══════════╪═════════════╪════════╣
║0 │    1│         0│        -1│   0.00000000│     b''║
║1 │    1│         0│        -1│   0.00000000│     b''║
║2 │    1│         0│        -1│   0.00000000│     b''║
║3 │    1│         0│        -1│   0.00000000│     b''║
║4 │    1│         0│        -1│   0.00000000│     b''║
║5 │    1│         0│        -1│   0.00000000│     b''║
║6 │    0│         0│        -1│   2.83464876│     b''║
║7 │    0│         0│        -1│  36.89187493│     b''║
║8 │    0│         0│        -1│  88.18475739│     b''║
║9 │    0│         0│        -1│2321.79057562│     b''║
║10│    0│         0│        -1│6747.66484293│     b''║
╚══╧═════╧══════════╧══════════╧═════════════╧════════╝

╔══╤═════╤══════════╤══════════╤════════════╤════════╗
║id│flags│population│individual│time        │metadata║
╠══╪═════╪══════════╪══════════╪════════════╪════════╣
║0 │    1│         0│         0│  0.00000000│     b''║
║1 │    1│         0│         1│  0.00000000│     b''║
║2 │    1│         0│         2│  0.00000000│     b''║
║3 │    1│         0│         3│  0.00000000│     b''║
║4 │    1│         0│         4│  0.00000000│     b''║
║5 │    1│         0│         5│  0.00000000│     b''║
║6 │    0│         0│        -1│  1.41732438│     b''║
║7 │    0│         0│        -1│ 18.44593747│     b''║
║8 │    0│         0│        -1│ 44.09237870│     b''║
║9 │    0│         0│        -1│109.78932642│     b''║
║10│    0│         0│        -1│142.14985992│     b''║
╚══╧═════╧══════════╧══════════╧════════════╧════════╝

╔══╤═════╤══════════╤══════════╤═════════════╤════════╗
║id│flags│population│individual│time         │metadata║
╠══╪═════╪══════════╪══════════╪═════════════╪════════╣
║0 │    1│         0│         0│   0.00000000│     b''║
║1 │    1│         0│         1│   0.00000000│     b''║
║2 │    1│         0│         2│   0.00000000│     b''║
║3 │    1│         0│         3│   0.00000000│     b''║
║4 │    1│         0│         4│   0.00000000│     b''║
║5 │    1│         0│         5│   0.00000000│     b''║
║6 │    0│         0│        -1│   2.83464876│     b''║
║7 │    0│         0│        -1│  36.89187493│     b''║
║8 │    0│         0│        -1│  88.18475739│     b''║
║9 │    0│         0│        -1│2321.79057562│     b''║
║10│    0│         0│        -1│6747.66484293│     b''║
╚══╧═════╧══════════╧══════════╧═════════════╧════════╝

1 and 3 are identical, as expected, but not 2 (as in, it should be off by a factor of 2 everywhere, I'd have thought).

0 replies

jeromekelleher · 2021-07-12T17:02:01Z

jeromekelleher
Jul 12, 2021
Maintainer

This is the only place that ploidy occurs in the above code paths AFAIK @grahamgower. I don't understand what's going on, but I'm terrible at this scaling stuff.

I guess we could compare against msprime 0.x to see what it produces in the strong bottleneck case?

0 replies

grahamgower · 2021-07-12T17:13:32Z

grahamgower
Jul 12, 2021
Collaborator Author

The ploidy is affecting the time scale only indirectly, via the coalescent rate. At t=200, the size change happens, which also affects the coalescent rate. Before this size-change event, the nodes 6, 7, 8 are created and thus have times that match, up to a scaling factor of 2.

0 replies

petrelharp · 2021-07-12T19:07:17Z

petrelharp
Jul 12, 2021
Maintainer

Shouldn't you be scaling the time of the bottleneck with ploidy?

0 replies

grahamgower · 2021-07-13T06:09:10Z

grahamgower
Jul 13, 2021
Collaborator Author

Yes, thanks @petrelharp. It's obvious in hindsight, but caused me quite a lot of head scratching!

2 replies

jeromekelleher Jul 13, 2021
Maintainer

Does this resolve the issue @grahamgower?

grahamgower Jul 13, 2021
Collaborator Author

Yes, absolutely. It implies that demographic models are inherently tied to the ploidy, which is a mixed bag I guess.

petrelharp · 2021-07-13T16:56:37Z

petrelharp
Jul 13, 2021
Maintainer

Yes, absolutely. It implies that demographic models are inherently tied to the ploidy, which is a mixed bag I guess.

But, if you doubled all the population sizes when you set ploidy=1, then you should get the same thing?

2 replies

grahamgower Jul 13, 2021
Collaborator Author

So, multiply the population sizes and divide the event times? That would probably do the trick. I don't really need to do this, I was just interested in understanding what was going on. Thanks for your very clear and direct pointers!

I think this does warrant some kind of note in the docs, maybe in the ploidy section with the example that shows time scaling of a simulation. I'll rummage up a patch later this week.

petrelharp Jul 13, 2021
Maintainer

No, I mean multiplying population sizes or dividing event times. Multiplying initial_size by two and setting ploidy=1 will give you a population with the same number of chromosomes, which should do exactly the same thing, since msprime doesn't really work with individuals. The fact that taking half as many chromosomes is equivalent to dividing all times by two is a property of the coalescent.

grahamgower · 2021-07-15T15:40:52Z

grahamgower
Jul 15, 2021
Collaborator Author

I see that the DemographyDebugger.mean_coalescence_time() (and presumably coalescence_rate_trajectory()) calculates coalescent rates according to a diploid model. I assume that scaling the result of these functions to adjust for ploidy is wrong for the same reason, and the demographic model must instead be scaled to be a diploid model? Is there anywhere else where someone might be bitten by a discrepancy between model ploidy and ploidy used for the coalescent rate?

2 replies

petrelharp Jul 16, 2021
Maintainer

Ah, good point. This should be an issue.

petrelharp Jul 16, 2021
Maintainer

#1771

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Confusion about ploidy and coalescent time scale #1761

{{title}}

Replies: 7 comments 6 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Confusion about ploidy and coalescent time scale #1761

grahamgower Jul 12, 2021 Collaborator

Replies: 7 comments · 6 replies

jeromekelleher Jul 12, 2021 Maintainer

jeromekelleher Jul 12, 2021 Maintainer

grahamgower Jul 12, 2021 Collaborator Author

petrelharp Jul 12, 2021 Maintainer

grahamgower Jul 13, 2021 Collaborator Author

jeromekelleher Jul 13, 2021 Maintainer

grahamgower Jul 13, 2021 Collaborator Author

petrelharp Jul 13, 2021 Maintainer

grahamgower Jul 13, 2021 Collaborator Author

petrelharp Jul 13, 2021 Maintainer

grahamgower Jul 15, 2021 Collaborator Author

petrelharp Jul 16, 2021 Maintainer

petrelharp Jul 16, 2021 Maintainer

grahamgower
Jul 12, 2021
Collaborator

Replies: 7 comments 6 replies

jeromekelleher
Jul 12, 2021
Maintainer

jeromekelleher
Jul 12, 2021
Maintainer

grahamgower
Jul 12, 2021
Collaborator Author

petrelharp
Jul 12, 2021
Maintainer

grahamgower
Jul 13, 2021
Collaborator Author

jeromekelleher Jul 13, 2021
Maintainer

grahamgower Jul 13, 2021
Collaborator Author

petrelharp
Jul 13, 2021
Maintainer

grahamgower Jul 13, 2021
Collaborator Author

petrelharp Jul 13, 2021
Maintainer

grahamgower
Jul 15, 2021
Collaborator Author

petrelharp Jul 16, 2021
Maintainer

petrelharp Jul 16, 2021
Maintainer