Unexpected neutral theta using tskit-built tree sequences #2252

mawassw · 2024-01-10T17:06:34Z

mawassw
Jan 10, 2024

Hi everyone,

let me start with the main problem: I am trying to calculate diversity-Ne by simulating neutral mutations using msprime.sim_mutations along a tree sequence recorded using our own forward time simulator of diploid genomes incorporating linkage blocks and recombination at given hotspots.

More details on the tree sequence (I've attached an example of a tree sequence tables that I am using for this purpose for any of you to test with; tskit_tables_example.zip): We use the tskit API integrated into our forward time simulator to build a tree sequence tree for the purpose of keeping track of mutations that occur within the linkage blocks. So we don't have sites, but a linkage block where multiple mutations can occur and the mutations have values which are sampled from DFEs.

We are fairly sure that the output tables and the tree sequences we build from them are correct because we get very good concordance between the fitness flux estimated directly in our simulator and the fitness flux calculated based on the fixed mutations we extract from the tree sequence.

The issue we are having now is that same tree sequence, when used to simulate neutral mutations to calculate diveristy-Ne is giving wild results. We are not sure if the parameterization we are using for msprime.sim_mutations is the issue. See the snippet of code below which we use to perform this operation. We are using InfiniteAlleles model since our setup is similar to that of SLiM and discrete_genomes=False since we assumed that neutral mutations can happen anywhere in our genomes even though our linkage blocks are integers (in the example I give we have 1150 linkage blocks = 23 chr * 50 blocks per chr) - changing it to True doesn't change much anyway. The amount of diversity we are getting is very high for a rate of 10^-8. Any insights on what we are going wrong here?

#read tables for tskit
with open('sitetable.txt') as f:
    sites = f.read()

with open('nodetable.txt') as f:
    nodes = f.read()

with open('mutationtable.txt') as f:
    mutations = f.read()

with open('edgetable.txt') as f:
    edges = f.read()

#load in the tree sequence data
ts = tskit.load_text(
    nodes = io.StringIO(nodes),
    edges = io.StringIO(edges),
    sites = io.StringIO(sites),
    mutations = io.StringIO(mutations),
    strict = False)

#Estimating Ne based on theta
print(f"tree sequence length is {ts.sequence_length}")
model = msprime.InfiniteAlleles()
rate = float(input("Enter mutation rate:"))
nts = msprime.sim_mutations(tree_sequence=ts, rate=rate, random_seed=1989, model=model, discrete_genome=False, keep=False)
print(f"The tree sequence has {nts.num_mutations} neutral mutations.")
Ne = nts.diversity()/(4*rate)
print(f"The average diversity {nts.diversity()}.")
print(f"Ne is {Ne}")

jeromekelleher · 2024-01-11T11:13:35Z

jeromekelleher
Jan 11, 2024
Maintainer

Hi @mawassw, welcome! 👋

It's always exciting to see people building cool new stuff on top of tskit, so hopefully we can help.

I think there's probably just some confusion here about calibrating branch lengths, sequence lengths and the rates. When I run the branch mode divergence like this

print(nts.diversity(mode="branch"))

I get 38717732.54502242, which tells me that the average distance between samples along the trees is very large. What are your time-units here?

I'm also not clear what the units of sequence length mean here. It looks like each tree is one unit of sequence long - so what's your model of "site density" here then?

These tricky scaling problems need to be worked out so that you can provide the appropriate rate to sim_mutations.

I wouldn't bother with the InfiniteAlleles model here by the way --- because you are using a continuous genome you aren't going to have multiple mutations at a site.

Is there a particular reason for using the text-based encoding rather than the dump() and load() methods? They are much faster and more convenient:

# Created from your files
ts.dump("example.ts")

# Same tree sequence loaded in a fraction of a second
ts = tskit.load("example.ts")

2 replies

josephmatheson Jan 18, 2024

Thank you for the reply! My name is Joseph Matheson, I'm working on a project using the same code, so I might be able to answer some of the questions about our code. I've definitely been confused about the appropriate way to scale the mutation rate given the model we're using, so that's more than likely the issue.

Our model is a Moran model. I work on the relative fitness version, where the time-units are time between individual birth or death events, calibrated to be 1 time unit between birth and death events on average. For the absolute fitness case that Walid works with, they should be similar, but there's a bit more calibration there -- Walid might have to describe that in more detail. So the extremely long distance between samples just reflects the fact that our time units represent very small amounts of time -- do we need to adjust the neutral mutation rate parameter to accommodate that?

The forward-time simulations have equal mutation rates at all 'sites' in the genome, but only allows recombination at hotspots between non-recombining blocks (those are the linkage blocks Walid mentioned). We were similarly going to simulate neutral mutations uniformly at random along the genome. Is there a parameter we need to specify for that?

I think the reason for using text-based encoding is that I first implemented of a version of this code with tskit in 2018/2019 before the dump() and load() methods existed! We should definitely update that, thank you for reminding us!

jeromekelleher Jan 19, 2024
Maintainer

So the extremely long distance between samples just reflects the fact that our time units represent very small amounts of time -- do we need to adjust the neutral mutation rate parameter to accommodate that?

Msprime generates mutations at the given rate per unit of sequence length, per unit of time (assumed to be generations, but it doesn't really matter). So, I think you need to scale your mutation rate both by your Moran time-scale (probably converting to generations by dividing by N or so, is simplest?), and by your "density of sites" within in each non-recombining block.

The docs will hopefully help.

The alternative would be to rescale your time units into generations within the tree sequence, and also change your blocks to be k-bases long.

Both are a bit fiddly, but that's the general approach.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unexpected neutral theta using tskit-built tree sequences #2252

{{title}}

Replies: 1 comment 2 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Unexpected neutral theta using tskit-built tree sequences #2252

mawassw Jan 10, 2024

Replies: 1 comment · 2 replies

jeromekelleher Jan 11, 2024 Maintainer

josephmatheson Jan 18, 2024

jeromekelleher Jan 19, 2024 Maintainer

mawassw
Jan 10, 2024

Replies: 1 comment 2 replies

jeromekelleher
Jan 11, 2024
Maintainer

jeromekelleher Jan 19, 2024
Maintainer