simulating genome-wide SNPs #2099

mandresapata · 2022-07-29T17:26:08Z

mandresapata
Jul 29, 2022

Hello!
I am having troubles to understand how to simulate specific amounts of genome-wide SNPs (e.g., 190,000 SNPs) under a demographic model in msprime. I want to simulate a genome-wide SNPs in a human demographic model, so these SNPs are mimicking 190K SNPs distributed along chromosomes 1-22.

Is any way to simulate this? I have only seen the option -L [LENGTH] but I think this option is only to specify the segment size and not the amount of SNPs that I want to use for the simulation right?. Or I should think in a different way, like to simulate each chromosome independently (e.g., different lengths) and then just to sample the 190,000 SNPs from the output?

Sorry if this question is a little basic or confusing but I just want to know which way might be better for my objective, so if there is an option to specify the amount of SNPs to simulate in msprime or if this must be done in the sampling process from the outputs?

I would appreciate any help,
Thanks

molpopgen · 2022-07-29T17:30:20Z

molpopgen
Jul 29, 2022
Maintainer

Hi @mandresapata -- see here for previous discussions. TL;DR -- it is not built in, but can be done manually.

0 replies

petrelharp · 2022-07-29T23:01:16Z

petrelharp
Jul 29, 2022
Maintainer

Hold on @molpopgen - I don't think that's the question? What are those SNPs, @mandresapata? If they're 190K SNPs from a genotyping array, then they are ascertained, and so to mimick them you'll have to do something in your code to mimic the ascertainment procedure (eg simulate a large number of SNPs and pick an unlinked set at intermediate frequency). If they're actually an unbiased collection of SNPs then the thing to do is to just choose the mutation rate to get you a bit more than the right number, and then choose a random sample of the size you want. (But, I'm guessing they're ascertained somehow.)

If 190K is really all the SNPs then I'd say that the TL;DR is that you almost certainly don't actually want to simulate a fixed number of SNPs: instead, choose a mutation rate and demographic model to give you the right expected number of SNPs; then you'll get a somewhat different and random number of SNPs, but your analysis pipeline should be robust to that. And, that's the point of simulation: to get real-ish data (and the precise number of SNPs is going to be random, in reality). Similarly, you can't ask to get SNPs at exactly the positions of the observed SNPs. (Well, you could ask for that, but it's (a) computationally infeasible, and (b) it's no longer clear just what you're simulating.)

1 reply

jeromekelleher Jul 30, 2022
Maintainer

This is a great answer @petrelharp, and probably something we should get into the documentation so that hopefully people will hit it when they google some terms like "msprime fixed number of SNPs".

Maybe we add a section "Fixed number of SNPs" to the end of the mutations page?

mandresapata · 2022-07-30T02:02:20Z

mandresapata
Jul 30, 2022
Author

Thank you both! @petrelharp and @molpopgen
Yes, those 190K SNPs are a subset of positions located in intergenic regions of a human SNP-array (total 1.4 Million SNPs). My idea is to use this "real data set" of intergenic SNPs (assuming that they are neutral sites) to infer demographic parameters later.

But first, I want to make a power analysis to know whether these 190K SNPs has the enough power to infer demographic parameters. To do that, I want to simulate genome-wide SNPs in msprime to get a "simulated data" (I was thinking to get ~190K intergenic SNPs across the human genome -that will be close to my real subsetted data). Then, I want to use that simulated data to infer known demographic parameters, and evaluate if I am able to infer them.

So based on your comment @petrelharp the best way to do this should be simulate a large number of SNPs and pick an unlinked set at intermediate frequency SNPs? So my code should be like I am simulating each whole chromosomes 1..22 in msprime, to then pick, for example, 10K randomly distant SNPs (so they are independent) from each chromosome, which in sum will be ~190-200 SNPs similar to my data set right?

I hope this might help to clarify my question and thank you so much!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

simulating genome-wide SNPs #2099

{{title}}

Replies: 3 comments 1 reply

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

simulating genome-wide SNPs #2099

mandresapata Jul 29, 2022

Replies: 3 comments · 1 reply

molpopgen Jul 29, 2022 Maintainer

petrelharp Jul 29, 2022 Maintainer

jeromekelleher Jul 30, 2022 Maintainer

mandresapata Jul 30, 2022 Author

mandresapata
Jul 29, 2022

Replies: 3 comments 1 reply

molpopgen
Jul 29, 2022
Maintainer

petrelharp
Jul 29, 2022
Maintainer

jeromekelleher Jul 30, 2022
Maintainer

mandresapata
Jul 30, 2022
Author