simulating genome-wide SNPs #2099
Replies: 3 comments 1 reply
-
Hi @mandresapata -- see here for previous discussions. TL;DR -- it is not built in, but can be done manually. |
Beta Was this translation helpful? Give feedback.
-
Hold on @molpopgen - I don't think that's the question? What are those SNPs, @mandresapata? If they're 190K SNPs from a genotyping array, then they are ascertained, and so to mimick them you'll have to do something in your code to mimic the ascertainment procedure (eg simulate a large number of SNPs and pick an unlinked set at intermediate frequency). If they're actually an unbiased collection of SNPs then the thing to do is to just choose the mutation rate to get you a bit more than the right number, and then choose a random sample of the size you want. (But, I'm guessing they're ascertained somehow.) If 190K is really all the SNPs then I'd say that the TL;DR is that you almost certainly don't actually want to simulate a fixed number of SNPs: instead, choose a mutation rate and demographic model to give you the right expected number of SNPs; then you'll get a somewhat different and random number of SNPs, but your analysis pipeline should be robust to that. And, that's the point of simulation: to get real-ish data (and the precise number of SNPs is going to be random, in reality). Similarly, you can't ask to get SNPs at exactly the positions of the observed SNPs. (Well, you could ask for that, but it's (a) computationally infeasible, and (b) it's no longer clear just what you're simulating.) |
Beta Was this translation helpful? Give feedback.
-
Thank you both! @petrelharp and @molpopgen But first, I want to make a power analysis to know whether these 190K SNPs has the enough power to infer demographic parameters. To do that, I want to simulate genome-wide SNPs in msprime to get a "simulated data" (I was thinking to get ~190K intergenic SNPs across the human genome -that will be close to my real subsetted data). Then, I want to use that simulated data to infer known demographic parameters, and evaluate if I am able to infer them. So based on your comment @petrelharp the best way to do this should be simulate a large number of SNPs and pick an unlinked set at intermediate frequency SNPs? So my code should be like I am simulating each whole chromosomes 1..22 in msprime, to then pick, for example, 10K randomly distant SNPs (so they are independent) from each chromosome, which in sum will be ~190-200 SNPs similar to my data set right? I hope this might help to clarify my question and thank you so much! |
Beta Was this translation helpful? Give feedback.
-
Hello!
I am having troubles to understand how to simulate specific amounts of genome-wide SNPs (e.g., 190,000 SNPs) under a demographic model in msprime. I want to simulate a genome-wide SNPs in a human demographic model, so these SNPs are mimicking 190K SNPs distributed along chromosomes 1-22.
Is any way to simulate this? I have only seen the option -L [LENGTH] but I think this option is only to specify the segment size and not the amount of SNPs that I want to use for the simulation right?. Or I should think in a different way, like to simulate each chromosome independently (e.g., different lengths) and then just to sample the 190,000 SNPs from the output?
Sorry if this question is a little basic or confusing but I just want to know which way might be better for my objective, so if there is an option to specify the amount of SNPs to simulate in msprime or if this must be done in the sampling process from the outputs?
I would appreciate any help,
Thanks
Beta Was this translation helpful? Give feedback.
All reactions