Simulatio to get heterozygosity of 0.5 #2320
Replies: 1 comment
-
I have good and bad news. The good news is that I don't think there's anything wrong with your code (but I haven't carefully checked). The bad news is that you're not getting heterozygosity of 0.5 because it is extremely unlikely to happen - you could simulate for the lifetime of the universe and it still wouldn't happen. For heterozygosity (of segregating sites, as defined here) to equal 0.5, you'd need every variant to be at frequency exactly 0.5. The chance of any one variant being close to 0.5 is already small, but since you're simulating a long genome, there's lots of variants, and the chance that all the variants are even above 0.01 is vanishingly small. So - my suggestion is to readjust your goal? Best of luck. I'll close this, but please re-open if I've misinterpreted your post. Happy simulating! |
Beta Was this translation helpful? Give feedback.
-
Hi everyone,
I am currently running a simulation in msprime with the objectives detailed below. I've attached the script for reference. Despite running the code multiple times with various generation settings—1000 generations, 1 million generations, and even 400 million generations—I have been unable to generate a VCF file with a heterozygosity of 0.5.
I would greatly appreciate any assistance or suggestions on how to achieve this target. Thank you!
##Procedure of Simulation
Initial Simulation to Find Heterozygosity at 0.5 excatly:
The script initializes the simulation with a variable current_generation and increases it by 100 generations in each iteration.
It simulates until it finds a generation where the heterozygosity is 0.5 excatly.
Saving VCF File at the Generation with Heterozygosity of 0.5:
Once the target heterozygosity is reached, the script saves a VCF file at this generation with the name indicating the heterozygosity of 0.5.
Continued Simulation from This Generation:
I have tried to run it for 1000 generations, 1 million and 4 billion generations. Script is attested below. Scenarios
Low Ne-100, Low migration rate:0.001
import os
import msprime
import numpy as np
import subprocess
Create directories for output files and filtered VCF files
output_dir = "/shared/jezkovt_bistbs_shared/Tiger_demography_msprime/LowNe_LowMig1"
filtered_output_dir = "/shared/jezkovt_bistbs_shared/Tiger_demography_msprime/LowNe_LowMig1/filtered_LowNe_LowMig1"
os.makedirs(output_dir, exist_ok=True)
os.makedirs(filtered_output_dir, exist_ok=True)
Initialize demography
demography = msprime.Demography()
demography.add_population(name="Panthera_tigris_tigris", initial_size=7819)
demography.add_population(name="Panthera_tigris_tigris_A", initial_size=100)
demography.add_population(name="Panthera_tigris_tigris_B", initial_size=100)
demography.set_migration_rate(source="Panthera_tigris_tigris_A", dest="Panthera_tigris_tigris_B", rate=0.001)
n_loci = 1000
ind_names = ["Panthera_tigris_tigris_A_" + str(i) for i in range(25)]
Function to calculate heterozygosity
def calculate_heterozygosity(tsm):
allele_frequencies = [variant.genotypes for variant in tsm.variants()]
heterozygosities = []
1. Simulate up to 6,389,572,031,159 generations to reach heterozygosity of 0.5
current_generation = 6389572031159
ts = msprime.sim_ancestry(
{"Panthera_tigris_tigris_A": 25},
ploidy=2,
demography=demography,
sequence_length=122500000,
recombination_rate=1e-8,
end_time=current_generation
)
tsm = msprime.sim_mutations(ts, rate=0.35e-8)
heterozygosity = calculate_heterozygosity(tsm)
print(f"Heterozygosity at {current_generation} generations: {heterozygosity}")
Save the VCF file at the starting point where heterozygosity is around 0.5
vcf_file_path = os.path.join(output_dir, f"Heterozygosity_0.5_gen_{current_generation}.vcf")
with open(vcf_file_path, "w") as vcf_file:
tsm.write_vcf(vcf_file, individual_names=ind_names)
2. Continue simulation and save VCF files at specified intervals (100 generations each)
for additional_generations in range(100, 1001, 100):
ts = msprime.sim_ancestry(
{"Panthera_tigris_tigris_A": 25},
ploidy=2,
demography=demography,
sequence_length=122500000,
recombination_rate=1e-8,
end_time=current_generation + additional_generations
)
tsm = msprime.sim_mutations(ts, rate=0.35e-8)
3. Filter VCF files using vcftools and save them to the new filtered directory
for filename in os.listdir(output_dir):
if filename.endswith(".vcf"):
input_vcf = os.path.join(output_dir, filename)
output_vcf = os.path.join(filtered_output_dir, f"filtered_{os.path.splitext(filename)[0]}")
# Run vcftools to filter VCF files
subprocess.run([
"vcftools", "--vcf", input_vcf,
"--max-alleles", "2",
"--recode", "--out", output_vcf
])
I appreciate your help.
Beta Was this translation helpful? Give feedback.
All reactions