New features are actively under construction (Fall 2024).
- Integrating multiple testing correction to be user-friendly
- The applet filter-lines.jar gets corrupted. Will write a .py script.
- Documenting simulation results for upcoming paper
Contact sethtem@umich.edu or Github issues for troubleshooting.
See misc/announcements.md
for high-level updates on this repo.
See misc/fixes.md
for any major bug fixes.
See misc/usage.md
to evaluate if this methodology fits your study.
See misc/cluster-options.md
for some suggested cluster options to use in pipelines.
See on GitHub "Issues/Closed" for some comments I/Seth left about the pipeline.
Please cite if you use this package.
Temple, S.D., Waples, R.K., Browning, S.R. (2024). Modeling recent positive selection using identity-by-descent segments. The American Journal of Human Genetics. https://doi.org/10.1016/j.ajhg.2024.08.023.
Temple, S.D., Thompson, E.A. (2024). Identity-by-descent in large samples. Preprint at bioRxiv, 2024.06.05.597656. https://www.biorxiv.org/content/10.1101/2024.06.05.597656v1.
Temple, S.D. (2024). "Statistical Inference using Identity-by-Descent Segments: Perspectives on Recent Positive Selection. PhD thesis (University of Washington). https://www.proquest.com/docview/3105584569?sourcetype=Dissertations%20&%20Theses.
Acronym: incomplete Selective sweep With Extended haplotypes Estimation Procedure
This software presents methods to study recent, strong positive selection.
- By recent, we mean within the last 500 generations
- By strong, we mean selection coefficient s >= 0.015 (1.5%)
The methods relate lengths of IBD segments to a coalescent model under selection.
We assume 1 selected allele at a locus.
- A genome-wide selection scan for anomalously large IBD rates
- With multiple testing correction
- Inferring anomalously large IBD clusters
- Ranking alleles based on evidence for selection
- Computing a measure of cluster agglomeration (Gini impurity index)
- Estimating frequency and location of unknown sweeping allele
- Estimating a selection coefficient
- Estimating a confidence interval
See misc/usage.md
.
- Whole genome sequences
- Probably at least > 500 diploids
- Phased vcf data 0|1
- No apparent population structure
- No apparent close relatedness
- A genetic map (bp ---> cM)
- If not available, create genetic maps w/ uniform rate
- Recombining diploid chromosomes
- Not extended to human X chromosome
- Access to cluster computing
- For human-scale data, you should have at least 25 Gb of RAM and 6 CPUs on a node.
- More memory and cores for TOPMed or UKBB-like sequence datasets
- Not extended to cloud computing
- For human-scale data, you should have at least 25 Gb of RAM and 6 CPUs on a node.
The chromosome numbers in genetic maps should match the chromosome numbers in VCFs.
The genetic maps should be tab-separated.
This repository contains a Python package and some Snakemake bioinformatics pipelines.
- The package --->
src/
- The pipelines --->
workflow/
You should run all snakemake
pipelines in their workflow/some-pipeline/
.
You should be in the mamba activate isweep
environment for analyses.
You should run the analyses using cluster jobs.
We have made README.md files in most subfolders.
See misc/installing-mamba.md
to get a Python package manager.
- Clone the repository
git clone https://github.com/sdtemple/isweep.git
- Get the Python package
mamba env create -f isweep-environment.yml
mamba activate isweep
python -c 'import site; print(site.getsitepackages())'
- Download software.
bash get-software.sh software
- Puts these in a folder called
software/
. - Requires
wget
. - For simulation study, download SLiM yourself.
- Put in
software/
. - https://messerlab.org/slim/
- Put in
- You need to cite these software.
See workflow/other-methods/
folder for how we run methods we compare to.
This is the overall procedure. You will see more details for each step in workflow/some-pipeline/README.md
files.
Phase data w/ Beagle or Shapeit beforehand. Subset data in light of global ancestry and close relatedness.
- Here is a pipeline we built for these purposes:
https://github.com/sdtemple/flare-pipeline
- You could use IBDkin to detect close relatedness:
https://github.com/YingZhou001/IBDkin
- You could use PCA, ADMIXTURE, or FLARE to determine global ancestry.
- Make pointers to large (phased) vcf files.
- Edit YAML files in the different workflow directories.
- Run the selection scan (
workflow/scan
).
nohup snakemake -s Snakefile-scan.smk -c1 --cluster "[options]" --jobs X --configfile *.yaml &
- See the file
misc/cluster-options.md
for support. - Recommendation: do a test run with your 2 smallest chromosomes.
- Check
*.log
files fromibd-ends
. If it recommends an estimated err, change error rate in YAML file. - Then, run with all your chromosomes.
- Estimate recent effective sizes :
workflow/scan/scripts/run-ibdne.sh
. - Make the Manhattan plot:
workflow/scan/scripts/manhattan.py
. - Checkout the
roi.tsv
file.
- Edit with locus names if you want.
- Edit to change defaults: additive model and 95% confidence intervals.
- Run the region of interest analysis (
workflow/roi
).
nohup snakemake -s Snakefile-roi.smk -c1 --cluster "[options]" --jobs X --configfile *.yaml &
The flow chart below shows the steps ("rules") in the selection scan pipeline.
Diverting paths "mle" versus "scan" refer to different detection thresholds (3.0 and 2.0 cM).
See dag-roi.png
for the steps in the sweep modeling pipeline.
- cM, not bp, windowing
- Replace filter-lines.jar with a python script
-
- Applet prone to corruption
- Integrate multiple testing correction into pipeline
- Extension to IBD mapping