Sites will be selected by first running a GWAS of each ancestry in 1000 Genomes (https://www.internationalgenome.org/). For each ancestry, associated genetic variants that are also mQTLs (http://mqtldb.godmc.org.uk/) will be ordered by mQTL effect size and the top 50 selected for the panel.
-
Input:
- mQTL summary statistics from GoDMC http://fileserve.mrcieu.ac.uk/mqtl/assoc_meta_all.csv.gz
- mapping between rsid and genomic coordinates ftp://hgdownload.soe.ucsc.edu/goldenPath/hg19/database/snp151Common.txt.gz
-
Output: GoDMC summary statistics with hg38 coordinates in
godmc-hg38.csv.gz
mkdir output
Rscript src/extract-mqtls.r godmc-hg38.csv.gz
- Input: http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/integrated_call_samples_v3.20130502.ALL.panel
- Output: GWAS phenotype files for each ancestry in
pheno/
'Super-populations' (i.e. AFR AMR EAS EUR SAS) will each be compared to all other super-populations. Other populations will be compared to all other populations within the same super-population.
On the compute cluster:
- create a folder for the scripts, inputs and outputs (in scratch space),
call it
BASE
- copy
ancestry/src/
toBASE
- generate GWAS phenotype files as follows:
Rscript src/generate-phenotype-files.r pheno
- Input:
ancestries.txt
,pheno/
, 1000 Genomes data - Output: GWAS outputs for each ancestry in
gwas-fst/
andgwas-glm/
On the compute cluster:
- copy
ancestry/ancestries.txt
toBASE
- download 1000 Genomes data (http://hgdownload.cse.ucsc.edu/gbdb/hg38/1000Genomes/) to
BASE/1000G
- submit the GWAS jobs to the system as follows:
mkdir gwas-glm
mkdir gwas-fst
sbatch src/gwas-glm.sh ancestries.txt 1000G pheno gwas-glm
sbatch src/gwas-fst.sh ancestries.txt 1000G pheno gwas-fst
- Input:
gwas-fst/
,gwas-glm/
,godmc-hg38.csv.gz
,ancestries.txt
- Output: mQTLs associated with each ancestry in
sites.csv
(at logistic p < 5e-8)
On the compute cluster, submit the jobs to cluster as follows:
sbatch src/filter-sites.sh ancestries.txt gwas-glm gwas-fst godmc-hg38.csv.gz output/sites
Afterward, collate the top 50 for each ancestry into a single file:
Rscript src/select-sites.r output/sites output/sites.csv
Determine to what extent the SNPs identified actually capture genetic variation of ancestry.
- Input:
output/sites.csv
, 1000 Genomes data - Output: Genotypes for each mQTL in
output/sites.csv
in VCF files ingenotypes/
Submit the jobs to extract genotypes to the system as follows:
Rscript src/extract-sites.r output/sites.csv sites.txt
mkdir genotypes
sbatch src/extract-genotypes.sh sites.txt 1000G genotypes
For further checks of the CpG sites selected here, see ../checks/readme.md.