This toolkit is for generating Single Nucleotide Polymorphism (SNP) based Polygenic Risk Scores (PRS) for common polygenic disease / traits where it is necessary to impute HLA alleles using single SNPs and model epistasis.
The major advantage of this is the ability to assign HLA haplogenotypes (e.g DR3/DQ2.5) using SNPs alone which is quicker and more cost effective than typical methods such array imputation and HLA typing. Demo data and guide files are provided for our own published PRS but these could be easily generated in order to apply code to your own HLA calling and or scoring schema.
For any questions about code on this page please contact Seth Sharp by email (Ssharp@stanford.edu).
7/8/2022 - Updated the SNP lists for TOPMED (r2) imputations as per NIH TOPMED server
7/8/2022 - Reordered the SNP lists folder to split by TOPMED vs 1000G imputation
7/8/2022 - Added a wrapper script
hla-prs-toolkit-0.22a.zip - Latest version of all scripts and demo data.
Alternatively "git clone https://github.com/sethsh7/hla-prs-toolkit.git" to clone onto your working local directory or server.
PLINK 1.9 is required.
Anaconda or Python (Version 3.8).
PLINK genotyping data (.bed/.bim/.fam)
python 1_plink2call.py --bfile <prefix> --mapping <mapping> --rank <ranking>
This script takes your PLINK data containing a set of SNPs for which you wish to generate HLA haplogenotypes and a mapping file which should be text (space or tab delimited) in the format:
SNP ALLELE
snp1 allele1
snp2 allele2
snp3 allele3
...
The SNPs are matched on RSID and should match exactly with your PLINK .bim file. The script will output the following results if succesful:
a) <prefix>_clean.txt - A table of allele counts by sample and mapped allele.
b) <prefix>_count.txt - Row wise sum of alleles per person for quality control.
c) <prefix>_cat.txt - List of categorical haplogenotypes by sample.
Additionally a ranking file is now needed to resolve impossible calls and invoke the HLA-LD tiebreak algorithm (see below).
Since only 2 HLA calls are possible per person (corresponding to 2 chromosomes) any samples with >2 calls were previously filtered out. This is no longer the case as an (unpublished) HLA LD-Tiebreak algorithm as been added based on a reference ranking. The reference ranking provided has been calculated based on the following:
Ranking score = SNP LD (r2) * Prior odds (frequency in reference population)
The reference prior odds were generated from HLA typing data take from Klitz et al (2003) and are therefore appropriate for a White US (European ancestry) population. This scenario typically occurs either in genotyping miscalls or in non-European ancestry samples.You can generate your own reference if you have other ethnicity data from any published HLA typing database. Our internal data shows ~72% of calls recovered this way are accurate compared to HLA typing.
The ranking file provided must have the following format:
ALELE RANK
allele1 1
allele2 2
allele3 3
...
python 2_cat2scores.py --cat <your_cat.txt> --score <scorefile>
The second script takes the by-sample categorical haplogenotype list file (c above) generated from PLINK2CALL and applies a scoring scheme. This can be EITHER on a full haplogenotype basis (one weight per combination) or on an additive basis (sum of each allele weight) or a mixture of both. The scorefile should be text (space or tab delimited) in the format:
ALLELE1 ALLELE2 BETA
allele1 allele2 value - Use this format for complete haplogenotypes
allele1 allele3 value
allele3 allele2 value
NULL allele1 value - For addition of an additive model use the NULL representation for the first column.
...
By default the script will NOT generate an additive term if a complete haplogenotype term is available (this differs from specific "interaction terms"). The script will output the following results file:
a) <prefix>_Scored.txt - List of samples and their corresponding scores as above
To generate a complete PRS (e.g. variants outside the HLA region) you can then sum this component with regular allele scoring generated using PLINK score function.
Scoring and mapping files are provided to generate scores from our publications with either TOPMED or 1000G imputed data.
10-SNP / 30-SNP score ("GRS1") [1] - Original 10 and 30 SNP Exeter scores, they use a simple interaction model for DR3 and DR4 haplotypes.
67-SNP score ("GRS2") [2] - Updated 67 SNP Exeter score, complex interaction model, combines interaction and additive models (as described above), best prediction.
42-SNP score [3] - 42 SNP Exeter score, interaction model only consisting of DQ2.5, DQ2.2, DQ8.1 and DQ2.2 haplotypes.
In the scripts folder a subfolder "Demo" contains randomly generated data on 50 samples. Run with RunDemo.sh.
You will need to impute your data using NIH Imputation Server. You will then need to convert your VCF of imputed dosages to PLINK format and extract the variants you need.
For TOPMED it should be hg38, for 1000G hg19.
We have provided SNP lists for TOPMED r2 and 1000 Genomes as an alternative.
You will need to download your raw genome data and impute it via a third party service.
They are set to 0 by default as the algorithm is conservative with calls.
[1] "A Type 1 Diabetes Genetic Risk Score Can Aid Discrimination Between Type 1 and Type 2 Diabetes in Young Adults", DiabetesCare 2016, RA Oram et al
[2] "Development and Standardization of an Improved Type 1 Diabetes Genetic Risk Score for Use in Newborn Screening and Incident Diagnosis", DiabetesCare 2019, SA Sharp et al
[3] "A single nucleotide polymorphism genetic risk score to aid diagnosis of coeliac disease: a pilot study in clinical care", AP&T 2020, SA Sharp et al
Resources are provided under the GNU General Public License v3.0. For more information please LICENSE.md.