Joyce Y. Wang1, Neeka Lin1, Michael Zietz2, Jason Mares3, Vagheesh M. Narasimhan1,4, Paul J. Rathouz4,5 and Arbel Harpak1,5,+
1 Department of Integrative Biology, The University of Texas at Austin, Austin, TX
2 Department of Biomedical Informatics, Columbia University, New York, NY
3 Department of Neurology, Columbia University, New York, NY
4 Department of Statistics and Data Science, The University of Texas at Austin, Austin, TX
5 Department of Population Health, The University of Texas at Austin, Austin, TX
+ Correspondence should be addressed to A.H. (arbelharpak@utexas.edu)
Provided below are instructions and details for scripts used to generate the results and figures in "Three Open Questions in Polygenic Score Portability".
Install the software:
Download the UK Biobank (UKB) dataset, following their guidelines. The scripts also use the 1000 Genomes phase 3 dataset provided by Plink, but it is not necessary to download it beforehand, as 05h_ukb_kgp_pca.sh
contains scripts for downloading it.
For running the scripts, we recommend creating a conda environment.
git clone https://github.com/harpak-lab/Portability_Questions
cd Portability_Questions
conda env create -f environment.yml
Before execution, the directories contained in the scripts need to be modified so that they point to your directories.
Execute the bash scripts ending with .sh with bash <script_name.sh>. Please see the details for each script in the following sections.
Execute the following files to filter and prepare the data:
00_make_directories.sh
01a_extract_data_fields.sh
(make sure to edit the file so it's pointing to the correct UKB basket file)01b_filter_individuals_job.sh
01d_filter_genotype_files.sh
02_prepare_covariates_phenotypes.sh
In the selection of the GWAS sample, we used the White British classification as provided by the UKB.
Execute the following files to perform GWAS, clumping, and thresholding:
03_gwas.sh
04a_clumping.sh
04e_after_clumping.sh
The fixation index (Fst) is a natural metric, a single number, to measure the divergence between two sets of chromosomes and we considered using it to measure the distance between the pair of chromosomes of an individual and chromosomes in the GWAS sample. However, calculating Fst was computationally costly, so we used Euclidean distance in the PC space as a single number proxying genetic distance from the GWAS sample.
Execute the following files to calculate Fst:
05a_pc_dist_fst.sh
- All the scripts created under
temp_fst_path
Then, execute the following files to calculate Euclidean distance:
05e_find_best_num_pc.sh
(creates Fig. S1)05h_ukb_kgp_pca.sh
(downloads 1000 Genomes phase 3 dataset provided by Plink)05j_pc_dist_fst_plots.sh
(creates Fig. 1)
Execute the following file to calculate PGS:
06_compute_prs.sh
We evaluated PGS prediction accuracy at both the group level and individual level:
07_group_ind_level_pred.sh
(creates Fig. 2, S2-13)
We compared the variance in squared prediction error explained for 8 raw measures: genetic distance, Townsend Deprivation Index, average yearly total household income before tax, educational attainment, which we converted into years of education, minor allele counts for SNPs with different with different magnitudes of effects (three equally-sized bins of small, medium, and large squared effect sizes, see Fig. S23), and minor allele counts of all SNPs:
08a_prepare_for_ma_counts.sh
08b_calc_ma_counts.sh
08d_ind_pred_plots.sh
(creates Fig. 3, S14-21)
To understand why immunity-related traits like lymphocyte count have group-level prediction accuracy that drops near zero even at a short genetic distance, we performed additional analyses.
We first performed two additional GWASs and compared the allelic effects across the three GWASs:
09a_prepare_close_far_pca.sh
09c_close_far_pca.sh
09d_prepare_close_far_gwas.sh
09f_gwas_close.sh
and09f_gwas_far.sh
We calculated heterozygosity at index SNPs as a function of genetic distance:
09g_calc_heterozygosity.sh
We examined the variance of PGS as a function of genetic distance:
09j_compare_effect_sizes_heterozygosity_var_pgs_plots.sh
(creates Fig. 4, S22)
We estimated the heritability associated with each index SNP:
10a_compare_heritability.sh
(creates Fig. S23-24)