PRSmix

The PRSmix package can be used to benchmark PRSs and compute the linear combination of trait-specific scores (PRSmix) and cross-trait scores (PRSmix+) to improve prediction accuracy.

INSTALLATION

To use PRSmix:

install.packages("devtools")
devtools::install_github("buutrg/PRSmix")
library(PRSmix)

MANUAL

We demonstrate the usage of PRSmix with PGS obtained from (but not limited to) PGS catalog and evaluated on an independent cohort. The steps are:

Harmonize SNP effect sizes to the risk alleles (with harmonize_snpeffect_toALT function)
Calculate PRS with all scores (with compute_PRS function)
Evaluate PRSs and performed linear combination: trait-specific (PRSmix) and cross-trait (PRSmix+) (with combine_PRS function)

NOTE:

If you already have the PRSs estimated in the target cohort (e.g. similar to plink2 format) with _SUM columns (via --score <your SNP effect file> cols=+scoresums no-mean-imputation) and want to benchmark and combine scores, you can directly go to step 3 (evaluate and perform linear combination of the scores) with the combine_PRS function. If you want to return the adjusted SNP effect sizes, you will need the harmonized SNP effect sizes to the alternative allele of the target cohort (Step 1).

BONUS:

If you want to evaluate a single score, you can use the eval_single_PRS function (described below).

1. Harmonize per-allele effect sizes to the same risk allele

The harmonize_snpeffect_toALT function:

Argument	Default	Description
`ref_file`		Reference file contain SNP ID (ID), reference allele (REF) and alternative allele (ALT) columns (e.g allele frequency output --freq from PLINK2)
`pgs_folder`		Directory to folder contain each PGS per-allele SNP effect sizes ending with .txt
`pgs_list`		File contain suffixes of file names (don't include suffix .txt) of single PGS on each line. The files must exist in the pgs_folder folder
`out`		Filename of the output for the weight file
`isheader`	TRUE	TRUE if the weight files contain header, otherwise the file would be read as SNP IDs, effect allele and effect size for column 1, 2, 3, respectively
`snp_col`	SNP	Column name of SNP IDs. Can be a number of column index with isheader=FALSE
`a1_col`	A1	Column name of effect allele. Can be a number of column index with isheader=FALSE
`beta_col`	BETA	Column name of effect size. Can be a number of column index with isheader=FALSE
`ncores`	1	Number of cores for parallel processing
`chunk_size`	1	Number of scores to process each chunk

For example:

The ref_file contains 3 columns: ID, REF and ALT (This could be called via PLINK2 to calculate the allele frequencies in your target data with --freq):

ID	ALT	REF
rs1	A	G
rs2	T	G
rs3	G	A

In the pgs_folder directory (i.e. ~/example/allPGScatalog/), there are files of per-allele SNP effect sizes: PGS000001.txt, PGS000002.txt. Each of the PGS will contains 3 columns: SNP, A1, BETA represent SNP ID, Effect allele and Effect size, respectively.

I.e. The SNP effect size file PGS000001.txt contains:

SNP	A1	BETA
rs1	A	0.01
rs2	T	0.02
rs3	G	0.03

The SNP effect size file PGS000002.txt contains:

SNP	A1	BETA
rs1	A	0
rs2	T	0.03

The pgs_list (i.e. ~/example/allscoresID.txt) file will contain (please don't include suffix .txt as this will be the name of the computed PRSs below and our package can automatically add this suffix):

PGS000001
PGS000002

Optional: To split all scores into smaller chunks: E.g. for 20 scores/chunk and 16 cores at once:

chunk_size = 20
ncores = 16

Then, to harmonize SNP effect sizes:

harmonize_snpeffect_toALT(
	ref_file = "~/example/geno.afreq", 
	pgs_folder = "~/example/allPGScatalog/",
	pgs_list = "~/example/allscoresID.txt",
	snp_col = 1,
	a1_col = 2,
	beta_col = 3,
	isheader = F,
	chunk_size = 20,
	ncores = 16,
	out = "~/example/weights.txt"
)

The output file will contains SNP ID, A1, A2, and columns of SNP effect sizes harmonized to the A1 (alternative) allele. For example:

SNP	A1	A2	PGS000001	PGS000002	...
rs1	A	G	0.01	0	...
rs2	T	G	0.02	0.03	...
rs3	G	A	0.03	0	...

2. Compute PRSs for all scores

The compute_PRS function:

Argument	Default	Description
`geno`		Prefix of genotype file in plink format (bed/bim/fam)
`weight_file`		The per-allele SNP effect output from harmonize_snpeffect_toALT function above
`plink2_path`		Path to executable PLINK2 (downloaded from https://www.cog-genomics.org/plink/2.0/)
`start_col`	4	Index of the starting column of SNP effect sizes to estimate PRS in the weight file
`out`		Name of output file, suffix sscore from PLINK2 will be added

Then, to compute PRSs:

compute_PRS(
	geno = "~/example/geno",
	weight_file = "~/example/weights.txt",
	plink2_path = "./plink2",
	start_col = 4,
	out = "~/example/pgs"
)

The output contains multiple column names ending with _SUM (and _AVG) represent for PGS for each set of SNP effect.

FID	IID	ALLELE_CT	NAMED_ALLELE_DOSAGE_SUM	PGS000001_AVG	PGS000001_SUM	PGS000002_AVG	PGS000002_SUM	...

See more at: https://www.cog-genomics.org/plink/2.0/formats#sscore

3. Perform linear combination: trait-specific (PRSmix) and cross-trait (PRSmix+)

The combine_PRS function:

Argument	Default	Description
`pheno_file`		Directory to the phenotype file
`covariate_file`		Directory to file with covariate information (age,sex,PC1..10)
`score_files_list`		A vector contains directories of the PGSs to be read as result of the `compute_PRS` function
`trait_specific_score_file`		A filename contain PGS IDs of trait-specific to combine (PRSmix), one score per line
`pheno_name`		Column name of the phenotype in phenofile
`isbinary`		TRUE if the phenotype is a binary trait
`out`		Prefix of output
`liabilityR2`	FALSE	TRUE if liability R2 should be reported, otherwise partial R2 (for continuous traits) or Nagelkerke R2 (for binary traits) will be reported
`IID_pheno`	IID	Column name of IID of phenotype file (e.g IID, person_id)
`covar_list`	`c("age", "sex", paste0("PC", 1:10))`	A vector of of covariates, must exists as columns in covariate_file
`cat_covar_list`	sex	Array of categorical covariates, must exists as columns in covariate_file. The program will generate dummy columns
`ncores`	1	Number of CPU cores for parallel processing
`is_extract_adjSNPeff`	FALSE	TRUE if extract adjusted SNP effects from PRSmix and PRSmix+, FALSE if only calculate the combined PRS as linear combination of PRS x mixing weights. May consume extended memory
`original_beta_files_list`	NULL	The vector contains directories to SNP effect sizes used to compute original PRSs (as weight_file argument from compute PRS above)
`train_size_list`	NULL	A vector of training sample sizes. If NULL, a random 80% of the samples will be used
`power_thres_list`	0.95	A vector of power thresholds to select scores
`pval_thres_list`	0.05	A vector of P-value thresholds to select scores
`read_pred_training`	FALSE	TRUE if PRSs were assessed in the training set was already run and can be read from file
`read_pred_testing`	FALSE	TRUE if PRSs were assessed in the testing set was already run and can be read from file

The output of the combination framework contains several files:

Description	Suffix
The number of cases and controls (for binary trait)	`_case_counts.txt`
The dataframe of training and testing sample split from the main dataframe	`_train_df.txt` and `_test_df.txt`
The prediction accuracy in the training set for all PRSs	`_train_allPRS.txt`
The prediction accuracy in the testing set for trait-specific PRSs	`_test_allPRS.txt`
The prediction accuracy assessed in the testing set of the best PRS selected from the training set	`_best_acc.txt`
The AUC (for binary trait) of the NULL model of only covariates, the best PGS, PRSmix and PRSmix+ (adjusted for covariates)	`_auc_NULL.txt`, `_auc_BestPGS.txt`, `_auc_PRSmix.txt` and `_auc_PRSmixPlus.txt`
Odds Ratio of the best PGS, PRSmix and PRSmix+ (for binary trait) (adjusted for covariates)	`_OR_bestPGS.txt`, `_OR_PRSmix.txt` and `_OR_PRSmixPlus.txt`
The mixing weights of the scores used in combination	`_weight_PRSmix.txt` and `_weight_PRSmixPlus.txt`
The adjusted SNP effects to estimate PRSmix and PRSmix+ (if is_extract_adjSNPeff=TRUE)	`_adjSNPeff_PRSmix.txt` and `_adjSNPeff_PRSmixPlus.txt`
The prediction accuracy of PRSmix and PRSmix+	`_test_summary_traitPRS_withPRSmix.txt` and `_test_summary_traitPRS_withPRSmixPlus.txt`

For example,

The pheno_file would be formatted as:

FID	IID	CAD
1	1	1
2	2	0

Therefore,

pheno_name = "CAD"
isbinary = TRUE
liabilityR2 = TRUE # FALSE if Nagelkerke R2 should be reported
IID_pheno = "IID"

covariate_file as:

FID	IID	sex	age	PC1	...	PC10
1	1	1	40	0.02	...	0.03
2	2	0	50	0.01	...	0.04

Therefore,

covar_list = c("sex", "age", "PC1", "PC2", "PC3", "PC4", "PC5", "PC6", "PC7", "PC8", "PC9", "PC10")
cat_covar_list = c("sex")

trait_specific_score_file as (i.e. PGS IDs for EFO_0001645 of coronary artery disease from PGS Catalog):

PGS000962
PGS000116
PGS000011
PGS000012
PGS000013
PGS000018
PGS000019
PGS000058
PGS000296
PGS000337

Other parameters could be:

ncores = 4
is_extract_adjSNPeff = TRUE
original_beta_files_list = "~/example/weights.txt"
train_size_list = NULL # randomly split 80% of the data as the training set
power_thres_list = 0.95
pval_thres_list = 0.05/2600 # P-value after Bonferroni correction
read_pred_training = FALSE
read_pred_testing = FALSE

To run the linear combination (PRSmix and PRSmix+):

combine_PRS(
	pheno_file = "~/example/phenotype.txt",
	covariate_file = "~/example/covariate.txt",
	score_files_list = c("~/example/pgs.sscore"),
	trait_specific_score_file = "cad_list.txt",
	pheno_name = "CAD",
	isbinary = TRUE,
	out = "CAD_test_",
	liabilityR2 = TRUE,
	IID_pheno = "IID",
	covar_list = c("age", "sex", "PC1", "PC2", "PC3", "PC4", "PC5", "PC6", "PC7", "PC8", "PC9", "PC10"),
	cat_covar_list = c("sex"),
	ncores = 4,
	is_extract_adjSNPeff = TRUE,
	original_beta_files_list = "~/example/weights.txt",
	train_size_list = NULL,
	power_thres_list = 0.95,
	pval_thres_list = 0.05/2600,
	read_pred_training = FALSE,
	read_pred_testing = FALSE
)

BONUS

The eval_single_PRS function can be used to evaluate a single score and return prediction accuracy R2 as partial R2, liability R2 or Nagelkerke R2:

Argument	Default	Description
`data_df`		Data to assess prediction accuracy which at least contains the phenotype, covariates and a PRS
`pheno`		Name of phenotype column
`prs_name`		PGS list of the trait, must exist in the column name of data_df
`covar_list`		Array of covariates, must exist in the column name of data_df
`isbinary`	FALSE	TRUE if the phenotype is a binary trait
`liabilityR2`	FALSE	TRUE if liability R2 should be reported, otherwise partial R2 (for continuous traits) or Nagelkerke R2 (for binary traits) will be reported

For example: data_df can be formatted as:

CAD	PRS_example	sex	age	PC1	...	PC10
1	0.02	1	40	0.02	...	0.03
0	0.03	0	50	0.01	...	0.04

Therefore, to evaluate PRS_example:

eval_single_PRS(
	data_df,
	pheno = "CAD",
	prs_name = "PRS_example",
	covar_list = c("age", "sex", "PC1", "PC2", "PC3", "PC4", "PC5", "PC6", "PC7", "PC8", "PC9", "PC10"),
	isbinary = TRUE,
	liabilityR2 = TRUE
)

The output of the function will contain the R2, standard error, 95% lower and upper bound, P-value and power

References

Truong, B., Hull, L. E., Ruan, Y., Huang, Q. Q., Hornsby, W., Martin, H. C., van Heel, D. A., Wang, Y., Martin, A. R., Lee, H. and Natarajan, P. 2023. "Integrative Polygenic Risk Score Improves The Prediction Accuracy Of Complex Traits And Diseases". doi:10.1101/2023.02.21.23286110.

Contact information

Please contact Buu Truong (btruong@broadinstitute.org) or Pradeep Natarajan (pradeep@broadinstitute.org) if you have any queries.

Name		Name	Last commit message	Last commit date
Latest commit History 362 Commits
R		R
man		man
.DS_Store		.DS_Store
.Rbuildignore		.Rbuildignore
.gitignore		.gitignore
CITATION.cff		CITATION.cff
DESCRIPTION		DESCRIPTION
NAMESPACE		NAMESPACE
PRSmix.Rproj		PRSmix.Rproj
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PRSmix

INSTALLATION

MANUAL

1. Harmonize per-allele effect sizes to the same risk allele

2. Compute PRSs for all scores

3. Perform linear combination: trait-specific (PRSmix) and cross-trait (PRSmix+)

BONUS

References

Contact information

About

Releases 2

Packages

Languages

buutrg/PRSmix

Folders and files

Latest commit

History

Repository files navigation

PRSmix

INSTALLATION

MANUAL

1. Harmonize per-allele effect sizes to the same risk allele

2. Compute PRSs for all scores

3. Perform linear combination: trait-specific (PRSmix) and cross-trait (PRSmix+)

BONUS

References

Contact information

About

Resources

Stars

Watchers

Forks

Releases 2

Packages 0

Languages

Packages