-
Notifications
You must be signed in to change notification settings - Fork 4
Configuration
Below, you can find the full list of customizable parameters included in the configuration file (pgstoolkit.conf
).
Note that before running the toolkit, you will also need to change the SLURM settings at the top of the
pgstoolkit.sh
file. Also, make sure to remove the '/' (forward-slash) at the end of any directory variable.
R = required O = optional
Parameter | Description |
---|---|
PRSMETHOD |
Indicate what method to use [PLINK/RAPIDOPGS/PRSCS/PRSICE/NONE]. Pick NONE if you only whish to perform quality control. |
PROJECTNAME |
Name of the project. |
PROJECT_DIR |
Path to where the main analysis directory resides. |
OUTPUT_DIRNAME |
Name of the output directory within the PROJECT_DIR directory. |
SUBPROJECT_DIR_NAME |
Name of (sub)project -- this will be used to create subfolders within the OUTPUTDIR . |
MAIN_WORKDIR_NAME |
Name of the working directory within the main analysis directory, used for temporary files. |
LOG_DIRNAME |
Name of the subdirectory of the PROJECT_DIR directory used for storing log files. |
QC |
Indicate whether quality control should be applied according to the MAF and INFO parameters. [YES/NO] |
MAF |
Minimum minor allele frequency to keep variants, e.g. "0.005". |
INFO |
Minimum imputation quality score to keep variants, e.g. "0.3". |
KEEP_TEMP_FILES |
Keep the files temporarily generated by the toolkit at the end of the job. [TRUE/FALSE] |
SAVE_CONFIG |
Save a copy of this configuration file along with the results. [TRUE/FALSE] |
Parameter | Description | RapidoPGS | PRS-CS | PRSice | PLINK |
---|---|---|---|---|---|
BASEDATA |
Path to the file containing the base data. | R | R | R | R |
BF_BUILD |
Build of the base file, e.g. "hg19" or "hg38". | R | |||
BF_ID_COL |
Name of the SNP ID column in the base file. | R | R | R | R |
BF_CHR_COL |
Name of the chromosome column in the base file. | R | R | ||
BF_POS_COL |
Name of the position column in the base file. | R | R | ||
BF_EFFECT_COL |
Name of the effect allele column in the base file. | R | R | R | R |
BF_NON_EFFECT_COL |
Name of the non-effect allele column in the base file. | R | R | R | |
BF_STAT |
Type of measure in the BF_STAT_COL, either "beta" or "or". | * | R | R | |
BF_STAT_COL |
Name of the beta/OR/effect size column in the base file. | R | R | R | R |
BF_FRQ_COL |
Name of the effect allele frequency column in the base file. | R/O** | |||
BF_SE_COL |
Name of the column of the standard error of the beta/OR value. | R | |||
BF_PVALUE_COL |
Name of the column containing the P-values of the assocation test. | R | R | R | |
BF_SBJ_COL |
Name of the column containing the sample size for each variant. | R/O*** | |||
BF_SAMPLE_SIZE |
Sample size of the GWAS | R/O*** | R | ||
BF_TARGET_TYPE |
"cc" for a case control trait, "quant" for a quantative trait | R | |||
LDDATA |
Path to the linkage disequilibrium reference data. PRS-CS and PRSice require a different format. | R**** | O***** | ||
VALIDATIONDATA |
Path to the directory containing the validation data, e.g. /hpc/data/_ae_originals . |
R | R | R | R |
VALIDATIONPREFIX |
Prefix of the validation files in BGEN format v1.2, excluding the chr-number and extension, e.g. aegs_combo_1kGp3GoNL5_RAW_chr . |
R | R | R | R |
VAL_REF_POS |
Position of the reference allele in the BGEN files relative to the alternative allele, ref-first, ref-last or ref-unknown. | R | R | R | |
SAMPLE_FILE |
Path to the sample file. A description of the sample file format can be found here. | R | R | R | R |
PRSICE_PHENOTYPE |
Phenotype which will be used by PRSice to find the best fitted set of polygenic scores, this phenotype must be present in the sample file. | R | |||
PRSICE_PHENOTYPE_BINARY |
[TRUE/FALSE] indicating whether PRSICE_PHENOTYPE contains a binary phenotype. |
R | |||
STATS_FILE |
Path to the stats file. | O | O | O | O |
STATS_ID_COL |
Name of the stats file column containing the SNP IDs, these IDs must match the IDs that occur in the base file. | O | O | O | O |
STATS_MAF_COL |
Name of the stats file column containing the minor allele frequency. | O | O | O | O |
STATS_INFO_COL |
Name of the stats file column containing the imputation score. | O | O | O | O |
Parameter | Description | RapidoPGS | PRS-CS | PRSice | PLINK |
---|---|---|---|---|---|
RUNTIME_QC |
Maximal duration of the quality control sub-job. | O | O | O | O |
RUNTIME_PLINKSCORE |
Maximal duration of the PLINK score sub-job. | R | R | R | |
RUNTIME_PLINKSUM |
Maximal duration of the PLINK sum sub-job. | R | R | R | |
RUNTIME_RAPIDO |
Maximal duration of the RapidoPGS sub-job. | R | |||
RUNTIME_PRSICE |
Maximal duration of the PRSice sub-job. | R | |||
RUNTIME_PRSCS |
Maximal duration of the PRS-CS sub-job. | R | |||
RUNTIME_PRSCS_format |
Maximal duration of the PRS-CS format sub-job. | R | |||
MEMORY_QC |
Maximal amount of RAM used for the quality control sub-job. | O | O | O | O |
MEMORY_PLINKSCORE |
Maximal amount of RAM used for the PLINK score sub-job. | R | R | R | |
MEMORY_PLINKSUM |
Maximal amount of RAM used for the PLINK sum sub-job. | R | R | R | |
MEMORY_RAPIDO |
Maximal amount of RAM used for the RapidoPGS sub-job. | R | |||
MEMORY_PRSICE |
Maximal amount of RAM used for the PRSice sub-job. | R | |||
MEMORY_PRSCS |
Maximal amount of RAM used for the PRS-CS sub-job. | R | |||
MEMORY_PRSCS_format |
Maximal amount of RAM used for the PRS-CS format sub-job. | R | |||
PRSICE_CPUS |
Maximal amount of CPUs used for the PRSice sub-job. | R | |||
PRSCS_CPUS |
Maximal amount of CPUs used for the PRS-CS sub-job. | R |
PRSice calculates the PRS for all individuals in the target population for a given phenotype. For more on PRSice parameters go here.
Parameter | Description | Required |
---|---|---|
PRSICE_EXTRACT |
File containing SNPs to be included in the analysis. PRSice will return an error if it runs into duplicate SNPs, in this case it will write the non-duplicate SNPs to a file in the working directory. Put the path to the generated file in this parameter to avoid this error. | O |
PRSICE_EXCLUDE |
File containing SNPs to be excluded from the analysis. | O |
PRSICE_CLUMP_KB |
Distance for clumping in kb, the default is "250". | R |
PRSICE_CLUMP_P |
P-value threshold used for clumping, default is "1". | R |
PRSICE_CLUMP_R2 |
r2 threshold for clumping, default is "0.1". | R |
PRSICE_PERM |
Number of permutations to perform, default is "10000". | R |
PRSICE_THREADS |
Number of threads to use, e.g. "20", if set to "max" the number of threads will be derived from the amount of dedicated CPUs (PRSICE_CPUS ). |
R |
PRSICE_SETTINGS |
Some (not all) additional settings for PRSice, e.g. PRSICE_SETTINGS="--no-clump --print-snp --extract PRSice.valid --score sum --missing center"
|
O |
Below the parameters for the PLINK allelic scoring function. This function is also used by RapidoPGS and PRS-CS as those are only able to compute effect sizes. Note that within this toolkit, PLINK is set to calculate the sum of the allele scores instead of the default average allele score. The reason behind this is that if we were to calculate the average for each chromosome, we would not be able to take the sum of all chromosomes. More on PLINK --score parameters here.
Parameter | Description | Required |
---|---|---|
PLINK_SETTINGS |
Optional settings of PLINK, e.g. "center no-mean-imputation se zs"
|
O |
More on RapidoPGS parameters here.
Parameter | Description | Required |
---|---|---|
RP_filt_threshold |
Scalar indicating the ppi threshold (if filt_threshold < 1) or the number of top SNPs by absolute weights (if filt_threshold >= 1). | R |
RP_recalc |
Logical [TRUE/FALSE] indicating if weights should be recalculated after thresholding, only relevant if filt_threshold is defined. | R |
RP_ppi |
Scalar representing the prior probability, default is "1e-04". | R |
RP_prior |
The prior specifies that BETA at causal SNPs follows a centred normal distribution with standard deviation sd.prior, sensible and widely used DEFAULTs are 0.2 for case control traits, and 0.15 * var(trait) for quantitative selected if trait == "quant"). | R |
RP_REF |
Path to the reference file the SNPs should be filtered and aligned to, this file should have 5 columns (CHR, BP, SNPID, REF and ALT) and should be in the same build as the summary statistics. | O |
Look here for a more detailed description of PRS-CS parameters.
Parameter | Description | Required |
---|---|---|
PRSCS_THREADS |
Maximum amount of threads, when empty, PRS-CS uses the maximum amount of threads of the CPUS dedicated to the job. | O |
BIM_FILE_AVAILABLE |
[YES/NO] indicating if a .bim file is already available, in this case BIM_FILE_PATH should also be specified, a .bim file can be retrieved from the tmp files of a previous run (https://www.cog-genomics.org/plink/2.0/formats#bim) and is specific to the validation dataset. | R |
BIM_FILE_PATH |
Path to the .bim file. | O |
PRSCS_SETTINGS |
Optional settings of PRS-CS (e.g. PRSCS_SETTINGS="--a 1 --b 0.5 --chrom 1,3,5").
|
O |