-
Context: I am applying the pgsc_calc (2.0.0-alpha, Linux/Singularity/AWS) to whole genome 1000 genome data (this has presented its own set of challenges). Concern: In the course of generating scores, I find that most PGS models I use produce scores in the expected range (mostly +/-2.0) over nearly all genomes. However, about 15% consistently produce quite large scores, for example PGS000749 (though this is just one of the affected PGS) which routinely produces "SUM" scores in the 100.0-500.0 range. Question: As I understand it, models are generally normalized during development, so in binary models scores have a mean ~0, a standard deviation of ~1.0, and the scores produced can be loosely interpreted as Z-scores. Is it reasonable assume that large magnitude scores on binary models are simply un-normalized results, or is this more likely a software/data issue or bug? Possible Workaround: If un-normalized, is there some way to get the development/eval normalization coefficients from data in the PGRS catalog? As an alternative, I can renormalize scores across my sample population, but I'd rather use the parameters developed by the submitters, in case my sample is biased over the predicted trait. Thanks! |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 1 reply
-
The |
Beta Was this translation helpful? Give feedback.
-
So it sounds like submitters aren't required to include any normalizing coefficients or Mean/SD values for scores from the score development population, and these may not generally be available in PGRS-provided models or metadata. It is a touch problematic to use measures like OR/SD in the development population, without knowing what the observed SD was in that population. No worries, it is most helpful just to know not to seek for them here. There may be growing interest in converting PGRS all the way to probabilities, which ideally requires the normalization used in the score development population. For now, I can work around this by deriving mean and SD of scores generated from a relevant validation population (I think this was also your reference). Thanks, indeed, for the quick response! |
Beta Was this translation helpful? Give feedback.
The
SUM
scores output by the calculator are the raw sum of effect_weight*dosage over all variants in the scoring file. So it is possible that these could have large numbers and not follow a normal distribution centred around 0. Usually people just center those results across the whole sample to make them more comparable to the others.