Replies: 10 comments
-
@moe1619 the top row would be the correct format for the pvar (and first 5 columns of the VCF). The current software doesn't fill in missing alleles for matching. Plink will read the hard calls from the columns using in the GT field in the sample columns (plink2 docs: https://www.cog-genomics.org/plink/2.0/input#vcf). |
Beta Was this translation helpful? Give feedback.
-
Thank you so much for your help and responsiveness. The pipeline does not like my inpu!. I created my VCF with calls at every position in the PRS using haplotypecaller with the -ERC BP_RESOLUTION flag. Specifically, my vcf handles variants like this
The program seems to expect
My reference calls look like this
I'm not exactly sure what the pipeline expects for those. I'm discussing with my bioinfo team and hopefully will figure it out on my end although suggestions welcome. Thanks! |
Beta Was this translation helpful? Give feedback.
-
OK, I made some adjustments and I'm getting closer. After running haplotypecaller with the flags -ERC BP_RESOLUTION, I ran GenotypeGVCFs with the flag --include-non-variant-sites true. This produces a VCF with values at all 297 sites. The variant sites look like this
My reference sites look like this
I match on 33.0% of variants. Any thoughts on why I'm not matching more? |
Beta Was this translation helpful? Give feedback.
-
and congrats on the release! |
Beta Was this translation helpful? Give feedback.
-
I ran it mostly through but dropped the matching requirement to 0.3 just to see what happens. Looking at the log showing matched and unmatched variants, it seems that the references are not matching. |
Beta Was this translation helpful? Give feedback.
-
I think because on these rows we don't know what the possible ALT alleles are. If the scoring file effect_allele at this position is G it wouldn't match because it wouldn't see it even if G was a possible alternate. One solution is to run your VCFs through imputation to get files with variants that are genotyped for the ALT alleles to output the types of VCF that the pipeline expects to see. Because the variants in PGS are usually genotyped/imputed in this way we usually know which is the REF allele and which possible ALT alleles we could genotype and count the dosage of. I'm going to do some more googling to see if/how other people have solved these problems! |
Beta Was this translation helpful? Give feedback.
-
making some progress. When I make a multisample vcf, there are more variant sites so the number of matched variants increases (I think that's how its working). It still doesn't recognize the invariant (reference) sites but I may ultimately have enough samples so that the union of all variants sites makes all of the sites appear as variant in the VCF. This is a work around as I still don't know how to run on a single sample (and would take your advice). I think as an added feature, it would be good to recognize a VCF with invariant sites. Thanks again for your work on this tool. |
Beta Was this translation helpful? Give feedback.
-
Got some PRSs! For any other users starting with WGS BAMs, I used haplotypecaller with BP_RESOLUTION just for the positions in the PRS, then CombineGVCFs then GenotypeGVCFs. I used --dbsnp and --include-non-variant-sites when possible. OK, I think this ticket can be closed too. Thank you for your help and advice. Great tool!!! |
Beta Was this translation helpful? Give feedback.
-
@moe1619, thanks for your reporting your experience here and in the other issue/thread! The pipeline will definitely work better for larger numbers of samples at the moment, and only outputs the weighted sum of dosges*weights which is hard to interpret for a single-sample without a reference distribution (or set of samples). We're working on features to calculate these distributions and normalised PGS using genetic ancestry for later versions of the pipeline (see explainer tweet). We'll definitely use your experience to add some information about using gVCF files to the documentation: if you have a set of example GATK commands that you used it would be helpful to see them! And if you know of any public data in this format we can try testing it on our end as well. |
Beta Was this translation helpful? Give feedback.
-
create a gvcf for each sample
I am not sure BP_RESOLUTION is actually required based on my next steps but I'm still working out the kinks merge gvcfs
then genotype
this output worked. Thanks again for your work on the tool. |
Beta Was this translation helpful? Give feedback.
-
Hi, thanks for developing this fantastic tool. Question about the expected input data.
I noticed in your example data
https://github.com/PGScatalog/pgsc_calc/blob/main/assets/examples/target_genomes/cineca_synthetic_subset.pvar
that there are only variant calls. No reference calls. I would assume you would need both reference and variant calls to have values at every site relevant for the PRS. Is that true? Or do you assume all absent calls are reference?
Also, I see the format for the expected input is
If the call were reference, would the correct format be the example below?
Thanks very much
Beta Was this translation helpful? Give feedback.
All reactions