format for input data includes reference calls? #123

moe1619 · 2022-09-16T04:05:07Z

moe1619
Sep 16, 2022

Hi, thanks for developing this fantastic tool. Question about the expected input data.

I noticed in your example data

https://github.com/PGScatalog/pgsc_calc/blob/main/assets/examples/target_genomes/cineca_synthetic_subset.pvar

that there are only variant calls. No reference calls. I would assume you would need both reference and variant calls to have values at every site relevant for the PRS. Is that true? Or do you assume all absent calls are reference?

Also, I see the format for the expected input is

#CHROM	POS	ID	REF	ALT
22	17080378	rs5746679	G	A

If the call were reference, would the correct format be the example below?

#CHROM	POS	ID	REF	ALT
22	17080378	rs5746679	G	.

Thanks very much

smlmbrt · 2022-09-16T08:25:36Z

smlmbrt
Sep 16, 2022
Maintainer

@moe1619 the top row would be the correct format for the pvar (and first 5 columns of the VCF). The current software doesn't fill in missing alleles for matching. Plink will read the hard calls from the columns using in the GT field in the sample columns (plink2 docs: https://www.cog-genomics.org/plink/2.0/input#vcf).

0 replies

moe1619 · 2022-09-16T13:10:55Z

moe1619
Sep 16, 2022
Author

Thank you so much for your help and responsiveness.

The pipeline does not like my inpu!. I created my VCF with calls at every position in the PRS using haplotypecaller with the -ERC BP_RESOLUTION flag.

Specifically, my vcf handles variants like this

#CHROM    POS    ID    REF    ALT
1    9341786    .    C    T,<NON_REF>

The program seems to expect

#CHROM    POS    ID    REF    ALT
1    9341786    .    C    T

My reference calls look like this

#CHROM    POS    ID    REF    ALT
1    166533517    .    T    <NON_REF>

I'm not exactly sure what the pipeline expects for those.

I'm discussing with my bioinfo team and hopefully will figure it out on my end although suggestions welcome.

Thanks!

0 replies

moe1619 · 2022-09-16T15:04:32Z

moe1619
Sep 16, 2022
Author

OK, I made some adjustments and I'm getting closer. After running haplotypecaller with the flags -ERC BP_RESOLUTION, I ran GenotypeGVCFs with the flag --include-non-variant-sites true. This produces a VCF with values at all 297 sites. The variant sites look like this

#CHROM | POS | ID | REF | ALT | QUAL | FILTER | INFO | FORMAT | sample_1
1 | 9341786 | . | C | T | 636.64 | . | AC=1;AF=0.500;AN=2;BaseQRankSum=-7.890e-01;DP=30;ExcessHet=3.0103;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=60.00;MQRankSum=0.00;QD=21.22;ReadPosRankSum=-1.690e-01;SOR=0.693 | GT:AD:DP:GQ:PL | 0/1:12,18:30:99:644,0,407

My reference sites look like this

#CHROM | POS | ID | REF | ALT | QUAL | FILTER | INFO | FORMAT | sample_1
1 | 166533517 | . | T | . | . | . | DP=40 | GT:AD:DP:RGQ | 0/0:40:40:99

I match on 33.0% of variants. Any thoughts on why I'm not matching more?
Thanks!

0 replies

moe1619 · 2022-09-16T15:20:58Z

moe1619
Sep 16, 2022
Author

and congrats on the release!

0 replies

moe1619 · 2022-09-16T15:45:35Z

moe1619
Sep 16, 2022
Author

I ran it mostly through but dropped the matching requirement to 0.3 just to see what happens. Looking at the log showing matched and unmatched variants, it seems that the references are not matching.

0 replies

smlmbrt · 2022-09-16T15:50:52Z

smlmbrt
Sep 16, 2022
Maintainer

#CHROM | POS | ID | REF | ALT | QUAL | FILTER | INFO | FORMAT | sample_1
1 | 166533517 | . | T | . | . | . | DP=40 | GT:AD:DP:RGQ | 0/0:40:40:99

I think because on these rows we don't know what the possible ALT alleles are. If the scoring file effect_allele at this position is G it wouldn't match because it wouldn't see it even if G was a possible alternate. One solution is to run your VCFs through imputation to get files with variants that are genotyped for the ALT alleles to output the types of VCF that the pipeline expects to see. Because the variants in PGS are usually genotyped/imputed in this way we usually know which is the REF allele and which possible ALT alleles we could genotype and count the dosage of. I'm going to do some more googling to see if/how other people have solved these problems!

0 replies

moe1619 · 2022-09-19T01:26:38Z

moe1619
Sep 19, 2022
Author

making some progress. When I make a multisample vcf, there are more variant sites so the number of matched variants increases (I think that's how its working). It still doesn't recognize the invariant (reference) sites but I may ultimately have enough samples so that the union of all variants sites makes all of the sites appear as variant in the VCF. This is a work around as I still don't know how to run on a single sample (and would take your advice).

I think as an added feature, it would be good to recognize a VCF with invariant sites.

Thanks again for your work on this tool.

0 replies

moe1619 · 2022-09-19T11:46:49Z

moe1619
Sep 19, 2022
Author

Got some PRSs! For any other users starting with WGS BAMs, I used haplotypecaller with BP_RESOLUTION just for the positions in the PRS, then CombineGVCFs then GenotypeGVCFs. I used --dbsnp and --include-non-variant-sites when possible. OK, I think this ticket can be closed too. Thank you for your help and advice. Great tool!!!

0 replies

smlmbrt · 2022-09-20T10:09:29Z

smlmbrt
Sep 20, 2022
Maintainer

@moe1619, thanks for your reporting your experience here and in the other issue/thread! The pipeline will definitely work better for larger numbers of samples at the moment, and only outputs the weighted sum of dosges*weights which is hard to interpret for a single-sample without a reference distribution (or set of samples). We're working on features to calculate these distributions and normalised PGS using genetic ancestry for later versions of the pipeline (see explainer tweet).

We'll definitely use your experience to add some information about using gVCF files to the documentation: if you have a set of example GATK commands that you used it would be helpful to see them! And if you know of any public data in this format we can try testing it on our end as well.

0 replies

moe1619 · 2022-09-22T12:54:10Z

moe1619
Sep 22, 2022
Author

create a gvcf for each sample

gatk_path --java-options "-Xmx4G" HaplotypeCaller \
  -R $reference_grch37 \
  -L PRS_snp_positions.list \
  -I $bam_path   \
  -O $gvcf_for_prs -ERC BP_RESOLUTION --dbsnp $dbsnp_file

I am not sure BP_RESOLUTION is actually required based on my next steps but I'm still working out the kinks

merge gvcfs

$gatk_path --java-options "-Xmx4g" CombineGVCFs \
     -R $reference_grch37 \
     --variant $list_of_gvcfs \
     -O $multisample_gvcf --dbsnp $dbsnp_file

then genotype

$gatk_path --java-options "-Xmx4g" GenotypeGVCFs \
   -R $reference_grch37 \
   -V $multisample_gvcf \
   -O $multisample_gt_vcf --dbsnp $dbsnp_file --include-non-variant-sites true

this output worked. Thanks again for your work on the tool.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

format for input data includes reference calls? #123

{{title}}

Replies: 10 comments

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

format for input data includes reference calls? #123

moe1619 Sep 16, 2022

Replies: 10 comments

smlmbrt Sep 16, 2022 Maintainer

moe1619 Sep 16, 2022 Author

moe1619 Sep 16, 2022 Author

moe1619 Sep 16, 2022 Author

moe1619 Sep 16, 2022 Author

smlmbrt Sep 16, 2022 Maintainer

moe1619 Sep 19, 2022 Author

moe1619 Sep 19, 2022 Author

smlmbrt Sep 20, 2022 Maintainer

moe1619 Sep 22, 2022 Author

moe1619
Sep 16, 2022

smlmbrt
Sep 16, 2022
Maintainer

moe1619
Sep 16, 2022
Author

moe1619
Sep 16, 2022
Author

moe1619
Sep 16, 2022
Author

moe1619
Sep 16, 2022
Author

smlmbrt
Sep 16, 2022
Maintainer

moe1619
Sep 19, 2022
Author

moe1619
Sep 19, 2022
Author

smlmbrt
Sep 20, 2022
Maintainer

moe1619
Sep 22, 2022
Author