-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add tests for genotype imputation #2815
base: main
Are you sure you want to change the base?
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #2815 +/- ##
===========================================
- Coverage 89.79% 76.97% -12.82%
===========================================
Files 30 30
Lines 30399 30399
Branches 5909 5643 -266
===========================================
- Hits 27296 23400 -3896
- Misses 1778 5614 +3836
- Partials 1325 1385 +60
Flags with carried forward coverage won't be shown. Click here to find out more. see 13 files with indirect coverage changes Continue to review full report in Codecov by Sentry.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent!
And the $10M question - do we compute the same thing with tskit???
I think the tricky is getting the normalization factors right. |
So we might not get exactly the same values, but I guess would expect the output matrices to be proportional (since normalisation factors can be arbitrary) |
200b6ab
to
b47db86
Compare
I've updated this to get the tskit code to run @szhan, and to remove the dependence on pandas for CSV parsing (which would have been a pain to update all the requirements.txt files). We can do the same thing (more or less) with numpy. |
dc0d3e9
to
b4628f8
Compare
Here is the first stab at the comparison. Rows = sites Forward probability matrix
beagle
Backward probability matrix
beagle
The forward values look pretty proportional to me. But I don't get why the values in the BEAGLE backward matrix is so uniform. |
I suspect it has to do with my compiled copy of BEAGLE 4.1. BEAGLE's forward-backward algorithm only keeps track of the values for one site/marker in order to reduce memory requirements. For some reasons, I'm not seeing it being updated across the iterations. |
Also, I'm seeing that BEAGLE is initialising the backward probability matrix to 1/n, where n = number of haplotypes in the reference panel. See in
See this line |
OK, I guess we'll need to figure this out. But, we should be able to check that the final imputed value (info scores?) are equal, easily enough? |
There's also the matrix that combines the forward and backward values. I think BEAGLE gets the allele probabilities out of that. |
I just discussed with Duncan. He spotted that the uniform backward values has to do with parametrization. |
2dd1007
to
0395bf1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks really excellent @szhan! I'm not sure whether the Beagle code belongs here or in the tsimpute repo, but it's really good progress.
I guess ultimately what we want in tskit is some intermediate results that we compute using the Beagle algorithms that we can compare with what we compute in tskit. The actual job of imputation has a lot of moving parts, which we probably don't want to put together directly in tskit.
I think we should just focus on the interpolated allele probabilities at all the imputed markers on query haplotypes. We can compare these allele probabilities from BEAGLE and our implementation here. |
Just to summarise the outputs from BEAGLE we can probably use for comparison here:
It is hard to replicate exactly what BEAGLE is doing, so at best we will be able to look at correlations between the values (as discussed above). |
9119d88
to
fe70e44
Compare
cb323a0
to
9895525
Compare
Hi @szhan - what's the plan with this work? |
Will circle back to this. Please keep. |
Description
Add toy examples for testing imputation. Results obtained by running BEAGLE 4.1 are stored for comparison.
Fixes #2802
PR Checklist:
_tskit
(tsimpute
).