-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Check for duplicate alleles #960
Conversation
How long does it take for a large file? |
2aa6c46
to
542737e
Compare
I'll test. Meanwhile, I messed up the GH commits with another PR, so will fix and force push |
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #960 +/- ##
=======================================
Coverage 93.16% 93.17%
=======================================
Files 18 18
Lines 6337 6340 +3
Branches 1133 1135 +2
=======================================
+ Hits 5904 5907 +3
Misses 294 294
Partials 139 139
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
Also fix tests to create valid datasets with a given seed
542737e
to
3421907
Compare
LGTM - I don't think the loop through the alleles will be that expensive for most cases - unless lots of sites have hundreds of alleles. This has made me realise the vcfzarr spec should probably be more explicit that zero-length alleles are not allowed. It is currently there as implicit as it states |
Zero-length alleles are not allowed in VCF files, so I guess that's another way in which it is implicit. However, we might want to allow representation of indels internally using a |
Rolled into #963 as part of review, and tested with a large file. |
Fixes #927.
I assume we are happy to take the hit of looping through all the sites and doing a
set()
. An alternative would be to do annp.unique
by row, but then we would need to remove the duplicate""
entries.