Check for duplicate alleles #960

hyanwong · 2024-09-04T11:25:24Z

Fixes #927.

I assume we are happy to take the hit of looping through all the sites and doing a set(). An alternative would be to do an np.unique by row, but then we would need to remove the duplicate "" entries.

jeromekelleher · 2024-09-04T11:32:00Z

How long does it take for a large file?

hyanwong · 2024-09-04T11:40:21Z

I'll test. Meanwhile, I messed up the GH commits with another PR, so will fix and force push

codecov · 2024-09-04T11:52:40Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 93.17%. Comparing base (a080e8f) to head (3421907).
Report is 8 commits behind head on main.

Additional details and impacted files

@@           Coverage Diff           @@
##             main     #960   +/-   ##
=======================================
  Coverage   93.16%   93.17%           
=======================================
  Files          18       18           
  Lines        6337     6340    +3     
  Branches     1133     1135    +2     
=======================================
+ Hits         5904     5907    +3     
  Misses        294      294           
  Partials      139      139

Flag	Coverage Δ
C	`93.17% <100.00%> (+<0.01%)`	⬆️
python	`95.53% <100.00%> (+<0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Also fix tests to create valid datasets with a given seed

benjeffery · 2024-09-04T12:36:52Z

LGTM - I don't think the loop through the alleles will be that expensive for most cases - unless lots of sites have hundreds of alleles.

This has made me realise the vcfzarr spec should probably be more explicit that zero-length alleles are not allowed. It is currently there as implicit as it states "" is a fill value.

hyanwong · 2024-09-04T13:45:32Z

This has made me realise the vcfzarr spec should probably be more explicit that zero-length alleles are not allowed. It is currently there as implicit as it states "" is a fill value.

Zero-length alleles are not allowed in VCF files, so I guess that's another way in which it is implicit. However, we might want to allow representation of indels internally using a "", otherwise there are forms of genetic variation that can be represented in a VCF that can't be represented in the tsinfer input (see #893 (comment))

hyanwong · 2024-09-10T05:52:27Z

Rolled into #963 as part of review, and tested with a large file.

hyanwong force-pushed the check-duplicate-alleles branch from 2aa6c46 to 542737e Compare September 4, 2024 11:38

mergify bot mentioned this pull request Sep 4, 2024

Override VariantData.from_tree_sequence #961

Merged

Test alleles are unique

3421907

Also fix tests to create valid datasets with a given seed

hyanwong force-pushed the check-duplicate-alleles branch from 542737e to 3421907 Compare September 4, 2024 12:17

hyanwong mentioned this pull request Sep 4, 2024

Do not warn about unknown states if ancestral allele is "N" #962

Closed

hyanwong closed this Sep 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Check for duplicate alleles #960

Check for duplicate alleles #960

hyanwong commented Sep 4, 2024

jeromekelleher commented Sep 4, 2024

hyanwong commented Sep 4, 2024

codecov bot commented Sep 4, 2024 •

edited

Loading

benjeffery commented Sep 4, 2024

hyanwong commented Sep 4, 2024 •

edited

Loading

hyanwong commented Sep 10, 2024

Check for duplicate alleles #960

Check for duplicate alleles #960

Conversation

hyanwong commented Sep 4, 2024

jeromekelleher commented Sep 4, 2024

hyanwong commented Sep 4, 2024

codecov bot commented Sep 4, 2024 • edited Loading

Codecov Report

benjeffery commented Sep 4, 2024

hyanwong commented Sep 4, 2024 • edited Loading

hyanwong commented Sep 10, 2024

codecov bot commented Sep 4, 2024 •

edited

Loading

hyanwong commented Sep 4, 2024 •

edited

Loading