VariantData #944

benjeffery · 2024-07-24T17:39:26Z

No description provided.

codecov · 2024-07-25T12:59:50Z

Codecov Report

Attention: Patch coverage is 89.83051% with 6 lines in your changes missing coverage. Please review.

Project coverage is 93.23%. Comparing base (0c5d895) to head (aeefa3a).

Files	Patch %	Lines
tsinfer/formats.py	89.47%	5 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #944      +/-   ##
==========================================
- Coverage   93.31%   93.23%   -0.08%     
==========================================
  Files          18       18              
  Lines        6281     6299      +18     
  Branches     1131     1139       +8     
==========================================
+ Hits         5861     5873      +12     
- Misses        285      290       +5     
- Partials      135      136       +1

Flag	Coverage Δ
C	`93.23% <89.83%> (-0.08%)`	⬇️
python	`95.65% <89.83%> (-0.12%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

benjeffery · 2024-07-25T13:15:37Z

@hyanwong The API is currently:

VariantData(
        path,
        ancestral_allele_name_or_array="variant_ancestral_allele",
        sample_mask_name_or_array=None,
        site_mask_name_or_array=None,
        sites_time_name_or_array=None,
    )

I just have to add a few more tests, but does this look good to you?

hyanwong · 2024-07-25T13:39:31Z

Yes, this looks great. I'm not sure about the default being "variant_ancestral_allele", as that's often not present in the VCF, so I'm wondering what happens if it's absent. But I guess we need to get it from somewhere.

hyanwong · 2024-07-25T13:42:29Z

We might want to swap e.g. new array masks or flip some ancestral states. Would we jut create a new VariantData object in that case, or would there be the ability to edit the one we have already, e.g. via

vd = VariantData(...)
vd.set_ancestral_allele(my_numpy_array)

Incidentally, although VariantData is a reasonable name, I'm not sure about using "vd" for the shorthand object 😬

hyanwong · 2024-07-25T20:27:12Z

I'm just looking at the tutorial page. In this, we invent a dataset, and use the SampleData.add_site() function to add data that can be read by tsinfer. Is there a way to do this using the new input format (i.e. to create a VCF Zarr file from scratch, without starting with a VCF)? Or do we keep the old interface around for teaching purposes (I would prefer not to, I think).

Edit - added: we could use sg.simulate_genotype_call_dataset to make some example data, but it has a bug which makes it inappropriate to use in general. We could pick a particular random seed that doesn't trigger that bug, though.

benjeffery · 2024-07-25T20:34:22Z

I'm just looking at the tutorial page. In this, we invent a dataset, and use the SampleData.add_site() function to add data that can be read by tsinfer. Is there a way to do this using the new input format (i.e. to create a VCF Zarr file from scratch, without starting with a VCF)? Or do we keep the old interface around for teaching purposes (I would prefer not to, I think).

The dataset is just a set of arrays, so you can create one from scratch or use sgkit.simulate_genotype_call_dataset for a simple example dataset.

hyanwong · 2024-07-25T20:36:10Z

Since you are writing tests, do you have time to roll in a check for duplicate alleles (#927) @benjeffery ? That would make using simulate_genotype_call_dataset less prone to giving weird results in the tutorial.

hyanwong · 2024-07-26T09:13:31Z

How do we specify an unknown ancestral allele? I'm guessing we provide any non-allele character, e.g. "." in the input array?

benjeffery · 2024-07-26T09:49:09Z

I'm not sure about the default being "variant_ancestral_allele"

I removed the default and made it required.

How do we specify an unknown ancestral allele?

Yep, anything that isn't in the alleles for the site will do.

hyanwong · 2024-07-26T11:00:04Z

Thanks. Incidentally, I wonder if we should transparently convert an array of numpy byte strings as ancestral alleles into "proper" strings, by calling .astype(str) on the AA array? That saves some hassle with sg.simulate_genotype_call_dataset which makes byte strings

hyanwong · 2024-07-26T11:01:26Z

How do we specify the AA on the command-line, by the way?

benjeffery · 2024-07-26T11:20:15Z

I wonder if we should transparently convert an array of numpy byte strings as ancestral alleles into "proper" strings, by calling .astype(str) on the AA array?

That's done already.

How do we specify the AA on the command-line, by the way?

There is no command line support for VariantData yet.

hyanwong · 2024-07-26T11:41:20Z

The name "ancestral_allele_name_or_array" is quite long. That's OK, but I wonder if we just get people to use positional arguments, or if we want to enforce using the actual parameter name. The following looks OK to me:

vdata = tsinfer.VariantData("data.vcz", ancestral_alleles)

benjeffery · 2024-07-26T11:45:37Z

The name "ancestral_allele_name_or_array" is quite long.

The current code supports positional for the AA.

hyanwong · 2024-07-26T12:39:28Z

The name "ancestral_allele_name_or_array" is quite long.

The current code supports positional for the AA.

Great, as long as we are happy recommending that, all's good. Thanks

jeromekelleher

Looks good as the top-level API, but I think we can shorten the names without loss of information. Should be simple global search and replace job.

jeromekelleher · 2024-07-26T12:43:12Z

tsinfer/formats.py

+    def __init__(
+        self,
+        path,
+        ancestral_allele_name_or_array,


The _name_or_array suffix is redundant for all these isn't it? We don't put type information into most things, and since these all share the same suffix I don't see what problem it solves.

Good point - fixed

benjeffery force-pushed the variant-data branch from acd0aa6 to ee3474f Compare July 25, 2024 12:41

hyanwong mentioned this pull request Jul 26, 2024

Update tutorial for new VariantData format #945

Merged

benjeffery force-pushed the variant-data branch from e2c5449 to c6a4458 Compare July 26, 2024 11:44

benjeffery changed the title ~~WIP - VariantData~~ VariantData Jul 26, 2024

benjeffery marked this pull request as ready for review July 26, 2024 11:45

jeromekelleher reviewed Jul 26, 2024

View reviewed changes

VariantData class

aeefa3a

benjeffery force-pushed the variant-data branch from c6a4458 to aeefa3a Compare July 26, 2024 13:32

benjeffery added the AUTOMERGE-REQUESTED label Jul 26, 2024

benjeffery merged commit a803933 into tskit-dev:main Jul 26, 2024
10 of 12 checks passed

mergify bot removed the AUTOMERGE-REQUESTED label Jul 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VariantData #944

VariantData #944

benjeffery commented Jul 24, 2024

codecov bot commented Jul 25, 2024 •

edited

Loading

benjeffery commented Jul 25, 2024

hyanwong commented Jul 25, 2024

hyanwong commented Jul 25, 2024

hyanwong commented Jul 25, 2024 •

edited

Loading

benjeffery commented Jul 25, 2024

hyanwong commented Jul 25, 2024

hyanwong commented Jul 26, 2024

benjeffery commented Jul 26, 2024

hyanwong commented Jul 26, 2024

hyanwong commented Jul 26, 2024

benjeffery commented Jul 26, 2024

hyanwong commented Jul 26, 2024

benjeffery commented Jul 26, 2024

hyanwong commented Jul 26, 2024

jeromekelleher left a comment

jeromekelleher Jul 26, 2024

benjeffery Jul 26, 2024

VariantData #944

VariantData #944

Conversation

benjeffery commented Jul 24, 2024

codecov bot commented Jul 25, 2024 • edited Loading

Codecov Report

benjeffery commented Jul 25, 2024

hyanwong commented Jul 25, 2024

hyanwong commented Jul 25, 2024

hyanwong commented Jul 25, 2024 • edited Loading

benjeffery commented Jul 25, 2024

hyanwong commented Jul 25, 2024

hyanwong commented Jul 26, 2024

benjeffery commented Jul 26, 2024

hyanwong commented Jul 26, 2024

hyanwong commented Jul 26, 2024

benjeffery commented Jul 26, 2024

hyanwong commented Jul 26, 2024

benjeffery commented Jul 26, 2024

hyanwong commented Jul 26, 2024

jeromekelleher left a comment

Choose a reason for hiding this comment

jeromekelleher Jul 26, 2024

Choose a reason for hiding this comment

benjeffery Jul 26, 2024

Choose a reason for hiding this comment

codecov bot commented Jul 25, 2024 •

edited

Loading

hyanwong commented Jul 25, 2024 •

edited

Loading