Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VariantData #944

Merged
merged 1 commit into from
Jul 26, 2024
Merged

VariantData #944

merged 1 commit into from
Jul 26, 2024

Conversation

benjeffery
Copy link
Member

No description provided.

Copy link

codecov bot commented Jul 25, 2024

Codecov Report

Attention: Patch coverage is 89.83051% with 6 lines in your changes missing coverage. Please review.

Project coverage is 93.23%. Comparing base (0c5d895) to head (aeefa3a).

Files Patch % Lines
tsinfer/formats.py 89.47% 5 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #944      +/-   ##
==========================================
- Coverage   93.31%   93.23%   -0.08%     
==========================================
  Files          18       18              
  Lines        6281     6299      +18     
  Branches     1131     1139       +8     
==========================================
+ Hits         5861     5873      +12     
- Misses        285      290       +5     
- Partials      135      136       +1     
Flag Coverage Δ
C 93.23% <89.83%> (-0.08%) ⬇️
python 95.65% <89.83%> (-0.12%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@benjeffery
Copy link
Member Author

@hyanwong The API is currently:

VariantData(
        path,
        ancestral_allele_name_or_array="variant_ancestral_allele",
        sample_mask_name_or_array=None,
        site_mask_name_or_array=None,
        sites_time_name_or_array=None,
    )

I just have to add a few more tests, but does this look good to you?

@hyanwong
Copy link
Member

Yes, this looks great. I'm not sure about the default being "variant_ancestral_allele", as that's often not present in the VCF, so I'm wondering what happens if it's absent. But I guess we need to get it from somewhere.

@hyanwong
Copy link
Member

We might want to swap e.g. new array masks or flip some ancestral states. Would we jut create a new VariantData object in that case, or would there be the ability to edit the one we have already, e.g. via

vd = VariantData(...)
vd.set_ancestral_allele(my_numpy_array)

Incidentally, although VariantData is a reasonable name, I'm not sure about using "vd" for the shorthand object 😬

@hyanwong
Copy link
Member

hyanwong commented Jul 25, 2024

I'm just looking at the tutorial page. In this, we invent a dataset, and use the SampleData.add_site() function to add data that can be read by tsinfer. Is there a way to do this using the new input format (i.e. to create a VCF Zarr file from scratch, without starting with a VCF)? Or do we keep the old interface around for teaching purposes (I would prefer not to, I think).

Edit - added: we could use sg.simulate_genotype_call_dataset to make some example data, but it has a bug which makes it inappropriate to use in general. We could pick a particular random seed that doesn't trigger that bug, though.

@benjeffery
Copy link
Member Author

I'm just looking at the tutorial page. In this, we invent a dataset, and use the SampleData.add_site() function to add data that can be read by tsinfer. Is there a way to do this using the new input format (i.e. to create a VCF Zarr file from scratch, without starting with a VCF)? Or do we keep the old interface around for teaching purposes (I would prefer not to, I think).

The dataset is just a set of arrays, so you can create one from scratch or use sgkit.simulate_genotype_call_dataset for a simple example dataset.

@hyanwong
Copy link
Member

Since you are writing tests, do you have time to roll in a check for duplicate alleles (#927) @benjeffery ? That would make using simulate_genotype_call_dataset less prone to giving weird results in the tutorial.

@hyanwong
Copy link
Member

How do we specify an unknown ancestral allele? I'm guessing we provide any non-allele character, e.g. "." in the input array?

@benjeffery
Copy link
Member Author

I'm not sure about the default being "variant_ancestral_allele"

I removed the default and made it required.

How do we specify an unknown ancestral allele?

Yep, anything that isn't in the alleles for the site will do.

@hyanwong
Copy link
Member

Thanks. Incidentally, I wonder if we should transparently convert an array of numpy byte strings as ancestral alleles into "proper" strings, by calling .astype(str) on the AA array? That saves some hassle with sg.simulate_genotype_call_dataset which makes byte strings

@hyanwong
Copy link
Member

How do we specify the AA on the command-line, by the way?

@benjeffery
Copy link
Member Author

I wonder if we should transparently convert an array of numpy byte strings as ancestral alleles into "proper" strings, by calling .astype(str) on the AA array?

That's done already.

How do we specify the AA on the command-line, by the way?

There is no command line support for VariantData yet.

@hyanwong
Copy link
Member

The name "ancestral_allele_name_or_array" is quite long. That's OK, but I wonder if we just get people to use positional arguments, or if we want to enforce using the actual parameter name. The following looks OK to me:

vdata = tsinfer.VariantData("data.vcz", ancestral_alleles)

@benjeffery benjeffery changed the title WIP - VariantData VariantData Jul 26, 2024
@benjeffery
Copy link
Member Author

The name "ancestral_allele_name_or_array" is quite long.

The current code supports positional for the AA.

@benjeffery benjeffery marked this pull request as ready for review July 26, 2024 11:45
@hyanwong
Copy link
Member

The name "ancestral_allele_name_or_array" is quite long.

The current code supports positional for the AA.

Great, as long as we are happy recommending that, all's good. Thanks

Copy link
Member

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good as the top-level API, but I think we can shorten the names without loss of information. Should be simple global search and replace job.

def __init__(
self,
path,
ancestral_allele_name_or_array,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The _name_or_array suffix is redundant for all these isn't it? We don't put type information into most things, and since these all share the same suffix I don't see what problem it solves.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point - fixed

@benjeffery benjeffery merged commit a803933 into tskit-dev:main Jul 26, 2024
10 of 12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants