-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
VariantData #944
VariantData #944
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #944 +/- ##
==========================================
- Coverage 93.31% 93.23% -0.08%
==========================================
Files 18 18
Lines 6281 6299 +18
Branches 1131 1139 +8
==========================================
+ Hits 5861 5873 +12
- Misses 285 290 +5
- Partials 135 136 +1
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
@hyanwong The API is currently: VariantData(
path,
ancestral_allele_name_or_array="variant_ancestral_allele",
sample_mask_name_or_array=None,
site_mask_name_or_array=None,
sites_time_name_or_array=None,
) I just have to add a few more tests, but does this look good to you? |
Yes, this looks great. I'm not sure about the default being "variant_ancestral_allele", as that's often not present in the VCF, so I'm wondering what happens if it's absent. But I guess we need to get it from somewhere. |
We might want to swap e.g. new array masks or flip some ancestral states. Would we jut create a new VariantData object in that case, or would there be the ability to edit the one we have already, e.g. via
Incidentally, although VariantData is a reasonable name, I'm not sure about using "vd" for the shorthand object 😬 |
I'm just looking at the tutorial page. In this, we invent a dataset, and use the Edit - added: we could use |
The dataset is just a set of arrays, so you can create one from scratch or use |
Since you are writing tests, do you have time to roll in a check for duplicate alleles (#927) @benjeffery ? That would make using |
How do we specify an unknown ancestral allele? I'm guessing we provide any non-allele character, e.g. "." in the input array? |
I removed the default and made it required.
Yep, anything that isn't in the alleles for the site will do. |
Thanks. Incidentally, I wonder if we should transparently convert an array of numpy byte strings as ancestral alleles into "proper" strings, by calling |
How do we specify the AA on the command-line, by the way? |
That's done already.
There is no command line support for VariantData yet. |
The name "ancestral_allele_name_or_array" is quite long. That's OK, but I wonder if we just get people to use positional arguments, or if we want to enforce using the actual parameter name. The following looks OK to me: vdata = tsinfer.VariantData("data.vcz", ancestral_alleles) |
The current code supports positional for the AA. |
Great, as long as we are happy recommending that, all's good. Thanks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good as the top-level API, but I think we can shorten the names without loss of information. Should be simple global search and replace job.
tsinfer/formats.py
Outdated
def __init__( | ||
self, | ||
path, | ||
ancestral_allele_name_or_array, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The _name_or_array
suffix is redundant for all these isn't it? We don't put type information into most things, and since these all share the same suffix I don't see what problem it solves.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point - fixed
No description provided.