-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extra sample nodes being created from .vcz file when ploidies are uneven #952
Comments
Yes, mixed ploidy isn't supported for now. I guess detecting the not-present samples and then masking them out is the only way to do this. |
Ah yes: https://sgkit-dev.github.io/sgkit/0.5.0/vcf.html#polyploid-and-mixed-ploidy-vcf. It looks like vcf2zarr doesn't have a
Do you mean the only way to do this at the moment, or the only foreseeable way forward in the long term? |
Mixed_ploidy isn't a property of the dataset, and was introduced in sgkit as a way of forcing calls to appear as if they are not mixed ploidy (from the output of certain variant callers on human datasets). I think we would need to tell tsinfer explicitly what ploidy to expect for each sample to support this properly. We could add a |
The .vcz file "knows" the ploidy in each case, though, as it has a
Is it not just a question of the VariantData object correctly inferring the ploidy of each sample from the |
That involves a full scan of the data, though, as you have to examine every element of the genotype matrix to find any -2s. That's definitely not an operation we want to be done "implicitly". It's not that big a chore for a user who already knows the ploidy of their samples to supply it as an argument. |
We could have a flag in VariantData to say to "check the first variant for non- Some other options:
|
Another option, which makes @jeromekelleher's suggestion easier:
data = tsinfer.VariantData(
"tmp.vcz",
ploidies = tsinfer.ploidies_from_vcz("tmp.vcz", variant_id=0)
) |
Option 3 is straightforward enough all right. |
It might also be a good idea to warn on creation of the VariantData object if the first variant has any |
Can't do it on create - it involves reading all values and can take 10s of minutes. |
If we have a mix of ploidies in a
.vcz
file, we create sample nodes for the maximum ploidy size for each individual, that just attach to the root (because they have no data). I.e. we make a huge polytomy at the root of "fake" samples. I'm not even sure how we identify these samples to delete them.There are a couple of people at the Oslo workshop who are working on haplodiploids, and others who are working on mixed policy systems in plants (e.g. @situssog), so this would be good to fix.
Here's an example:
The text was updated successfully, but these errors were encountered: