Extra sample nodes being created from .vcz file when ploidies are uneven #952

hyanwong · 2024-08-20T15:38:21Z

If we have a mix of ploidies in a .vcz file, we create sample nodes for the maximum ploidy size for each individual, that just attach to the root (because they have no data). I.e. we make a huge polytomy at the root of "fake" samples. I'm not even sure how we identify these samples to delete them.

There are a couple of people at the Oslo workshop who are working on haplodiploids, and others who are working on mixed policy systems in plants (e.g. @situssog), so this would be good to fix.

Here's an example:

import tempfile
import os
import sys
import subprocess
import msprime
import pysam
from Bio import bgzf


# make some haploids, diploids, and tetraploids
ts = msprime.sim_ancestry(
    [msprime.SampleSet(2, ploidy=1), msprime.SampleSet(3, ploidy=2), msprime.SampleSet(1, ploidy=4)],
    population_size=1e3,
    sequence_length=1e4,
    recombination_rate=1e-6,
)
ts = msprime.sim_mutations(ts, rate=1e-7)
zarr_file_name = "tmp.vcz"
with tempfile.TemporaryDirectory() as tmpdirname:
    vcf_name = os.path.join(tmpdirname, zarr_file_name.replace("/", "") + ".vcf.gz")
    with bgzf.open(vcf_name, "wt") as f:
        ts.write_vcf(f)
    pysam.tabix_index(vcf_name, preset="vcf")
    subprocess.run([sys.executable, "-m", "bio2zarr", "vcf2zarr", "convert", "--force", vcf_name, zarr_file_name])

vdata = tsinfer.VariantData("tmp.vcz", ancestral_allele=np.array([s.ancestral_state for s in ts.sites()]))
inferred_ts = tsinfer.infer(vdata)
print("ts.num_samples:", ts.num_samples, "-- inferred_ts.num_samples:", inferred_ts.num_samples)

The text was updated successfully, but these errors were encountered:

benjeffery · 2024-08-21T00:45:30Z

Yes, mixed ploidy isn't supported for now. I guess detecting the not-present samples and then masking them out is the only way to do this.

hyanwong · 2024-08-22T10:41:44Z

Ah yes: https://sgkit-dev.github.io/sgkit/0.5.0/vcf.html#polyploid-and-mixed-ploidy-vcf. It looks like vcf2zarr doesn't have a mixed_ploidy option like sgkit.io.vcf.vcf_to_zarr does. I assume this is something we eventually plan to incorporate? And in the same way, we would set a sentinel of -2 for each genotype call?

I guess detecting the not-present samples and then masking them out is the only way to do this.

Do you mean the only way to do this at the moment, or the only foreseeable way forward in the long term?

jeromekelleher · 2024-08-27T10:17:26Z

Mixed_ploidy isn't a property of the dataset, and was introduced in sgkit as a way of forcing calls to appear as if they are not mixed ploidy (from the output of certain variant callers on human datasets). I think we would need to tell tsinfer explicitly what ploidy to expect for each sample to support this properly.

We could add a ploidy= argument in SampleData, which could be a single number or an array (one value for each sample)?

hyanwong · 2024-09-01T21:20:08Z

The .vcz file "knows" the ploidy in each case, though, as it has a -2 in the right place. E.g. here is the call_genotype array for the example above:

array([[[ 1, -2, -2, -2],
        [ 1, -2, -2, -2],
        [ 0,  0, -2, -2],
        [ 0,  1, -2, -2],
        [ 0,  0, -2, -2],
        [ 0,  1,  0,  0]],

       [[ 0, -2, -2, -2],
        [ 0, -2, -2, -2],
        [ 0,  0, -2, -2],
        [ 1,  0, -2, -2],
        [ 0,  0, -2, -2],
        [ 1,  0,  0,  1]],

       [[ 0, -2, -2, -2],
        [ 0, -2, -2, -2],
        [ 0,  0, -2, -2],
        [ 0,  1, -2, -2],
        [ 0,  0, -2, -2],
        [ 0,  0,  0,  0]],

       [[ 0, -2, -2, -2],
        [ 0, -2, -2, -2],
        [ 1,  1, -2, -2],
        [ 0,  0, -2, -2],
        [ 0,  1, -2, -2],
        [ 0,  0,  0,  1]],

       [[ 0, -2, -2, -2],
        [ 0, -2, -2, -2],
        [ 0,  0, -2, -2],
        [ 0,  0, -2, -2],
        [ 0,  0, -2, -2],
        [ 0,  0,  0,  1]]], dtype=int8)

Is it not just a question of the VariantData object correctly inferring the ploidy of each sample from the -2 values?

jeromekelleher · 2024-09-02T08:39:18Z

That involves a full scan of the data, though, as you have to examine every element of the genotype matrix to find any -2s. That's definitely not an operation we want to be done "implicitly".

It's not that big a chore for a user who already knows the ploidy of their samples to supply it as an argument.

hyanwong · 2024-09-02T09:13:49Z

We could have a flag in VariantData to say to "check the first variant for non--2 values and use that as the ploidy"? I don't think -2 is used for other stuff, is it (missing data is -1). The users I was speaking to will probably just be pulling in VCFs from somewhere else, and I suspect it would be a tedious extra step to find the sample IDs and figure out the ploidy, when it's all in the VCF anyway.

Some other options:

a VariantData function, check_ploidy that "fixes" the problem on an optional basis
some sort of check when we do go though the variants (e.g. when building ancestors or matching samples) that the ploidies are correct?

hyanwong · 2024-09-02T09:24:15Z

Another option, which makes @jeromekelleher's suggestion easier:

A function which extracts the ploidies for all samples from a single variant, which could then be used as the input to the VariantData constructor, without having to figure out sample IDs etc.

data = tsinfer.VariantData(
    "tmp.vcz",
    ploidies = tsinfer.ploidies_from_vcz("tmp.vcz", variant_id=0)
)

jeromekelleher · 2024-09-02T10:54:25Z

Option 3 is straightforward enough all right.

hyanwong · 2024-09-02T11:00:17Z

It might also be a good idea to warn on creation of the VariantData object if the first variant has any -2 values and a ploidy argument is not specified?

jeromekelleher · 2024-09-02T12:32:11Z

Can't do it on create - it involves reading all values and can take 10s of minutes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extra sample nodes being created from .vcz file when ploidies are uneven #952

Extra sample nodes being created from .vcz file when ploidies are uneven #952

hyanwong commented Aug 20, 2024 •

edited

Loading

benjeffery commented Aug 21, 2024

hyanwong commented Aug 22, 2024

jeromekelleher commented Aug 27, 2024

hyanwong commented Sep 1, 2024

jeromekelleher commented Sep 2, 2024

hyanwong commented Sep 2, 2024 •

edited

Loading

hyanwong commented Sep 2, 2024 •

edited

Loading

jeromekelleher commented Sep 2, 2024

hyanwong commented Sep 2, 2024

jeromekelleher commented Sep 2, 2024

Extra sample nodes being created from .vcz file when ploidies are uneven #952

Extra sample nodes being created from .vcz file when ploidies are uneven #952

Comments

hyanwong commented Aug 20, 2024 • edited Loading

benjeffery commented Aug 21, 2024

hyanwong commented Aug 22, 2024

jeromekelleher commented Aug 27, 2024

hyanwong commented Sep 1, 2024

jeromekelleher commented Sep 2, 2024

hyanwong commented Sep 2, 2024 • edited Loading

hyanwong commented Sep 2, 2024 • edited Loading

jeromekelleher commented Sep 2, 2024

hyanwong commented Sep 2, 2024

jeromekelleher commented Sep 2, 2024

hyanwong commented Aug 20, 2024 •

edited

Loading

hyanwong commented Sep 2, 2024 •

edited

Loading

hyanwong commented Sep 2, 2024 •

edited

Loading