Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VariantData sequence_length not being taken from zarr contig_length field #949

Open
hyanwong opened this issue Aug 14, 2024 · 7 comments · May be fixed by #964
Open

VariantData sequence_length not being taken from zarr contig_length field #949

hyanwong opened this issue Aug 14, 2024 · 7 comments · May be fixed by #964

Comments

@hyanwong
Copy link
Member

hyanwong commented Aug 14, 2024

I'm having problems getting VariantData.sequence_length to match the value in the contig_length field of the zarr file. It wasn't clear until reading the code that I needed to set z.attrs["sequence_length"]. If there isn't a z.attrs["sequence_length"] defined, should we take the value from contig_length instead?

I was having the problem when trying to round-trip data from a tree sequence via a vcf file into another inferred tree sequence, and the sequence lengths didn't match up.

@hyanwong hyanwong changed the title Sequence_length not being taken from zarr contig_length field on inference VariantData sequence_length not being taken from zarr contig_length field Aug 14, 2024
@benjeffery
Copy link
Member

To use contig_length we would need to know which value from that array to use. There's the usual issue here that sgkit has many contigs, whereas tsinfer only allows one.
One option would be to error out if there is only one contig - but that seems overly restrictive.

@hyanwong
Copy link
Member Author

Don't we need to error out if the variants cover multiple contigs? I guess only if the sites mask leaves more than one contig available. Presumably we check this when running tsinfer on the VariantData object?

That implies (to me) that the sequence length is essentially dependent on the mask used, right? Following on from that thought, the sequence length can be picked from the contig_length field (and cached) after checking the uniqueness of the (masked) CHROM field, I suppose? Sounds a bit complex though!

@benjeffery
Copy link
Member

If sequence_length isn't present we currently default to last_position+1. We don't currently error on differing chroms - but we do error if positions aren't monotonically increasing.

@jeromekelleher
Copy link
Member

jeromekelleher commented Aug 27, 2024

If there is more than one contig present in the VCF Zarr we require a contig parameter to be provided in the constructor. If a contig_length is present, we use this as sequence_length if not, we do the last_position + 1 thing.

I don't think we should be requiring any custom Zarr attrs, as we won't have write access in many cases.

@hyanwong
Copy link
Member Author

hyanwong commented Sep 4, 2024

If there is more than one contig present in the VCF Zarr we require a contig parameter to be provided in the constructor. If a contig_length is present, we use this as sequence_length if not, we do the last_position + 1 thing.

Sorry, I'm not getting this. At the moment users can quite happily make a .vcz file with multiple defined contigs, and feed that to tsinfer. It then only barfs because positions aren't monotonically increasing. I just wrote some extra docs to show how to mask out all but one contig.

Are you saying that we should detect multiple contigs when making a VariantData object, and require one of them to be specified? If so, then I guess it would be also reasonable to get the equivalent contig_length from the .vcz file (if it exists), which would be handy.

I don't think we should be requiring any custom Zarr attrs, as we won't have write access in many cases.

Yes, that seems the right approach.

@jeromekelleher
Copy link
Member

Yeah, it seems that if there are multiple contigs in the VCF then it would be helpful to tell tsinfer which one it should use. You're right that this cuts across the site masking though, so it's not entirely clear what the right thing to do here is. Perhaps just providing a sequence_length option which the user can fill in themselves is the easiest.

@hyanwong
Copy link
Member Author

hyanwong commented Sep 4, 2024

I think the best thing (as long as it doesn't kill efficiency) is to allow a contig to be specified (default None), and if so, start with that as the site mask, then OR the mask with any provided site mask.

We can then check if the (masked) variant_contig array has only a single value, and store that in a VariantData attribute (this would be useful anyway, I think). Then it's easy to pick out a contig_length, if it exists (or use the N+1 hack if it doesn't).

hyanwong added a commit to hyanwong/tsinfer that referenced this issue Sep 4, 2024
Introduces a `contig_id` parameter to variant_data, as described in tskit-dev#949. Fixes tskit-dev#249
hyanwong added a commit to hyanwong/tsinfer that referenced this issue Sep 4, 2024
Introduces a `contig_id` parameter to variant_data, as described in tskit-dev#949. Fixes tskit-dev#249
@hyanwong hyanwong linked a pull request Sep 4, 2024 that will close this issue
hyanwong added a commit to hyanwong/tsinfer that referenced this issue Sep 6, 2024
Introduces a `contig_id` parameter to variant_data, as described in tskit-dev#949. Fixes tskit-dev#249
hyanwong added a commit to hyanwong/tsinfer that referenced this issue Sep 10, 2024
Introduces a `contig_id` parameter to variant_data, as described in tskit-dev#949. Fixes tskit-dev#249
hyanwong added a commit to hyanwong/tsinfer that referenced this issue Sep 10, 2024
Introduces a `contig_id` parameter to variant_data, as described in tskit-dev#949. Fixes tskit-dev#249
hyanwong added a commit to hyanwong/tsinfer that referenced this issue Sep 10, 2024
Introduces a `contig_id` parameter to variant_data, as described in tskit-dev#949. Fixes tskit-dev#249
hyanwong added a commit to hyanwong/tsinfer that referenced this issue Sep 10, 2024
Introduces a `contig_id` parameter to variant_data, as described in tskit-dev#949. Fixes tskit-dev#249
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants