-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Edge case of decompressing legacy file with top-level metadata #65
Comments
Thanks for the report @brianzhang01. Just so we're clear, did the version of the file format that you're using ever get released? If not, then it's hard for us to justify adding complexity to the codebase to support it, unless there are end-users out there being affected. Would a script to fix-up your tszipped files so that they use the same format as the released version help, or do you have files "out in the wild" that use this format? |
Thanks @jeromekelleher . I expect myself to be the only end user, as I was using a version of the code between PR #35 and PR #42 (which did not get its own PyPI release). I also don't know of anyone besides me who uses top-level metadata in tskit.TreeSequence objects. I do have thousands of tszipped files which wrote top-level metadata in this way... If it's simpler, feel free to close the issue and I will add the 5 lines of code posted above to a local version of tszip as a bridge between old and new formats. We can alternatively merge those 5 lines of code in, but it would only benefit myself. Also, I looked into my earlier comment about "whether the |
Using development versions of any tskit-dev project in production is something we do discourage @brianzhang01, becomes it makes the process of incremental development (where we have to ensure that code is production ready on every commit) far more dificult. We need to be free to try stuff out without taking on the contract of perpetual support. Adding your patch above will end up being quite a lot of work because we'll have to contrive specific test cases for it and maintain it into the future. Perhaps a reasonable way forward would be to convert your files to use the production format? It should be only a few lines of Python to convert the top-level metadata, and it should be very fast because you won't need to decompress the data arrays. I'm happy to help with this process if you post a small example of one of your zarr files. I wouldn't recommend using a local patched version of tszip, unless it's for a once-off upgrade to the production format. |
Back in April 2021, I had some TreeSequence objects with top-level metadata that I wanted to write out using tszip, so I proposed PR #35. I've stuck to a version of tszip with that change for the past year.
PR #42 involved a rewrite of tszip that mostly supports legacy versions, but I have found that the top-level metadata in files I had previously written seem to be missing. Top-level metadata from newly tszipped files works fine as it is stored in
root.items()
and is correctly handled, but the previous metadata is inroot.attrs.items()
. Another piece of good news is that PR #35 never got its own tszip version on PyPI, so I'm probably the only who uses it. 😄After quite a bit of debugging, trying to understand what PR #42 did, I came up with a solution that seems to work locally.
Would someone with more knowledge of the codebase be able to take a look and modify as necessary? My PR #35 includes a test,
test_small_msprime_top_level_metadata
, that can be used to construct a TreeSequence with some top-level metadata. If you write it out with a version right after PR #35 and then read it in with the latest version, I see the following keys for root.attrs.items():and the following are the keys of root.items():
I would also suggest checking whether the
format_name
,format_version
, andprovenance
fields ofroot.attrs
are correctly handled for the legacy versions. It seemed fine to me with the provenance correctly carried over, but I notice that onlyroot.attrs["sequence_length"]
is accessed in the currentdecompress_zarr()
function.Thank you very much.
The text was updated successfully, but these errors were encountered: