Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Header as a HDF5 compound datatype #2

Open
mkitti opened this issue Jun 6, 2022 · 10 comments
Open

Header as a HDF5 compound datatype #2

mkitti opened this issue Jun 6, 2022 · 10 comments

Comments

@mkitti
Copy link

mkitti commented Jun 6, 2022

HDF5 has the capability to create a HDF5 compound dataset, which is analogous to a C struct.

https://portal.hdfgroup.org/display/HDF5/Datatype+Basics#DatatypeBasics-compound
https://api.h5py.org/h5t.html#compound-types

It may be possible to also construct this from a NumPy record. I suspect it may be easier to use the low-level API from the CSV files that you created.

@mkitti
Copy link
Author

mkitti commented Jun 6, 2022

h5py then provides the ability to read individual fields directly.
https://docs.h5py.org/en/stable/high/dataset.html?highlight=compound#reading-writing-data

@clbarnes
Copy link
Owner

clbarnes commented Jun 6, 2022

Presumably this would end up very close to Davis' approach here? https://github.com/janelia-cosem/fibsem-tools/blob/f4bedbfc4ff81ec1b83282908ba6702baf98c734/src/fibsem_tools/io/fibsem.py#L81

It's smart and probably a better representation of what's going on, but is this kind of access standard across common HDF5 implementations? The HDF5 spec is colossal so it wouldn't surprise me if many APIs only cover a subset of its functionality; in that case I'd prefer to target that common subset of "basic" features rather than go deep into the HDF5 spec to find something which is technically allowed but not available to many users.

@mkitti
Copy link
Author

mkitti commented Jun 6, 2022

I was thinking of this as a way to encode the jeiss-convert tsv files as a datatype in HDF5 itself. In the worse case scenario, one could always use H5Dread to just read the bytes giving uint8 as the memory type, which is the status quo.

Many packages support compound datatypes. Perhaps the most common use of compound datatype is complex numbers.

Java: https://bitbucket.hdfgroup.org/pages/HDFFV/hdf5doc/master/browse/html/javadoc/index.html?hdf/hdf5lib/H5.html
MATLAB: https://www.mathworks.com/help/matlab/import_export/import-hdf5-files.html
Julia: https://juliaio.github.io/HDF5.jl/stable/#Supported-data-types

@mkitti
Copy link
Author

mkitti commented Jul 19, 2022

JHDF5 which is currently used by the Java tools BigDataViewer and SciJava (FIJI, etc.) has a compound datatype reader here:
https://svnsis.ethz.ch/doc/hdf5/hdf5-19.04/ch/systemsx/cisd/hdf5/IHDF5CompoundReader.html

@mkitti
Copy link
Author

mkitti commented Jul 19, 2022

@clbarnes , let me know if you have time to chat for a few minutes. One concern about embracing HDF5 for this is that we're not sure if this works for everyone at Cambridge. Albert in particular seemed to prefer text based attributes via JSON or similar.

@clbarnes
Copy link
Owner

I actually have a fork which writes to zarr, which is exactly that - a JSON file for the metadata, plus a npy-esque binary dump (which can be chunked). Zarr is getting a lot of attention but the spec is anticipated to change some time soon, in a way which will make it less convenient for this sort of thing.

I'm flexible for the rest of the week if we can figure out time differences! I'm in BST.

@mkitti
Copy link
Author

mkitti commented Jul 19, 2022

Yes, I participated in the discussion on the Zarr shard specification that should be part of v3:
zarr-developers/zarr-python#876 (comment)
It looks like a HDF5 file with an extra linear dataset could also be a Zarr shard.

Extracting that indexing from HDF5 should be quite fast if we use H5Dchunk_iter currently in HDF5 1.13 or or the h5ls command line utility:
https://docs.hdfgroup.org/hdf5/develop/group___h5_d.html#gac482c2386aa3aea4c44730a627a7adb8

Another extreme is https://github.com/HDFGroup/hdf5-json

Nonetheless, once we have the data in one standard format, I do not mind investing in tooling to move between standard formats or using something like kerchunk. The best part is that tooling may already exist.

@clbarnes
Copy link
Owner

I have an implementation of this with a convenient Mapping wrapper, which round-trips correctly through bytes. What I'm trying to figure out now is where it fits with the rest of program as it currently stands - if the header is written into the HDF5 as this compound dtype array, do we still want to encode the same metadata as attributes, which is the more HDF5-y way to do it? That duplication concerns me a bit. If not, then we've made the attributes a bit more awkward to access. Is having today's header encoded byte-for-byte in the HDF5 file a goal in its own right?

It also gets more complicated to add the zarr/n5 implementations, which don't support compound dtypes (to my knowledge). In these cases, you'd need to serialise the metadata as attributes anyway (which, again, is more convenient for downstream users anyway). I'm not entirely convinced zarr/n5 support is a good way to go anyway - keeping everything contained in the same file and having a single supported workflow from proprietary to open format is of benefit, and given that these files will almost certainly require post-processing, downstream users can write to other formats at that stage if they want.

@mkitti
Copy link
Author

mkitti commented Jul 26, 2022

Is having today's header encoded byte-for-byte in the HDF5 file a goal in its own right?

This was a stated goal of the last round to help ensure round trip durability. Originally it was just going to be an opaque datatype or byte stream, but I realized that we may be able to do better with the compound datatype. We do not want to depend on someone bumping the version number or the accuracy of the reader's table of offsets and types in order to preserve the header.

One option might be to save the 1 KB header as a separate file for reference. For Zarr this might just be an opaque block of bytes. N5 has N5-HDF5 backend that may be able to take advantage of the compound datatype.

@clbarnes
Copy link
Owner

My current implementation does store the raw header as well as the exploded metadata values, without using the compound dtype. For HDF5, there is a u8 attribute "_header" (as well as "_footer"); for the N5 and Zarr implementations in the PR, these are hex-encoded strings (open to using base64 as well). The tests round trip from the exploded values, rather than relying on the byte dump.

The compound dtype is just calculated from the table of offsets and dtypes so isn't any more robust in that respect. I don't think there's a better way to do that which doesn't just duplicate the information and introduce a new source of error. The reader doesn't need to explicitly state the version as it's read from the metadata, and (in my implementation anyway) will fail if the version's spec isn't known.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants