Skip to content

Introduction to LINDI, NWB Developer Hackathon

Jeremy Magland edited this page Apr 18, 2024 · 7 revisions

Introduction to LINDI

Jeremy Magland, Ryan Ly, Oliver Ruebel

NWB Developer Hackathon, DataJoint Headquarters, April 2024

What is LINDI (Linked Data Interface)?

HDF5 as Zarr as JSON for NWB

LINDI provides a JSON representation of NWB (Neurodata Without Borders) data where the large data chunks are stored separately from the main metadata. This enables efficient storage, composition, and sharing of NWB files on cloud systems such as DANDI without duplicating the large data blobs.

LINDI provides:

  • A specification for representing arbitrary HDF5 files as Zarr stores. This handles scalar datasets, references, soft links, and compound data types for datasets.
  • A Zarr wrapper for remote or local HDF5 files (LindiH5ZarrStore).
  • A mechanism for creating .lindi.json (or .nwb.lindi.json) files that reference data chunks in external files, inspired by kerchunk.
  • An h5py-like interface for reading from and writing to these data sources that can be used with pynwb.
  • A mechanism for uploading and downloading these data sources to and from cloud storage, including DANDI.

This project was inspired by kerchunk and hdmf-zarr.

Use cases

  • Represent a remote NWB/HDF5 file as a .nwb.lindi.json file.
  • Read a local or remote .nwb.lindi.json file using pynwb or other tools.
  • Edit a .nwb.lindi.json file using pynwb or other tools.
  • Add datasets to a .nwb.lindi.json file using a local staging area.
  • Upload a .nwb.lindi.json file to a cloud storage service such as DANDI.

DANDI Integration (WIP)

https://gui-staging.dandiarchive.org/dandiset/213569/draft/files?location=000946%2Fsub-BH494&page=1

image

Advantages over HDF5

  • Efficient remote access: Lazy reading from remote HDF5 is inherently inefficient. Many serial requests are needed to load metadata. LINDI is as efficient as Zarr for remote access.
  • Flexible data composition without duplication.
  • Ability to edit files without rewriting / re-uploading
  • Scalability
  • Accessibility and interoperability: JSON is more widely supported than HDF5.
  • Custom compression codecs

Cons: requires more than one file to represent the data.

Advantages over traditional Zarr

  • More flexible in terms of where data chunks are stored.
  • Supports HDF5 features such as scalar datasets, references, and compound data types.

Represent a remote NWB/HDF5 file as a .nwb.lindi.json file

import json
import pynwb
import lindi

# URL of the remote NWB file
h5_url = "https://api.dandiarchive.org/api/assets/11f512ba-5bcf-4230-a8cb-dc8d36db38cb/download/"

# Create a read-only Zarr store as a wrapper for the h5 file
store = lindi.LindiH5ZarrStore.from_file(h5_url)

# Generate a reference file system
rfs = store.to_reference_file_system()

# Save it to a file for later use
with open("example.lindi.json", "w") as f:
    json.dump(rfs, f, indent=2)

# Create an h5py-like client from the reference file system
client = lindi.LindiH5pyFile.from_reference_file_system(rfs)

# Open using pynwb
with pynwb.NWBHDF5IO(file=client, mode="r") as io:
    nwbfile = io.read()
    print(nwbfile)

Read a local or remote .nwb.lindi.json file using pynwb or other tools

import pynwb
import lindi

# URL of the remote .nwb.lindi.json file
url = 'https://lindi.neurosift.org/dandi/dandisets/000939/assets/11f512ba-5bcf-4230-a8cb-dc8d36db38cb/zarr.json'

# Load the h5py-like client for the reference file system
client = lindi.LindiH5pyFile.from_reference_file_system(url)

# Open using pynwb
with pynwb.NWBHDF5IO(file=client, mode="r") as io:
    nwbfile = io.read()
    print(nwbfile)

Edit a .nwb.lindi.json file using pynwb or other tools

import json
import lindi

# URL of the remote .nwb.lindi.json file
url = 'https://lindi.neurosift.org/dandi/dandisets/000939/assets/11f512ba-5bcf-4230-a8cb-dc8d36db38cb/zarr.json'

# Load the h5py-like client for the reference file system
# in read-write mode
client = lindi.LindiH5pyFile.from_reference_file_system(url, mode="r+")

# Edit an attribute
client.attrs['new_attribute'] = 'new_value'

# Save the changes to a new .nwb.lindi.json file
rfs_new = client.to_reference_file_system()
with open('new.nwb.lindi.json', 'w') as f:
    f.write(json.dumps(rfs_new, indent=2, sort_keys=True))

Add datasets to a .nwb.lindi.json file using a local staging area

import lindi

# URL of the remote .nwb.lindi.json file
url = 'https://lindi.neurosift.org/dandi/dandisets/000939/assets/11f512ba-5bcf-4230-a8cb-dc8d36db38cb/zarr.json'

# Load the h5py-like client for the reference file system
# in read-write mode with a staging area
with lindi.StagingArea.create(base_dir='lindi_staging') as staging_area:
    client = lindi.LindiH5pyFile.from_reference_file_system(
        url,
        mode="r+",
        staging_area=staging_area
    )
    # add datasets to client using pynwb or other tools
    # upload the changes to the remote .nwb.lindi.json file

Upload a .nwb.lindi.json file to DANDI

See this example

Special Zarr Annotations in LINDI

LINDI defines a set of special Zarr annotations to correspond with HDF5 features that are not natively supported in Zarr.

Scalar Datasets

_SCALAR = True

In HDF5, datasets can be scalar, but Zarr does not natively support scalar arrays. To overcome this limitation, LINDI represents scalar datasets as Zarr arrays with a shape of (1,) and marks them with the _SCALAR = True attribute.

Soft Links

_SOFT_LINK = { 'path': '...' }

Soft links in HDF5 are pointers to other groups within the file. LINDI utilizes the _SOFT_LINK attribute on a Zarr group to represent this relationship. The path key within the attribute specifies the target group within the Zarr structure. Soft link groups in Zarr should be otherwise empty, serving only as a reference to another location in the dataset.

Note that we do not currently support external links.

References

{
  "_REFERENCE": {
    "source": ".",
    "path": "...",
    "object_id": "...",
    "source_object_id": "...",
  }
}
  • source: Always . for now, indicating that the reference is to an object within the same Zarr.
  • path: Path to the target object within the Zarr.
  • object_id: The object_id attribute of the target object (for validation).
  • source_object_id: The object_id attribute of the source object (for validation).

The largely follows the convention used by hdmf-zarr.

HDF5 references can appear within both attributes and datasets. For attributes, the value of the attribute is a dict in the above form. For datasets, the value of an item within the dataset is a dict in the above form.

Note: Region references are not supported at this time and are planned for future implementation.

Compound Data Types

_COMPOUND_DTYPE: [['x', 'int32'], ['y', 'float64'], ...]

Zarr arrays can represent compound data types from HDF5 datasets. The _COMPOUND_DTYPE attribute on a Zarr array indicates this, listing each field's name and data type. The array data should be JSON encoded, aligning with the specified compound structure. The h5py.Reference type is also supported within these structures (represented by the type string '').

External Array Links

_EXTERNAL_ARRAY_LINK = {'link_type': 'hdf5_dataset', 'url': '...', 'name': '...'}

For datasets with an extensive number of chunks such that inclusion in the Zarr or reference file system is impractical, LINDI uses the _EXTERNAL_ARRAY_LINK attribute on a Zarr array. This attribute points to an external HDF5 file, specifying the url for remote access (or local path) and the name of the target dataset within that file. When slicing that dataset, the LindiH5pyClient will handle data retrieval, leveraging h5py and remfile for remote access.