Skip to content
This repository has been archived by the owner on Aug 5, 2021. It is now read-only.

Latest commit

 

History

History
215 lines (149 loc) · 21.8 KB

README.md

File metadata and controls

215 lines (149 loc) · 21.8 KB

THIS REPO IS ARCHIVED, AND ONLY AVAILABLE FOR HISTORICAL PURPOSES. YOU'RE PROBABLY LOOKING FOR https://github.com/beacon-biosignals/Onda.jl.

Onda Dataset Format

Onda is a lightweight format for storing and manipulating sets of multi-sensor, multi-channel, LPCM-encodable, annotated, time-series recordings.

The latest tagged version is v0.5.1.

This document contains:

Implementations:

Terminology

This document uses the term...

  • ..."LPCM" to refer to linear pulse code modulation, a form of signal encoding where multivariate waveforms are digitized as a series of samples uniformly spaced over time and quantized to a uniformly spaced grid.

  • ..."signal" to refer to the digitized output of a process. A signal is comprised of metadata (e.g. LPCM encoding, channel information, sample data path/format information, etc.) and associated multi-channel sample data.

  • ..."recording" to refer a collection of one or more signals recorded simultaneously over some time period.

  • ..."annotation" to refer to a piece of (meta)data associated with a specific time span within a specific recording.

Design Principles

[back to top]

Onda is useful...

  • ...when segments of a signal can fit in memory simultaneously, but an entire signal cannot.
  • ...when each signal in each recording in your dataset can fit in memory, but not all signals in each recording can fit in memory simultaneously.
  • ...when each recording in your dataset can fit in memory, but not all recordings in your dataset can fit in memory simultaneously.
  • ...when your dataset's signals benefit from sensor-specific encodings/compression codecs.
  • ...as an intermediate target format for wrangling unstructured signal data before bulk ingestion into a larger data store.
  • ...as an intermediate target format for local experimentation after bulk retrieval from a larger data store.
  • ...as a format for sharing datasets comprised of several gigabytes to several terabytes of signal data.
  • ...as a format for sharing datasets comprised of hundreds to hundreds of thousands of recordings.

Onda's design must...

  • ...depend only upon technologies with standardized, implementation-agnostic specifications that are well-used across multiple application domains.
  • ...support recordings where each signal in the recording may have a unique channel layout, physical unit resolution, bit depth and sample rate.
  • ...be well-suited for ingestion into/retrieval from...
    • ...popular distributed analytics tools (e.g. Spark, TensorFlow).
    • ...traditional databases (e.g. PostgresSQL, Cassandra).
    • ...object-based storage systems (e.g. S3, GCP Cloud Storage).
  • ...enable metadata, annotations etc. to be stored and processed separately from raw sample data without significant communication overhead.
  • ...enable extensibility without sacrificing interpretability. New signal encodings, annotations, sample data file formats, etc. should all be user-definable by design.
  • ...be simple enough that a decent programmer (with Google access) should be able to fully interpret (and write performant parsers for) an Onda dataset without ever reading Onda documentation.

Onda is not...

  • ...a sample data file format. Onda allows dataset authors to utilize whatever file format is most appropriate for a given signal's sample data, as long as the author provides a mechanism to deserialize sample data from that format to a standardized interleaved LPCM representation.
  • ...a transactional database. The majority of an Onda dataset's mandated metadata is stored in tabular manifests containing recording information, signal descriptions, annotations etc. This simple structure is tailored towards Onda's target regimes (see above), and is not intended to serve as a persistent backend for external services/applications.
  • ...an analytics platform. Onda seeks to provide a data model that is purposefully structured to enable various sorts of analysis, but the format itself does not mandate/describe any specific implementation of analysis utilities.

Specification

[back to top]

Versioning

This specification document is versioned in accordance with semantic versioning; version numbers take the form major.minor.patch where...

  • ...increments to major correspond to changes/additions that are likely to break existing Onda readers
  • ...increments to minor correspond to changes/additions that are unlikely to break existing Onda readers
  • ...increments to patch correspond to purely textual changes, e.g. clarifying a phrase or fixing a typo

Note that, in accordance with the semantic versioning specification, minor increments in the 0.y.z release series may include breaking changes:

Major version zero (0.y.z) is for initial development. Anything MAY change at any time. The public API SHOULD NOT be considered stable.

Overview

The Onda format describes three different types of files:

  • *.onda.annotations.arrow files: Arrow files that contain annotation (meta)data associated with a dataset.
  • *.onda.signals.arrow files: Arrow files that contain signal metadata (e.g. LPCM encoding, channel information, sample data path/format, etc.) required to find and read sample data files associated with a dataset.
  • sample data files: Files of user-defined formats that store the sample data associated with signals.

Note that *.onda.annotations.arrow files and *.onda.signals.arrow files are largely orthogonal to one another - there's nothing inherent to the Onda format that prevents dataset producers/consumers from separately constructing/manipulating/transferring/analyzing these files. Furthermore, there's nothing that prevents dataset producers/consumers from working with multiple files of the same type referencing the same set of recordings (e.g. splitting all of a dataset's annotations across multiple *.onda.annotations.arrow files).

The Arrow tables contained in *.onda.annotations.arrow and *.onda.signals.arrow must have attached custom metadata containing the key "onda_format_version" whose value specifies the version of the Onda format that an Onda reader must support in order to properly read the file. This string takes the form "vM.m.p" where M is a major version number, m is a minor version number, and p is a patch version number.

Each of the aforementioned file types are further specified in the following sections. These sections refer to the logical types defined by the Arrow specification. Onda reader/writer implementations may additionally employ Arrow extension types that directly alias a column's specified logical type in order to support application-level features (first-class UUID support, custom file_path type support, etc.).

*.onda.annotations.arrow Files

An *.onda.annotations.arrow file contains an Arrow table whose first 3 columns are:

  1. recording (128-bit FixedSizeBinary): The UUID identifying the recording with which the annotation is associated.
  2. id (128-bit FixedSizeBinary): The UUID identifying the annotation.
  3. span (Struct): The annotations's time span within the recording. This structure has two fields:
    • start (Duration w/ NANOSECOND unit): The start offset in nanoseconds from the beginning of the recording. The minimum possible value is 0.
    • stop (Duration w/ NANOSECOND unit): The stop offset in nanoseconds (exclusive) from the beginning of the recording. This value must be greater than start.

Note that this table may contain additional author-provided columns following the columns mandated above.

An example of an *.onda.annotations.arrow table (whose value column happens to contain strings):

recording id span my_custom_value
0xb14d2c6d8d844e46824f5c5d857215b4 0x81b17ea902504371954e7b8b167236a6 (start=5e9, stop=6e9) "this is a value"
0xb14d2c6d8d844e46824f5c5d857215b4 0xdaebbd1b0cab4b89acdde51f9c9a1d7c (start=3e9, stop=7e9) "this is a different value"
0x625fa5eadfb24252b58d1eb350fa7df6 0x11aeeb4b743149808b53547642652f0e (start=1e9, stop=2e9) "this is another value"
0xa5c01f0e50fe4acba065fcf474e263f5 0xbc0be95e3da2495391daba233f035acc (start=2e9, stop=3e9) "wow what a great value"

*.onda.signals.arrow Files

A *.onda.signals.arrow file contains an Arrow table whose first 11 columns are:

  1. recording (128-bit FixedSizeBinary): The UUID identifying the recording with which the signal is associated.
  2. file_path (Utf8): A string identifying the location of the signal's associated sample data file. This string must either be a valid URI or a relative file path (specifically, relative to the location of the *.onda.signals.arrow file itself).
  3. file_format (Utf8): A string identifying the format of the signal's associated sample data file. All Onda readers/writers must support the following file formats (and may define and support additional values as desired):
    • "lpcm": signals are stored in raw interleaved LPCM format (see format description below).
    • "lpcm.zst": signals stored in raw interleaved LPCM format and compressed via zstd
  4. span (Struct): The signal's time span within the recording. This has the same structure as an *.onda.annotations.arrow table's span column (specified in the previous section).
  5. kind (Utf8): A string identifying the kind of signal that the row represents. Valid kind values are alphanumeric, lowercase, snake_case, and contain no whitespace, punctuation, or leading/trailing underscores.
  6. channels (List of Utf8): A list of strings where the ith element is the name of the signal's ith channel. A valid channel name...
    • ...conforms to the same format as kind (alphanumeric, lowercase, snake_case, and contain no whitespace, punctuation, or leading/trailing underscores).
    • ...conforms to an a-b format where a and b are valid channel names. Furthermore, to allow arbitrary cross-signal referencing, a and/or b may be channel names from other signals contained in the recording. If this is the case, such a name must be qualified in the format signal_name.channel_name. For example, an eog signal might have a channel named left-eeg.m1 (the left eye electrode referenced to the mastoid electrode from a 10-20 EEG signal).
  7. sample_unit (Utf8): The name of the signal's canonical unit as a string. This string should conform to the same format as kind (alphanumeric, lowercase, snake_case, and contain no whitespace, punctuation, or leading/trailing underscores), should be singular and not contain abbreviations (e.g. "uV" is bad, "microvolt" is good; "l/m" is bad, "liter_per_minute" is good).
  8. sample_resolution_in_unit (Int or FloatingPoint): The signal's resolution in its canonical unit. This value, along with the signal's sample_type and sample_offset_in_unit fields, determines the signal's LPCM quantization scheme.
  9. sample_offset_in_unit (Int or FloatingPoint): The signal's zero-offset in its canonical unit (thus allowing LPCM encodings that are centered around non-zero values).
  10. sample_type (Utf8): The primitive scalar type used to encode each sample in the signal. Valid values are:
    • "int8": signed little-endian 1-byte integer
    • "int16": signed little-endian 2-byte integer
    • "int32": signed little-endian 4-byte integer
    • "int64": signed little-endian 8-byte integer
    • "uint8": unsigned little-endian 1-byte integer
    • "uint16": unsigned little-endian 2-byte integer
    • "uint32": unsigned little-endian 4-byte integer
    • "uint64": unsigned little-endian 8-byte integer
    • "float32": 32-bit floating point number
    • "float64": 64-bit floating point number
  11. sample_rate (Int or FloatingPoint): The signal's sample rate.

Note that this table may contain additional author-provided columns after the columns mandated above.

An example *.onda.signals.arrow table:

recording file_path file_format span kind channels sample_unit sample_resolution_in_unit sample_offset_in_unit sample_type sample_rate my_custom_value
0xb14d2c6d8d844e46824f5c5d857215b4 "./relative/path/to/samples.lpcm" "lpcm" (start=10e9, stop=10900e9) "eeg" ["fp1", "f3", "f7", "fz", "f4", "f8"] "microvolt" 0.25 3.6 "int16" 256 "this is a value"
0xb14d2c6d8d844e46824f5c5d857215b4 "s3://bucket/prefix/obj.lpcm.zst" "lpcm.zst" (start=0, stop=10800e9) "ecg" ["avl", "avr"] "microvolt" 0.5 1.0 "int16" 128.3 "this is a different value"
0x625fa5eadfb24252b58d1eb350fa7df6 "s3://other-bucket/prefix/obj_with_no_extension" "flac" (start=100e9, stop=500e9) "audio" ["left", "right"] "scalar" 1.0 0.0 "float32" 44100 "this is another value"
0xa5c01f0e50fe4acba065fcf474e263f5 "./another-relative/path/to/samples" "custom_price_format:{\"parseable_json_parameter\":3}" (start=0, stop=3600e9) "price" ["price"] "dollar" 0.01 0.0 "uint32" 50.75 "wow what a great value"

Sample Data Files

All sample data is encoded as specified by the corresponding signal's sample_type, sample_resolution_in_unit, and sample_offset_in_unit fields, serialized to raw LPCM format, and formatted as specified by the signal's file_format field.

While Onda explicitly supports arbitrary choice of file format for serialized sample data via the file_format field, Onda reader/writer implementations should support (de)serialization of sample data from any implementation-supported format into the following standardized interleaved LPCM representation:

Given an n-channel signal, the byte offset for the ith channel value in the jth multichannel sample is given by ((i - 1) + (j - 1) * n) * byte_width(signal.sample_type). This layout can be expressed in the following table (where w = byte_width(signal.sample_type)):

Byte Offset Value
0 1st channel value for 1st sample
w 2nd channel value for 1st sample
... ...
(n - 1) * w nth channel value for 1st sample
(n + 0) * w 1st channel value for 2nd sample
(n + 1) * w 2nd channel value for 2nd sample
... ...
(2*n - 1) * w nth channel value for 2nd sample
(2*n + 0) * w 1st channel value for 3rd sample
(2*n + 1) * w 2nd channel value for 3rd sample
... ...
(3*n - 1) * w nth channel value for 3rd sample
... ...
((i - 1) + (j - 1) * n) * w ith channel value for jth sample
... ...

Values are stored in little-endian format.

An individual value in a multichannel sample can be converted to its encoded representation from its canonical unit representation via:

encoded_value = (decoded_value - sample_offset_in_unit) / sample_resolution_in_unit

where the division is followed/preceded by whatever quantization strategy is chosen by the user (e.g. rounding/truncation/dithering etc). Complementarily, an individual value in a multichannel sample can be converted ("decoded") from its encoded representation to its canonical unit representation via:

decoded_value = (encoded_value * sample_resolution_in_unit) + sample_offset_in_unit

Potential Alternatives

[back to top]

In this section, we describe several alternative technologies/solutions considered during Onda's design.

  • HDF5: HDF5 was a candidate for Onda's de facto underlying storage layer. While featureful, ubiquitous, and technically based on an open standard, HDF5 is infamous for being a hefty dependency with a fairly complex reference implementation. While HDF5 solves many problems inherent to filesystem-based storage, most use cases for Onda involve storing large binary blobs in domain-specific formats that already exist quite naturally as files on a filesystem. Though it was decided that Onda should not explicitly depend on HDF5, nothing inherently technically precludes Onda dataset content from being stored in HDF5 in the same manner as any other similarly structured filesystem directory. For practical purposes, however, Onda readers/writers may not necessarily automatically be able to read such a dataset unless they explicitly feature HDF5 support (since HDF5 support isn't mandated by the format).

  • Avro: Avro was originally considered as an alternative to Onda's current approach (associating one sample data file per row in *.onda.signals.arrow). Avro's consideration was initially motivated by Uber's use of the format in a manner that was extremely similar to an early Onda prototype's use of NPY. Unfortunately, it seems that most of the well-maintained tooling for Avro is Spark-centric; in fact, the overarching Avro project has struggled (until very recently) to keep a dedicated set of maintainers engaged with the project. Avro's most desirable features, from the perspective of Onda, was its compression and "random" row access. However, early tests indicated that neither of those features worked particularly well for signals of interest compared to domain-specific seekable compression formats like FLAC.

  • EDF/MEF/etc.: Onda was originally motivated by bulk electrophysiological dataset manipulation, a domain in which there are many different recording file formats that are all generally designed to support a one-file-per-recording use case and are constrained to certain domain-specific assumptions (e.g. specific bit depth assumptions, annotations stored within signal artifacts, etc.). Technically, since Onda itself is agnostic to choice of file formats used for signal serialization, one could store Onda sample data in EDF/MEF.

  • BIDS: BIDS is an alternative option for storing neuroscience datasets. As mentioned above, Onda's original motivation is electrophysiological dataset manipulation, so BIDS appeared to be a highly relevant candidate. Unfortunately, BIDS restricts EEG data to very specific file formats and also does not account for the plurality of LPCM-encodable signals that Onda seeks to handle generically.

  • MessagePack: Before v0.5.0, the Onda format used MessagePack to store all signal/annotation metadata. See this issue for background on the switch to Arrow.

  • JSON: In early Onda implementations, JSON was used to serialize signal/annotation metadata. While JSON has the advantage of being ubiquitous/simple/flexible/human-readable, the performance overhead of textual decoding/encoding was greater than desired for datasets with lots of annotations. In comparison, switching to MessagePack yielded a ~3x performance increase in (de)serialization for practical usage. The subsequent switch from MessagePack to Arrow in v0.5.0 of the Onda format yielded even greater (de)serialization improvements.

  • BSON: BSON was considered as a potential serialization format for signal/annotation metadata. Before v0.5.0 of the Onda format, MessagePack was chosen over BSON due to the latter's relative complexity compared to the former. After v0.5.0 of the Onda format, BSON remains less preferable than Arrow from a tabular/columnar data storage perspective.