THIS REPO IS ARCHIVED, AND ONLY AVAILABLE FOR HISTORICAL PURPOSES. YOU'RE PROBABLY LOOKING FOR https://github.com/beacon-biosignals/Onda.jl.
Onda is a lightweight format for storing and manipulating sets of multi-sensor, multi-channel, LPCM-encodable, annotated, time-series recordings.
The latest tagged version is v0.5.1.
This document contains:
- Onda's Terminology
- Onda's Design Principles
- Onda's Specification
- Potential Alternative Technologies/Approaches
Implementations:
This document uses the term...
-
..."LPCM" to refer to linear pulse code modulation, a form of signal encoding where multivariate waveforms are digitized as a series of samples uniformly spaced over time and quantized to a uniformly spaced grid.
-
..."signal" to refer to the digitized output of a process. A signal is comprised of metadata (e.g. LPCM encoding, channel information, sample data path/format information, etc.) and associated multi-channel sample data.
-
..."recording" to refer a collection of one or more signals recorded simultaneously over some time period.
-
..."annotation" to refer to a piece of (meta)data associated with a specific time span within a specific recording.
- ...when segments of a signal can fit in memory simultaneously, but an entire signal cannot.
- ...when each signal in each recording in your dataset can fit in memory, but not all signals in each recording can fit in memory simultaneously.
- ...when each recording in your dataset can fit in memory, but not all recordings in your dataset can fit in memory simultaneously.
- ...when your dataset's signals benefit from sensor-specific encodings/compression codecs.
- ...as an intermediate target format for wrangling unstructured signal data before bulk ingestion into a larger data store.
- ...as an intermediate target format for local experimentation after bulk retrieval from a larger data store.
- ...as a format for sharing datasets comprised of several gigabytes to several terabytes of signal data.
- ...as a format for sharing datasets comprised of hundreds to hundreds of thousands of recordings.
- ...depend only upon technologies with standardized, implementation-agnostic specifications that are well-used across multiple application domains.
- ...support recordings where each signal in the recording may have a unique channel layout, physical unit resolution, bit depth and sample rate.
- ...be well-suited for ingestion into/retrieval from...
- ...popular distributed analytics tools (e.g. Spark, TensorFlow).
- ...traditional databases (e.g. PostgresSQL, Cassandra).
- ...object-based storage systems (e.g. S3, GCP Cloud Storage).
- ...enable metadata, annotations etc. to be stored and processed separately from raw sample data without significant communication overhead.
- ...enable extensibility without sacrificing interpretability. New signal encodings, annotations, sample data file formats, etc. should all be user-definable by design.
- ...be simple enough that a decent programmer (with Google access) should be able to fully interpret (and write performant parsers for) an Onda dataset without ever reading Onda documentation.
- ...a sample data file format. Onda allows dataset authors to utilize whatever file format is most appropriate for a given signal's sample data, as long as the author provides a mechanism to deserialize sample data from that format to a standardized interleaved LPCM representation.
- ...a transactional database. The majority of an Onda dataset's mandated metadata is stored in tabular manifests containing recording information, signal descriptions, annotations etc. This simple structure is tailored towards Onda's target regimes (see above), and is not intended to serve as a persistent backend for external services/applications.
- ...an analytics platform. Onda seeks to provide a data model that is purposefully structured to enable various sorts of analysis, but the format itself does not mandate/describe any specific implementation of analysis utilities.
This specification document is versioned in accordance with semantic versioning; version numbers take the form major.minor.patch
where...
- ...increments to
major
correspond to changes/additions that are likely to break existing Onda readers - ...increments to
minor
correspond to changes/additions that are unlikely to break existing Onda readers - ...increments to
patch
correspond to purely textual changes, e.g. clarifying a phrase or fixing a typo
Note that, in accordance with the semantic versioning specification, minor increments in the 0.y.z
release series may include breaking changes:
Major version zero (0.y.z) is for initial development. Anything MAY change at any time. The public API SHOULD NOT be considered stable.
The Onda format describes three different types of files:
*.onda.annotations.arrow
files: Arrow files that contain annotation (meta)data associated with a dataset.*.onda.signals.arrow
files: Arrow files that contain signal metadata (e.g. LPCM encoding, channel information, sample data path/format, etc.) required to find and read sample data files associated with a dataset.- sample data files: Files of user-defined formats that store the sample data associated with signals.
Note that *.onda.annotations.arrow
files and *.onda.signals.arrow
files are largely orthogonal to one another - there's nothing inherent to the Onda format that prevents dataset producers/consumers from separately constructing/manipulating/transferring/analyzing these files. Furthermore, there's nothing that prevents dataset producers/consumers from working with multiple files of the same type referencing the same set of recordings (e.g. splitting all of a dataset's annotations across multiple *.onda.annotations.arrow
files).
The Arrow tables contained in *.onda.annotations.arrow
and *.onda.signals.arrow
must have attached custom metadata containing the key "onda_format_version"
whose value specifies the version of the Onda format that an Onda reader must support in order to properly read the file. This string takes the form "vM.m.p"
where M
is a major version number, m
is a minor version number, and p
is a patch version number.
Each of the aforementioned file types are further specified in the following sections. These sections refer to the logical types defined by the Arrow specification. Onda reader/writer implementations may additionally employ Arrow extension types that directly alias a column's specified logical type in order to support application-level features (first-class UUID support, custom file_path
type support, etc.).
An *.onda.annotations.arrow
file contains an Arrow table whose first 3 columns are:
recording
(128-bitFixedSizeBinary
): The UUID identifying the recording with which the annotation is associated.id
(128-bitFixedSizeBinary
): The UUID identifying the annotation.span
(Struct
): The annotations's time span within the recording. This structure has two fields:start
(Duration
w/NANOSECOND
unit): The start offset in nanoseconds from the beginning of the recording. The minimum possible value is0
.stop
(Duration
w/NANOSECOND
unit): The stop offset in nanoseconds (exclusive) from the beginning of the recording. This value must be greater thanstart
.
Note that this table may contain additional author-provided columns following the columns mandated above.
An example of an *.onda.annotations.arrow
table (whose value
column happens to contain strings):
recording |
id |
span |
my_custom_value |
---|---|---|---|
0xb14d2c6d8d844e46824f5c5d857215b4 |
0x81b17ea902504371954e7b8b167236a6 |
(start=5e9, stop=6e9) |
"this is a value" |
0xb14d2c6d8d844e46824f5c5d857215b4 |
0xdaebbd1b0cab4b89acdde51f9c9a1d7c |
(start=3e9, stop=7e9) |
"this is a different value" |
0x625fa5eadfb24252b58d1eb350fa7df6 |
0x11aeeb4b743149808b53547642652f0e |
(start=1e9, stop=2e9) |
"this is another value" |
0xa5c01f0e50fe4acba065fcf474e263f5 |
0xbc0be95e3da2495391daba233f035acc |
(start=2e9, stop=3e9) |
"wow what a great value" |
A *.onda.signals.arrow
file contains an Arrow table whose first 11 columns are:
recording
(128-bitFixedSizeBinary
): The UUID identifying the recording with which the signal is associated.file_path
(Utf8
): A string identifying the location of the signal's associated sample data file. This string must either be a valid URI or a relative file path (specifically, relative to the location of the*.onda.signals.arrow
file itself).file_format
(Utf8
): A string identifying the format of the signal's associated sample data file. All Onda readers/writers must support the following file formats (and may define and support additional values as desired):"lpcm"
: signals are stored in raw interleaved LPCM format (see format description below)."lpcm.zst"
: signals stored in raw interleaved LPCM format and compressed viazstd
span
(Struct
): The signal's time span within the recording. This has the same structure as an*.onda.annotations.arrow
table'sspan
column (specified in the previous section).kind
(Utf8
): A string identifying the kind of signal that the row represents. Validkind
values are alphanumeric, lowercase,snake_case
, and contain no whitespace, punctuation, or leading/trailing underscores.channels
(List
ofUtf8
): A list of strings where thei
th element is the name of the signal'si
th channel. A valid channel name...- ...conforms to the same format as
kind
(alphanumeric, lowercase,snake_case
, and contain no whitespace, punctuation, or leading/trailing underscores). - ...conforms to an
a-b
format wherea
andb
are valid channel names. Furthermore, to allow arbitrary cross-signal referencing,a
and/orb
may be channel names from other signals contained in the recording. If this is the case, such a name must be qualified in the formatsignal_name.channel_name
. For example, aneog
signal might have a channel namedleft-eeg.m1
(the left eye electrode referenced to the mastoid electrode from a 10-20 EEG signal).
- ...conforms to the same format as
sample_unit
(Utf8
): The name of the signal's canonical unit as a string. This string should conform to the same format askind
(alphanumeric, lowercase,snake_case
, and contain no whitespace, punctuation, or leading/trailing underscores), should be singular and not contain abbreviations (e.g."uV"
is bad,"microvolt"
is good;"l/m"
is bad,"liter_per_minute"
is good).sample_resolution_in_unit
(Int
orFloatingPoint
): The signal's resolution in its canonical unit. This value, along with the signal'ssample_type
andsample_offset_in_unit
fields, determines the signal's LPCM quantization scheme.sample_offset_in_unit
(Int
orFloatingPoint
): The signal's zero-offset in its canonical unit (thus allowing LPCM encodings that are centered around non-zero values).sample_type
(Utf8
): The primitive scalar type used to encode each sample in the signal. Valid values are:"int8"
: signed little-endian 1-byte integer"int16"
: signed little-endian 2-byte integer"int32"
: signed little-endian 4-byte integer"int64"
: signed little-endian 8-byte integer"uint8"
: unsigned little-endian 1-byte integer"uint16"
: unsigned little-endian 2-byte integer"uint32"
: unsigned little-endian 4-byte integer"uint64"
: unsigned little-endian 8-byte integer"float32"
: 32-bit floating point number"float64"
: 64-bit floating point number
sample_rate
(Int
orFloatingPoint
): The signal's sample rate.
Note that this table may contain additional author-provided columns after the columns mandated above.
An example *.onda.signals.arrow
table:
recording |
file_path |
file_format |
span |
kind |
channels |
sample_unit |
sample_resolution_in_unit |
sample_offset_in_unit |
sample_type |
sample_rate |
my_custom_value |
---|---|---|---|---|---|---|---|---|---|---|---|
0xb14d2c6d8d844e46824f5c5d857215b4 |
"./relative/path/to/samples.lpcm" |
"lpcm" |
(start=10e9, stop=10900e9) |
"eeg" |
["fp1", "f3", "f7", "fz", "f4", "f8"] |
"microvolt" |
0.25 |
3.6 |
"int16" |
256 |
"this is a value" |
0xb14d2c6d8d844e46824f5c5d857215b4 |
"s3://bucket/prefix/obj.lpcm.zst" |
"lpcm.zst" |
(start=0, stop=10800e9) |
"ecg" |
["avl", "avr"] |
"microvolt" |
0.5 |
1.0 |
"int16" |
128.3 |
"this is a different value" |
0x625fa5eadfb24252b58d1eb350fa7df6 |
"s3://other-bucket/prefix/obj_with_no_extension" |
"flac" |
(start=100e9, stop=500e9) |
"audio" |
["left", "right"] |
"scalar" |
1.0 |
0.0 |
"float32" |
44100 |
"this is another value" |
0xa5c01f0e50fe4acba065fcf474e263f5 |
"./another-relative/path/to/samples" |
"custom_price_format:{\"parseable_json_parameter\":3}" |
(start=0, stop=3600e9) |
"price" |
["price"] |
"dollar" |
0.01 |
0.0 |
"uint32" |
50.75 |
"wow what a great value" |
All sample data is encoded as specified by the corresponding signal's sample_type
, sample_resolution_in_unit
, and sample_offset_in_unit
fields, serialized to raw LPCM format, and formatted as specified by the signal's file_format
field.
While Onda explicitly supports arbitrary choice of file format for serialized sample data via the file_format
field, Onda reader/writer implementations should support (de)serialization of sample data from any implementation-supported format into the following standardized interleaved LPCM representation:
Given an n
-channel signal, the byte offset for the i
th channel value in the j
th multichannel sample is given by ((i - 1) + (j - 1) * n) * byte_width(signal.sample_type)
. This layout can be expressed in the following table (where w = byte_width(signal.sample_type)
):
Byte Offset | Value |
---|---|
0 | 1st channel value for 1st sample |
w | 2nd channel value for 1st sample |
... | ... |
(n - 1) * w | n th channel value for 1st sample |
(n + 0) * w | 1st channel value for 2nd sample |
(n + 1) * w | 2nd channel value for 2nd sample |
... | ... |
(2*n - 1) * w | n th channel value for 2nd sample |
(2*n + 0) * w | 1st channel value for 3rd sample |
(2*n + 1) * w | 2nd channel value for 3rd sample |
... | ... |
(3*n - 1) * w | n th channel value for 3rd sample |
... | ... |
((i - 1) + (j - 1) * n) * w | i th channel value for j th sample |
... | ... |
Values are stored in little-endian format.
An individual value in a multichannel sample can be converted to its encoded representation from its canonical unit representation via:
encoded_value = (decoded_value - sample_offset_in_unit) / sample_resolution_in_unit
where the division is followed/preceded by whatever quantization strategy is chosen by the user (e.g. rounding/truncation/dithering etc). Complementarily, an individual value in a multichannel sample can be converted ("decoded") from its encoded representation to its canonical unit representation via:
decoded_value = (encoded_value * sample_resolution_in_unit) + sample_offset_in_unit
In this section, we describe several alternative technologies/solutions considered during Onda's design.
-
HDF5: HDF5 was a candidate for Onda's de facto underlying storage layer. While featureful, ubiquitous, and technically based on an open standard, HDF5 is infamous for being a hefty dependency with a fairly complex reference implementation. While HDF5 solves many problems inherent to filesystem-based storage, most use cases for Onda involve storing large binary blobs in domain-specific formats that already exist quite naturally as files on a filesystem. Though it was decided that Onda should not explicitly depend on HDF5, nothing inherently technically precludes Onda dataset content from being stored in HDF5 in the same manner as any other similarly structured filesystem directory. For practical purposes, however, Onda readers/writers may not necessarily automatically be able to read such a dataset unless they explicitly feature HDF5 support (since HDF5 support isn't mandated by the format).
-
Avro: Avro was originally considered as an alternative to Onda's current approach (associating one sample data file per row in
*.onda.signals.arrow
). Avro's consideration was initially motivated by Uber's use of the format in a manner that was extremely similar to an early Onda prototype's use of NPY. Unfortunately, it seems that most of the well-maintained tooling for Avro is Spark-centric; in fact, the overarching Avro project has struggled (until very recently) to keep a dedicated set of maintainers engaged with the project. Avro's most desirable features, from the perspective of Onda, was its compression and "random" row access. However, early tests indicated that neither of those features worked particularly well for signals of interest compared to domain-specific seekable compression formats like FLAC. -
EDF/MEF/etc.: Onda was originally motivated by bulk electrophysiological dataset manipulation, a domain in which there are many different recording file formats that are all generally designed to support a one-file-per-recording use case and are constrained to certain domain-specific assumptions (e.g. specific bit depth assumptions, annotations stored within signal artifacts, etc.). Technically, since Onda itself is agnostic to choice of file formats used for signal serialization, one could store Onda sample data in EDF/MEF.
-
BIDS: BIDS is an alternative option for storing neuroscience datasets. As mentioned above, Onda's original motivation is electrophysiological dataset manipulation, so BIDS appeared to be a highly relevant candidate. Unfortunately, BIDS restricts EEG data to very specific file formats and also does not account for the plurality of LPCM-encodable signals that Onda seeks to handle generically.
-
MessagePack: Before v0.5.0, the Onda format used MessagePack to store all signal/annotation metadata. See this issue for background on the switch to Arrow.
-
JSON: In early Onda implementations, JSON was used to serialize signal/annotation metadata. While JSON has the advantage of being ubiquitous/simple/flexible/human-readable, the performance overhead of textual decoding/encoding was greater than desired for datasets with lots of annotations. In comparison, switching to MessagePack yielded a ~3x performance increase in (de)serialization for practical usage. The subsequent switch from MessagePack to Arrow in v0.5.0 of the Onda format yielded even greater (de)serialization improvements.
-
BSON: BSON was considered as a potential serialization format for signal/annotation metadata. Before v0.5.0 of the Onda format, MessagePack was chosen over BSON due to the latter's relative complexity compared to the former. After v0.5.0 of the Onda format, BSON remains less preferable than Arrow from a tabular/columnar data storage perspective.