Skip to content

Commit

Permalink
Merge pull request #222 from maffettone/bulk-external-resources
Browse files Browse the repository at this point in the history
Add a new document type for "bulk" external resources with no predetermined shape.
  • Loading branch information
tacaswell authored Nov 3, 2022
2 parents 772ae3c + 541e914 commit eac55ca
Show file tree
Hide file tree
Showing 8 changed files with 426 additions and 12 deletions.
5 changes: 4 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -79,4 +79,7 @@ target/
.vscode/*

#Ipython Notebook
.ipynb_checkpoints
.ipynb_checkpoints

# precomit
.pre-commit-config.yaml
90 changes: 88 additions & 2 deletions docs/source/data-model.rst
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ with useful components.
Overview
========

The data model is composed of six types of Documents, which in Python are
The data model is composed of eight types of Documents, which in Python are
represented as dictionaries but could be represented as nested mappings (e.g.
JSON) in any language. Each document class has a defined, but flexible, schema.

Expand All @@ -35,8 +35,13 @@ JSON) in any language. Each document class has a defined, but flexible, schema.
* Event Descriptor --- Metadata about a series of Events. Envision
richly-detail column headings in a table, encompassing physical units,
hardware configuration information, etc.
* Resource --- A pointer to an external file (or resource in general).
* Resource --- A pointer to an external file (or resource in general) that has
predictable and fixed dimensionality.
* Datum --- A pointer to a specific slice of data within a Resource.
* Stream Resource (Experimental) --- A pointer to an external resource that contains a stream
of data without restriction on the length of the stream. This resources with
know 'column' dimension without a known number of rows (e.g., time series, point detectors).
* Stream Datum (Experimental) --- A pointer to a specific slice of data within a Stream Resource.
* Run Stop Document --- Everything that we can only know at the very end, such
as the time it ended and the exit status (succeeded, aborted, failed due to
error).
Expand Down Expand Up @@ -463,6 +468,87 @@ Formal Datum Page schema:
.. literalinclude:: ../../event_model/schemas/datum_page.json
.. _stream_resource:
Stream Resource Document (Experimental)
---------------------------------------
See :doc:`external` for details on the role Stream Resource documents play in
referencing external assets that are natively ragged, such as single-photon detectors,
or assets where there are many relatively small data sets (e.g. scanned fluorescence data).
Minimal nontrivial valid example:
.. code-block:: python
# 'Stream Resource' document
{'path_semantics': 'posix',
'resource_kwargs': {},
'resource_path': '/local/path/subdirectory/data_file',
'root': '/local/path/',
'run_start': '10bf6945-4afd-43ca-af36-6ad8f3540bcd',
'spec': 'SOME_SPEC',
'stream_names': ['point_det'],
'uid': '272132cf-564f-428f-bf6b-149ee4287024'}
Typical example:
.. code-block:: python
# resource
{'spec': 'AD_HDF5',
'root': '/GPFS/DATA/Andor/',
'resource_path': '2020/01/03/8ff08ff9-a2bf-48c3-8ff3-dcac0f309d7d.h5',
'resource_kwargs': {'frame_per_point': 1},
'path_semantics': 'posix',
'stream_names': ['point_det'],
'uid': '3b300e6f-b431-4750-a635-5630d15c81a8',
'run_start': '10bf6945-4afd-43ca-af36-6ad8f3540bcd'}
Formal schema:
.. literalinclude:: ../../event_model/schemas/stream_resource.json
.. _stream_datum:
Stream Datum Document
---------------------
See :doc:`external` for details on the role Stream Datum documents play in referencing
external assets that are natively ragged, such as single-photon detectors,
or assets where there are many relatively small data sets (e.g. scanned fluorescence data).
Minimal nontrivial valid example:
.. code-block:: python
# 'datum' document
{'resource': '272132cf-564f-428f-bf6b-149ee4287024', # foreign key
'datum_kwargs': {}, # format-specific parameters
'datum_id': '272132cf-564f-428f-bf6b-149ee4287024/1',
'block_idx': 0,
'event_count': 1
}
Typical example:
.. code-block:: python
# datum
{'resource': '3b300e6f-b431-4750-a635-5630d15c81a8',
'datum_kwargs': {'index': 3},
'datum_id': '3b300e6f-b431-4750-a635-5630d15c81a8/3',
'block_idx': 0,
'event_count': 5,
'event_offset': 14}
It is an implementation detail that ``datum_id`` is often formatted as
``{resource}/{counter}`` but this should not be considered part of the schema.
Formal schema:
.. literalinclude:: ../../event_model/schemas/stream_datum.json
.. _bulk_events:
"Bulk Events" Document (DEPRECATED)
Expand Down
2 changes: 1 addition & 1 deletion docs/source/use-cases.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ Event Model Patterns
============================
When implementing a system with the Event Model for a particular use case (technique, scan type, etc.), many design choices can be made. For example: how many streams you define (through Event Descriptor documents), what events you put into those streams, and what how data points are stored in an event. Here, we present a use case and a potential design, discussing the pros and cons of different options.

To further complicate things, we can consider that a document stream might be be optimized for very different scenarios within the same technique. For example, for the sample scan of the same sample, a document stream might read as the scan as it is being run for the purpose of providing in-scan feedback to a user or beamline analysis tool. For the same scan, a docstream might be serialized into a Mongo database and read back out by analysis and visualiztion tools. In these different uses of data from the same scan, the level of granularity that one puts intto an Event Document might be very different. The streaming consumer might require a large number of very small granular events in order to quickly make decisions that affect the course of the scan. On the other hand, MongoDB document retrieval is much more efficient with a small number of larger documents, and a small number of events that each contain data from multiple time steps might be preferrable.
To further complicate things, we can consider that a document stream might be be optimized for very different scenarios within the same technique. For example, for the sample scan of the same sample, a document stream might read as the scan as it is being run for the purpose of providing in-scan feedback to a user or beamline analysis tool. For the same scan, a docstream might be serialized into a Mongo database and read back out by analysis and visualiztion tools. In these different uses of data from the same scan, the level of granularity that one puts into an Event Document might be very different. The streaming consumer might require a large number of very small granular events in order to quickly make decisions that affect the course of the scan. On the other hand, MongoDB document retrieval is much more efficient with a small number of larger documents, and a small number of events that each contain data from multiple time steps might be preferrable.


Use Case - Tomography Tiling and MongoDB Serialization
Expand Down
Loading

0 comments on commit eac55ca

Please sign in to comment.