Merge pull request #222 from maffettone/bulk-external-resources

Add a new document type for "bulk" external resources with no predetermined shape.
bluesky · Nov 3, 2022 · eac55ca · eac55ca
2 parents 772ae3c + 541e914
commit eac55ca
Show file tree

Hide file tree

Showing 8 changed files with 426 additions and 12 deletions.
diff --git a/.gitignore b/.gitignore
@@ -79,4 +79,7 @@ target/
 .vscode/*
 
 #Ipython Notebook
-.ipynb_checkpoints
+.ipynb_checkpoints
+
+# precomit
+.pre-commit-config.yaml
diff --git a/docs/source/data-model.rst b/docs/source/data-model.rst
@@ -24,7 +24,7 @@ with useful components.
 Overview
 ========
 
-The data model is composed of six types of Documents, which in Python are
+The data model is composed of eight types of Documents, which in Python are
 represented as dictionaries but could be represented as nested mappings (e.g.
 JSON) in any language. Each document class has a defined, but flexible, schema.
 
@@ -35,8 +35,13 @@ JSON) in any language. Each document class has a defined, but flexible, schema.
 * Event Descriptor --- Metadata about a series of Events. Envision
   richly-detail column headings in a table, encompassing physical units,
   hardware configuration information, etc.
-* Resource --- A pointer to an external file (or resource in general).
+* Resource --- A pointer to an external file (or resource in general) that has
+  predictable and fixed dimensionality.
 * Datum --- A pointer to a specific slice of data within a Resource.
+* Stream Resource (Experimental) --- A pointer to an external resource that contains a stream
+  of data without restriction on the length of the stream. This resources with
+  know 'column' dimension without a known number of rows (e.g., time series, point detectors).
+* Stream Datum (Experimental) --- A pointer to a specific slice of data within a Stream Resource.
 * Run Stop Document ---  Everything that we can only know at the very end, such
   as the time it ended and the exit status (succeeded, aborted, failed due to
   error).
@@ -463,6 +468,87 @@ Formal Datum Page schema:
 
 .. literalinclude:: ../../event_model/schemas/datum_page.json
 
+.. _stream_resource:
+
+Stream Resource Document (Experimental)
+---------------------------------------
+
+See :doc:`external` for details on the role Stream Resource documents play in
+referencing external assets that are natively ragged, such as single-photon detectors,
+or assets where there are many relatively small data sets (e.g. scanned fluorescence data).
+
+Minimal nontrivial valid example:
+
+.. code-block:: python
+
+   # 'Stream Resource' document
+   {'path_semantics': 'posix',
+    'resource_kwargs': {},
+    'resource_path': '/local/path/subdirectory/data_file',
+    'root': '/local/path/',
+    'run_start': '10bf6945-4afd-43ca-af36-6ad8f3540bcd',
+    'spec': 'SOME_SPEC',
+    'stream_names': ['point_det'],
+    'uid': '272132cf-564f-428f-bf6b-149ee4287024'}
+
+Typical example:
+
+.. code-block:: python
+
+   # resource
+   {'spec': 'AD_HDF5',
+    'root': '/GPFS/DATA/Andor/',
+    'resource_path': '2020/01/03/8ff08ff9-a2bf-48c3-8ff3-dcac0f309d7d.h5',
+    'resource_kwargs': {'frame_per_point': 1},
+    'path_semantics': 'posix',
+    'stream_names': ['point_det'],
+    'uid': '3b300e6f-b431-4750-a635-5630d15c81a8',
+    'run_start': '10bf6945-4afd-43ca-af36-6ad8f3540bcd'}
+
+Formal schema:
+
+.. literalinclude:: ../../event_model/schemas/stream_resource.json
+
+.. _stream_datum:
+
+Stream Datum Document
+---------------------
+
+See :doc:`external` for details on the role Stream Datum documents play in referencing
+external assets that are natively ragged, such as single-photon detectors,
+or assets where there are many relatively small data sets (e.g. scanned fluorescence data).
+
+Minimal nontrivial valid example:
+
+.. code-block:: python
+
+   # 'datum' document
+   {'resource': '272132cf-564f-428f-bf6b-149ee4287024',  # foreign key
+    'datum_kwargs': {},  # format-specific parameters
+    'datum_id': '272132cf-564f-428f-bf6b-149ee4287024/1',
+    'block_idx': 0,
+    'event_count': 1
+    }
+
+Typical example:
+
+.. code-block:: python
+
+   # datum
+   {'resource': '3b300e6f-b431-4750-a635-5630d15c81a8',
+    'datum_kwargs': {'index': 3},
+    'datum_id': '3b300e6f-b431-4750-a635-5630d15c81a8/3',
+    'block_idx': 0,
+    'event_count': 5,
+    'event_offset': 14}
+
+It is an implementation detail that ``datum_id`` is often formatted as
+``{resource}/{counter}`` but this should not be considered part of the schema.
+
+Formal schema:
+
+.. literalinclude:: ../../event_model/schemas/stream_datum.json
+
 .. _bulk_events:
 
 "Bulk Events" Document (DEPRECATED)

diff --git a/docs/source/use-cases.rst b/docs/source/use-cases.rst
@@ -3,7 +3,7 @@ Event Model Patterns
 ============================
 When implementing a system with the Event Model for a particular use case (technique, scan type, etc.), many design choices can be made. For example: how many streams you define (through Event Descriptor documents), what events you put into those streams, and what how data points are stored in an event. Here, we present a use case and a potential design, discussing the pros and cons of different options.
 
-To further complicate things, we can consider that a document stream might be be optimized for very different scenarios within the same technique. For example, for the sample scan of the same sample, a document stream might read as the scan as it is being run for the purpose of providing in-scan feedback to a user or beamline analysis tool. For the same scan, a docstream might be serialized into a Mongo database and read back out by analysis and visualiztion tools. In these different uses of data from the same scan, the level of granularity that one puts intto an Event Document might be very different. The streaming consumer might require a large number of very small granular events in order to quickly make decisions that affect the course of the scan. On the other hand, MongoDB document retrieval is much more efficient with a small number of larger documents, and a small number of events that each contain data from multiple time steps might be preferrable.
+To further complicate things, we can consider that a document stream might be be optimized for very different scenarios within the same technique. For example, for the sample scan of the same sample, a document stream might read as the scan as it is being run for the purpose of providing in-scan feedback to a user or beamline analysis tool. For the same scan, a docstream might be serialized into a Mongo database and read back out by analysis and visualiztion tools. In these different uses of data from the same scan, the level of granularity that one puts into an Event Document might be very different. The streaming consumer might require a large number of very small granular events in order to quickly make decisions that affect the course of the scan. On the other hand, MongoDB document retrieval is much more efficient with a small number of larger documents, and a small number of events that each contain data from multiple time steps might be preferrable.
 
 
 Use Case - Tomography Tiling and MongoDB Serialization