From bff2496e12e1b12b741d284af87502fe29f98b41 Mon Sep 17 00:00:00 2001 From: Matthieu Dorier Date: Tue, 5 Mar 2024 13:36:12 +0000 Subject: [PATCH] some improvements to the documentation --- docs/index.rst | 4 ++-- docs/usage/consumer.rst | 16 ++++++++-------- docs/usage/producer.rst | 40 ++++++++++++++++++++------------------- docs/usage/quickstart.rst | 4 ++-- docs/usage/topics.rst | 34 ++++++++++++++++----------------- 5 files changed, 50 insertions(+), 48 deletions(-) diff --git a/docs/index.rst b/docs/index.rst index 78eafa8..c8db66e 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -15,10 +15,10 @@ RPC and RDMA library and a high level of on-node concurrency using `Argobots `_. Mofka provides a C++ and a Python interface. One of its particularities is that it -splits events into two parts: a *Data* part, referencing raw, potentially large data, +splits events into two parts: a **data** part, referencing potentially large, raw data, which Mofka will try its best not to copy more than necessary (e.g., by relying on RDMA to transfer it directly from a client application's memory to a storage device -on servers) and a *Metadata* part, which consists of structured information about +on servers) and a **metadata** part, which consists of structured information about the data (usually expressed in JSON). Doing so allows Mofka to store each part independently, batch metadata together, and allow an event to reference (a subset of) the data of another event. This interface is also often more adapted to HPC applications, diff --git a/docs/usage/consumer.rst b/docs/usage/consumer.rst index 4b23ab4..b6f9e99 100644 --- a/docs/usage/consumer.rst +++ b/docs/usage/consumer.rst @@ -41,18 +41,18 @@ A consumer can be created with five parameters, four of which are optional. send batches as soon as possible but will increase the batch size if the consumer is not responding fast enough. -* **Data selector**: the consumer first receives the Metadata part of an event and runs - the user-provided data selector function on the Metadata to know whether the data should - be pulled. This function takes the Metadata part of the event as well as a :code:`DataDescriptor` +* **Data selector**: the consumer first receives the metadata part of an event and runs + the user-provided data selector function on the metadata to know whether the data should + be pulled. This function takes the metadata part of the event as well as a :code:`DataDescriptor` instance. The latter is an opaque key that Mofka can use to locate the actual data. The above code is an example of data selector that will tell the consumer to pull the data - only if the *"energy"* field in the Metadata is greater than 20. It does so by returning + only if the *"energy"* field in the metadata is greater than 20. It does so by returning the provided :code:`DataDescriptor` if the field is greater than 20, and by returning :code:`mofka::DataDescriptor::Null()` if it isn't. The data selector could tell Mofka to pull *only a subset of an event's data*. More on this in the :ref:`Data descriptors` section. -* **Data broker**: if the data selector returned a non-null DataDescriptor, the user-provided - data broker function is invoked by the consumer. This function takes the event's Metadata +* **Data broker**: if the data selector returned a non-null :code:`DataDescriptor`, the user-provided + data broker function is invoked by the consumer. This function takes the event's metadata and the :code:`DataDescriptor` returned by the data selector, and must return a :code:`mofka::Data` object pointing to the location in memory where the application wishes for the data to be placed. This memory could be non-contiguous, it could be allocated by the data broker or it could point to @@ -82,7 +82,7 @@ we can pull the events out of the consumer. The following code shows how to do t :code:`consumer.pull()` is a non-blocking function that returns a :code:`mofka::Future` that can be tested for completion and waited on. Waiting on the future gets us a :code:`mofka::Event` instance which contains the - event's Metadata and Data. + event's metadata and data. The call to :code:`event.acknowledge()` tells the Mofka partition manager that all the events in the partition up to this one have been processed by this consumer @@ -90,7 +90,7 @@ we can pull the events out of the consumer. The following code shows how to do t .. note:: - In this example we have allocated the Data in our data broker function, + In this example we have allocated the memory for the data in our data broker function, so we need to free it when we no longer need it. .. group-tab:: Python diff --git a/docs/usage/producer.rst b/docs/usage/producer.rst index c515c4b..0402e50 100644 --- a/docs/usage/producer.rst +++ b/docs/usage/producer.rst @@ -5,7 +5,7 @@ Applications that need to produce events into one or more topics will need to create a :code:`Producer` instance. This object is an interface to produce events into a designated topic. It will internally run the Validator, Partition selector, and Serializer on the events it is being passed to validate the event's -Metadata and Data, select a destination partition for each event, and serialize +metadata and data, select a destination partition for each event, and serialize the event's metadata into batches aimed at the same partition. .. note:: @@ -70,8 +70,8 @@ A producer can be created with four optional parameters. Producing events ---------------- -As explained earlier, Mofka splits events into two parts: Metadata and Data. -The Metadata part is JSON-structured, small, and can be batched with the Metadata +As explained earlier, Mofka splits events into two parts: metadata and data. +The metadata part is JSON-structured, small, and can be batched with the metadata of other events to issue fewer RPCs to partition managers. The Data part is optional and represents potentially larger, raw data that can benefit from being transferred via zero-copy mechanisms such as RDMA. @@ -82,7 +82,8 @@ by a JSON fragment containing the timestamp and detector information (e.g., call parameters), as well as information about the images (e.g., dimensions, pixel format). The data part of an event would be the image itself. -The code bellow shows how to create the Data and Metadata pieces of an event. +The code bellow shows how to create the data and metadata pieces of an event +in the form of a :code:`Data` instance and a :code:`Metadata` instance respectively. .. tabs:: @@ -94,13 +95,13 @@ The code bellow shows how to create the Data and Metadata pieces of an event. :end-before: END EVENT :dedent: 8 - The first Data object, :code:`data1`, is a view of a single contiguous + The first :code:`mofka::Data data1` object is a view of a single contiguous segment of memory underlying the :code:`segment1` vector. The second - Data object, :code:`data2`, is a view of two non-contiguous such segments. + :code:`Data data2` object is a view of two non-contiguous segments. - The first Metadata object, :code:`metadata1`, is created from a raw string - representing a JSON object with and "energy" field. The second Metadata object - contains the same information but is initialized using an :code:`nlohmann::json` + The first :code:`mofka::Metadata` object, :code:`metadata1`, is created from a + raw string representing a JSON object with and "energy" field. The second :code:`Metadata` + object contains the same information but is initialized using an :code:`nlohmann::json` instance, which is the library used by Mofka to manage JSON data in C++. .. group-tab:: Python @@ -123,7 +124,7 @@ The code bellow shows how to create the Data and Metadata pieces of an event. is freed. Howeber the user should still take care that they are not written to until the data has been transferred. -Having created the Metadata and the Data part of an event, we can now push the event +Having created the metadata and the data part of an event, we can now push the event into the producer, as shown in the code bellow. .. tabs:: @@ -140,13 +141,13 @@ into the producer, as shown in the code bellow. Work in progress... -The producer's :code:`push` function takes the Metadata and the Data and returns a :code:`Future`. -Such a future can be tested for completion (:code:`future.completed()`) and can be blocked -on until it completes (:code:`future.wait()`). The latter method returns the event ID of the -created event (64-bits unsigned integer). It is perfectly OK to drop the future if you do not care -to wait for its completion or for the resulting event ID, as examplified with the second event. -Event IDs are monotonically increasing and are per-partition, so two events stored in distinct -partitions may end up with the same ID. +The producer's :code:`push` function takes the :code:`Metadata` and the :code:`Data` +objects and returns a :code:`Future`. Such a future can be tested for completion +(:code:`future.completed()`) and can be blocked on until it completes (:code:`future.wait()`). +The latter method returns the event ID of the created event (64-bits unsigned integer). +It is perfectly OK to drop the future if you do not care to wait for its completion or +for the resulting event ID, as examplified with the second event. Event IDs are monotonically +ncreasing and are per-partition, so two events stored in distinct partitions may end up with the same ID. Calling :code:`producer.flush()` is a blocking call that will force all the pending batches of events to be sent, regardless of whether they have reached the requested size. It can be useful to ensure @@ -156,6 +157,7 @@ that all the events have been sent either periodically or before terminating the If the batch size used by the producer is anything else than :code:`mofka::BatchSize::Adaptive()`, a call to :code:`future.wait()` will block until the batch containing the corresponding event - has been filled up to the requested size and sent to its target partition. Hence, and easy + has been filled up to the requested size and sent to its target partition. Hence, an easy mistake to do is to call :code:`future.wait()` when the batch is not full and with no other threads - filling it up. This situation will result in a deadlock. + pushing more events to it. In this situation the batch will never get full, will never be sent, + and :code:`future.wait()` will never complete. diff --git a/docs/usage/quickstart.rst b/docs/usage/quickstart.rst index 118533c..601bf33 100644 --- a/docs/usage/quickstart.rst +++ b/docs/usage/quickstart.rst @@ -9,8 +9,8 @@ about it, from the implementation of its databases, down to how they share resources such as hardware threads and I/O devices, ensuring that you can configure it to maximize performance on each individual platform and for each individual use case. The downside of this approach, however, is that -you will need a lot more knowledge about Mochi than you would need about -the inner workings of other services like Kafka. +you will need more knowledge about Mochi than you would need about the inner +workings of other services like Kafka. In this section, we will quickly deploy the bare minimum for a single-node, functional Mofka service accessible locally, before we can dive into the diff --git a/docs/usage/topics.rst b/docs/usage/topics.rst index acc17bc..8e36c9e 100644 --- a/docs/usage/topics.rst +++ b/docs/usage/topics.rst @@ -5,10 +5,10 @@ Events in Mofka are pushed into *topics*. A topic is a distributed collection of *partitions* to which events are appended. When creating a topic, users have to give it a name, and optionally provide three objects. -* **Validator**: a validator is an object that validates that the Metadata and Data +* **Validator**: a validator is an object that validates that the metadata and data part comply with whatever is expected for the topic. Metadata are JSON documents by default, so for instance a validator could check that some expected fields - are present. If the Metadata part describes the Data part in some way, a validator + are present. If the metadata part describes the data part in some way, a validator could check that this description is actually correct. This validation will happen before the event is sent to any server, resulting in an exception if the event is not valid. If not provided, the default validator will accept all the events it is @@ -20,10 +20,10 @@ give it a name, and optionally provide three objects. strategy. If not provided, the default partition selector will cycle through the partitions in a round robin manner. -* **Serializer**: a serializer is an object that can serialize a Metadata object into - a binary representation, and deserialize a binary representation back into a Metadata - object. If not provided, the default serializer will convert the Metadata into a - string representation. +* **Serializer**: a serializer is an object that can serialize a :code:`Metadata` object + into a binary representation, and deserialize a binary representation back into a + :code:`Metadata` object. If not provided, the default serializer will convert the + :code:`Metadata` into a string representation. .. image:: ../_static/TopicPipeline-dark.svg :class: only-dark @@ -34,13 +34,13 @@ give it a name, and optionally provide three objects. Mofka will take advantage of multithreading to parallelize and pipeline the execution of the validator, partition selector, and serializer over many events. These objects can be customized and parameterized. For instance, a validator that checks the content -of a JSON Metadata could be provided with a list of fields it expects to find in the -Metadata of each event. +of a JSON metadata could be provided with a list of fields it expects to find in the +metadata of each event. .. topic:: A motivating example Hereafter, we will create a topic accepting events that represent collisions in a - particle accelerator. We will require that the Metadata part of such events have + particle accelerator. We will require that the metadata part of such events have an *energy* value, represented by an unsigned integer (just so we can show what optimizations could be done with Mofka's modularity). Furthermore, let's say that the detector is calibrated to output energies from 0 to 99. We can create a validator that @@ -48,7 +48,7 @@ Metadata of each event. than 100. If we would like to aggregate events with similar energy values into the same partition, we could have the partition selector make its decision based on this energy value. Finally, since we know that the energy value is between 0 and 99 and is the only relevant - part of an event's Metadata, we could serialize this value into a single byte (:code:`uint8_t`), + part of an event's metadata, we could serialize this value into a single byte (:code:`uint8_t`), drastically reducing the metadata size compared with a string like :code:`{"energy":42}`. .. important:: @@ -149,15 +149,15 @@ is the object that will receive and respond to RPCs targetting the partition's data and metadata. While it is possible to implement your own partition manager, Mofka already comes with two implementations. -* **Memory**: The *"memory"* partition manager is a manager that keeps the Metadata - and Data in the local memory of the process it runs on. This partition manager +* **Memory**: The *"memory"* partition manager is a manager that keeps the metadata + and data in the local memory of the process it runs on. This partition manager doesn't have any dependency and is easy to use for testing, for instance, but it won't provide persistence and will be limited by the amount of memory available on the node. * **Default**: The *"default"* partition manager is a manager that relies on a `Yokan `_ provider for storing - Metadata and on a `Warabi `_ - provider for storing Data. Yokan is a key/value storage component with a number + metadata and on a `Warabi `_ + provider for storing data. Yokan is a key/value storage component with a number of database backends available, such as RocksDB, LevelDB, BerkeleyDB, etc. Warabi is a blob storage component also with a variety of backend implementations including Pmem. @@ -218,7 +218,7 @@ Two required arguments when adding partitions are the name of the topic and the of the server to which the partition should be added. Here because we only have one server, the rank is 0. -With a default partition manager, we can specify the Metadata provider in the form +With a default partition manager, we can specify the metadata provider in the form of an "address" interpretable by Bedrock. Here *"my_metadata_provider@local"* asks Bedrock to look for a provider named *"my_metadata_provider"* in the same process as the partition manager. In :ref:`Deployment` we will see that we could easily run these @@ -226,10 +226,10 @@ providers on different processes. .. note:: - If we don't specify the Metadata (resp. Data) provider in the above + If we don't specify the metadata (resp. data) provider in the above code/commands, Mofka will look for a Yokan (resp. Warabi) provider with the tag :code:`"mofka:metadata"` (resp. :code:`"mofka:data"` ) in the - target server process and use that as the Metadata (resp. Data) provider. + target server process and use that as the metadata (resp. data) provider. If multiple such providers exist, Mofka will choose the first one it finds in the configuration file.