Skip to content

Commit

Permalink
some improvements to the documentation
Browse files Browse the repository at this point in the history
  • Loading branch information
mdorier committed Mar 5, 2024
1 parent 379419e commit bff2496
Show file tree
Hide file tree
Showing 5 changed files with 50 additions and 48 deletions.
4 changes: 2 additions & 2 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -15,10 +15,10 @@ RPC and RDMA library and a high level of on-node concurrency using
`Argobots <https://www.argobots.org/>`_.

Mofka provides a C++ and a Python interface. One of its particularities is that it
splits events into two parts: a *Data* part, referencing raw, potentially large data,
splits events into two parts: a **data** part, referencing potentially large, raw data,
which Mofka will try its best not to copy more than necessary (e.g., by relying on
RDMA to transfer it directly from a client application's memory to a storage device
on servers) and a *Metadata* part, which consists of structured information about
on servers) and a **metadata** part, which consists of structured information about
the data (usually expressed in JSON). Doing so allows Mofka to store each part
independently, batch metadata together, and allow an event to reference (a subset of)
the data of another event. This interface is also often more adapted to HPC applications,
Expand Down
16 changes: 8 additions & 8 deletions docs/usage/consumer.rst
Original file line number Diff line number Diff line change
Expand Up @@ -41,18 +41,18 @@ A consumer can be created with five parameters, four of which are optional.
send batches as soon as possible but will increase the batch size if the consumer is not
responding fast enough.

* **Data selector**: the consumer first receives the Metadata part of an event and runs
the user-provided data selector function on the Metadata to know whether the data should
be pulled. This function takes the Metadata part of the event as well as a :code:`DataDescriptor`
* **Data selector**: the consumer first receives the metadata part of an event and runs
the user-provided data selector function on the metadata to know whether the data should
be pulled. This function takes the metadata part of the event as well as a :code:`DataDescriptor`
instance. The latter is an opaque key that Mofka can use to locate the actual data.
The above code is an example of data selector that will tell the consumer to pull the data
only if the *"energy"* field in the Metadata is greater than 20. It does so by returning
only if the *"energy"* field in the metadata is greater than 20. It does so by returning
the provided :code:`DataDescriptor` if the field is greater than 20, and by returning
:code:`mofka::DataDescriptor::Null()` if it isn't. The data selector could tell Mofka to pull
*only a subset of an event's data*. More on this in the :ref:`Data descriptors` section.

* **Data broker**: if the data selector returned a non-null DataDescriptor, the user-provided
data broker function is invoked by the consumer. This function takes the event's Metadata
* **Data broker**: if the data selector returned a non-null :code:`DataDescriptor`, the user-provided
data broker function is invoked by the consumer. This function takes the event's metadata
and the :code:`DataDescriptor` returned by the data selector, and must return a :code:`mofka::Data`
object pointing to the location in memory where the application wishes for the data to be placed.
This memory could be non-contiguous, it could be allocated by the data broker or it could point to
Expand Down Expand Up @@ -82,15 +82,15 @@ we can pull the events out of the consumer. The following code shows how to do t
:code:`consumer.pull()` is a non-blocking function that returns a
:code:`mofka::Future<Event>` that can be tested for completion and waited on.
Waiting on the future gets us a :code:`mofka::Event` instance which contains the
event's Metadata and Data.
event's metadata and data.

The call to :code:`event.acknowledge()` tells the Mofka partition manager that
all the events in the partition up to this one have been processed by this consumer
and should not be sent again, should the consumer restart.

.. note::

In this example we have allocated the Data in our data broker function,
In this example we have allocated the memory for the data in our data broker function,
so we need to free it when we no longer need it.

.. group-tab:: Python
Expand Down
40 changes: 21 additions & 19 deletions docs/usage/producer.rst
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ Applications that need to produce events into one or more topics will need
to create a :code:`Producer` instance. This object is an interface to produce
events into a designated topic. It will internally run the Validator, Partition
selector, and Serializer on the events it is being passed to validate the event's
Metadata and Data, select a destination partition for each event, and serialize
metadata and data, select a destination partition for each event, and serialize
the event's metadata into batches aimed at the same partition.

.. note::
Expand Down Expand Up @@ -70,8 +70,8 @@ A producer can be created with four optional parameters.
Producing events
----------------

As explained earlier, Mofka splits events into two parts: Metadata and Data.
The Metadata part is JSON-structured, small, and can be batched with the Metadata
As explained earlier, Mofka splits events into two parts: metadata and data.
The metadata part is JSON-structured, small, and can be batched with the metadata
of other events to issue fewer RPCs to partition managers. The Data part is optional
and represents potentially larger, raw data that can benefit from being transferred
via zero-copy mechanisms such as RDMA.
Expand All @@ -82,7 +82,8 @@ by a JSON fragment containing the timestamp and detector information (e.g., call
parameters), as well as information about the images (e.g., dimensions, pixel format).
The data part of an event would be the image itself.

The code bellow shows how to create the Data and Metadata pieces of an event.
The code bellow shows how to create the data and metadata pieces of an event
in the form of a :code:`Data` instance and a :code:`Metadata` instance respectively.

.. tabs::

Expand All @@ -94,13 +95,13 @@ The code bellow shows how to create the Data and Metadata pieces of an event.
:end-before: END EVENT
:dedent: 8

The first Data object, :code:`data1`, is a view of a single contiguous
The first :code:`mofka::Data data1` object is a view of a single contiguous
segment of memory underlying the :code:`segment1` vector. The second
Data object, :code:`data2`, is a view of two non-contiguous such segments.
:code:`Data data2` object is a view of two non-contiguous segments.

The first Metadata object, :code:`metadata1`, is created from a raw string
representing a JSON object with and "energy" field. The second Metadata object
contains the same information but is initialized using an :code:`nlohmann::json`
The first :code:`mofka::Metadata` object, :code:`metadata1`, is created from a
raw string representing a JSON object with and "energy" field. The second :code:`Metadata`
object contains the same information but is initialized using an :code:`nlohmann::json`
instance, which is the library used by Mofka to manage JSON data in C++.

.. group-tab:: Python
Expand All @@ -123,7 +124,7 @@ The code bellow shows how to create the Data and Metadata pieces of an event.
is freed. Howeber the user should still take care that they are not written to
until the data has been transferred.

Having created the Metadata and the Data part of an event, we can now push the event
Having created the metadata and the data part of an event, we can now push the event
into the producer, as shown in the code bellow.

.. tabs::
Expand All @@ -140,13 +141,13 @@ into the producer, as shown in the code bellow.

Work in progress...

The producer's :code:`push` function takes the Metadata and the Data and returns a :code:`Future`.
Such a future can be tested for completion (:code:`future.completed()`) and can be blocked
on until it completes (:code:`future.wait()`). The latter method returns the event ID of the
created event (64-bits unsigned integer). It is perfectly OK to drop the future if you do not care
to wait for its completion or for the resulting event ID, as examplified with the second event.
Event IDs are monotonically increasing and are per-partition, so two events stored in distinct
partitions may end up with the same ID.
The producer's :code:`push` function takes the :code:`Metadata` and the :code:`Data`
objects and returns a :code:`Future`. Such a future can be tested for completion
(:code:`future.completed()`) and can be blocked on until it completes (:code:`future.wait()`).
The latter method returns the event ID of the created event (64-bits unsigned integer).
It is perfectly OK to drop the future if you do not care to wait for its completion or
for the resulting event ID, as examplified with the second event. Event IDs are monotonically
ncreasing and are per-partition, so two events stored in distinct partitions may end up with the same ID.

Calling :code:`producer.flush()` is a blocking call that will force all the pending batches of events
to be sent, regardless of whether they have reached the requested size. It can be useful to ensure
Expand All @@ -156,6 +157,7 @@ that all the events have been sent either periodically or before terminating the

If the batch size used by the producer is anything else than :code:`mofka::BatchSize::Adaptive()`,
a call to :code:`future.wait()` will block until the batch containing the corresponding event
has been filled up to the requested size and sent to its target partition. Hence, and easy
has been filled up to the requested size and sent to its target partition. Hence, an easy
mistake to do is to call :code:`future.wait()` when the batch is not full and with no other threads
filling it up. This situation will result in a deadlock.
pushing more events to it. In this situation the batch will never get full, will never be sent,
and :code:`future.wait()` will never complete.
4 changes: 2 additions & 2 deletions docs/usage/quickstart.rst
Original file line number Diff line number Diff line change
Expand Up @@ -9,8 +9,8 @@ about it, from the implementation of its databases, down to how they share
resources such as hardware threads and I/O devices, ensuring that you can
configure it to maximize performance on each individual platform and for
each individual use case. The downside of this approach, however, is that
you will need a lot more knowledge about Mochi than you would need about
the inner workings of other services like Kafka.
you will need more knowledge about Mochi than you would need about the inner
workings of other services like Kafka.

In this section, we will quickly deploy the bare minimum for a single-node,
functional Mofka service accessible locally, before we can dive into the
Expand Down
34 changes: 17 additions & 17 deletions docs/usage/topics.rst
Original file line number Diff line number Diff line change
Expand Up @@ -5,10 +5,10 @@ Events in Mofka are pushed into *topics*. A topic is a distributed collection
of *partitions* to which events are appended. When creating a topic, users have to
give it a name, and optionally provide three objects.

* **Validator**: a validator is an object that validates that the Metadata and Data
* **Validator**: a validator is an object that validates that the metadata and data
part comply with whatever is expected for the topic. Metadata are JSON documents
by default, so for instance a validator could check that some expected fields
are present. If the Metadata part describes the Data part in some way, a validator
are present. If the metadata part describes the data part in some way, a validator
could check that this description is actually correct. This validation will happen
before the event is sent to any server, resulting in an exception if the event is
not valid. If not provided, the default validator will accept all the events it is
Expand All @@ -20,10 +20,10 @@ give it a name, and optionally provide three objects.
strategy. If not provided, the default partition selector will cycle through the
partitions in a round robin manner.

* **Serializer**: a serializer is an object that can serialize a Metadata object into
a binary representation, and deserialize a binary representation back into a Metadata
object. If not provided, the default serializer will convert the Metadata into a
string representation.
* **Serializer**: a serializer is an object that can serialize a :code:`Metadata` object
into a binary representation, and deserialize a binary representation back into a
:code:`Metadata` object. If not provided, the default serializer will convert the
:code:`Metadata` into a string representation.

.. image:: ../_static/TopicPipeline-dark.svg
:class: only-dark
Expand All @@ -34,21 +34,21 @@ give it a name, and optionally provide three objects.
Mofka will take advantage of multithreading to parallelize and pipeline the execution
of the validator, partition selector, and serializer over many events. These objects
can be customized and parameterized. For instance, a validator that checks the content
of a JSON Metadata could be provided with a list of fields it expects to find in the
Metadata of each event.
of a JSON metadata could be provided with a list of fields it expects to find in the
metadata of each event.

.. topic:: A motivating example

Hereafter, we will create a topic accepting events that represent collisions in a
particle accelerator. We will require that the Metadata part of such events have
particle accelerator. We will require that the metadata part of such events have
an *energy* value, represented by an unsigned integer (just so we can show
what optimizations could be done with Mofka's modularity). Furthermore, let's say that
the detector is calibrated to output energies from 0 to 99. We can create a validator that
checks that the energy field is not only present, but that its value is also stricly lower
than 100. If we would like to aggregate events with similar energy values into the same partition,
we could have the partition selector make its decision based on this energy value.
Finally, since we know that the energy value is between 0 and 99 and is the only relevant
part of an event's Metadata, we could serialize this value into a single byte (:code:`uint8_t`),
part of an event's metadata, we could serialize this value into a single byte (:code:`uint8_t`),
drastically reducing the metadata size compared with a string like :code:`{"energy":42}`.

.. important::
Expand Down Expand Up @@ -149,15 +149,15 @@ is the object that will receive and respond to RPCs targetting the partition's
data and metadata. While it is possible to implement your own partition manager,
Mofka already comes with two implementations.

* **Memory**: The *"memory"* partition manager is a manager that keeps the Metadata
and Data in the local memory of the process it runs on. This partition manager
* **Memory**: The *"memory"* partition manager is a manager that keeps the metadata
and data in the local memory of the process it runs on. This partition manager
doesn't have any dependency and is easy to use for testing, for instance, but it
won't provide persistence and will be limited by the amount of memory available
on the node.
* **Default**: The *"default"* partition manager is a manager that relies on a
`Yokan <https://mochi.readthedocs.io/en/latest/yokan.html>`_ provider for storing
Metadata and on a `Warabi <https://github.com/mochi-hpc/mochi-warabi>`_
provider for storing Data. Yokan is a key/value storage component with a number
metadata and on a `Warabi <https://github.com/mochi-hpc/mochi-warabi>`_
provider for storing data. Yokan is a key/value storage component with a number
of database backends available, such as RocksDB, LevelDB, BerkeleyDB, etc.
Warabi is a blob storage component also with a variety of backend implementations
including Pmem.
Expand Down Expand Up @@ -218,18 +218,18 @@ Two required arguments when adding partitions are the name of the topic and the
of the server to which the partition should be added. Here because we only have one
server, the rank is 0.

With a default partition manager, we can specify the Metadata provider in the form
With a default partition manager, we can specify the metadata provider in the form
of an "address" interpretable by Bedrock. Here *"my_metadata_provider@local"* asks
Bedrock to look for a provider named *"my_metadata_provider"* in the same process as
the partition manager. In :ref:`Deployment` we will see that we could easily run these
providers on different processes.

.. note::

If we don't specify the Metadata (resp. Data) provider in the above
If we don't specify the metadata (resp. data) provider in the above
code/commands, Mofka will look for a Yokan (resp. Warabi)
provider with the tag :code:`"mofka:metadata"` (resp. :code:`"mofka:data"` ) in the
target server process and use that as the Metadata (resp. Data) provider.
target server process and use that as the metadata (resp. data) provider.
If multiple such providers exist, Mofka will choose the first one it finds in the
configuration file.

Expand Down

0 comments on commit bff2496

Please sign in to comment.