This repository contains the code that I developed as part of my master’s thesis “Design of a Benchmark Concept for Data Stream Management Systems (DSMS) in the Context of Smart Factories”. The following sections give a rough overview. In case you are interested in design decisions or more details on the queries and performance results, send me a message on LinkedIn.
“Currently, there does not exist a satisfying application benchmark for distributed DSMSs in the area of smart factories.”
- Definition of a set of queries to be executed by the System Under Test (SUT).
- Design and setup of the benchmark architecture.
- Definition of a set of benchmark metrics to evaluate the SUT’s performance.
- Provision of a basic toolkit including a data sender, validator and system setup scripts.
- Provision of a prototypical reference implementation for a subset of the queries.
- Correctness
- Response Time (90th-percentile)
- Single Stream Throughput (in records/s)
- Number of Streams
- Number of input data streams (scale factor)
- Record frequency per data stream
- Benchmark duration
- Queries to be executed on each data stream
These parameters should be set in tools/commons/commons.conf
.
Contains code that is used by multiple modules and the file commons.conf
in which the main benchmark parameters are set.
Contains the datasender. Kafka-specific configurations can be done in tools/datasender/datasender.conf
.
Contains the validator - a streaming application which makes use of the Akka Stream Kafka Library.
Contains setup and configuration scripts and a benchmark runner. All of them are defined with Ansible.
Contains utility functions to create/delete/redistribute Kafka topics and to get current offsets in topics.
Partial benchmark implementation with Apache Flink for Identity Query (incoming events are written back to Kafka without modification) and Statistics Query (min, max, mean, sum and count for tumbling window of 1 second). Each query is run in a separate job to be able to execute queries in parallel but still keep the order of records.
In case of a single data stream, the data sender reads the data records from a provided file (e.g. taken from here) and sends them according to the configured frequency to the Kafka input topic. The SUT consumes the records, runs the configured queries and writes the results to the Kafka output topics (one dedicated Kafka topic per query). Afterwards, the validator can read the same records from the input topics, create gold standard results and compare them to the results created by the SUT to check for correctness. Furthermore, based on the Kafka message timestamps, the 90th-percentile of response times is calculated.
In case of multiple data streams the setup is similar. Each data stream is sent to a dedicated Kafka input topic. The SUT is required to run all configured queries on all data streams and write them to the dedicated output topics. The following image shows how the Data Stream Management System executes the Identity and Statistics Query on each data stream.
To run the benchmark on a cluster, it is advisable to install Ansible. The provided scripts allow installing the necessary software and
running the benchmark (tools/configuration/plays/benchmark-runner.yml
).
If you prefer running the modules without Ansible you can compile the whole project with sbt assembly
or a specific module with sbt project [module]:assembly
.
The created jars can be run with java -jar /path/to/jar.jar
.