This repository contains the code of the open stream processing benchmark.
All documentation can be found in our wiki.
It includes:
- benchmark: benchmark pipeline implementations (docs).
- data-stream-generator: data stream generator to generate input streams locally or on a DC/OS cluster (docs).
- output-consumer: consumes the output of the processing job and metrics-exporter from Kafka and stores it on S3 (docs).
- evaluator: computes performance metrics on the output of the output consumer (docs).
- result analysis: Jupyter notebooks to visualize the results (docs).
- deployment: deployment scripts to run the benchmark on an DC/OS setup on AWS (docs).
- kafka-cluster-tools: Kafka scripts to start a cluster and read from a topic for local development (docs).
- metrics-exporter: exports metrics of JMX and cAdvisor and writes them to Kafka (docs).
Currently the benchmark includes Apache Spark (Spark Streaming and Structured Streaming), Apache Flink and Kafka Streams.
-
van Dongen, G., & Van den Poel, D. (2020). Evaluation of Stream Processing Frameworks. IEEE Transactions on Parallel and Distributed Systems, 31(8), 1845-1858. The Supplemental Material of this paper can be found here.
-
Earlier work-in-progress publication: van Dongen, G., Steurtewagen, B., & Van den Poel, D. (2018, July). Latency measurement of fine-grained operations in benchmarking distributed stream processing frameworks. In 2018 IEEE International Congress on Big Data (BigData Congress) (pp. 247-250). IEEE. Talks related to this publication:
-
Spark Summit Europe 2019: Stream Processing: Choosing the Right Tool for the Job - Giselle van Dongen
Are you having issues with anything related to the project? Do you wish to use this project or extend it? The fastest way to contact me is through:
LinkedIn: giselle-van-dongen