The repository is structured as follows:
- Explanation for the idea of Keyed Watermarks.
- How to set up and run the cluster to be able to run Apache Flink with Keyed Watermarks.
- The repo. contains the jars, source java classes of Keyed Watermarks, pipeline to test with, docker file used to build the cluster, and instructions on how to use everything.
Big Data Stream processing engines, exemplified by tools like Apache Flink, employ windowing techniques to manage unbounded streams of events. The aggregation of relevant data within Windows holds utmost importance for event-time windowing due to its impact on result accuracy. A pivotal role in this process is attributed to watermarks, unique timestamps signifying event progression in time. Nonetheless, the existing watermark generation method within Apache Flink, operating at the input stream level, exhibits a bias towards faster sub-streams, causing the omission of events from slower counterparts. Through our analysis, we determined that Apache Flink's standard watermark generation approach results in an approximate
Set up an Apache Flink Cluster using Docker
- Clone this repo. using the following command:
git clone https://github.com/TawfikYasser/kw-flink-cluster-docker.git
. - Download the
build-target
folder from thislink
and put it in the same directory of the repo. - Clone this repo. and in the same directory run the following command:
docker build -t <put-your-docker-image-name-here> .
, a docker image will be created. - Then we need to create 3 docker containers, one for the
JobManager
, and two for theTaskManagers
.- To create the
JobManager
container run the following command:docker run -it --name <put-your-docker-container-name-here> -p 8081:8081 --network bridge <put-your-docker-image-name-here>:latest
, we're exposing the 8081 port in order to be able to access the Apache Flink Web UI from outside the containers, also we're attaching the container to thebridge
network. - To create the
TaskManagers
run the following two commands:docker run -it --name <put-your-docker-container-name-here> --network bridge <put-your-docker-image-name-here>:latest
,docker run -it --name <put-your-docker-container-name-here> --network bridge <put-your-docker-image-name-here>:latest
.
- To create the
- Now, you have to configure the
masters
,workers
, andflink-config.yml
files on each container as follows: - Start the containers, and start the
ssh
service on the containers of taskManagers using:service ssh start
.
- Now, you're ready to start the cluster, go to
/home/flink/bin/
inside the container containing the jobManager and run the cluster using:./start-cluster.sh
. - Run a Flink job inside
/home/flink/bin/
using the following command:./flink run <pipeline-jar-path>
. (-p optional if you want to override the default parallelism inflink-config.yml
) - Open the Web UI of Apache Flink using:
http://localhost:8081
. (If you're running the containers on a VM use the VM's External IP, otherwise use your local machine's IP)
- In the following
java
file, use it as a starting point.
@INPROCEEDINGS{10296717,
author={Yasser, Tawfik and Arafa, Tamer and El-Helw, Mohamed and Awad, Ahmed},
booktitle={2023 5th Novel Intelligent and Leading Emerging Sciences Conference (NILES)},
title={Keyed Watermarks: A Fine-grained Tracking of Event-time in Apache Flink},
year={2023},
volume={},
number={},
pages={23-28},
doi={10.1109/NILES59815.2023.10296717}}
IMPORTANT: Code Base of Keyed Watermarks
& Ready to Run Flink Cluster