kafkapipeline

About the project

Academic project for the course of Middleware Technologies at Polimi

Here is the request:

Goal: Implement a data processing pipeline in Kafka.

Requirements:

Provide administrative tools / scripts to create and deploy a processing pipeline that processes messages from a given topic.

A processing pipeline consists of multiple stages, each of them processing an input message at a time and producing one output message for the downstream stage.

Different processing stages could run on different processes for scalability.

Messages have a key, and the processing of messages with different keys is independent. 4. Stages are stateful and their state is partitioned by key. 4. Each stage consists of multiple processes that handle messages with different keys in parallel.

Messages having the same key are processed in FIFO order with end-to-end exactly once delivery semantics.

Assumptions:

Processes can fail.

Kafka topics with replication factor > 1 can be considered reliable.

You are only allowed to use Kafka Producers and Consumers API

Design choices

We implemented four types of processors. One of them embeds a stateful operation: a physical windowed aggregate function. The other processors are

map
flatmap
filter

and they are stateless.

This project cannot be run on your local machine.

The executables for this project cannot be run without AWS EC2.

Getting Started

make sure you have basic Kafka and AWS prerequisites
compile the jar file mvn clean compile assembly:single with ProcessorStarter.java as mainclass
make sure you have AWS key.pem in the main folder of the project
launch EC2 instance (recommended t2.2xlarge)
register an AMI that satisfy AMI prerequisites
run scp -i kafka-pipeline.pem processorStartedJarFile ubuntu@server.ip: to upload the jar on EC2 instance
save the template from EC2 console with a name such as Kafkatemplate
modify launch_instances.sh with the name of the template

Now you can tweak the main parameters in the configuration file following this syntax

Once you've defined the pipeline, compile the jar file mvn clean compile assembly:single with ClusterLauncher.java as mainclass

Prerequisites

Kafka 2.3.1
Java 8
AWS CLI pip install awscli

AMI prerequisites

linux 16.04.6 LTS
Java JRE (version >=1.8)
Kafka (version >= 2.12-2.31)

Configurations

In config.properties you must specify

pipeline.length = <positive_number>
replication.factor = <positive_number>

Then, you must add a configuration for each stage of the pipeline. Each stage is identified by the following two parameters:

processors.at.<positive_number> = <positive_number>
function.at.<positive_number> = "windowaggregate" | "flatmap" | "map" | "filter"

where:

<positive_number> ::= (<positive digit>)^(+)
<positive digit> ::= "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9"

Failures in the pipeline

Resilience to failures of processors has been managed in order to guarantee end-to-end exactly-one-semantic. We used MapDB as a store for recent operations of each processor and the recovery of the failed processors.

Demo

We provided a sample source to roll out a demo.

First, follow getting started steps in order to set up deployment and local installation.

Once you've defined a pipeline schema run:

java -cp target/processor-1.0-SNAPSHOT-jar-with-dependencies.jar org.middleware.project.ProcessorStarter ./source.properties
java -cp target/processor-1.0-SNAPSHOT-jar-with-dependencies.jar org.middleware.project.ProcessorStarter ./sink.properties

We forced crashes on random processors in the code, in order to demonstrate exactly-one-semantic.

The pipeline will throw the processed source in sink.txt

Dependecies

Kafka 2.3.1
MapDB (v3.5)
AWS CLI v2

Authors

Alberto Rossettini

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
.idea		.idea
resources		resources
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
cluster.json		cluster.json
create_topics.sh		create_topics.sh
get_brokers_ips.sh		get_brokers_ips.sh
get_total_number_of_cores_of_cluster.sh		get_total_number_of_cores_of_cluster.sh
kafkapipeline.iml		kafkapipeline.iml
launch_instances.sh		launch_instances.sh
launch_kafka_cluster.sh		launch_kafka_cluster.sh
pom.xml		pom.xml
remove.sh		remove.sh
remove_processor_properties.sh		remove_processor_properties.sh
remove_processor_properties_remotely.sh		remove_processor_properties_remotely.sh
remove_tmp.sh		remove_tmp.sh
remove_tmps_remotely.sh		remove_tmps_remotely.sh
run_processors.sh		run_processors.sh
run_processors_on_servers.sh		run_processors_on_servers.sh
sink.txt		sink.txt
source.txt		source.txt
source_2.txt		source_2.txt
transferProperties.sh		transferProperties.sh
transfer_processors_files.sh		transfer_processors_files.sh
transferfiles.sh		transferfiles.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

kafkapipeline

About the project

Design choices

Getting Started

Prerequisites

AMI prerequisites

Configurations

Failures in the pipeline

Demo

Dependecies

Authors

Collaborators

About

Releases

Packages

Languages

License

albeRoss/kafkapipeline

Folders and files

Latest commit

History

Repository files navigation

kafkapipeline

About the project

Design choices

Getting Started

Prerequisites

AMI prerequisites

Configurations

Failures in the pipeline

Demo

Dependecies

Authors

Collaborators

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages