GitHub - emitskevich/kafka-reo: Kafka Replicator Exactly-Once - the tool to replicate data with exactly-once guarantee between different clusters of Apache Kafka.

What is it

Kafka Replicator Exactly-Once - the tool to replicate data with exactly-once guarantee between different clusters of Apache Kafka.

See launch options below for quick start.

The problem

Why one need replication between clusters

Making a copy of data for disaster recovery purposes.
Gathering data from different regions to the central one for aggregation.
Data sharing between organizations.
...

What tools exist for cross-cluster replication

MirrorMaker from Apache Kafka.
Replicator from Confluent.
Simple self-made "consume-produce in the loop" application.

These tools provide either at-most-once or at-least-once delivery.

What about exactly-once

Apache Kafka has transactional API, which can be used for exactly-once delivery. The fundamental idea is to commit consumer offset and producer records in a single transaction. Kafka Streams uses it to provide high-level abstraction and easy access to exactly-once benefits. It may be enabled literally without code change, with one config option.

This works only within the same Kafka cluster.

What's not enough

Replication tools from the list above are not compatible with exactly-once delivery. The reason is in such case consumer offsets and producer records live in different clusters. Apache Kafka can't wrap operations with different clusters in one transaction. There is Kafka Improvement Proposal how to make it possible and good reading about it, but it exists from 2020 and nothing of it is implemented in 2023.

The solution

Theory

Replicate messages to destination cluster with at-least-once guarantee. Wrap the messages with some metadata and apply repartitioning.
Apply deduplication, unwrap and restore initial partitioning, using exactly-once delivery within the destination cluster.

As a drawback, it requires about 2 times more processing as compared with usual at-least-once replication.

Design schema

This is screenshot of design-schema.drawio from the project root.

Launch options

Docker run

Set your bootstrap servers and topic name for both clusters and run:

docker run \
    -e KAFKA_CLUSTERS_SOURCE_BOOTSTRAP_SERVERS=source-kafka-cluster:9092 \
    -e KAFKA_CLUSTERS_SOURCE_TOPIC=source-topic \
    -e KAFKA_CLUSTERS_DESTINATION_BOOTSTRAP_SERVERS=destination-kafka-cluster:9092 \
    -e KAFKA_CLUSTERS_DESTINATION_TOPIC=destination-topic \
    emitskevich/kafka-reo

Or firstly build it from sources:

./gradlew check installDist
docker build . -t emitskevich/kafka-reo --build-arg MODULE=replicator
docker run \
    -e KAFKA_CLUSTERS_SOURCE_BOOTSTRAP_SERVERS=source-kafka-cluster:9092 \
    -e KAFKA_CLUSTERS_SOURCE_TOPIC=source-topic \
    -e KAFKA_CLUSTERS_DESTINATION_BOOTSTRAP_SERVERS=destination-kafka-cluster:9092 \
    -e KAFKA_CLUSTERS_DESTINATION_TOPIC=destination-topic \
    emitskevich/kafka-reo

Docker compose

Replace env vars in docker-compose.yml, then run:

docker-compose up

Or firstly build it from sources:

./gradlew check installDist
docker-compose build
docker-compose up

Kubernetes

Replace env vars in k8s-deployment.yml, then run:

kubectl apply -f k8s-deployment.yml

Best practices

Launch as close to destination cluster as possible. It has notable performance boost, since the step of deduplication uses transactional API of destination cluster and is latency-sensible.

Name		Name	Last commit message	Last commit date
Latest commit History 65 Commits
.github/workflows		.github/workflows
gradle/wrapper		gradle/wrapper
replicator		replicator
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE.md		LICENSE.md
README.md		README.md
build.gradle.kts		build.gradle.kts
design-schema-v2.png		design-schema-v2.png
design-schema.drawio		design-schema.drawio
docker-compose.yml		docker-compose.yml
gradlew		gradlew
gradlew.bat		gradlew.bat
k8s-deployment.yaml		k8s-deployment.yaml
settings.gradle.kts		settings.gradle.kts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

What is it

The problem

Why one need replication between clusters

What tools exist for cross-cluster replication

What about exactly-once

What's not enough

The solution

Theory

Design schema

Launch options

Docker run

Docker compose

Kubernetes

Best practices

About

Languages

License

emitskevich/kafka-reo

Folders and files

Latest commit

History

Repository files navigation

What is it

The problem

Why one need replication between clusters

What tools exist for cross-cluster replication

What about exactly-once

What's not enough

The solution

Theory

Design schema

Launch options

Docker run

Docker compose

Kubernetes

Best practices

About

Topics

Resources

License

Stars

Watchers

Forks

Languages