Associated blog post for this project - https://dr563105.github.io/posts/2023-08-29-cdc-debezium-kafka-pg/
Observe and monitor a postgres DB source with a schema commerce
and tables users
and products
. When there is a change in their rows, we will capture those changed data, send it downstream through Apache Kafka and make it available for analytics through duckDB
.
We use Ubuntu 20.04 LTS AWS EC2 machine for the project.
We need the following:
- git version >= 2.37.1,
- Docker version >= 20.10.17 and Docker compose v2 version >= v2.10.2,
- pgcli,
- make,
- optional, python >= v3.7.
To make things easier I have scripted these prerequisites. Just clone my repo and run the instructions I provide.
sudo apt update && sudo apt install git make -y
git clone https://github.com/dr563105/cdc-debezium-kafka.git
cd cdc-debezium-kafka
make install_conda
make install_docker
source ~/.bashrc
Logout and log in back to the instance. To test docker if it is working, run
docker run --rm hello-world # should return "Hello from Docker!" without errors
Set environment variables:
export POSTGRES_USER=postgres
export POSTGRES_PASSWORD=postgres
export POSTGRES_DB=cdc-demo-db
export POSTGRES_HOST=postgres
export AWS_KEY_ID=minio
export AWS_SECRET_KEY=minio123
export AWS_BUCKET_NAME=commerce
export DB_SCHEMA=commerce
Now we're ready to execute our project.
cd cdc-debezium-kafka
make up # runs all docker containers and connections
#wait for 100-120 seconds to allow data to be pushed to Minio(S3).
Open a browser and go to localhost:9001
to open up Minio UI
. Login with minio
as username and minio123
as password. Then navigate to buckets
-> commerce
-> debezium.commerce.productsand further to get to the
jsonfiles. Similarly to reach
debezium.commerce.userstable
json` files.
Note: In some cloud platforms such GCP and Azure, localhost
doesn't work. In that case, use server's public ip address -- <public_ipaddress>:9001
to access minio
.
These json
files contain the change data(Upsert and delete) for respective tables. From here we can use duckdb
to analyse data. There you have it, a complete data pipeline that fetches change data from the source and brings it to the sink(downstream).
To bring down all container and return to the original state, run the following instructions
make down #shuts down all project processes and docker containers
# to delete minio buckets with json files
sudo rm -rf minio/ psql_vol/
For testing a separate testing environment with each component is created. To get started, execute the following commands sequentially.
export TEST_POSTGRES_USER=test_postgres
export TEST_POSTGRES_PASSWORD=test_postgres
export TEST_POSTGRES_DB=test_cdc-demo-db
export TEST_POSTGRES_HOST=test_postgres
export TEST_DB_SCHEMA=commerce
export TEST_AWS_KEY_ID=test_minio
export TEST_AWS_SECRET_KEY=test_minio123
export TEST_AWS_BUCKET_NAME=commerce
make tsetup #tests end-to-end
make tdown #shutdown all resources
# sudo rm -rf test_minio/ test_psql_vol/