Realtime Data Streaming Project

This project demonstrates a full end-to-end real-time data streaming pipeline using modern technologies like Apache Kafka, Apache Spark, and Cassandra, all within Docker containers. Data is fetched from the randomuser API and streamed into Kafka via an Apache Airflow DAG running for 1 minute.

Kafka streams are processed in real-time by Apache Spark and the reformatted data is stored in Cassandra. The Kafka infrastructure, including Confluent Control Center and Schema Registry, is used for monitoring and maintaining data integrity throughout the pipeline.

Project architecture

Project Overview

The project fetches data from an external API and processes it using Apache Airflow, then streams the data through Apache Kafka for real-time processing in Apache Spark. The processed data is stored in a Cassandra database. The infrastructure is containerized using Docker and monitored with Confluent.

Technology Used

Docker: Containerization of all services.
Apache Airflow: For orchestrating data fetch and process pipelines.
Apache Kafka: For real-time data streaming.
Apache Spark: For real-time data processing.
Cassandra: NoSQL database for storing processed data.
ZooKeeper: For Kafka coordination.
Confluent: For Kafka monitoring and schema registry.
PostgreSQL: For intermediate data storage.

Setup and Installation

Clone the Repository: Clone this repository to your local machine.

git clone https://github.com/MekWiset/Realtime_Data_Streaming_project.git

Install Docker and Docker Compose: Ensure you have Docker installed on your system.
- Install Docker
- Install Docker Compose
Configure config.yaml: Update config.yaml to include your specific configurations such as API endpoints, Kafka topics, etc.
Start the Docker Containers: Use docker-compose to build and start the containers.
```
docker-compose up --build
```
Install Python Dependencies: Install the required Python packages.
```
pip install -r requirements.txt
```

Steps

Fetch Data: Apache Airflow fetches data from the external API https://randomuser.me/api and stores it in PostgreSQL.
Stream to Kafka: The data is streamed from Airflow into Kafka using a custom Kafka producer.
Process with Spark: Spark consumes the Kafka stream, processes the data, and prepares it for storage.
Store in Cassandra: The processed data is stored in Cassandra for further analysis or querying.
Monitor with Confluent: Kafka is monitored via Confluent for ensuring stream health and data integrity.

Usage

Clone the Repository: Clone this repository to your local machine.

git clone https://github.com/MekWiset/Realtime_Data_Streaming_project.git
cd Realtime_Data_Streaming_project

Prepare Your Configuration: Ensure all configuration files, including config.yaml, are correctly set before starting the services.
Set PYTHONPATH: Export the PYTHONPATH to include the project directory (either on your local machine or within Docker).
```
export PYTHONPATH=path/to/Realtime_Data_Streaming
```
Launch Docker Composer: Start the services by running the following command:
```
docker compose up -d
```
Trigger the Airflow DAG: Initiate the Airflow DAG that fetches data from the API and streams it to Kafka.
```
airflow dags trigger stream_to_kafka
```
- Or monitor the Airflow DAG at http://localhost:8080
Monitor Kafka: Use Confluent's Control Center to monitor Kafka topics and the streaming status.
- Access the Control Center at: http://localhost:9021
Launch Cassandra: Enter Cassandra's interactive shell to query the database.
```
docker exec -it cassandra cqlsh -u cassandra -p cassandra localhost 9042
```
Submit Spark Job: Stream data from Kafka to Cassandra using Spark.
```
spark-submit --master spark://localhost:7077 spark_stream.py
```
Query Data in Cassandra: View the streamed data in Cassandra.
- Describe the data
```
DESCRIBE spark_streams.created_users
```
- View the data
```
SELECT * from spark_streams.created_users
```

Project Structure

README.md: Provides an overview and instructions for the project.
config.yaml: Configuration file for variable settings.
dags/: Contains Airflow DAGs.
- stream_to_kafka.py: Main DAG to handle data fetching and streaming to Kafka.
docker-compose.yml: Docker Compose file to set up the environment.
helpers/: Contains helper SQL scripts.
- create_table_created_users.sql: Script to create necessary tables in PostgreSQL.
jobs/: Contains streaming and processing scripts.
- kafka_stream.py: Kafka producer logic to stream data.
- spark_stream.py: Spark consumer logic to process the Kafka stream.
plugins/: Contains custom Airflow classes.
- kafka_streamer.py: A class for fetching data from API and managing Kafka streaming.
- spark_streamer.py: A class for handling Spark streaming and Cassandra connector.
- utils/: Utility modules for configurations and helper functions.
  - configparser.py: Custom parser for handling YAML configurations.
requirements.txt: Lists the Python packages required for the project.
script/entrypoint.sh: Shell script to automate Docker container startup.

Features

Real-time Data Streaming: Handles real-time data ingestion and processing using Kafka and Spark.
API Integration: Fetches data dynamically from an API source.
Dockerized Environment: Complete setup using Docker and Docker Compose for easy deployment.
Kafka Monitoring: Uses Confluent for Kafka topic and schema monitoring.
Scalable Architecture: Easily extendable for additional data sources or processing logic.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Realtime Data Streaming Project

Project architecture

Table of Contents

Project Overview

Technology Used

Setup and Installation

Steps

Usage

Project Structure

Features

License

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
dags		dags
helpers		helpers
jobs		jobs
plugins		plugins
script		script
.gitignore		.gitignore
README.md		README.md
config.yaml		config.yaml
docker-compose.yml		docker-compose.yml
project_architecture.png		project_architecture.png
requirements.txt		requirements.txt

MekWiset/Realtime_Data_Streaming

Folders and files

Latest commit

History

Repository files navigation

Realtime Data Streaming Project

Project architecture

Table of Contents

Project Overview

Technology Used

Setup and Installation

Steps

Usage

Project Structure

Features

License

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages