Real-time-data-pipeline-kafka-mongo-elasticsearch-pyspark

A real-time data pipeline project using Kafka, MongoDB, Elasticsearch, and PySpark. Streams raw data from Kafka, enriches it with sentiment analysis using Hugging Face models, stores results in MongoDB, and visualizes data in Elasticsearch with Kibana. Scalable solution for real-time data analytics and machine learning.

This project demonstrates an end-to-end real-time data pipeline designed to perform sentiment analysis on YELP reviews using a combination of Confluent Kafka, Spark Structured Streaming, MongoDB Atlas, HuggingFace's DistilBERT base uncased finetuned SST-2, and Elasticsearch with Kibana for real-time data visualization. The pipeline processes the YELP Dataset (from Kaggle) in real-time and provides step-by-step instructions for building the architecture from scratch.

System Architecture

The architecture consists of the following components:

Kafka Producer (Kaggle Notebook): Streams YELP data in real-time from a CSV file.
Apache Spark Structured Streaming: Processes and transforms data in real-time using Spark.
MongoDB Atlas: Serves as an intermediary storage layer for holding processed data.
**Confluent Kafka: Manages data ingestion and stream processing.
HuggingFace Sentiment Model: DistilBERT base uncased finetuned SST-2 performs sentiment analysis on the reviews.
Elasticsearch: Stores and indexes the data for efficient search and visualization.
Kibana: Provides real-time visualization dashboards for exploring processed data.

Technologies

Python: Used to develop the Kafka producer, Spark stream processor, and data analysis scripts.
Apache Kafka (Confluent Cloud): Handles data ingestion and message brokering.
Apache Spark: Used for real-time data processing.
MongoDB Atlas: Temporary storage for streaming data.
HuggingFace Model: Performs sentiment analysis on incoming reviews.
Elasticsearch: Stores, indexes, and searches the processed data.
Kibana: Used for building dashboards and visualizing the data.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
notebooks		notebooks
schema		schema
README.md		README.md
elastic_config.conf		elastic_config.conf
final_yelp_overview2.jpg		final_yelp_overview2.jpg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Real-time-data-pipeline-kafka-mongo-elasticsearch-pyspark

System Architecture

The architecture consists of the following components:

Technologies

About

Releases

Packages

Languages

Trups39/Real-Time-Yelp-Review-Sentiment-Analysis-Pipeline

Folders and files

Latest commit

History

Repository files navigation

Real-time-data-pipeline-kafka-mongo-elasticsearch-pyspark

System Architecture

The architecture consists of the following components:

Technologies

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages