E-commerce Customer Behavior Analysis: Batch & Real-Time Workflows

Overview

This project offers a dual approach to understanding e-commerce customer behavior through

Batch data analysis /Main/For batch-data
Real-time data processing. /Main/For Streaming-data

The goal is to glean insights into purchasing patterns, product preferences, buying frequency, and the temporal impact on online shopping behavior to answer the following questions:

Can we segment customers based on their demographic information (Age, Gender, City) and shopping behaviors (Total Spend, Number of Items Purchased, Membership Type)?
Which customers are at risk of not making future purchases based on their Days Since Last Purchase and Satisfaction Level?
Can we predict a customer’s Satisfaction Level based on their demographic and purchase history data?

NOTE:

You will find our visulizations and analysis here: /Analysis-report
Further information regarding Hadoop_Docker_cluster_setup is also provided in /Analysis-report

Dataset

Access the dataset at E-commerce Customer Behavior Dataset.

Tools and Technologies

Hadoop: To store and preprocess the large datasets of user logs and transaction data.
YARN: To efficiently manage resources for complex analytics tasks
Apache Kafka: To ingest real-time e-commerce transaction data.
Apache Spark Streaming : For real-time data processing and to analyze customer behavior and predict future buying patterns.
PostgreSQL: For storing the the result of Data analysis and insights

`1-` Batch Data Analysis:

Objective

Employs a Hadoop Docker cluster and Apache Spark for batch processing. It extracts insights from the data to understand customer behavior to some questions and uses Logistic Regressin algorithm to analyze customer behavior and predict future buying patterns.

Running the Application

Ensure Docker is installed.
Clone the repository.
Run Docker Compose:
```
docker-compose up
```

Upload the dataset to HDFS:

docker exec -it namenode hdfs dfs -put /path/to/E-commerceCustomerBehavior-Sheet1.csv /E-commerceCustomerBehavior-Sheet1.csv

Start analysis and preprocessing using PySpark:

docker exec -it spark-master bash

spark-submit --master yarn /path/to/Analysis.py

docker exec -it spark-master bash

spark-submit /path/to/predict.py

`2-` Real-Time E-commerce Data Workflow:

Objective

Introducing a real-time approach using Apache Kafka, Flume, and Spark Streaming to capture and analyze dynamic customer behavior, transactions, detecting anomalies in real-time and save it in Hadoop HDFS or PostgreSQL.

Real-Time Workflow:

1. Data Ingestion and Streaming

Set up Kafka: Install and configure Apache Kafka to ingest real-time e-commerce transaction data. Define Kafka topics to categorize different types of transaction data.

2. Stream Processing

Use Apache Spark Streaming for real-time data processing, filtering, aggregating, and detecting anomalies in real-time.
Implement Kafka Producer: Develop a Kafka producer to simulate or generate real-time e-commerce transactions and feed them into the Kafka topics.
Run the correspoding stream Processing consumer to process the real-time data.

3. Data Storage

Store the processed data, including summaries and insights, in Hadoop HDFS or postgreSQL for further analysis and historical record-analysis.

4. Generate Visual Reports

Answer the previous questions in cool visualizations.

Running the Real-Time Workflow

Same as before, except that the "preprocessing part" will be done by the consumers. To sum up:

Start Kafka: Ensure it's running with topics created.
Run Kafka Producer: Simulate or generate real-time transactions.
Start Stream Processing: Use Spark Streaming consumers to process real-time data.
Store Processed Data: Verify data is stored in Hadoop HDFS or PostgreSQL.
Generate Visual Reports: Utilize visualization tools for real-time insights.

By combining batch and real-time approaches, this project provides a comprehensive understanding of both historical and dynamic e-commerce customer behavior and that's it :D

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
Analysis-report		Analysis-report
Main		Main
Reports		Reports
Scripts		Scripts
apps		apps
articles		articles
base		base
configs		configs
datanode		datanode
docker_image_conf		docker_image_conf
flume_config		flume_config
historyserver		historyserver
namenode		namenode
nginx		nginx
nodemanager		nodemanager
resourcemanager		resourcemanager
submit		submit
Dockerfile		Dockerfile
E-commerceCustomerBehavior-Sheet1.csv		E-commerceCustomerBehavior-Sheet1.csv
E-commerce_Customer_Behavior_reports.ipynb		E-commerce_Customer_Behavior_reports.ipynb
Makefile		Makefile
README.md		README.md
all-docker-compose.yaml		all-docker-compose.yaml
docker-compose.yml		docker-compose.yml
hadoop.env		hadoop.env
start-spark.sh		start-spark.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

E-commerce Customer Behavior Analysis: Batch & Real-Time Workflows

Overview

Dataset

Tools and Technologies

`1-` Batch Data Analysis:

Objective

Running the Application

`2-` Real-Time E-commerce Data Workflow:

Objective

Real-Time Workflow:

1. Data Ingestion and Streaming

2. Stream Processing

3. Data Storage

4. Generate Visual Reports

Running the Real-Time Workflow

About

Releases

Packages

Contributors 2

Languages

nourhansowar/E-commerce-Customer-Behavior-Analysis

Folders and files

Latest commit

History

Repository files navigation

E-commerce Customer Behavior Analysis: Batch & Real-Time Workflows

Overview

Dataset

Tools and Technologies

1- Batch Data Analysis:

Objective

Running the Application

2- Real-Time E-commerce Data Workflow:

Objective

Real-Time Workflow:

1. Data Ingestion and Streaming

2. Stream Processing

3. Data Storage

4. Generate Visual Reports

Running the Real-Time Workflow

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

`1-` Batch Data Analysis:

`2-` Real-Time E-commerce Data Workflow:

Packages