Unlock valuable insights and improve decision-making with advanced data analysis and predictive modeling using the Online Retail Dataset.
This project demonstrates an end-to-end data science workflow using the Online Retail Dataset. From cleaning messy data to forecasting sales and segmenting customers, the project combines statistical analysis, machine learning, and visualization to deliver actionable insights.
- π Data Cleaning: Prepare raw data for reliable analysis.
- π EDA: Identify trends in sales and customer behavior.
- π Forecasting: Predict future sales with time-series models.
- π Customer Segmentation: Classify customers based on purchase behavior.
- π Interactive Dashboard: Visualize trends and predictions with Shiny.
Source: UCI Machine Learning Repository
Content:
- π§Ύ
InvoiceNo
: Unique transaction code. - π¦
StockCode
: Product identifier. - π
Description
: Product description. - π’
Quantity
: Number of products purchased. - π
InvoiceDate
: Date and time of transaction. - π°
UnitPrice
: Price per unit of product. - π
CustomerID
: Unique customer ID. - π
Country
: Customer location.
Category | Tools Used |
---|---|
Languages | |
Visualization | ggplot2 , plotly , shiny |
Data Wrangling | dplyr , janitor , tidyverse , lubridate |
Forecasting | forecast , prophet |
Customer Analysis | caret , cluster , factoextra |
-
Data Cleaning
- Handle missing values and invalid data.
- Generate new features like total revenue.
-
Exploratory Data Analysis (EDA)
- Visualize trends, top products, and country-wise sales.
-
Sales Forecasting
- Use ARIMA and Prophet for predicting future revenue.
-
Customer Segmentation
- Perform RFM analysis to classify customers into loyalty tiers.
-
Interactive Dashboard (Optional)
- Build a Shiny app to showcase key insights dynamically.
Exploratory Data Analysis (EDA) was conducted to uncover insights into sales performance, top products, and revenue trends over time.
The chart below highlights the top-performing countries by total revenue, with the United Kingdom contributing the most significantly.
The following plot shows the top 10 products that generated the highest revenue.
The product "REGENCY CAKESTAND 3 TIER" dominates as the top-selling item.
This time-series plot displays the monthly revenue trends throughout the year.
There is a noticeable increase in revenue during the last quarter, followed by a steep drop, likely indicating incomplete December data.
The distribution of recency β days since a customer's last purchase β shows important insights into customer activity.
Key Insights:
-
Majority of Customers are Inactive:
- A significant portion of customers have not made a purchase in a long time, indicating potential churn.
-
Skewed Distribution:
- Most customers have high recency values (inactive), while very few have low recency values (recent activity).
-
Retention Opportunity:
- Customers with high recency values can be targeted with re-engagement campaigns to revive interest.
- Recent customers should be incentivized to ensure continued loyalty.
The map below visualizes revenue distribution across countries.
Countries with higher revenue are highlighted in darker shades. The United Kingdom and Australia dominate in terms of revenue contribution.
Key Insights:
- Revenue is concentrated in a few major markets, primarily in the UK, followed by other European countries and Australia.
- Emerging opportunities may exist in countries with lower revenue contributions.
The plot below shows the total revenue generated across different hours of the day.
Key Insights:
- Sales activity peaks between 10 AM and 3 PM, with the highest revenue observed around midday.
- Early mornings and evenings have lower sales volumes, suggesting focused business hours.
The following chart highlights the top 10 products in terms of quantity sold.
"JUMBO BAG RED RETROSPOT" is the most sold product, followed by "WORLD WAR 2 GLIDERS ASSTD DESIGNS".
Key Insights:
- Top-selling products by quantity differ from top revenue-generating products.
- Lower-priced products may dominate in quantity, while premium products contribute more to total revenue.
-
Revenue Distribution:
- The United Kingdom is the leading country in revenue generation.
- Seasonal peaks are observed during the last quarter.
-
Product Analysis:
- "REGENCY CAKESTAND 3 TIER" generates the most revenue.
- "JUMBO BAG RED RETROSPOT" is the most frequently purchased product.
-
Customer Behavior:
- Sales activity peaks during midday hours.
- Recency analysis suggests opportunities for re-engagement campaigns to target inactive customers.
-
Geographical Insights:
- Revenue is concentrated in key markets like the UK and Australia, with untapped potential in other regions.
Feature Engineering transforms raw data into meaningful insights and improves the performance of machine learning models. The following steps were performed:
- Recency: Days since the customer's last purchase.
- Frequency: Number of transactions per customer.
- Monetary Value: Total revenue generated by each customer.
- Extracted features like hour, day, week, and month from the purchase date to capture temporal purchasing patterns.
- Average Revenue per Order
- Total Revenue per Customer
- Average Quantity Sold per Transaction
- Applied K-Means Clustering to segment customers based on RFM scores, identifying high-value and low-value groups.
These engineered features enable better analysis, improved model performance, and actionable customer insights. π
The K-Means clustering algorithm was applied to segment customers based on their RFM features:
- Recency: Time since last purchase.
- Frequency: Total number of purchases.
- Monetary Value: Total revenue generated.
The resulting clusters are visualized below:
- Cluster 1 (Red): High-value and frequent buyers.
- Cluster 2 (Green): Customers with high monetary value but low frequency.
- Cluster 3 (Cyan): Regular customers with moderate activity.
- Cluster 4 (Purple): Infrequent and low-value customers.
Purpose:
This segmentation helps businesses identify customer behaviors, enabling targeted marketing strategies and resource optimization. π
We applied ARIMA (AutoRegressive Integrated Moving Average) to predict hourly sales trends for better inventory and revenue planning. The baseline model was configured using ARIMA(3,1,3)(0,0,1)[24], indicating seasonal adjustment with 24-hour cycles.
Steps:
- Prepared hourly revenue data and tested for stationarity.
- Optimized ARIMA model parameters using AIC for best fit.
- Forecasted future hourly sales with confidence intervals (80% and 95%) to account for uncertainty.
- Validated the model using residual analysis and Ljung-Box test.
Results:
- Forecast Plot: Displayed actual vs. predicted sales with confidence intervals.
- Residual Analysis: The residuals appear random with no significant autocorrelation, suggesting a good model fit.
- Mean Error (ME):
11.77
- Root Mean Squared Error (RMSE):
3039.78
- Mean Absolute Error (MAE):
1843.33
- Mean Absolute Percentage Error (MAPE):
336.65%
- Ljung-Box Test (p-value):
0.3741
(Residuals show no significant autocorrelation).
Plots:
-
Interactive Hourly Sales Forecast: Shows the actual values vs. the forecasted values along with confidence intervals.
-
Residual Diagnostics: Ensures residuals are normally distributed and uncorrelated, validating model accuracy.
We also used the Prophet model for time-series forecasting to analyze and predict sales trends. The results include:
-
Trend Analysis:
- The sales trend showed a decline at the beginning of the year.
- Recovery began mid-2011, with a sharp upward trend toward the end of the year, likely due to seasonal effects or increased demand.
-
Weekly Seasonality:
- Sales peaked on Fridays and midweek, indicating higher purchasing activity during weekdays.
- The lowest sales occurred on Sundays and Saturdays, reflecting reduced business activity on weekends.
-
Forecast Plot:
- Forecasted future values with confidence intervals were generated to visualize expected sales performance.
- The Prophet model successfully captured both long-term trends and weekly seasonality in the sales data.
- Plots provide actionable insights into purchasing patterns and help validate the current forecast.
- Model Optimization: Fine-tune the Prophet model parameters to improve forecast accuracy.
- Additional Seasonality: Incorporate monthly or holiday seasonality to capture short-term variations.
- Model Comparison: Evaluate alternative models such as SARIMA or machine learning-based regressors for better performance.
- Validation: Assess model performance using error metrics like RMSE, MAE, and MAPE.
This process ensures a robust, data-driven approach to sales forecasting and trend analysis.
- Further refine the model using additional seasonal components or deep learning techniques (LSTM).
- Explore ensemble methods (e.g., SARIMA + XGBoost) to improve forecast accuracy.
- Scale forecasting to include monthly or weekly sales trends.
- Clone the repository:
git clone https://github.com/yasirusama61/online-retail-analysis.git
- Install required R packages:
install.packages(c("tidyverse", "janitor", "lubridate", "ggplot2", "readxl", "forecast", "prophet"))
- Run the scripts: Navigate to the notebooks/ folder to execute individual analysis steps. For the Shiny dashboard, open and run shiny_dashboard/app.R.
This project was a collaborative effort aimed at analyzing and forecasting sales data using machine learning techniques and statistical models. The following contributions were made:
-
Data Cleaning and Preprocessing:
- Removed missing values and invalid records.
- Normalized and aggregated data to prepare it for analysis.
-
Exploratory Data Analysis (EDA):
- Analyzed sales trends (hourly and monthly).
- Identified top-selling products, revenue-generating countries, and customer behavior.
- Visualized insights using interactive and static plots.
-
Feature Engineering:
- Created RFM (Recency, Frequency, Monetary) features for customer segmentation.
- Extracted time-based features to capture sales patterns (hour, month, etc.).
- Generated derived metrics like average revenue per order and demand metrics.
-
Customer Segmentation:
- Performed clustering (K-Means) on RFM features to segment customers into actionable groups.
-
Forecasting:
- Implemented ARIMA and LSTM models for time-series sales forecasting.
- Evaluated model performance using metrics such as RMSE, MAE, and residual diagnostics.
-
Visualizations:
- Built informative plots to communicate key insights, including:
- Revenue trends
- Residual diagnostics
- Customer segmentation
- Used tools like ggplot2 and plotly for interactive data visualizations.
- Built informative plots to communicate key insights, including:
-
Online Retail Dataset:
- Source: UCI Machine Learning Repository - Online Retail Dataset
- Description: A transactional dataset of a UK-based online retailer containing purchase details between 2010 and 2011.
-
R Packages:
- tidyverse: For data manipulation and cleaning.
Link: tidyverse.org - lubridate: For date-time handling.
Link: lubridate on CRAN - janitor: For cleaning column names and data.
Link: janitor on CRAN - forecast: For ARIMA time-series modeling.
Link: forecast on CRAN - keras and tensorflow: For building and training LSTM deep learning models.
Link: Keras for R
- tidyverse: For data manipulation and cleaning.
-
Clustering Algorithms:
- K-Means: For customer segmentation based on RFM features.
Link: K-Means Clustering
- K-Means: For customer segmentation based on RFM features.
-
Time-Series Forecasting Concepts:
- ARIMA: AutoRegressive Integrated Moving Average Model.
- LSTM: Long Short-Term Memory Model for sequential data.