Rossmann operates over 3,000 drug stores in 7 European countries. Currently, Rossmann store managers are tasked with predicting their daily sales for up to six weeks in advance. Store sales are influenced by many factors, including promotions, competition, school and state holidays, seasonality, and locality. With thousands of individual managers predicting sales based on their unique circumstances, the accuracy of results can be quite varied. You are provided with historical sales data for 1,115 Rossmann stores. The task is to forecast the "Sales" column for the test set. Note that some stores in the dataset were temporarily closed for refurbishment.
- Problem Statement and Project Description
- Project Files Description
- Goal
- Dataset Information
- Exploratory Data Analysis
- Random Forest Model
- Technologies Used
This project contains two executable file as follows:
- Rossmann Sales Prediction - Capstone Project.ipynb - Google Collab notebook containing data summary, exploration, visualisations and modeling, model hyperparameter tuning, model performance, evaluation and conclusion.
- Data & Resources link : https://drive.google.com/drive/folders/1qnxqMxy8_gI-siwVhUaQOBW1QbM_07g0
The interest in a product continues to change occasionally. No business can work on its monetary growth without assessing client interest and future demand of items precisely. Sales forecasting refers to the process of estimating demand for or sales of a particular product over a specific period of time. This project involves solving a real-world business problem of sales forecasting and building up a machine learning model for the same.
Our goal here is to forecast the sales for six weeks for each store and find out the factors influencing it and recommend ways in order to improve the numbers.
Features in the dataset: Most of the fields are self-explanatory. The following are descriptions for those that aren't.
- Id - an Id that represents a (Store, Date) duple within the test set
- Store - a unique Id for each store
- Sales - the turnover for any given day (this is what you are predicting)
- Customers - the number of customers on a given day
- Open - an indicator for whether the store was open: 0 = closed, 1 = open
- StateHoliday - indicates a state holiday. Normally all stores, with few exceptions, are closed on state holidays. Note that all schools are closed on public holidays and weekends. a = public holiday, b = Easter holiday, c = Christmas, 0 = None
- SchoolHoliday - indicates if the (Store, Date) was affected by the closure of public schools
- StoreType - differentiates between 4 different store models: a, b, c, d
- Assortment - describes an assortment level: a = basic, b = extra, c = extended
- CompetitionDistance - distance in meters to the nearest competitor store
- CompetitionOpenSince[Month/Year] - gives the approximate year and month of the time the nearest competitor was opened
- Promo - indicates whether a store is running a promo on that day
- Promo2 - Promo2 is a continuing and consecutive promotion for some stores: 0 = store is not participating, 1 = store is participating
- Promo2Since[Year/Week] - describes the year and calendar week when the store started participating in Promo2
- PromoInterval - describes the consecutive intervals Promo2 is started, naming the months the promotion is started anew. E.g. "Feb,May,Aug,Nov" means each round starts in February, May, August, November of any given year for that store
There were more sales on Monday, probably because shops generally remain closed on Sundays which had the lowest sales in a week. Store type B though being few in number had the highest sales average. The reasons include all three kinds of assortments specially assortment level b which is only available at type b stores and being open on sundays as well. The outliers in the dataset showed justifiable behaviour. The outliers were either of store type b or had promotion going on which increased sales.
Store type B was open on all seven days of the week and had more sales than any other store type and promotion had a positive effect across all store types.
Random forest is a supervised learning algorithm. It creates a "forest" out of an ensemble of decision trees, which are commonly trained using the "bagging" method. The bagging method's basic premise is that combining different learning models improves the overall output. Simply said, random forest combines many decision trees to produce a more accurate and stable prediction.
Furthermore, the random forest classifier is efficient, can handle a large number of input variables, and provides correct predictions in most cases. It's a very strong tool that doesn't require any coding to implement.
The XGB Regressor model is an implementation of the XGBoost algorithm, which is an optimized version of gradient boosting. It is particularly useful for large datasets and high dimensional data, and is often used in Kaggle competitions and other machine learning challenges. The XGB Regressor model uses decision tree ensembles as its base learners and is trained by minimizing the gradient of the loss function. It is a powerful model that can be used for both regression and classification tasks.
In this case, the Random Forest model has a Test_R2 score of 0.9527, which is 3.49% higher than the Decision Tree model's score of 0.920600. This suggests that the Random Forest model is able to make better predictions than the Decision Tree model.
On the other hand, the XGB Regressor Tuned model has a Test_R2 score of 0.955427, which is 0.29% higher than the Random Forest model's score of 0.9527. This suggests that the XGB Regressor Tuned model is able to make slightly better predictions than the Random Forest model. However, the difference is small and may not be significant for all use cases. Therefore, it would be necessary to analyze other performance metrics and evaluate the trade-offs between the different models to determine which one is best suited for a particular task.
Andrew Udell, 'Predicting E-Commerce Sales with Random Forest'. [Online].
ChatGPT. [Online].
Available: (https://chat.openai.com/chat)
Builtin.com, 'Random Forest'. [Online].
Available: https://builtin.com/data-science/random-forest-algorithm
Machine Learning Mastery, 'Random Forest for Time Series Prediction'. [Online].
Available: https://machinelearningmastery.com/random-forest-for-time-series-forecasting/
Mohd Zahid Ansari | Avid Learner | Data Scientist | Machine Learning Engineer | Deep Learning enthusiast
Contact me for Data Science Project Collaborations