Skip to content

Latest commit

 

History

History
256 lines (143 loc) · 14.1 KB

README.md

File metadata and controls

256 lines (143 loc) · 14.1 KB

Group 6 Project 4: Sales Volume Forecasting

Project 4 Group Members:

  • Zachary Corbett
  • Victor Donstov
  • Sara Parveen
  • Set Z

Project Overview and Purpose of Data Model

  • For this project, we focused on improving the inventory planning process of an anonymous company.

  • This anonymous company acts as a product distributor by purchasing office products from manufacturers, holding them and then reselling them to retailers and end-users.

  • By analyzing their office products' purchases, sales and product details data, we created a data model that makes predictions of predicted sales quantities for each product category.

  • Knowing how much will be sold of each product category in the future can help determine how much of each product category should be ordered or kept in stock at any given time.

  • Having a tool that assists in determining the ideal inventory levels at any given time is crucial to:

    • meet customer demand and ensure customer satisfaction
    • avoid having too much inventory which can lead to unnecessary storage and handling costs
    • avoid stockouts which can result in sales loss and/or fines from customers
    • maintain high profitability

Target Audience of Data Model

The target audiences for this model are:

  • the anonymous company whose data was analyzed as well as other distributors, manufacturers or retailers that depend on predicting future sales for goods procurement
  • all product sales planning, supply chain, and procurement professionals

Data Sources

  • The data used for this project comes in the form of CSV files obtained from the anonymous company.
  • The original data has been anonymized for the purposes of this project.
  • The CSVs contain 5-years worth of data (2018 to 2022) for Purchases, Sales as well as Product Details.

Data Hosting

Database Creation and Data Cleaning

  • To help setup the database, an ERD schema was created as shown below:

image

  • The schema was defined using PySpark SQL in the Databricks notebook: "Data Management Python"
  • Once the schema is defined, the raw CSV files are read in from the S3 Bucket and transformed to meet the Table requirements defined by the schema.
  • The ETL performed using PySpark by completing the following steps:
    1. Loading and merging DataFrames
    2. Selecting and filtering the columns to be used
    3. Converting the columns to the appropriate data type
    4. Groupings the data to get yearly and monthly aggregates The ETL process is documented in the Databricks notebook which can be viewed here: "ETL_Data_Cleaning" as an export of the ETL script.
  • The transformed data is then inserted into their respective Tables.
  • Due to the limitations of the free Community Edition tier of Databricks, the "Data Management Python" and "ETL_Data_Cleaning" Databricks notebooks are run whenever a new Cluster is connected.

Data Model Description

  • The model utilizes data retrieved from Pysark SQL.

We have used a time series model to make sales volumne predictions. The machine learning model is called SARIMAX (Seasonal Auto-Regressive Integrated Moving Average with eXogenous factors) which is an extension of the ARIMA class of models.

For predicting future sales volumes of each product category, we applied the SARIMAX model which takes into account the past experience (autoregressive model) and seasonality patterns. The model includes the following parameters:

p - order of the autoregressive part
d - degree of first differencing involved
q - order of the moving average part
P, D, Q - all previous characteristics with seasonal factors
s - seasonal length in the data

The following steps were completed to make predictions about overall sales volume using the SARMAX model.

1. Identified the stationarity of the time series:

Based on the results of the Dickey-Fuller test, the p-value = 0.11034. As the p-value is greater than 0.05, we can suggest that our time series is non-stationary.

Analyzing the ADF (Augmented Dickey-Fuller) chart, we can suggest the parameter p = 0 since coefficients of autocorrelation slowly decrease over time lags. In other words, autocorrelation is not significant.

image

2. Suggested the initial parameters:

Since our series is not stationary, for further analysis, we need to make the time series stationary which was completed using the differencing method: lag=1: y_dif(t+1)=y(t+1)- y(t) .

After applying the differencing method, the p-value of the Dickey-Fuller test equals 0.00000, so we can imply that the data is now stationary. We can suggest the other parameters: q=1 (as the first lag has a spike).

image

We can see little spikes at lag 12 and 24 on ACF (AutoCorrelation Function), so P can be equal to 1. By decomposition of the time series, we can see the seasonal trend. We are using s =12 because we are using monthly sales volume data for predictions.

image

3. Generated the final parameters for the model:

In order to find the best parameters for our model, we generated different series of parameters (p, d, q, P, D, Q). The parameters chosen were based on the lowest AIC (Akaike’s Information Criterion) scores.

image

4. Ran the SARIMAX model:

The SARIMAX model generated monthly sales volume predictions for the next 12-month period. The predictions for the overall sales volume has a Mean Absolute Percentage Error of 12.72%.

image

We can imply that the model fits well based on the analysis of residuals.

image

Attempt to Optimize the Model

The charts in the section above are based on data from all 10 product categories. Similar steps were performed for each individual product category and can be viewed in the Images folder. These visualizations showing Analysis and Predictions are based on the optimized model. The optimization was attempted by selecting different parameters in order to try and improve the accuracy.

The results of the first attempt at implementing the model can be viewed in the Images folder.

Following tables shows a comparison of the Mean Absolute Percentage Errors for the two attempts:

image

We tried running the model on Databricks but had trouble creating one of the charts due to formatting errors. So we have used Jupyter Notebook primarily for running the model and creating the charts. The two attempts in Jupyter Notebook for model optimization can be accessed in Notebook_Jupyter folder. The attempt to run it on Databricks can be accessed in Notebooks_Databricks folder.

Data Model Results

  • The model was trained on first four years of data (2018 - 2021). This time series model creates an equation which was then be used to test against the 2022 data and predict future sales volumes.
  • Although the model generated good results for the overall sales but it did not have good accuracy for some product categories even after the optimization.

Visulizations and Dashboard Views

  • Databricks was utilized to created both the visualizations and the dashboards.
  • Two dashboards were created: One for the historical data (2018-2022), and the second for the Predicted Sales data (2023) which was extracted from the SARIMAX model.

Dashboard 1: Historical data (2018-2022)

  • This dashboard shows visulaizations related to historical inventory sales volume, profits and costs.

  • The HTML version of the Databricks dashboard can be accessed here. (Please note that this version does not display the filters which are visible and functional on the local file).

  • Here are screenshots of the dashboard:

    Overall view of dashboard:

    image

    Word Cloud Chart #1 is based on Product Category Sales Volume

    image

    Word Cloud Chart #2 is based on Monthly Sales Volume

    image

    Revenue Breakdown (Stacked Bar graph)

    image

    Total Profit by Month (Stacked Bar graph)

    image

    Total Profit Table

    image

    Total Sales Volumee by Month (Area graph)

    image

    Total Sales Volume Table

    image

    Sales Volumne vs Profit (Bubble Size = Cost) (Bubble Chart)

    image

    Profit vs Cost Combo Chart (Bar graph and Line graph)

    image

Dashboard 2: Predictions data (2023)

  • This dashboard shows visulaizations related to predicted sales volume, margin error and mean absolute percentage error.

    • The HTML version of the Databricks dashboard can be accessed here. (Please note that this version does not display the filters which are visible and functional on the local file).
  • Here are screenshots of the dashboard:

    Overall view of dashboard:

    image

    Predicted Sales Quantity Table

    image

    Margin Error Table

    image

    Mean Absolute Percentage Error Table

    image

    Mean Absolute Percentage Error vs AIC Scores (Bubble Size = Predcited Sales Quantity) (Bubble Chart)

    image

Limitations and Assumptions

1- There are 134 different product SKUs within the original dataset which can be classified into 10 unique product categories. For the purposes of our analysis and predictive model, we decided to focus on the 10 unique product categories to avoid scope creep.

2- We have assumed that we have unlimited warehouse space to store the products in order to meet the sales demand.

3- We have assumed we have unlimited resources and budget to order the ideal sales volume.

4- We assumed there are no minimum order quantities.

5- The time period covered by the raw data includes the early years of the COVID-19 global pandemic. This introduced a significant number of external factors which could be considered as outliers or could contribute to a higher number of outliers within the dataset.

Challenges

1- We used Databricks for ETL which truncates the data to 10,000 rows. In order to fix this issue, we used groupings based on yearly and monthly data for each product category which reduced the number of rows to 592.

2- The dashboards in Databricks do not have a default option of adding filters. We used Databricks Widgets to create filters in order to filter by year and by each product category which makes it a more dynamic dashboard.

3- The dashboards filters in Databricks do not carry over into the HTML file that is exported from the Databricks dashboards.

Conclusion

In our project, we created a statistical (SARIMAX) model to estimate the future sales of a distribution company. This model was based on an analysis of the historical sales (unit sales volume) for 10 product categories over the last 5 year period (2018 - 2022).

The sales predictions using the SARIMAX model are helpful but they do have certain limitations. Using other available models to take into factor the space/ dimensions of the warehouse as well as budget constraints will help to make more applicable predictions. These changes will require additional resources.

The Margin Error increases as we predict further into the future. This limits the use case of the model to shorter ordering intervals (1-3 months of stock). This may prove appropriate for a distributor/ retailer that is operating with a Just-In-Time (JIT) principal in mind.

image

The sales predictions produced by the SARIMAX model show that the overall sales will slightly decline next year. Though for the majority of product categories, sales are not expected to change significantly. Based on the SARIMAX model predictions, following categories are expected to exhibit stable sales in the coming year:

Category % Share of Unit Sales
PVC Chair Mat 35%
Desk Pad 19%
Polycarbonate Chair Mat 13%
Recycled Chair Mat 12%
Anti-Fatigue Mat 9%

In the end, SARIMAX does seem like a very appropriate model for predicting sales volumes based on historical time series.