-
Notifications
You must be signed in to change notification settings - Fork 4
/
SuperStoreSalesinR.Rmd
443 lines (325 loc) · 17.4 KB
/
SuperStoreSalesinR.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
---
title: "Superstore Sales Predictor"
author:
- Sarah Haley, slh54@drexel.edu
- Zach Carlson, zc378@drexel.edu
- Nancy Melucci, njm99@drexel.edu
date: "November 14, 2021"
output:
prettydoc::html_pretty:
theme: Architect
highlight: github
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(root.dir= '/tmp')
knitr::opts_chunk$set(warning = FALSE, message = FALSE)
```
#### Libraries
```{r libraries used}
library(tidyverse) #Tidyverse v3.6.3
library(lubridate)
library(ggplot2)
library(ggfortify)
library(knitr)
library(kableExtra)
library(zoo)
library(forecast)
```
### ARIMA in R and Time Series Forecasting
For this project, we will be using the Autoregressive Integrated Moving Average (ARIMA) forecasting model which in turn is based on the Box-Jenkins Methodology. This is comprised of three major steps:
1) Conditioning the data and selecting a model (from chapter 8 in Data Science and Big Data Analytics)
-identify and account for any trends or seasonality in the time series
-examine the remaining time series for any trends or seasonality in the time series
2) Estimate the model parameters
3) Assess the model and return to step one if necessary
*Time series data* is made up of four possible components: Trend [T], Cycle [C], Seasonality [S], and Random[R]. To that end, we will employ a few libraries that were developed to deal with these particular issues in Time Series data.
So to begin, we will start looking at the data and getting a sense of its shape and possibilities.
## Import and View the Data
```{r sales}
sales_raw <- read.csv("./data/superstore_dataset2011-2015.csv", header = TRUE, sep = ",")
sales_raw <- mutate_at(sales_raw, vars("Order.Date","Ship.Date"),funs(dmy))
```
The dates need to be converted using the `lubridate` library. The two date columns are now correctly reading as `<date>`.
```{r }
sales_raw %>%
select(Order.Date, Ship.Date) %>%
head(., 6)
```
#### Initial Observations
Using the dpylr package's glimpse function, we see there are
- `51,290` instances with `24` features
- lubridate dates are now in `YYYY-MM-DD` format
- most features are factored but there are also int and nums where sensible:
+ Quantity is in integers
+ Sales is num (floating point)
+ For ease of analysis, we will assume all sales are in USD
+ Sales column row-wise represents the total amount by Quantity(1-n)
For the superstore data we want to know a number of things that time series can help us understand. Things such as:
- Are Sales & Profits increasing over time?
- What categories of sales are most profitable over time?
- What region has the most sales and are the most profitable over time?
- Predicted growth rates and declines will help us establish which markets, regions, and products we should be investing in for best profitability.
#### Exploratory Data Analysis
```{r}
dplyr::glimpse(sales_raw, width = getOption("width"))
```
Right away we can see there are a lot of NA values for Postal.Code so we will drop that column; we will not be using it for this project.
```{r}
sales <- subset(sales_raw, select = -c(Postal.Code))
```
Additionally, we will also add a few columns to the data set to make it easier for analysis throughout the project. A year column, a YearMonth column, a monthabbr column, and a month date object. Note some of these are not considered actual date parts but can be wrangled to look like dates.
```{r}
sales <- sales %>%
mutate(
YearMonth = format(Order.Date, "%Y-%m")
) #character
sales <- sales %>%
mutate(
year = year(Order.Date), monthAbbr = month(Order.Date, label = TRUE)
) #Ordinal
sales$month <- as.Date(cut(sales$Order.Date, breaks = "month")) #Date
```
A quick look at the new columns:
```{r}
sales %>%
select(YearMonth, year, month, monthAbbr) %>%
tail(., n = 10)
```
And here we will look at a few 30,000ft summary statistics around the factors affecting Revenue (Sales, Quantity, Discount Rates, Shipping Costs, and Profits) over the four year lifecycle:
```{r}
kable(sales %>%
select(Sales, Quantity, Discount, Shipping.Cost, Profit) %>%
summary()
, format = "html", digits = 2) %>%
kable_styling(bootstrap_options= c("striped"), full_width = F, font_size = 14)
```
`Discount` is in a percentage format already, showing an average at about 14% and the average quantity per sale is relatively low with an IQR of 2-5 and a median value of 3 per sale item. We will look more closely at sales volume trends shortly.
Dropping `Quantity` and `Discount` data, we can see the main revenue factors in side-by-side box plots. Total revenue, or the Sales plot, is the highest, as it should be, and Profit and Shipping Cost are subsets of that number. Profit is also quite large in the plot but Shipping Costs are not far behind, leading to some questions around efficiency for upper management to contemplate further. Additionally, there are a large number of outliers in the data and it is highly positively skewed, hence the addition of log10 on the y scale.
```{r}
options(scipen=999)# prevents scientific numbering on axis
sales %>%
select(Sales, Shipping.Cost, Profit) %>%
pivot_longer(cols = everything()) %>%
ggplot(aes(x= name, y = value, fill=name)) +
geom_boxplot()+
xlab("Factors in Revenue") +
scale_y_log10("Values in $ (Log 10 scale)") +
ggtitle("Boxplot of Profit, Sales, and Shipping Costs", subtitle = "Y scale in Log 10 Scale")+
scale_fill_brewer(palette = "Paired")+
theme_bw()
```
Looking a little more closely at Sales by year:
```{r}
kable(sales %>%
group_by(year) %>%
summarise(total_sales = sum(Sales), mean_sales = mean(Sales), median_sales = median(Sales), max_sales = max(Sales), min_sales = min(Sales)), format = "html", digits = 2) %>%
kable_styling(full_width = F, font_size = 12)
```
Total sales have grown every year while average sales have not. It looks like a lot of small sales make up the vast majority of the revenue for this company. Quantity sold, (our sales volume), should help us understand if this supposition is true and help us determine if there are visible seasonal trends.
Let's start by looking at monthly sales trends, aggregating the `Quantity` column by month.
```{r}
monthly_sales_quantity <- sales %>%
group_by(year, monthAbbr) %>%
summarise(total.qty = sum(Quantity))
tail(monthly_sales_quantity, 24)
```
And used to visualize, we get:
```{r fig.width = 10}
monthly_sales_quantity %>%
ggplot(aes(x = monthAbbr, y = total.qty, group = year)) +
geom_area(aes(fill=year), position = "stack") +
labs(title = "Volume of Sales by Month", x= "", y ="Total Sales Qty") +
scale_y_continuous() +
theme(legend.position = "top")+
theme_minimal()
```
There are obvious seasonal trends in terms of quantity sold with slow and steady growth throughout the year capped with three peaks, and a bulk of sales happening in the last quarter of the year. There is a small peak between April through to July, another between July and October, and the third (and largest) trending from October to the end of December.
We will continue to look at total quantity (volume of sales) and total sales (revenue) over the course of this project. Since the per sale dollar amount already reflects the unit by price * quantity, the charts below are representing the `volume` of sales per month by individual days' (points) `revenue`. Each point represents a Sales line item multiplied the quantity sold.
```{r}
ggplot(sales, aes(monthAbbr, Sales)) +
#geom_point(color= "steelblue")
geom_jitter(alpha =0.5, aes(color=year), position = position_jitter(width = .2))+
labs(x = "Daily Sales by Month",
y = " Sales",
title = "Annual SuperStore Sales by Month") +
geom_line()+
#stat_summary(fun.y = sum, geom = "bar")+
#geom_bar(stat = "identity") +
facet_wrap(~year(month),ncol = 2) +
theme_bw() +
theme(legend.position = "none",axis.text.x = element_text(angle = 90,
vjust = 0.5,
hjust = 1))
```
Looking at this panel, we see we have a number of outliers in the data and that most of the sales amounts per sale are below $5000 (and often much less than that). Also, it is apparent that there was less variance in the 2012 month to month data points than from the other years. Finally, outside of the obvious connection between sales and quantity sold, a closer look at the relationship between sales and profits can reveal more details.
Let's look at a few of the ouliers to see if we can determine what is causing them.
```{r}
kable(sales %>%
group_by(year) %>% arrange(year) %>%
select(., c(Order.Date, Customer.Name, Segment, Product.Name, Sales, Quantity, Discount, Profit)) %>%
filter(Sales == max(Sales)),format = "html", digits = 2, table.attr = "style='width:40%;'") %>%
kable_styling(full_width = F, font_size = 12)
```
The above appear to be genuine sales, although they are certainly much larger than the average sales and very rare in the overall dataset. Also, they appear to be strangely segmented - I have yet to meet someone who buys 6 video conferencing units for their `Home Office` and the Canon sales are each attributed to a different segment for the same product.
#### Sales and Profit Relationships
Now to plot the multiple line plots to look at the relationship between the "Sales" and "Profit" columns by year by month.
```{r}
monthly_sales_profit <- sales %>%
group_by(month) %>%
summarise_at(c("Sales", "Profit"), sum)
MSP <-monthly_sales_profit %>%
gather(key = "variable", value = "values", -month)
MSP
```
```{r}
options(scipen=999)# prevents scientific numbering on axis
ggplot(MSP, aes(x = month, y = values)) +
geom_line(aes(color = variable), size = 1)+
labs(title = "Monthly Totals of Sales and Profits by Year", y=NULL)+
scale_color_manual(values = c("#96be25", "#2596be"))+
theme_minimal()
```
It is really hard to see if anything is changing at all with profits. So we will plot again using a log 10 scale on the Y-axis.
```{r}
options(scipen=999)# prevents scientific numbering on axis
ggplot(MSP, aes(x = month, y = values)) +
geom_line(aes(color = variable), size = 1)+
labs(title = "Monthly Totals of Sales and Profits by Year", subtitle = "Y axis scaled by log 10", y=NULL)+
scale_color_manual(values = c("#96be25", "#2596be"))+
theme_minimal()+
scale_y_log10(breaks = scales::log_breaks(10))
```
Using a `log10 scale` on the y axis helps us see what is happening with the data. While not moving in complete tandem, the growth of total sales does have an impact on total profits and we appear to have the answer to our first question; sales and profits are increasing, but total profits appear to remain relatively low in relation - never once breaking the \$50,000 mark despite total sales growing past $500,000\.
Let's see how correlated the two values are:
```{r}
cor(sales$Sales, sales$Profit)
```
```{r}
ggplot(sales, aes(Sales, Profit)) +
geom_point()+
geom_smooth(method = "lm") +
theme_minimal()
```
and looking at a scatter plot is not very helpful. The raw values of the data show a poor alignment between the two values.
Let's standardize for easier readability. To do that we will first look at the percentage change month over month for Sales and Profits.
```{r}
MSP<-MSP %>%
mutate(pct_change = (values/lag(values)-1)*100)
tail(MSP, 5)
```
Charting the data we get:
```{r}
options(scipen=999)# prevents scientific numbering on axis
ggplot(MSP, aes(x = month, y = pct_change)) +
geom_line(aes(color = variable), size = 1)+
labs(title = "Monthly % Change of Sales and Profits by Year", y=NULL)+
scale_color_manual(values = c("#96be25", "#2596be"))+
theme_minimal()
```
This makes things a little clearer and shows that sales and profits are both trending at similar rates, with profits being much less stable and more variable than sales.
### Rolling Averages
Further standardizing, we will next look a the rolling means:
```{r}
daily_sales <- sales %>%
select(Order.Date, Sales, Quantity, Profit, Shipping.Cost) %>%
group_by(Order.Date) %>%
summarise(
daSales = sum(Sales),
daProfits = sum(Profit),
daShipping = sum(Shipping.Cost))
rolling <-daily_sales %>%
mutate(sales_01da = rollmean(daSales, k = 1, fill = NA),
sales_07da = rollmean(daSales, k = 7, fill = NA),
sales_30da = rollmean(daSales, k = 30, fill = NA),
sales_360da = rollmean(daSales, k = 365, fill = NA),
profits_01da = rollmean(daProfits, k = 1, fill = NA),
profits_07da = rollmean(daProfits, k = 7, fill = NA),
profits_30da = rollmean(daProfits, k = 30, fill = NA),
profits_360da = rollmean(daProfits, k = 365, fill = NA))
rolling <- subset(rolling, select = -c(daSales, daProfits, daShipping))
rolling
```
```{r width = 15}
rs <- ggplot(rolling, aes(x = Order.Date, y = sales_01da)) +
geom_line(color="#2596be") +
geom_line(aes(y = sales_07da), color = "dark blue", size = 0.75)+
labs (x = "Order Date", y = "Daily Sales", title = "Daily Sales with 7 day Rolling Average") +
theme_minimal()
rs
```
```{r width = 15}
sp7 <- ggplot(rolling, aes(x = Order.Date)) +
geom_line(aes(y = profits_07da), color="#96be25") +
geom_line(aes(y = sales_07da), color = "#2596be", size = 0.75)+
labs (x = "Order Date", y = "Daily Sales", title = "Rolling 7-day Averages of Sales and Profit")
sp7
```
```{r}
ggplot(rolling, aes(x = Order.Date))
```
### ARIMA in R and Time Series Forecasting
Now we come to the heart of the matter, using the Box-Jenkins Methodology,with the Autoregressive Integrated Moving Average model(ARIMA). This is comprised of three major steps:
So far, our data, appears to have strong seasonality trends, although the magnitude is hard to tell. It may also have cyclicity but that is also hard to tell with the current graphs. To that end we will employ a few libraries that were developed to deal with these particuar issues in Time Series data.
In R this is done by converting the information to a timeseries object and then use the forecast or other similar libraries to perform some decomposition on the data. Decomposition is a process of seperating out different time series peices, rendering the data in a more "predictable" format.
For the ARIMA model we must also ensure the data is in a `stationary` format.
- the mean of Y(t) is a constant for all values of t
- the variance of Y(t) is finite
- the covariance of Y(t) and Y(t + h) depends only on the value of h = 0,1,2,...for all of t.
#### Stationary Data Test
#### Using the Time Series Object in R
Converting the monthly sales information into a ts (time series) object in R, we can use its native library and the stats package to decompose the monthly sales data to see if we are on the right track. This will also confirm which parts of the data are due to randomness, which is something we cannot gleen from the earlier charts.
I am going to plot the STL (seasonal and trend decomposition using Loess) for `Sales` and then we will look at some of the other factors seperately.
```{r}
monthly_revenue_factors <- sales %>%
group_by(month) %>%
summarise(total.sales = sum(Sales), total.profit = sum(Profit), total.shipping = sum(Shipping.Cost))
monthly_ts <-ts(monthly_revenue_factors[,2:4], start = 2011, end = 2014, freq = 12)
monthly_sales <- sales %>%
group_by(month) %>%
summarise(total.sales = sum(Sales))
sales_ts <- ts(monthly_sales[,2], start = 2011, end = 2014, frequency = 12)
sales_ts %>%
stl( s.window = "period") %>%
autoplot() +
theme_bw()
```
With everything split out, it is possible to see the trend and seasonal data very clearly. Sales are growing and seasonality cannot be denied. The `Random` or reaminder data is also often used for prediction as it can contain underpinning structure within the 'noise' letting us know whether or not we have achieved our goals with the data.
Since we know there is seasonality in the data, it means the series is not stationary and for the ARIMA models of prediction, we will need to do a litle work.
Another chart that is a part of the Box Jenkins Methodology is the ACF or Autocorrelation Function. This function provides an autocorrelation with previous lags or time periods of a dataset with itself. The ACF will peak around seasonal lags or at the average cycle length.
Below is an ACF plot on just the monthly sales data. The dashed lines are there to show when a correlation is significantly above or below zero. The lags at 1 and 3 and 12 show that the data is likley not happening by random chance. This is data we can use to build a forcasting model.
```{r}
ggAcf(sales_ts) +
theme_minimal()
```
Here, we will start with some basic predictions of the data.
```{r}
library(caret)
# training and test data
set.seed(42)
index = createDataPartition(sales$Sales, p = 0.70, list = FALSE)
train = sales[index, ]
test = sales[-index, ]
```
we are going to try out the
```{r}
nf_sales <- naive(sales_ts, h = 4)
autoplot(nf_sales) +
theme_minimal()
```
```{r}
summary(nf_sales)
```
We added `Shipping Cost` back into the equation as well, to look at it in relationship to to the sales and profits side of the equation, as it is a variable of interest.
```{r}
autoplot(monthly_ts) + #without facets
theme_minimal()
```
Here we see that the sales appears multiplicative in nature, but with the scale being so narrow and small in comparison for profits and shipping, it is harder to tell. Placing this in a facet could help.
```{r}
autoplot(monthly_ts, facets = TRUE) + #without facets
theme_minimal()
```
```{r}
monthly_ts_stl <-stl(sales_ts, s.window = "period")
plot(monthly_ts_stl)
```