-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathPredict Brand.Rmd
247 lines (200 loc) · 6.81 KB
/
Predict Brand.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
---
title: "Consumer Brand Preference Predictions"
author: "Jennifer Brosnahan"
date: "9/12/2020"
output:
html_document:
keep_md: yes
theme: lumen
highlight: haddock
---
## Background
#### An Electronics Company would like for us to predict which computer brand customers from a new region will prefer, Acer or Sony. Doing so will help to determine inventory needs for new clientele and individual customer brand preferences for robust in-store and online marketing efforts.
## Objective
#### The objective is to build predictive models and choose a model that can predict consumer computer brand preferene with at least 80% accuracy on test data. Ideal goal is a model that can predict brand preference with at least 90% level of certainty.
## Load libraries
```{r Loading packages, message=FALSE, warning=FALSE}
# Loading packages
library(tidyverse)
library(caret)
library(ggplot2)
library(corrplot)
library(openxlsx)
library(h2o)
library(knitr)
library(kableExtra)
```
## Import data
```{r}
# training data
complete <- read.csv(file.path('C:/Users/jlbro/OneDrive/C3T2/C3T2', 'complete.csv'), stringsAsFactors = TRUE)
# testing data
incomplete <- read.csv(file.path('C:/Users/jlbro/OneDrive/C3T2/C3T2', 'incomplete.csv'), stringsAsFactors = TRUE)
```
## Understand data
```{r}
# check structure of training data
str(complete)
```
#### Observations
* 9898 total observations with 7 total features
* Six features are customer description variables, one feature is consumer brand preference, 'brand'
* The data description tells us the codes for brand: Acer=0 and Sony=1
```{r}
# check descriptive stats
summary(complete)
summary(complete$brand)
```
```{r, message=FALSE, warning=FALSE}
# histogram of 'brand'
ggplot(complete) +
geom_histogram(aes(brand), stat = 'count', bins = 2)
```
#### Binary classification is slightly imbalanced, 62/38
## Preprocessing
```{r}
# check for NAs
sum(is.na(complete))
# change data type and values
complete$brand <- as.factor(complete$brand)
str(complete)
```
## Modeling
```{r}
set.seed(123)
# createDataPartition() 75% and 25%
index1 <- createDataPartition(complete$brand, p=0.75, list = FALSE)
train1 <- complete[ index1,]
test1 <- complete[-index1,]
# Check structure of trainSet
str(train1)
```
```{r}
# set cross validation
control <- trainControl(method = 'repeatedcv',
number=10,
repeats = 1)
```
## Random Forest using automatic tuning
```{r}
set.seed(123)
# train algorithm
rf1 <- train(brand~.,
data = train1,
method = 'rf',
trControl = control,
tuneLength = 1)
rf1
```
## Random forest using manual tuning
```{r}
set.seed(123)
# train
rf2 <- train(brand~.,
data = train1,
method = 'rf',
trControl=control,
tuneLength = 5)
rf2
```
#### Observations
* Both Random forest models are about the same. We will use Random Forest 1 with an mtry of 2 and 91.8% accuracy, as it is simpler and faster model and meets client goal of at least 80% accuracy.
* Kappa is 82.5%, which is a more useful determination of accuracy if predicted class is imbalanced, as it helps normalize an imbalance in the classes.
```{r}
# variable importance using ggplot
ggplot(varImp(rf1, scale=FALSE)) +
geom_bar(stat = 'identity') +
ggtitle('Variable Importance of Top Random Forest Model')
```
## Predict on test dataset using optimal model
```{r}
rfPreds <- predict(rf1, newdata = test1)
```
```{r}
# predict using type = 'prob' helps see prediction for each observation
rfProbs <- predict(rf1, newdata = test1, type = 'prob')
head(rfProbs, 10)
```
## Confusion matrix of top model
```{r}
confusionMatrix(data = rfPreds, test1$brand)
```
#### Observation
* Confusion matrix shows 92.6% accuracy, 92% sensitivity (% of positives we are catching), and 93% specificity (% of negatives we are catching). This is a good model.
```{r}
# postResample reveals if it will do well in real world or if it is overfitting
postResample(rfPreds, test1$brand)
```
#### rf1 is not overfitting, accuracy and kappas are both comparable and high
```{r}
# compare predictions to actual within same data frame
compare_rf <- data.frame(test1,rfPreds)
head(compare_rf, 100)
```
```{r, message=FALSE, warning=FALSE}
# summarize and plot
summary(rfPreds)
ggplot(compare_rf) +
geom_histogram(aes(rfPreds), stat = 'count', bins = 2) +
xlab('Brand Preference Predictions') +
ggtitle('Distribution of Brand Preference Predictions')
```
#### Model predicted on test data similar ratio of actual brand preference (38% Acer to 62% Sony)
## Predict brand preference on new customer dataset: 'incomplete'
```{r}
# check structure of new dataset
str(incomplete)
```
## Preprocess
```{r}
# check for NAs
sum(is.na(incomplete))
# change data type
incomplete$brand <- as.factor(incomplete$brand)
# check structure and summary
str(incomplete)
summary(incomplete)
```
#### Observation
* Summary reveals some brands are answered as 1
* Deeper dive of dataset unveils that first 102 rows of brand have been filled in
* The rest (103:5000) are unanswered and need to be predicted
```{r}
# make brand predictions on new customer data 'incomplete'
incompletePreds <- data.frame(predict(rf1, newdata = incomplete))
names(incompletePreds)[1] <- 'Predictions'
str(incompletePreds)
```
```{r}
# postResample on first 102 observations to determine how well model did on test df
subset_incomplete <- incomplete %>% slice(1:102)
subset_PR <- incompletePreds %>% slice(1:102)
postResample(subset_PR, subset_incomplete$brand)
```
#### Observation
* Accuracy is high at 85.3%, kappa is lower, perhaps because of small sample size
```{r}
# compare first 102 actual observations to predictions
bind_cols(subset_incomplete$brand, subset_PR)
```
```{r, message=FALSE, warning=FALSE}
# summarize and plot predictions
summary(incompletePreds)
ggplot(incompletePreds) +
geom_histogram(aes(Predictions), stat = 'count', bins = 2)
```
#### Observation
* Ratio of distribution of predictions is very similar to training and testing: 38% predicted to prefer Acer and 62% predicted to prefer Sony
```{r}
# predictions in new data
compare_incomplete <- data.frame(incomplete,incompletePreds)
head(compare_incomplete, 150)
```
## View first 200 predictions of consumer brand preference on new dataset 'incomplete'
```{r}
kable(compare_incomplete[1:200,], format = 'html', caption = 'Brand Preference Predictions on New Customer Dataset', digits=3) %>% kable_styling(bootstrap_options = 'striped', full_width = FALSE)
```
## Actionable Insights
* Continue product inventory ratio of Sony (62%) to Acer (38%) for new customer clientele
* Target new customer base with direct-to-consumer email and mailing promotional efforts based on predicted brand preference
* Deploy algorithm in online platform for upselling and recommender strategies during all client online shopping experiences