-
Notifications
You must be signed in to change notification settings - Fork 0
/
WineRatings-JosephWellman.rmd
811 lines (630 loc) · 40.3 KB
/
WineRatings-JosephWellman.rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
---
title: "Wine Ratings Predictions Project"
author: "Joseph Wellman"
date: "04 June 2020"
output: pdf_document
---
# 1. INTRODUCTION
In this project, we will study a data set of wine ratings available on Kaggle (https://www.kaggle.com/zynicide/wine-reviews/data). The data set comprises information on wine reviews scraped from the Wine Enthusiast Magazine website (https://www.winemag.com/ratings/) on November 22nd, 2017.
Our data set contains nearly 130,000 reviews written by wine critics for the magazine with information on wine grape variety, price, place of origin, and several others. The full text of the review is included for each entry, as well as a points score between 80-100, which can be categorized in the following six categories (see https://www.winemag.com/wine-vintage-chart/):
Score | Category
------ | ----------
98-100 | Classic
94-97 | Superb
90-93 | Excellent
87-89 | Very Good
83-86 | Good
80-82 | Acceptable
Wines scoring below 80 are not reviewed or recommended by the magazine.
We will look to build a machine learning algorithm which will predict one of the six score categories for wines with unknown titles or wineries by utilizing other available variables in the data set. We will first explore the data set in some detail to undertake necessary data cleaning and to determine which variables are useable based on the available data.
Then we will construct training and testing subsets from the full data set with points scores stratified by category, and perform some data visualization to evaluate each variable's usefulness in predicting wine score categories.
Several machine learning algorithms will then be trained on the training set and we will assess them based on their accuracy rate in determining correct points categories.
Finally, we will select one final machine learning model to run on our testing set, review its accuracy performance, and provide suggestions for improvements or next steps for further study.
# 2. DATA ANALYSIS AND ALGORITHM METHODS
## 2.1 DATA ANALYSIS
```{r, include=FALSE}
if(!require(tidyverse)) install.packages("tidyverse", repos = "http://cran.us.r-project.org")
if(!require(caret)) install.packages("caret", repos = "http://cran.us.r-project.org")
if(!require(tidytext)) install.packages("tidytext", repos = "http://cran.us.r-project.org")
if(!require(Rborist)) install.packages("Rborist", repos = "http://cran.us.r-project.org")
```
```{r, include=FALSE}
url <- "https://www.dropbox.com/s/wug790n9fqrj7uz/winemag-data-130k-v2.csv?dl=1"
if(!file.exists("./wineratings/winemag-data-130k-v2.csv")){
download.file(url, "./wineratings/winemag-data-130k-v2.csv")
}
dataset <- read.csv("./wineratings/winemag-data-130k-v2.csv", encoding = "UTF-8")
```
Examining the structure of the raw data set below, we see that it includes 129,971 entries across 14 variables:
* `X`: integer referencing the row number;
* `country`: factor providing the country where the wine was produced;
* `description`: factor containing the text of the review written by the reviewer as originally posted on https://www.winemag.com/ratings/;
* `designation`: factor of any special designations given by the winery in the wine's name;
* `points`: integer giving the wine score of the review;
* `price`: numeric providing the price per bottle of the wine in US Dollars;
* `province`: factor providing the province of the country where the wine was produced;
* `region_1`: factor for regional detail where the wine was produced;
* `region_2`: factor for further sub-regional detail where the wine was produced;
* `taster_name`: factor giving the name of the reviewer;
* `taster_twitter_handle`: factor giving the twitter handle of the reviewer;
* `title`: factor giving the title of the wine beginning with the name of the winery, vintage year, and then the name of the wine;
* `variety`: factor giving the grape variety of the wine; and
* `winery`: factor giving the name of the winery that produced the wine.
```{r}
str(dataset)
```
We show the first six entries of the data set below for illustration:
```{r, echo=FALSE}
dataset %>% select(X, country, designation, points, price, province, region_1) %>% head() %>% knitr::kable()
dataset %>% select(region_2, title) %>% head() %>% knitr::kable()
dataset %>% select(winery, variety, taster_name, taster_twitter_handle) %>% head() %>% knitr::kable()
dataset %>% select(description) %>% head() %>% knitr::kable()
```
Noting that the wine vintage year is included in the title labels, and that this is a particularly important factor when it comes to evaluating wines, we extract the vintage year data from the wine titles and save this in a new variable named `vintage`. We notice some of the winery names include years also (likely the year of their founding), so we ensure that we exclude years contained in winery names:
```{r, warning=FALSE}
# Extract vintage year information from wine titles
dataset <- dataset %>% mutate(vintage = as.numeric(substr(title, str_length(winery)+2,
str_length(winery)+5)))
dataset %>% select(title, vintage) %>% head() %>% knitr::kable()
```
We also notice from examining the structure of the data set above that it appears there are a number of blank or missing entries in several of the variables. We check the number for each variable and summarise this in the table below:
```{r, echo=FALSE}
data.frame(Variable = dataset %>% names() %>% .[. != "X"],
Blank_or_NA = c(dataset %>% filter(is.na(country) | country == "") %>% nrow(),
dataset %>% filter(is.na(description) | description == "") %>% nrow(),
dataset %>% filter(is.na(designation) | designation == "") %>% nrow(),
dataset %>% filter(is.na(points) | points == "") %>% nrow(),
dataset %>% filter(is.na(price) | price == "") %>% nrow(),
dataset %>% filter(is.na(province) | province == "") %>% nrow(),
dataset %>% filter(is.na(region_1) | region_1 == "") %>% nrow(),
dataset %>% filter(is.na(region_2) | region_2 == "") %>% nrow(),
dataset %>% filter(is.na(taster_name) | taster_name == "") %>% nrow(),
dataset %>% filter(is.na(taster_twitter_handle) | taster_twitter_handle == "") %>% nrow(),
dataset %>% filter(is.na(title) | title == "") %>% nrow(),
dataset %>% filter(is.na(variety) | variety == "") %>% nrow(),
dataset %>% filter(is.na(winery) | winery == "") %>% nrow(),
dataset %>% filter(is.na(vintage) | vintage == "") %>% nrow())) %>%
knitr::kable()
```
In evaluating the results above, we see that more than half of the entries (79,460) have missing data for `region_2`, 21,247 entries have missing `region_1` data, 37,465 have missing `designation` data, 8,996 have missing `price` data, and 4,630 have missing `vintage` data. Only 63 entries have missing `country` and `province` data, while just one entry has missing `variety` data. We will not focus on the `taster_name` or `taster_twitter_handle` in this study, and so we can ignore these missing entries.
Based on the available data in the data set, it looks appropriate to ignore the variables `region_1`, `region_2`, `designation`, `taster_name`, and `taster_twitter_handle` in predicting our wine ratings. We will also ignore the `title` and `winery` variables, since our models will be designed to predict wine quality for wines with previously unseen titles and wineries through utilizing other factors.
We filter our data set to include only non-blank entries for `price`, `country`, `province`, `variety`, and `vintage`:
```{r}
# Filter out NA or blank entries for the relevant variables in our data set
dataset <- dataset %>% filter(!is.na(price) & price != "" &
!is.na(country) & country != "" &
!is.na(province) & province != "" &
!is.na(variety) & variety != "" &
!is.na(vintage) & vintage != "")
dataset %>% nrow() # Count the remaining entries in the data set
```
We see there are now 116,761 entries remaining, and now re-check the number of blank or NA entries for each variable:
```{r, echo=FALSE}
data.frame(Variable = dataset %>% names() %>% .[. != "X"],
Blank_or_NA = c(dataset %>% filter(is.na(country) | country == "") %>% nrow(),
dataset %>% filter(is.na(description) | description == "") %>% nrow(),
dataset %>% filter(is.na(designation) | designation == "") %>% nrow(),
dataset %>% filter(is.na(points) | points == "") %>% nrow(),
dataset %>% filter(is.na(price) | price == "") %>% nrow(),
dataset %>% filter(is.na(province) | province == "") %>% nrow(),
dataset %>% filter(is.na(region_1) | region_1 == "") %>% nrow(),
dataset %>% filter(is.na(region_2) | region_2 == "") %>% nrow(),
dataset %>% filter(is.na(taster_name) | taster_name == "") %>% nrow(),
dataset %>% filter(is.na(taster_twitter_handle) | taster_twitter_handle == "") %>% nrow(),
dataset %>% filter(is.na(title) | title == "") %>% nrow(),
dataset %>% filter(is.na(variety) | variety == "") %>% nrow(),
dataset %>% filter(is.na(winery) | winery == "") %>% nrow(),
dataset %>% filter(is.na(vintage) | vintage == "") %>% nrow())) %>%
knitr::kable()
```
Next, we check to see the number of distinct entries for the relevant variables to be used:
```{r, echo=FALSE}
data.frame(Count = t(dataset %>% summarize(
"Distinct Countries" = n_distinct(country),
"Distinct Descriptions" = n_distinct(description),
"Distinct Points" = n_distinct(points),
"Distinct Prices" = n_distinct(price),
"Distinct Provinces" = n_distinct(province),
"Distinct Varieties" = n_distinct(variety),
"Distinct Vintages" = n_distinct(vintage)))) %>%
knitr::kable()
```
Looking at province names, we want to check that each province name applies only to one country (for example there is not a case of a province name "Southern", say, which is used in more than one country). We can check this by using the `unite()` function to create a temporary new variable, `country_province`, and then counting the number of distinct entries. If the number is greater than the number of distinct provinces, then there is a province name used in more than one country:
```{r}
dataset %>% unite(country_province, country, province, sep = "_") %>%
summarise(n_distinct(country_province)) %>%
knitr::kable()
```
We see that the result matches the number of distinct province names, 415, and therefore each province name is unique to one country only. This means that country information is implied in the `province` variable and may make the `country` variable unnecessary in our algortithms.
Now, we proceed to build training set and a test set from our original full data set.
We first reduce the size of our data set to 10,000 entries in order to provide a more practical sample size to run our machine learning algorithms in a manageable amount of time. We also add a `category` variable which contains a stratified score by category A-F according to the table below:
Score | Category
------ | ---------
98-100 | A
94-97 | B
90-93 | C
87-89 | D
83-86 | E
80-82 | F
We then construct a training set containing 80% of the entries and a testing set containing approximately 20%, discarding entries with provinces and grape varieties that do not appear in the training set. We choose an 80:20 split in order to have a substantial proportion of the data entries available in our training set to train the models:
```{r, warning=FALSE}
# Set sample seed to 1 for replicability
set.seed(1, sample.kind="Rounding")
# We filter the data set to only include the columns that will be relevant for our study,
# then sample 10,000 entries and add a category variable with the stratified scores
dataset <- dataset %>% select(-X, -designation, -country, -region_1, -region_2,
-taster_name, -taster_twitter_handle, -title, -winery) %>%
sample_n(10000) %>%
mutate(category = as.factor(case_when(points > 97 ~ "A",
points > 93 ~ "B",
points > 89 ~ "C",
points > 86 ~ "D",
points > 82 ~ "E",
points >= 80 ~ "F")))
# Set sample seed to 1 for replicability
set.seed(1, sample.kind="Rounding")
# We create a test index using 20% of the entries in the dataset
test_index <- createDataPartition(y = dataset$points, times = 1, p = 0.2, list = FALSE)
# We then create the training and testing sets using the test index
train_set <- dataset[-test_index,]
test_set <- dataset[test_index,]
# Make sure that provinces and varieties in the test set are also in the training set
test_set <- test_set %>%
semi_join(train_set, by = "province") %>%
semi_join(train_set, by = "variety")
# Remove unused factors for the variety and province variables in the training and test sets
train_set <- train_set %>% mutate(variety = droplevels(variety),
province = droplevels(province))
test_set <- test_set %>% mutate(variety = droplevels(variety),
province = droplevels(province))
# Match the factor levels in the training and test sets to prepare for use in our models
levels(test_set$variety) <- levels(train_set$variety)
levels(test_set$province) <- levels(train_set$province)
# Removing the dataset and test_index objects to clean up since they are no longer needed
rm(dataset, test_index)
```
```{r, echo=FALSE}
data.frame(Data_set = c("train_set", "test_set"),
No_entries = c(nrow(train_set), nrow(test_set))) %>%
knitr::kable()
```
We see that the training set and test set contain data entries in approximately an 80:20 proportion.
Now we begin visualizing the data in our training set to further understand the links between the different variables and wine ratings.
Firstly, looking at the distribution of wine review scores (points), we plot a histogram below:
```{r, warning=FALSE, echo=FALSE}
library(dslabs)
ds_theme_set()
train_set %>% ggplot(aes(points)) +
geom_histogram(breaks = seq(80, 100, 1), col = "black", fill = "dodgerblue3") +
xlab("Points") +
ylab("Count")
```
```{r, echo=FALSE}
options(digits = 5)
train_set %>% summarise("Mean points" = mean(points),
"SD points" = sd(points),
"Min points" = min(points),
"Max points" = max(points)) %>%
knitr::kable()
```
We can see that the distribution of points somewhat resembles a normal distribution, albeit with two peaks, with a mean of 88.5 and standard deviation of 3.08.
```{r, echo=FALSE, warning=FALSE}
train_set %>% ggplot(aes(category)) +
geom_histogram(stat = "count", col = "black", fill = "dodgerblue3") +
xlab("Category") +
ylab("Count")
```
Plotting a histogram by points score category shows a similar distribution.
Now looking at wine prices, we plot the distribution of wine prices in our data set:
```{r, warning=FALSE, echo=FALSE}
train_set %>% ggplot(aes(price)) +
geom_histogram(bins = 30, col = "black", fill = "dodgerblue3") +
scale_x_log10() +
xlab("Price (log scale)") +
ylab("Count")
```
```{r, echo=FALSE}
train_set %>% summarise("Mean price" = mean(price),
"SD price" = sd(price),
"Min price" = min(price),
"Max price" = max(price)) %>%
knitr::kable()
```
We see that there is a very large range in wine prices, from \$5 to \$3,300 per bottle in our training set, but with a heavy skew towards the lower end with a mean of 35.7 and a standard deviation of 57.
We next examine the relationship between wine price and points in the training set:
```{r, echo=FALSE}
train_set %>% group_by(price) %>%
summarise(points = mean(points)) %>%
ggplot(aes(price, points)) +
geom_point() +
geom_smooth(method = "lm") +
scale_x_log10() +
xlab("Price (log scale)") +
ylab("Average points")
```
We see from the plot above that there is a strong positive trend between wine price and average points, although we see the points for the wine priced at the maximum $3,300 price is actually below the average of all ratings (88.5):
```{r, echo=FALSE}
train_set %>% group_by(price) %>%
filter(price == 3300) %>% summarize("Average points" = mean(points), count = n()) %>%
knitr::kable()
```
Next we look at the distribution of wine vintage years in our training set:
```{r, echo=FALSE}
train_set %>% ggplot(aes(vintage)) +
geom_histogram(binwidth = 1, col = "black", fill = "dodgerblue3") +
scale_x_continuous(breaks = seq (1930, 2017, 20)) +
xlab("Vintage year") +
ylab("Count")
```
```{r, echo=FALSE}
train_set %>% summarise("Mean year" = mean(vintage),
"SD year" = sd(vintage),
"Min year" = min(vintage),
"Max year" = max(vintage)) %>%
knitr::kable()
```
We see that the vintages for wines in our training set are heavily clustered in more recent years, with an average vintage of 2010, but with an earliest vintage of 1934.
We look at a plot of the relationship of vintage against points below:
```{r, echo=FALSE}
train_set %>% group_by(vintage) %>%
summarise(points = mean(points)) %>%
ggplot(aes(vintage, points)) +
geom_point() +
geom_smooth(method = "lm") +
scale_x_continuous(breaks = seq (1930, 2017, 20)) +
xlab("Vintage year") +
ylab("Average points")
```
We see there is a strong negative trend for wine points versus vintage year; newer wines appear to be rated lower than older ones. However, since there are far fewer data points available for older wines we re-plot the data using only vintages after 1995:
```{r, echo=FALSE}
train_set %>% group_by(vintage) %>%
filter(vintage > 1995) %>%
summarise(points = mean(points)) %>%
ggplot(aes(vintage, points)) +
geom_point() +
geom_smooth(method = "lm") +
scale_x_continuous(breaks = seq (1990, 2017, 5)) +
xlab("Vintage year") +
ylab("Average points")
```
Clearly the negative trend for wine points versus vintage year is much less noticeable when focusing only on more recent years. Nevertheless, vintage year looks to provide useful information when it comes to predicting ratings.
Now, we examine the link between the province of wine origin and points. First we plot the distribution among provinces with over 50 reviews in our training set below:
```{r, warning=FALSE, echo=FALSE}
train_set %>% group_by(province) %>%
summarize(count = n()) %>%
mutate(province = reorder(province, desc(count))) %>%
filter(count >= 50) %>%
ggplot(aes(province, count)) +
geom_histogram(stat = "identity", col = "black", fill = "dodgerblue3") +
xlab("Wine province") +
ylab("Count") +
theme(axis.text.x=element_text(angle = 90, hjust = 1, vjust = 0.5),
axis.ticks.x=element_blank(),
panel.grid.major.x=element_blank())
```
We can see that almost one third of the wines in our training set originate from California, with mainly other US, Italian, and French provinces featuring prominently.
```{r, warning=FALSE, echo=FALSE}
train_set %>% mutate(province = reorder(province, points, FUN = mean)) %>%
group_by(province) %>%
filter(n() >= 50) %>%
summarize(mean = mean(points), sd = sd(points), count = n()) %>%
ggplot(aes(province, mean)) +
geom_point(aes(size = count), col = "dodgerblue3") +
labs(size="Review count") +
geom_errorbar(aes(x = province,
ymin = mean - sd,
ymax = mean + sd)) +
xlab("Wine province") +
ylab("Average points") +
scale_y_continuous(breaks = seq (82, 94, 2)) +
theme(axis.text.x=element_text(angle = 90, hjust = 1, vjust = 0.5),
axis.ticks.x=element_blank(),
panel.grid.major.x=element_blank())
```
Looking at an ordered plot of average points with error bars at one standard deviation for provinces with more than 50 reviews we see there is some disparity between provinces, although provinces with the highest number of reviews tend to be clustered together around the overall training set average of 88.5. Standard deviation of points for each province is approximately 3 points. The outliers at the high and low ends generally tend to be provinces with very few reviews:
```{r, echo=FALSE}
train_set %>% group_by(province) %>%
summarize(Mean_points = mean(points), SD_points = sd(points), Count = n()) %>%
arrange(desc(Mean_points)) %>%
top_n(5, Mean_points) %>%
knitr::kable()
train_set %>% group_by(province) %>%
summarize(Mean_points = mean(points), SD_points = sd(points), Count = n()) %>%
arrange(Mean_points) %>%
top_n(-5, Mean_points) %>%
knitr::kable()
```
Overall, wine province looks to be a useful variable for use in training our algorithms.
We now look at wine grape variety in a similar manner, plotting review counts and average ratings for varieties with more than 50 reviews:
```{r, warning=FALSE, echo=FALSE}
train_set %>% group_by(variety) %>%
summarize(count = n()) %>%
mutate(variety = reorder(variety, desc(count))) %>%
filter(count >= 50) %>%
ggplot(aes(variety, count)) +
geom_histogram(stat = "identity", col = "black", fill = "dodgerblue3") +
xlab("Wine grape variety") +
ylab("Count") +
theme(axis.text.x=element_text(angle = 90, hjust = 1, vjust = 0.5),
axis.ticks.x=element_blank(),
panel.grid.major.x=element_blank())
```
```{r, warning=FALSE, echo=FALSE}
train_set %>% mutate(variety = reorder(variety, points, FUN = mean)) %>%
group_by(variety) %>%
filter(n() >= 50) %>%
summarize(mean = mean(points), sd = sd(points), count = n()) %>%
ggplot(aes(variety, mean)) +
geom_point(aes(size = count), col = "dodgerblue3") +
labs(size="Review count") +
geom_errorbar(aes(x = variety,
ymin = mean - sd,
ymax = mean + sd)) +
xlab("Wine grape variety") +
ylab("Average points") +
scale_y_continuous(breaks = seq (82, 94, 2)) +
theme(axis.text.x=element_text(angle = 90, hjust = 1, vjust = 0.5),
axis.ticks.x=element_blank(),
panel.grid.major.x=element_blank())
```
We see that the most popular grape varieties are Pinot Noir, Chardonnay, and Cabernet Sauvignon, along with Red Blend, which have average ratings between 88 and 89.5 - around the overall average. Standard deviation of points for each variety is again approximately 3 points. Below, we again see that the outliers at the high and low end of average points have only one or two reviews each:
```{r, echo=FALSE}
train_set %>% group_by(variety) %>%
summarize(Mean_points = mean(points), SD_points = sd(points), Count = n()) %>%
arrange(desc(Mean_points)) %>%
top_n(5, Mean_points) %>%
knitr::kable()
train_set %>% group_by(variety) %>%
summarize(Mean_points = mean(points), SD_points = sd(points), Count = n()) %>%
arrange(Mean_points) %>%
top_n(-5, Mean_points) %>%
knitr::kable()
```
Wine grape variety again in general looks to provide useful information for use in our algorithms.
Finally, we can evaluate the usefulness of wine descriptions, the text of the reviews themselves, in predicting wine ratings. The descriptions are in the form of free-form text, so we can use the `tidytext` package to tokenize the words of each review, remove stop words, and then perform a sentiment analysis on the remaining words.
We tokenize the words in the descriptions of a subset of our training set containing 10 entries for experimentation. We remove stop words and numbers and then run sentiment analysis on the remaining words using the 'Afinn' lexicon, which assess the positivity or negativity of words on a scale of -5 to 5:
```{r, warning=FALSE}
afinn <- get_sentiments("afinn") # Load the afinn sentiments
set.seed(1, sample.kind = "Rounding") # Set seed for replicability
sample_set <- train_set %>%
mutate(description = as.character(description)) %>% # Convert descriptions from factors
sample_n(10) %>% # Sample 10 rows
unnest_tokens(word, description) %>% # Tokenize words
filter(!word %in% stop_words$word & # Filter out stop words
!str_detect(word, "^\\d+$")) %>% # Filter out numbers
inner_join(afinn, by="word") # Join the afinn sentiments by word
sample_set %>% knitr::kable()
```
We can see from the above that only a very small subset of words in the descriptions have a sentiment value attached to them, and if we group the sentiments by review, we see that only eight out of the ten reviews have any sentiment value attached at all:
```{r}
# Group the word sentiment scores by review and calculate the average sentiment
sample_set %>% unite(review, province, vintage, variety, sep = " ") %>%
group_by(review) %>%
summarise(points = points[1], Avg_sentiment = mean(value)) %>%
knitr::kable()
```
We look also at a sentiment analysis using the 'bing' lexicon, which classifies words as either positive or negative only:
```{r, warning=FALSE}
bing <- get_sentiments("bing") # Load the bing sentiments
set.seed(1, sample.kind = "Rounding") # Set seed for replicability
sample_set_bing <- train_set %>%
mutate(description = as.character(description)) %>% # Convert descriptions from factors
sample_n(10) %>% # Sample 10 rows
unnest_tokens(word, description) %>% # Tokenize words
filter(!word %in% stop_words$word & # Filter out stop words
!str_detect(word, "^\\d+$")) %>% # Filter out numbers
inner_join(bing, by="word") # Join the bing sentiments by word
sample_set_bing %>% knitr::kable()
```
Slightly more words are picked up in the Bing lexicon than Afinn, however it is still a small subset of all words. If we group sentiments by review and calculate the percentage of positive words for each entry, we see that all ten entries now have some sentiment value attached to them:
```{r}
# Group the word sentiment scores by review and calculate the postive word percentage
sample_set_bing %>% unite(review, province, vintage, variety, sep = " ") %>%
group_by(review) %>%
summarise(points = points[1],
Positive_percentage = sum(sentiment == "positive")*100/n()) %>%
knitr::kable()
```
The sentiments here clearly have little relation to the points scores, however, with five of the entries having the maximum 100% positive sentiment but points scores in the 80's as low as the minimum score of 80, and the wines scoring above 90 having lower proportions of positive word sentiments.
Overall, it appears that the wine descriptions may provide limited use in our machine learning algorithms for predicting wine quality compared to our other variables. For the purposes of this study, we will leave out further analysis of wine descriptions.
In summary then, in our machine learning algorithms we will use the **price**, **vintage**, **province**, and **variety** variables to predict wine points categories.
We now remove the `description` and `points` variables from our training and testing sets, as well as remove remaining objects from the text sentiment analysis that are no longer required.
```{r}
# Remove unrequired variables from training and testing sets and unused objects
train_set <- train_set %>% select(-description, -points)
test_set <- test_set %>% select(-description, -points)
rm(afinn, bing, sample_set, sample_set_bing)
```
## 2.2 ALGORITHM METHODS
**CLASSIFICATION TREE**
We will first look at a Classification Tree model. We run the `rpart` algorithm using the `train()` function in the `caret` package, which by default performs a 25-fold cross validation. We set the algorithm to use the tuning parameter, complexity parameter, over a range of 0 to 0.05:
```{r}
# Train a classification tree model on train_set
train_rpart <- train(category ~ .,
method = "rpart",
data = train_set,
tuneGrid = data.frame(cp = seq(0.0, 0.05, len = 20)))
```
```{r, echo=FALSE}
ggplot(train_rpart, highlight = TRUE)
```
```{r}
train_rpart$results %>% filter(Accuracy == max(Accuracy))
```
We see that utilizing the optimal complexity parameter for our model, the accuracy is only `r max(train_rpart$results$Accuracy)`.
```{r, echo=FALSE}
plot(train_rpart$finalModel, margin = 0.1)
text(train_rpart$finalModel, cex = 0.9)
```
Examining the optimal tree structure, we find that only points categories C, D, and E are being assigned and `price` is the only variable being used to determine the category.
```{r, echo=FALSE, warning=FALSE}
varImp(train_rpart)
```
Examining the variable importance confirms that `price` is overwhelmingly the most important variable in this model.
```{r, echo=FALSE, warning=FALSE}
model_results <- data_frame(Model="Classification Tree",
Accuracy = max(train_rpart$results$Accuracy))
model_results %>% knitr::kable()
```
**RANDOM FOREST**
Next we look at a Random Forest model using the `Rborist` package. We will limit the algorithm to a 3-fold cross validation, reduce the number of trees to 50, and take a random subset of 500 observations when constructing each tree in order to save on computation time.
We run the algorithm with the number of predictors tuning parameter `predFixed` over a range of 20 to 80, and minimum node sizes of 1, 5, and 10 under the `minNode` parameter:
```{r, warning=FALSE}
# Set to 3-fold cross validation and our tuning parameter values to test
control <- trainControl(method="cv", number = 3)
grid <- expand.grid(minNode = c(1,5, 10) , predFixed = seq(20,80,10))
# Train random forest model on train_set with 50 trees sampling 500 rows each
train_rf <- train(category ~ .,
method = "Rborist",
data = train_set,
nTree = 50,
tuneGrid = grid,
trControl = control,
nSamp = 500)
```
```{r, echo=FALSE}
ggplot(train_rf, highlight = TRUE)
```
```{r, echo=FALSE}
train_rf
```
We see that our optimal model is using `r train_rf$bestTune$predFixed` predictors and a minimum node size of `r train_rf$bestTune$minNode`, resulting in an accuracy of `r max(train_rf$results$Accuracy)`.
```{r, echo=FALSE,warning=FALSE}
varImp(train_rf)
```
Looking at the variable importance for the random forest model, we again see that price is the most important variable, followed by vintage.
```{r, echo=FALSE, warning=FALSE}
model_results <- bind_rows(model_results,
data_frame(Model="Random Forest",
Accuracy = max(train_rf$results$Accuracy)))
model_results %>% knitr::kable()
```
**MULTINOMIAL LOGISTIC REGRESSION**
Next, we look at a Multinomial Logistic Regression model `multinom` from the `nnet` package. This model is able to fit a logistic regression for multiple category outcomes, as is required in this case. We limit to a 5-fold cross validation to save computing time:
```
# Set to 5-fold cross validation
control <- trainControl(method = "cv", number = 5)
# Train multinomial logistic regression model on train_set
train_mn <- train(category ~ .,
data = train_set,
method = "multinom",
trControl = control,
MaxNWts = 3400)
```
```{r, include=FALSE}
control <- trainControl(method = "cv", number = 5)
train_mn <- train(category ~ .,
data = train_set,
method = "multinom",
trControl = control,
MaxNWts = 3400)
```
```{r, echo=FALSE, warning=FALSE}
ggplot(train_mn, highlight = TRUE) +
scale_x_log10()
```
```{r, echo=FALSE}
train_mn
```
We see using the optimal tuning parameter that we have a model accuracy of only `r max(train_mn$results$Accuracy)`.
```{r, echo=FALSE, warning=FALSE}
model_results <- bind_rows(model_results,
data_frame(Model="Multinomial Logistic Regression",
Accuracy = max(train_mn$results$Accuracy)))
model_results %>% knitr::kable()
```
**QDA MODEL**
Next we look at a QDA Model. The model fails to run including the `variety` and `province` variables due to insufficient individual factor level datapoints available in our sample, so we utilize only the `price` and `vintage` variables:
```{r, warning=FALSE}
# Train QDA model on train_set
train_qda <- train(category ~ price + vintage,
data = train_set,
method = "qda")
```
```{r, echo=FALSE}
train_qda
```
We see that the accuracy is only `r train_qda$results$Accuracy`, the lowest yet.
```{r, echo=FALSE, warning=FALSE}
model_results <- bind_rows(model_results,
data_frame(Model="QDA",
Accuracy = max(train_qda$results$Accuracy)))
model_results %>% knitr::kable()
```
**K-NEAREST NEIGHBORS**
Finally we will look at training a k-Nearest Neighbors (kNN) algorithm on our training set.
Since categorical variables cannot be used with the kNN algorithm, we first need to create dummy variables for each level of the categorical variables `variety` and `province`. We do this by using the `dummyVars()` function in the `caret` package, which creates variables that are either 1 or 0 for each factor level.
```{r, warning=FALSE}
# Set our training set outcomes in a separate vector
y_train <- train_set$category
# Create a dummy variable matrix of predictors for factor variable levels
dummyvars <- dummyVars( ~ price + vintage + variety + province, data = train_set)
train_dummyvars <- predict(dummyvars, newdata = train_set)
dim(train_dummyvars)
```
We see that we have created a matrix of predictors with 541 variables. We see this is correct since we have 217 distinct province factor levels and 322 distinct variety factor levels in our training set, plus the price and vintage variables:
```{r, echo=FALSE}
data.frame(Count = t(train_set %>% summarize(
"train_set distinct provinces" = n_distinct(province),
"train_set distinct varieties" = n_distinct(variety)))) %>%
knitr::kable()
```
We run a kNN algorithm and again limit to 5-fold cross validation to save on computation time. We run the algorithm with values between 40 and 100 for our tuning parameter, the number of neighbors k:
```{r}
# Set to 5-fold cross validation
control <- trainControl(method = "cv", number = 5)
# Train kNN model using the dummy variables and outcomes for train_set
train_knn <- train(train_dummyvars,
y_train,
method = "knn",
tuneGrid = data.frame(k = seq(40, 100, 10)),
trControl = control)
```
```{r, echo=FALSE}
ggplot(train_knn, highlight = TRUE)
```
```{r, echo=FALSE}
train_knn
```
We see that the optimal value for the number of neighbors k is `r train_knn$bestTune$k`, with an accuracy level of `r max(train_knn$results$Accuracy)` - a close second best.
```{r, echo=FALSE, warning=FALSE}
model_results <- bind_rows(model_results,
data_frame(Model="kNN",
Accuracy = max(train_knn$results$Accuracy)))
model_results %>% knitr::kable()
```
Overall, our model with the highest accuracy is the Random Forest model and we will proceed to test this model on our testing set.
# 3. RESULTS
We first optimize a final Random Forest model on the training set utilizing the `Rborist()` function, setting our optimal parameters from the previous cross validation and the number of trees to be 500. The `Rborist()` function again needs us to use the dummy variables for each of the factor levels.
```{r}
# Train final random forest model using train_set dummy variables and outcomes with 500 trees
final_model <- Rborist(x = train_dummyvars,
y = y_train,
nTree = 500,
predFixed = train_rf$bestTune$predFixed,
minNode = train_rf$bestTune$minNode)
```
Now we create a matrix of predictors including dummy variables for our testing set, `test_dummyvars`, and predict our wine points categories for the test set using our final model with the `predict()` function:
```{r}
# Create a dummy variable matrix of predictors for factor variable levels in the testing set
dummyvars <- dummyVars( ~ price + vintage + variety + province, data = test_set)
test_dummyvars <- predict(dummyvars, newdata = test_set)
# Create our model predictions for the testing set using the final model
y_hat <- as.factor(predict(final_model, test_dummyvars)$yPred)
cm <- confusionMatrix(y_hat, test_set$category)
cm
```
Looking at the confusion matrix above, we see that the accuracy of our final model on the test set is `r round(cm$overall['Accuracy'], 3)[[1]]`. This is in line with our earlier expected accuracy level from the training process.
We see that our sensitivity for the different categories ranges from as low as 0 for classes A and F with very few entries, up to around 0.5-0.7 for classes C and D - the most populated categories. Specificity is better across all classes, although dropping below 0.7 for the most populated class D.
# 4. CONCLUSION
In this project we constructed a machine learning algorithm to predict wine points score categories for wines with unknown titles and wineries in a data set of ratings sourced from Kaggle with nearly 130,000 wine reviews from the Wine Enthusiast Magazine website.
After initially inspecting the data and performing some data cleaning, we took a subset of 10,000 reviews in order to have a more practical sample size for model fitting purposes and built training and testing sets in a 80:20 proportion. We then analyzed and visualized the different variables in some detail to explore their link to wine points scores and determined that price, vintage year, province of origin, and grape variety all looked to have an impact on score.
We then trained various machine learning algorithms on the training set, including a classification tree model, random forest, multinomial logistic regression, QDA model, and a k-Nearest Neighbors model, and found that the random forest model indicated the best accuracy performance on our training set.
Finally, we ran a random forest model trained on our training set on the testing set and found a final accuracy value of `r round(cm$overall['Accuracy'], 3)[[1]]`. This final value is lower than we might have hoped for, where just over half of the time the correct points category is predicted. Ideally, we would want to have an accuracy value of greater than 0.8 or 0.9 in order to have a more useful model.
To improve our model results, we could use a larger sample size for training, up to say over 100,000 reviews. This would, of course, greatly increase the computing time needed to train the models, in particular the computationally intense random forest and kNN models. We could also include more variables in our models, such as country, sub-regional detail where available, and winery if we wanted our models to use that information. The text descriptions may also be able to be used if we explored the text sentiment analysis further and found it to be helpful at scale.
In addition, further machine learning models could be explored which were not tried out in this study, including k-means clustering, neural networks, and a matrix factorization model using singular value decomposition and principal component analysis. We would again, however, need to find a sample size to strike a balance between a practical amount of computing time and model accuracy.