-
Notifications
You must be signed in to change notification settings - Fork 51
/
Copy path04_Classification.Rmd
1198 lines (733 loc) · 49.5 KB
/
04_Classification.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# Classification
**Learning objectives:**
- Compare and contrast **classification** with linear regression.
- Perform classification using **logistic regression**.
- Perform classification using **linear discriminant analysis (LDA)**.
- Perform classification using **quadratic discriminant analysis (QDA)**.
- Perform classification using **naive Bayes**.
- Identify the **strengths and weaknesses** of the various classification models.
- Model count data using **Poisson regression**.
## An Overview of Classification
- **Classification**: Approaches to make inference and/or predict qualitative (categorical) response variable
- Few common classification techniques (classifiers):
- logistic regression
- linear discriminant analysis (LDA)
- quadratic discriminant analysis (QDA)
- naive Bayes
- K-nearest neighbors
<br>
- **Examples of classification problems: **
<br>
1. A person arrives at the emergency room with a set of symptoms that could possibly be attributed to one of three medical conditions. Which of the three conditions does the individual have?
- Predictor variable: Symptoms
- Response variable: Type of medical conditions
<br>
2. An online banking service must be able to determine whether or not a transaction being performed on the site is fraudulent, on the basis of the user’s IP address, past transaction history, and so forth.
- Predictor variable: User's IP address, past transaction history, etc
- Response variable: Fraudulent activity (Yes/No)
<br>
3. On the basis of DNA sequence data for a number of patients with and without a given disease, a biologist would like to figure out which DNA mutations are deleterious (disease-causing) and which are not.
- Predictor variable: DNA sequence data
- Response variable: Presence of deleterious gene (Yes/No)
<br>
- In the following section, we are going to explore the `Default` dataset. The annual incomes ($X_1$ = `income`) and monthly credit card balances ($X_2$ =`balance`) are used to predict whether whether an individual will default on his or her credit card payment.
```{r fig4-1, cache=TRUE, echo=FALSE, fig.align="center", fig.cap="The distribution of balance and income split by the binary default variable respectively; Note. Defaulters represented as orange plus sign; non-defaulters represented as blue circle"}
knitr::include_graphics("./images/fig4_1.jpg", error = FALSE)
```
## Why NOT Linear Regression?
- a regression method cannot convert a qualitative response variable with more than two levels into a quantitative response that is ready for linear regression
$$Y = \left\{ \begin{array}{ll}
1 & \mbox{if stroke};\\
2 & \mbox{if epileptic seizure};\\
3 & \mbox{if drug overdose}.\end{array} \right.$$
- Depending on the complexity of the problem, a regression method will not provide meaningful estimates of Pr(Y |X);
- There are times that a binary *qualitative* responses can be modeled using *dummy variables* approach. Example:
$$Y = \left\{ \begin{array}{ll}
0 & \mbox{if stroke};\\
1 & \mbox{if drug overdose}.\end{array} \right.$$
- in such cases, the prediction of $\hat{Y} > 0.5$, can be associated with \mbox{drug overdose}.
- The main issue is partial estimates might be outside the [0, 1] probability interval, e.g. fig4-2:
```{r fig4-2, cache=TRUE, echo=FALSE, fig.align="center", fig.cap="Classification using the Default data. Left: Estimated probability of default using linear regression. Some estimated probabilities are negative! The orange ticks indicate the 0/1 values coded for default(No or Yes). Right: Predicted probabilities of default using logistic regression. All probabilities lie between 0 and 1."}
knitr::include_graphics("./images/fig4_2.jpg", error = FALSE)
```
## Logistic Regression
### The Logistic Model
- **Logistic regression**: models the probability that Y belongs to a particular category (X)
- X is binary (0/1)
$$p(X) = β_0 + β_1X \space \Longrightarrow {Linear \space regression}$$
$$p (X) = \frac{e^{\beta_{0} + \beta_{1}X}}{1 + e^{\beta_{0} + \beta_{1}X}} \space \Longrightarrow {Logistic \space function}$$
$$odds = \frac{p (X)}{1 - p (X)} = e^{\beta_{0} + \beta_{1}X} \Longrightarrow {odds \space value [0, ∞]}$$
By logging the whole equation, we get
$$\log \biggl(\frac{p(X)}{1- p(X)}\bigg) = \beta_{0} + \beta_{1}X \Longrightarrow {log \space odds/logit}$$
### Estimating the Regression Coefficient
To estimate the regression coefficient, we use **maximum likelihood (ME)**.
***Likelihood Function***
$$ℓ (\beta_{0}, \beta_{1}) = \prod_{i: y_{i}= 1} p (x_i) \prod_{i': y_{i'}= 0} (1- p (x_{i'})) \Longrightarrow {Likelihood \space function}$$
- The aim is to find beta values such that $$ℓ$$ is maximum.
- The Least square method is the special case of maximum likelihood function.
### Multiple Logistic Regression
$$\log \biggl(\frac{p(X)}{1- p(X)}\bigg) = \beta_{0} + \beta_{1}X_1 + ... + \beta_{p}X_p \\ \Downarrow \\ p(X) = \frac{e^{\beta_{0} + \beta_{1}X_1 + ... + \beta_{p}X_p}}{1 + \beta_{0} + \beta_{1}X_1 + ... + \beta_{p}X_p}$$
```{r fig4-3, cache=FALSE, echo=FALSE, fig.align="center", fig.cap= "Confounding in the Default data. Left: Default rates are shown for students (orange) and non-students (blue). The solid lines display default rate as a function of balance, while the horizontal broken lines display the overall default rates. Right: Boxplots of balance for students (orange) and non-students (blue) are shown."}
knitr::include_graphics("./images/fig4_3.jpg", error = FALSE)
```
### Multinomial Logistic Regression
- This is used in the setting where K > 2 classes. In multinomial, we select a single class to serve as the baseline.
- However, the interpretation of the coefficients in a multinomial logistic regression model must be done with care, since it is tied to the choice of baseline.
- Alternatively, you can use `Softmax coding, where we _treat all K classes symmetrically_, and assume that for k = 1, . . . ,K, rather than selecting a baseline. This means, we estimate coefficients for all K classes, rather than estimating coefficients for K − 1 classes.
## Generative Models for Classification
**Why Logistic Regression is not ideal?**
- When there is substantial separation between the two classes, the
parameter estimates for the logistic regression model are surprisingly
unstable.
- If the distribution of the predictors X is approximately normal in
each of the classes and the sample size is small, then the generative modelling may be more accurate than logistic regression.
- Generative modelling can be naturally extended to the case
of more than two response classes.
<br>
**Common notations:**
<br>
- K $\Longrightarrow$ response class
- $π_k \Longrightarrow$ overall or prior probability that a randomly chosen observation comes from the prior kth class; can be obtained from the random
sample from the population
- $f_k(X) ≡ Pr(X|Y = k)^1 \Longrightarrow$ the density function of X density for an observation that comes from the kth class; requires some underlying assumption to estimate
<br>
Bayes’ theorem states that
$$Pr(Y = k|X = x) = \frac {π_k f_k(x)}{\sum_{l =1}^{k} π_lf_l(x)}$$
- $p_k(x) = Pr(Y = k|X = x) \Longrightarrow$ _posterior probability_ that an observation posterior X = x belongs to the kth class; computed from $f_k(X)$
## A Comparison of Classification Methods
Each of the classifiers below uses different estimates of $f_k(x)$.
- linear discriminant analysis;
- quadratic discriminant analysis;
- naive Bayes
### Linear Discriminant Analysis for p = 1
- one predictor
- classify an observation to the class for which $p_k(x)$ is greatest
**Assumptions:**
- we assume that $f_k(x)$ is normal or Gaussian with a classs pecific
mean and,
- a shared variance term across all K classes [$σ^2_1 = · · · = σ^2_K$ ]
The normal density takes the form
$$f_k(x) = \frac{1}{\sqrt{2πσ_k}}exp(- \frac{1}{2σ^2_k}(x- \mu_k)^2)$$
Then, the posterior probability (probability that the observation belongs to the kth class, given the predictor value for that observation) is
$$p_k(x) = \frac{π_k \frac{1}{\sqrt{2πσ}}exp(- \frac{1}{2σ^2}(x- \mu_k)^2)}{\sum^k_{l=1} π_l \frac{1}{\sqrt{2πσ}}exp(- \frac{1}{2σ^2}(x- \mu_l)^2)}$$
**Additional mathematical formula**
After you log and rearrange the above equation, you will the following formula. The Bayes' classifier assign to one class if $2x (μ_1 − μ_2) > μ_1^2 − μ_2^2$ and otherwise.
$$δ_k(x) = x . \frac{\mu_k}{\sigma^2} - \frac{\mu_k^2}{2\sigma^2} + log(π_k) \Longrightarrow {Equation \space 4.18}$$
The Bayes decision boundary is the point for which $δ_1(x) = δ_2(x)$
$$x = \frac{μ_1^2 − μ_2^2}{2(μ_1 − μ_2)} = \frac{μ_1 + μ_2}{2}$$
```{r fig4-4, cache=FALSE, echo=FALSE, fig.align="center", fig.cap= "Left: Two one-dimensional normal density functions are shown. The dashed vertical line represents the Bayes decision boundary. Right: 20 observations were drawn from each of the two classes, and are shown as histograms. The Bayes decision boundary is again shown as a dashed vertical line. The solid vertical line represents the LDA decision boundary estimated from the training data."}
knitr::include_graphics("./images/fig4_4.jpg", error = FALSE)
```
The **linear discriminant analysis (LDA)** method approximates the linear
discriminant analysis Bayes classifier by plugging estimates for $π_k$, $μ_k$, and $σ^2$ into equation 4.18.
$\hat μ_k$ is the average of all the training observations from the kth class
$$\hat{\mu}_{k} = \frac{1}{n_{k}}\sum_{i: y_{i}= k} x_{i}$$
$\hat σ^2$ is the weighted average of the sample variances for each of the K classes
$$\hat{\sigma}^2 = \frac{1}{n - K} \sum_{k = 1}^{K} \sum_{i: y_{i}= k} (x_{i} - \hat{\mu}_{k})^2$$
Note.
n = total number of training observations,
$n_k$ = number of training observations in the kth class
$π_k$ is estimated from the proportion of the training observations
that belong to the kth class.
$$π_k = \frac{n_k}{n}$$
LDA classifier assigns an observation X = x to the class for which $δ_k(x)$ is largest.
$$δ_k(x) = x . \frac{\mu_k}{\sigma^2} - \frac{\mu_k^2}{2\sigma^2} + log(π_k) \Longrightarrow {Equation \space 4.18} \\ \Downarrow \\ \hat δ_k(x) = x \cdot \frac{\hat \mu_k}{\hat \sigma^2} - \frac{\hat \mu_k^2}{2\hat \sigma^2} + log(\hat π_k)$$
### Linear Discriminant Analysis for p > 1
- multiple predictors; p > 1 predictors
- observations come from a multivariate Gaussian (or multivariate normal) distribution, with a **class-specific mean vector** and a common **covariance matrix**; $$N(μ_k,Σ)$$
**Assumptions: **
- each individual predictor follows a one-dimensional normal distribution, with predictors having some correlation
```{r fig4-5, cache=FALSE, echo=FALSE, fig.align="center", fig.cap= "Two multivariate Gaussian density functions are shown, with p = 2. Left: The two predictors are uncorrelated and it has a circular base. Var(X_1) = Var(X_2) and Cor(X_1,X_2) = 0; Right: The two variables have a correlation of 0.7 with a elliptical base"}
knitr::include_graphics("./images/fig4_5.jpg", error = FALSE)
```
$\exp$
The multivariate Gaussian density is defined as:
$$f(x) = \frac{1}{(2π)^{\frac{p}{2}}|Σ|^{\frac{1}{2}}}\exp -\frac{1}{2}(x - \mu)^T Σ^{−1}(x − μ))$$
Bayes classifier assigns an observation X = x to the class for which $$δ_k(x)$$ is largest.
$$δ_k(x) = x^T Σ^{−1}μ_k - \frac{1}{2}μ_k^T Σ^{−1} μ_k + log π_k \Longrightarrow vector/matrix \space version \\ δ_k(x) = x . \frac{\mu_k}{\sigma^2} - \frac{\mu_k^2}{2\sigma^2} + log(π_k) \Longrightarrow {Equation \space 4.18}$$
```{r fig4-6, cache=FALSE, echo=FALSE, fig.align="center", fig.cap= "An example with three classes. The observations from each class are drawn from a multivariate Gaussian distribution with p = 2, with a class-specific mean vector and a common covariance matrix. Left: Ellipses that contain 95% of the probability for each of the three classes are shown. The dashed lines are the Bayes decision boundaries. Right: 20 observations were generated from each class, and the corresponding LDA decision boundaries are indicated using solid black lines. The Bayes decision boundaries are once again shown as dashed lines. Overall, the LDA decision boundaries are pretty close to the Bayes decision boundaries, shown again as dashed lines. The test error rates for the Bayes and LDA classifiers are 0.0746 and 0.0770, respectively."}
knitr::include_graphics("./images/fig4_6.jpg", error = FALSE)
```
All classification models have training error rate, which can be displayed with a **confusion matrix**.
**Caveats of error rate: **
- training error rates will usually be lower than test error rates, which are the real quantity of interest. The higher the ratio of parameters _p_ to number of samples n, the more we expect this _overfitting_ to play a role.
- the trivial null classifier will achieve an error rate that is only a bit higher than the LDA training set error rate
- a binary classifier such as this one can make two types of errors (Type I and II)
- Class-specific performance _(sensitivity and specificity)_ is important in certain fields (e.g., medicine)
LDA has low sensitivity due to
1. LDA is trying to approximate the Bayes classifier, which has the lowest
total error rate out of all classifiers
2. In the process, the Bayes classifier will yield the smallest possible total number of misclassified observations, regardless of the class from which the errors stem.
3. It also uses a threshold of 50% for the posterior probability of default in order to assign an observation to the default class
$$Pr(default = Yes|X = x) > 0.5. \\ Pr(default = Yes|X = x) > 0.2.$$
```{r fig4-7, cache=FALSE, echo=FALSE, fig.align="center", fig.cap= "The figure illustrates the trade-off that results from modifying the threshold value for the posterior probability of default. For the Default data set, error rates are shown as a function of the threshold value for the posterior probability that is used to perform the assignment. The black solid line displays the overall error rate. The blue dashed line represents the fraction of defaulting customers that are incorrectly classified, and the orange dotted line indicates the fraction of errors among the non-defaulting customers."}
knitr::include_graphics("./images/fig4_7.jpg", error = FALSE)
```
- As the threshold is reduced, the error rate among individuals who default decreases steadily, but the error rate among the individuals who do not default increases. The decision on the threshold must be based on **domain knowledge** (e.g., detailed information about the costs associated with default)
- ROC curve is a way to illustrate the two type of errors at all possible thresholds.
```{r fig4-8, cache=FALSE, echo=FALSE, fig.align="center", fig.cap= "The true positive rate is the sensitivity: the fraction of defaulters that are correctly identified, using a given threshold value. The false positive rate is 1-specificity: the fraction of non-defaulters that we classify incorrectly as defaulters, using that same threshold value. The ideal ROC curve hugs the top left corner, indicating a high true positive rate and a low false positive rate. The dotted line represents the “no information” classifier; this is what we would expect if student status and credit card balance are not associated with probability of default."}
knitr::include_graphics("./images/fig4_8.jpg", error = FALSE)
```
An ideal ROC curve will hug the top left corner, so the larger **area under the ROC curve (AUC)**, the better the classifier.
```{r tbl4_6, cache=FALSE, echo=FALSE, fig.align="center", fig.cap="Possible results when applying a classifier or diagnostic test to a population"}
library("htmlTable")
library("magrittr")
matrix(c("True Neg. (TN)", "False Pos. (FP)", "N", "False Neg. (FN)", "True Pos. (TP)", "P", "N∗", "P∗", ""),
ncol = 3,
dimnames = list("Predicted class" = c(" − or Null", " + or Non-null", "Total"),
"True class" = c("Neg. or Null", "Pos. or Non-null", "Total"))) %>%
addHtmlTableStyle(align = "lcr") %>%
htmlTable
```
Important measures for classification and diagnostic testing:
- **False Positive rate (FP/N)** $\Longrightarrow$ Type I error, 1−Specificity
- **True Positive rate (TP/P)** $\Longrightarrow$ 1−Type II error, power, sensitivity, recall
- **Pos. Predicted value (TP/P∗)** $\Longrightarrow$ Precision, 1−false discovery proportion
- **Neg. Predicted value (TN/N∗)**
### Quadratic Discriminant Analysis (QDA)
- Assumptions similar to LDA, in which observations from each class are drawn from a Gaussian distribution, and plugging estimates for the parameters into Bayes’ theorem in order to perform prediction
- QDA assumes that each class has its own covariance matrix
$$X ∼ N(μ_k,Σ_k) \Longrightarrow {Σ_k is \space covariance \space matrix \space for \space the \space kth \space class}$$
**Bayes classifier**
$$δ_k(x) = - \frac{1}{2}(x - \mu_k)^T Σ_k^{−1}(x - \mu_k) - \frac{1}{2}log|Σ_k| + log(π_k) \\ \Downarrow \\ δ_k(x) = - \frac{1}{2}x^T Σ_k^{−1}x - x^T Σ_k^{−1} \mu_k - \frac{1}{2}μ_k^T Σ_k^{−1} μ_k - \frac{1}{2}log|Σ_k| + log π_k$$
QDA classifier involves plugging estimates for **$Σ_k$, $μ_k$, and $π_k$** into the above equation, and then assigning an observation X = x to the class for which this quantity is **largest**.
The quantity x appears as a quadratic function, hence the name.
<br>
**Why the LDA to QDA is preferred or vice-versa?**
<br>
1. **Bias-variance trade-off**
<br>
- Pro LDA: LDA assumes that the K classes share a common covariance matrix and the quantity X becomes linear, which means there are $K_p$ linear coefficients to estimate.LDA is a much less flexible classifier than QDA, and so has substantially *lower variance*; improved prediction performance.
- Con LDA: If the assumption K classes share a common covariance matrix is badly off, LDA can suffer from *high bias*
- Conclusion: Use LDA when there is a few training observations; use QDA when the training set is very large or common covariance matrix is untennable.
```{r fig4-9, cache=FALSE, echo=FALSE, fig.align="center", fig.cap= "Left: The Bayes (purple dashed), LDA (black dotted), and QDA (green solid) decision boundaries for a two-class problem with Σ1 = Σ2. The shading indicates the QDA decision rule. Since the Bayes decision boundary is linear, it is more accurately approximated by LDA than by QDA. Right: Details are as given in the left-hand panel, except that Σ1 ̸= Σ2. Since the Bayes decision boundary is non-linear, it is more accurately approximated by QDA than by LDA."}
knitr::include_graphics("./images/fig4_9.jpg", error = FALSE)
```
### Naive Bayes
- Estimating a p-dimensional density function is challenging; naive bayes make a different assumption than LDA and QDA.
- an alternative to LDA that does not assume normally distributed
predictors
$$f_k(x) = f_{k1}(x_1) × f_{k2}(x_2)×· · ·×f{k_p}(x_p),$$
where $f_{kj}$ is the density function of the jth predictor among observations in the kth class
*Within the kth class, the p predictors are independent.*
**Why naive Bayes is better/powerful?**
1. By assuming that the p covariates are independent within each class, we assumed that there is no association between the predictors! When estimating a p-dimensional density function, it is difficult to calculate the *marginal distribution* of each predictor and *joint distribution* of the predictors.
2. Although p covariates might not be independent within each class, it is convenient and we obtain pretty decent results when the n is small, p is large.
3. It reduces variance, though it has some bias (Bias-variance trade-off)
**Options to estimate the one-dimensional density function fkj using training data**
1. [For Quantitative $X_j$] -> We assume $X_j |Y = k ∼ N(μ_{jk},σ_{jk}^2)$, where within each class, the jth predictor is drawn from a (univariate) normal distribution. It is **QDA-like with diagonal class-specific covariance matrix**
2. [For Quantitative $X_j$] -> Use a *non-parametric estimate* for $f_{kj}$. First, a histogram for the within-class observations and then estimate $f_{kj}(x_j)$. Or else, use **kernel density estimator**.
3. [For Qualitative $X_j$] ->Count the proportion of training observations for the jth predictor corresponding to each class.
Note: Fixing the threshold, the Naive Bayes has a higher error rate than LDA, but better prediction (higher sensitivity).
## Summary of the classification methods
### An Analytical Comparison
- **LDA** and **logistic regression** assume that the log odds of the posterior probabilities is _linear_ in x.
- **QDA** assumes that the log odds of the posterior probabilities is _quadratic_ in x.
- **LDA** is simply a restricted version of QDA with $Σ_1 = · · · = Σ_K = Σ$
- **LDA** is a special case of naive Bayes and vice-versa!
- **LDA** assumes that the features are normally distributed with a common within-class covariance matrix, and naive Bayes instead assumes _independence_ of the features.
- **Naive Bayes** can produce a more _flexible_ fit.
- **QDA** might be more accurate in settings where interactions among the predictors are important in discriminating between classes.
- **LDA > logistic regression** when the observations at each Kth class is normal.
- **K-nearest neighbors (KNN)** will be better classifiers when decision boudary is non-linear, n is large, and p is small.
- **KNN** has low bias but large variance; as such, KNN requires a lot of observations relative to the number of predictors.
- If decision boundary is non-linear but n is and p are small, then QDA may be preferred to KNN.
- KNN does not tell us which predictors are important!
<br>
_Final note._ The choice of method depends on (1) the true distribution of the predictors in each of the K classes,(2) the values of n and p - bias-variance trade-off
### An Empirical Comparison
```{r fig4-11, cache=FALSE, echo=FALSE, fig.align="center", fig.cap= "Boxplots of the test error rates for each of the linear scenarios described in the main text."}
knitr::include_graphics("./images/fig4_11.jpg", error = FALSE)
```
**When Bayes decision boundary is linear,**
_Scenario 1_: Binary class response, equal observations in each class, uncorrelated predictors
_Scenario 2_: Similar to Scenario 1, but the predictors had a correlation of −0.5.
_Scenario 3_: Predictors had a negative correlation, t-distribution (more extreme points at the tails)
```{r fig4-12, cache=FALSE, echo=FALSE, fig.align="center", fig.cap= "Boxplots of the test error rates for each of the non-linear scenarios described in the main text"}
knitr::include_graphics("./images/fig4_12.jpg", error = FALSE)
```
**When Bayes decision boundary is non-linear,**
_Scenario 4_: normal distiibution, correlation of 0.5 between the predictors in the first class, and correlation of −0.5 between the predictors in the second class.
_Scenario 5_: Normal distribution, uncorrelated predictors
_Scenario 6_: Normal distribution, different diagonal covariance matrix for each class, small n
## Generalized Linear Models
**Count data** (e.g. number of bikers per hour) is neither quantitative nor qualitative
=> neither linear regression nor the classification approaches considered so far are applicable.
## Linear regression with count data - negative values
The results of fitting a least squares regression model to the `Bikeshare` data provides some reasonable results:
* as weather progressively worsens, the number of bikers decreases (_coefficients become negative wrt baseline_)
* the coefficients associated with season and time of day match expected patterns (_lowest in winter, and highest during peak commute times_)
```{r tab4-10, cache=FALSE, echo=FALSE, fig.align="center", fig.cap= "_Results for a least squares linear model fit to predict bikers in the Bikeshare data. For the qualitative variable weathersit, the baseline level corresponds to clear skies._"}
knitr::include_graphics("./images/tab4_10.jpg", error = FALSE)
```
```{r fig4-13, cache=FALSE, echo=FALSE, fig.align="center", fig.cap= "_A least squares linear regression model was fit to predict bikers in the Bikeshare data set. Left: The coefficients associated with the month of the year. Bike usage is highest in the spring and fall, and lowest in the winter. Right: The coefficients associated with the hour of the day. Bike usage is highest during peak commute times, and lowest overnight._"}
knitr::include_graphics("./images/fig4_13.jpg", error = FALSE)
```
***Problem 1***: <mark>*model predicts negative numbers of bikers at times*</mark>
## Linear regression with count data - heteroscedasticity
In this example, the variance of biker numbers changes as the mean number changes:
* during worse conditions, there are few bikers, and little variation in the number of bikers
* during better conditions, there are many bikers on average, but also larger variation in the number of bikers
```{r fig4-14, cache=FALSE, echo=FALSE, fig.align="center", fig.cap= "_Left: On the Bikeshare dataset, the number of bikers is displayed on the y-axis, and the hour of the day is displayed on the x-axis. For the most part, as the mean number of bikers increases, so does the variance in the number of bikers. A smoothing spline fit is shown in green. Right: The log of the number of bikers is displayed on the y-axis._"}
knitr::include_graphics("./images/fig4_14.jpg", error = FALSE)
```
***Problem 2***: <mark>*observed heteroscedasticity is a violation of linear model assumptions*</mark>
$$Y = \beta_{0} + \sum_{j=1}^p \beta_{j} + \epsilon$$
where $\epsilon$ is a mean-zero error term with a constant variance
Transforming to log improves the variance, but cannot be used where the response can take on a 0 value.
Log transformation also results in challenges in interpretation:
e.g. “_a one-unit increase in $X_j$ is associated with an increase in the mean of the log of $Y$ by an amount $β_j$_”
## Problems with linear regression of count data
***Problem 1***: <mark>*model predicts negative numbers of bikers at times*</mark>
***Problem 2***: <mark>*observed heteroscedasticity is a violation of linear model assumptions*</mark>
***Problem 3***: <mark>*integer values (bikers) predicted using a continuous response $Y$*</mark>
"_[A] Poisson regression model provides a much more natural and elegant approach for this task._"
## Poisson distribution
A count response variable $Y$ (which takes on non-negative integer values) can be modeled using the **Poisson distribution**, where the probability that $Y$ takes on a given count value $k$ can be calculated as:
$Pr(Y = k) = \frac{e^{-\lambda}\lambda^k}{k!}$ for $k$ = 0, 1, 2, ...
where $\lambda$ represents both the expected value (mean) and variance of $Y$:
$Y = E(Y) = Var(Y)$
=> "_[I]f $Y$ follows the Poisson distribution, then the larger the mean of $Y$, the larger its variance._"
```{r fig.cap= "_Plots of Poisson Distributions with different lambda values, showing how variance increases with increasing lambda. Note all values are non-negative integer values, suitable for modelling counts, k._"}
par(mfrow = c(2,2))
lambda <- c(1:4)
k <- c(0:10)
for (lam in lambda) {
Prk <- (exp(-lam)*lam^k)/factorial(k)
plot(k, Prk, type = 'b', ylim = c(0, 0.4), main = paste("lambda =", lam))
}
```
## Poisson Regression Model mean (lambda)
"_[R]ather than modeling [a count response variable], $Y$, as a Poisson distribution with a fixed mean value like $\lambda$ = 5, we would like to allow the mean to vary as a function of the covariates._"
The mean $\lambda$ can be modeled as a function of the predictor variables as follows:
$log(\lambda(X_1, ..., X_p) = \beta_0 + \beta_1X_1 + ... + \beta_pX_p$
NB: taking the log ensures that $\lambda$ can only be non-negative.
This is equivalent to representing the mean $\lambda$ as follows:
$\lambda = \text{E}(Y) = \lambda(X_1, ..., X_p) = e^{\beta_0 + \beta_1X_1 + ... + \beta_pX_p}$
## Estimating the Poisson Regression parameters
The calculation of $\lambda$ can then be used in the formula of the Poisson Distribution, allowing the Maximum Likelihood approach to be used in estimating the parameters, $\beta_0$, $\beta_1$,..., $\beta_p$:
Poisson Distribution Formula: $Pr(Y = k) = \frac{e^{-\lambda}\lambda^k}{k!}$ for $k$ = 0, 1, 2, ...
Maximum likelihood: $l(\beta_0, \beta_1, ..., \beta_p) = \Pi_{i=1}^n\frac{e^{-\lambda(x_i)}\lambda(x_i)^{y_i}}{y_i!}$
where $\lambda(x_i) = e^{\beta_0 + \beta_1x_{i1} + ... + \beta_px_{ip}}$
Coefficients that maximize the likelihood $l(\beta_0, \beta_1, ..., \beta_p)$ (make the observed data as likely as possible) are chosen.
## Interpreting Poisson Regression
An increase in $X_j$ by one unit is associated with a change in $E(Y) = \lambda$ by a factor of $exp(\beta_j)$
```{r tab4-11, cache=FALSE, echo=FALSE, fig.align="center", fig.cap= "_Results for Poisson regression model fit to predict bikers in the Bikeshare data. For the qualitative variable weathersit, the baseline level corresponds to clear skies._"}
knitr::include_graphics("./images/tab4_11.jpg", error = FALSE)
```
A change in weather from clear to cloudy skies is associated with a change in mean bike usage by a factor of
exp(-0.08) = 0.923
i.e. on average, only 92.3% as many people will use bikes compared to when it is clear (baseline weather).
## Advantages of Poisson Regression
Poisson regression has several advantages in modeling count data:
**Mean-variance relationship** We implicitly assume that mean bike usage in a given hour equals the variance of bike usage during that hour (cf use constant variance in linear regression).
**Non-negative fitted values** There are no negative predictions using the Poisson regression model.
## Generalized Linear Models
Generalized linear models (GLMs) all follow the same 'recipe':
* use a set of predictors $X_1$, ..., $X_p$ to predict a response $Y$
* model the response $Y$ as coming from a particular distribution
e.g. Poisson Distribution, for Poisson regression
* transform the mean of the response (via a _link function_ $\eta$) so that the transformed mean is a linear function of the predictors
e.g. for Poisson regression, $log(\lambda(X_1, ..., X_p) = \beta_0 + \beta_1X_1 + ... + \beta_pX_p$
## Addendum - Logistic Regression Assumptions
```{r}
library(dplyr)
library(titanic)
library(car)
```
Source: [The 6 Assumptions of Logistic Regression (With Examples)](https://www.statology.org/assumptions-of-logistic-regression/)
Source: [Assumptions of Logistic Regression, Clearly Explained](https://towardsdatascience.com/assumptions-of-logistic-regression-clearly-explained-44d85a22b290)
**Logistic regression** is a method to fit a regression model usually when the response variable is binary.
#### Assumption #1 - The response variable is binary
Examples:
- Yes or No
- Male or Female
- Pass or Fail
For more tha two possible outcomes, an **ordinal regression** is the model of choice.
#### Assumption #2 - Observations are independent
As the OLS regression, the logistic regression requires that observations are iid (independent and identical distributed).
Easiest way is to create a plot of residuals against time (i.e. order of observations) and observe if the pattern is random or not.
#### Assumption #3 - No multicollinearity among predictors
Multicollinearity occurs when two or more explanatory variables are highly correlated to each other, such that they do not provide unique or independent information in the regression model. If the degree of correlation is high enough between variables, it can cause problems when fitting and interpreting the model.
Use `VIF` to check multicollinearity (> 10 show strong collinearity among predcitors)
#### Assumption #4 - No extreme outliers
Logistic regression assumes that there are no extreme outliers or influential observations in the dataset.
Use `Cook's distance` for each observation.
#### Assumption #5 - There is a Linear Relationship Between Explanatory Variables and the Logit of the Response Variable
Logistic regression assumes that there exists a linear relationship between each explanatory variable and the logit of the response variable. Recall that the logit is defined as:
Logit(p) = log(p / (1-p)) where p is the probability of a positive outcome.
Use `Box-Tidwell` test to check this assumption.
Example:
```{r}
titanic <- titanic_train %>%
select(Survived, Age, Fare) %>%
na.omit() %>%
janitor::clean_names()
glimpse(titanic)
```
Build the model
```{r}
# survived (target ~ age + fare)
log_reg <- glm(survived ~ age + fare, data = titanic, family = binomial(link = "logit"))
summary(log_reg)
```
Box_Tidwell test
```{r}
titanic <- titanic %>%
mutate(age_1 = age + 1, fare_1 = fare + 1)
boxTidwell(survived ~ age_1 + fare_1, data = titanic)
```
#### Assumption #6 - Sample size must be sufficiently large
Logistic regression assumes that the sample size of the dataset if large enough to draw valid conclusions from the fitted logistic regression model.
As a rule of thumb, you should have a minimum of 10 cases with the least frequent outcome for each explanatory variable. For example, if you have 3 explanatory variables and the expected probability of the least frequent outcome is 0.20, then you should have a sample size of at least (10*3) / 0.20 = 150.
## Lab: Classification Methods
## Exercises
<!--
adapted from https://onmee.github.io/assets/docs/ISLR/Classification.pdf
-->
### Conceptual
1. Claim: The logistic function representation for logistic regression
$$p (X) = \frac{e^{\beta_{0} + \beta_{1}X}}{1 + e^{\beta_{0} + \beta_{1}X}} \space \Longrightarrow {Logistic \space function}$$
is equivalent to the logit function representation for logistic regression.
$$\log \biggl(\frac{p(X)}{1- p(X)}\bigg) = \beta_{0} + \beta_{1}X \Longrightarrow {log \space odds/logit}$$
Proof:
$$\begin{array}{rcl}
p(X) & = & \frac{e^{\beta_{0} + \beta_{1}X}}{1 + e^{\beta_{0} + \beta_{1}X}} \\
p(X)[1 + e^{\beta_{0} + \beta_{1}X}] & = & e^{\beta_{0} + \beta_{1}X} \\
p(X) + p(X)e^{\beta_{0} + \beta_{1}X} & = & e^{\beta_{0} + \beta_{1}X} \\
p(X) & = & e^{\beta_{0} + \beta_{1}X} - p(X)e^{\beta_{0} + \beta_{1}X} \\
p(X) & = & e^{\beta_{0} + \beta_{1}X}[1 - p(X)] \\
\frac{p(X)}{1 - p(X)} & = & e^{\beta_{0} + \beta_{1}X} \\
\log \biggl(\frac{p(X)}{1- p(X)}\bigg) & = & \beta_{0} + \beta_{1}X \\
\end{array}$$
2. Under the assumption that the observations in the $k^{th}$ class are drawn from a Gaussian $N(\mu_{k}, \sigma^{2})$ distribution,
$$p_k(x) = \frac{π_k \frac{1}{\sqrt{2πσ}}exp(- \frac{1}{2σ^2}(x- \mu_k)^2)}{\sum^k_{l=1} π_l \frac{1}{\sqrt{2πσ}}exp(- \frac{1}{2σ^2}(x- \mu_l)^2)}$$
is largest when $x = \mu_{k}$ (i.e. observation classified when $x$ is close to $\mu_{k}$). We can proceed toward the discriminant function $\delta_{k}(x) = \ln C^{-1}p_{k}(x)$ using $C$ as a scaling factor of proportionality
$$\begin{array}{rcl}
p_k(x) & = & \frac{π_k \frac{1}{\sqrt{2πσ}}exp(- \frac{1}{2σ^2}(x- \mu_k)^2)}{\sum^k_{l=1} π_l \frac{1}{\sqrt{2πσ}}exp(- \frac{1}{2σ^2}(x- \mu_l)^2)} \\
p_k(x) & \propto & π_k \frac{1}{\sqrt{2πσ}}exp(- \frac{1}{2σ^2}(x- \mu_k)^2) \\
p_k(x) & \propto & π_k \frac{1}{\sqrt{2πσ}}exp(- \frac{1}{2σ^2}(x^{2}- 2\mu_{k}x + \mu_{k}^{2})) \\
p_k(x) & = & Cπ_k exp(- \frac{1}{2σ^2}(-2\mu_{k}x + \mu_{k}^{2})) \\
C^{-1}p_k(x) & = & π_k exp(- \frac{1}{2σ^2}(-2\mu_{k}x + \mu_{k}^{2})) \\
\ln C^{-1}p_{k}(x) & = & \ln \pi_{k} + \frac{\mu_{k}x}{\sigma^{2}} - \frac{\mu_{k}^{2}}{2\sigma^{2}} \\
\delta_{k}(x) & = & \frac{\mu_{k}x}{\sigma^{2}} - \frac{\mu_{k}^{2}}{2\sigma^{2}} + \ln \pi_{k}
\end{array}$$
where the observation is also classified into the $k^{th}$ class when $x$ is close to $\mu_{k}$.
3. For QDA, whose observations $X_{k} \sim N(\mu_{k}, \sigma_{k}^{2})$, consider the case with one feature (i.e. $p = 1$). Prove that the Bayes Classifier is quadratic (i.e. not linear).
In a similar proof as the previous exercise, but without the assumption of the same variance, so each class has its own variance $\sigma_{k}$,
$$p_k(x) = \frac{π_k \frac{1}{\sqrt{2πσ_{k}}}exp(- \frac{1}{2σ_{k}^2}(x- \mu_k)^2)}{\sum^k_{l=1} π_l \frac{1}{\sqrt{2πσ_{l}}}exp(- \frac{1}{2σ_{l}^2}(x- \mu_l)^2)}$$
and we would arrive at the discriminant function
$$\begin{array}{rcl}
C^{-1}p_k(x) & = & \frac{\pi_k}{\sigma_{k}} exp(- \frac{1}{2σ_{k}^2}(x^{2}-2\mu_{k}x + \mu_{k}^{2})) \\
\ln C^{-1}p_{k}(x) & = & \ln\frac{π_k}{\sigma_{k}} - \frac{x^{2}}{2\sigma_{k}^{2}} + \frac{\mu_{k}x}{\sigma^{2}} - \frac{\mu_{k}^{2}}{2\sigma^{2}} \\
\delta_{k}(x) & = & - \frac{1}{2\sigma_{k}^{2}}x^{2} + \frac{\mu_{k}}{\sigma^{2}}x - \frac{\mu_{k}^{2}}{2\sigma^{2}} + \ln \pi_{k} - \ln\sigma_{k}
\end{array}$$
which is quadratic with respect to $x$.
4. When the number of features $p$ is large, we may encounter the *curse of dimensionality*.
a) $p = 1, X \sim U(0,1)$, and classifying observations that within 10 percent of the test observation. On average, we use 10 percent of the available observations.
b) $p = 2$, we would use 1 percent of the observations.
c) $p = 100$, we would use $0.1^{p-2}$ percent of the observations.
d) How reliable would KNN be if there are very few observations used?
e) One idea is to then extend the length of the $p$-dimensional hypercube to find more observations.
5. LDA versus QDA
* If the Bayes decision boundary is linear, QDA may be better on the training set with its flexibility, but will probably be worse on the test set due to higher variance. Therefore, LDA is advised.
* If the Bayes decision boundary is non-linear, QDA is advised.
6. We model with logistic regression
* $Y$: receive an A
* $X_{1}$: hours studied, $X_{2}$: undergraduate GPA
* coefficients $\hat{\beta}_{0} = -6$, $\hat{\beta}_{1} = 0.05$, $\hat{\beta}_{2} = 1$
a) Estimate the probability that a student who studies for 40 hours and has an undergraduate GPA of 3.5 gets an A in the class.
$$Y = \frac{e^{\hat{\beta}_{0} + \hat{\beta}_{1}X_{1} + \hat{\beta}_{2}X_{2}} }{1 + e^{\hat{\beta}_{0} + \hat{\beta}_{1}X_{1} + \hat{\beta}_{2}X_{2}}} = \frac{e^{-0.5}}{1 + e^{-0.5}} \approx 0.3775$$
b) How many hours would the student in part (a) need to study to have a 50 percent chance of getting an A in the class?
$$\begin{array}{rcl}
\ln\left(\frac{Y}{1 - Y}\right) & = & \hat{\beta}_{0} + \hat{\beta}_{1}X_{1} + \hat{\beta}_{2}X_{2} \\
\ln\left(\frac{0.5}{1 - 0.5}\right) & = & -6 + 0.05X_{1} + 3.5 \\
X_{1} & = & 50 \text{ hours} \\
\end{array}$$
7. Predict $Y$ (whether or not a stock will issue a dividend: "Yes" or "No") based on $X$ (last year's percent profit).
* issued dividend: $X \sim N(10, 36)$
* no dividend: $X \sim N(0, 36)$
* P(issue dividend) = 0.80
Using Bayes' Rule
$$\begin{array}{rcl}
P(Y = \text{yes}|X) & = & \frac{\pi_{\text{yes}}\exp(-\frac{1}{2\sigma^{2}}(x-\mu_{\text{yes}})^{2})}{\sum_{l = 1}^{K} \pi_{l}\exp(-\frac{1}{2\sigma^{2}}(x-\mu_{l})^{2})} \\
P(Y = \text{yes}|X = 4) & = & \frac{0.8\exp(-0.5)}{0.8\exp(-0.5) + 0.2\exp(-16/72)} \\
P(Y = \text{yes}|X = 4) & \approx & 0.7519 \\
\end{array}$$
8. Two models
1) logistic regression: 30% training error, 20% test data
2) KNN (K = 1) averaged 18% error over training and test sets
The KNN with K=1 model would fit the training set exactly and so the training error would be zero. This means the test error has to be 36% in order for the average of the errors to be 18%. As model selection is based on performance on the test set, we will choose logistic regression to classify new observations.
9. About odds
a) If 0.37 probability of defaulting on credit card, then the probability of default is
$$0.37 = \frac{P(X)}{1 - P(X)} \quad\Rightarrow\quad P(X) \approx 0.2701$$
b) If an individual has 16% chance of defaulting, then their odds are
$$\text{odds} = \frac{P(X)}{1 - P(X)} = \frac{0.16}{1 - 0.16} \approx 0.1905$$
### Applied
13.
```{r}
library("ISLR")
# (a) numerical and graphical summaries
summary(Weekly)
```
```{r}
# scatterplot matrix
pairs(Weekly[,1:8])
```
```{r}
# correlation matrix
round(cor(Weekly[,1:8]),2)
```
```{r}
# (b) logistic regression
logistic_fit = glm(Direction ~ Lag1+Lag2+Lag3+Lag4+Lag5+Volume, data=Weekly, family=binomial)
summary(logistic_fit)
```
```{r}
# (c) confusion matrix
logistic_probs = predict(logistic_fit, type="response")
logistic_preds = rep("Down", 1089) # Vector of 1089 "Down" elements.
logistic_preds[logistic_probs>0.5] = "Up" # Change "Down" to up when probability > 0.5.
attach(Weekly)
table(logistic_preds,Direction)
```
$$\text{accuracy} = \frac{54 + 557}{54 + 48 + 430 + 557} \approx 0.5611$$
```{r}
# Training observations from 1990 to 2008.
train = (Year<2009)
# Test observations from 2009 to 2010.
Test = Weekly[!train ,]
Test_Direction= Direction[!train]
# Logistic regression on training set.
logistic_fit2 = glm(Direction ~ Lag2, data=Weekly, family=binomial, subset=train)
# Predictions on the test set.
logistic_probs2 = predict(logistic_fit2,Test, type="response")
logistic_preds2 = rep("Down", 104)
logistic_preds2[logistic_probs2>0.5] = "Up"
# Confusion matrix.
table(logistic_preds2,Test_Direction)
```
$$\text{accuracy} = \frac{9 + 56}{9 + 5 + 34 + 56} = 0.625$$
```{r}
# Using LDA
library("MASS")
lda_fit = lda(Direction ~ Lag2, data=Weekly, subset=train)
#lda_fit
# Predictions on the test set.
lda_pred = predict(lda_fit,Test)
lda_class = lda_pred$class
# Confusion matrix.
table(lda_class,Test_Direction)
```
$$\text{accuracy} = \frac{9 + 56}{9 + 5 + 34 + 56} = 0.625$$
```{r}
# Using QDA.
qda_fit = qda(Direction ~ Lag2, data=Weekly, subset=train)
qda_pred = predict(qda_fit,Test)
qda_class = qda_pred$class
table(qda_class,Test_Direction)
```
```{r}
# Using KNN
library("class")
set.seed(1)
train_X = Weekly[train,3]
test_X = Weekly[!train,3]
train_direction = Direction[train]
# Changing from vector to matrix by adding dimensions
dim(train_X) = c(985,1)
dim(test_X) = c(104,1)
# Predictions for K=1
knn_pred = knn(train_X, test_X, train_direction, k=1)
table(knn_pred, Test_Direction)
```
$$\text{accuracy} = \frac{31 + 31}{21 + 30 + 22 + 31} \approx 0.5962$$
14. Develop a model to predict whether a given car gets high or low gas milage based on the `Auto` data set.
```{r}
# binary variable for "high" versus "low"
# Dataframe with "Auto" data and empty "mpg01" column
df = Auto
df$mpg01 = NA
median_mpg = median(df$mpg)
# Loop
for(i in 1:dim(df)[1]){
if (df$mpg[i] > median_mpg){
df$mpg01[i] = 1
}else{
df$mpg01[i] = 0
}
}
```
```{r}
# graphical summary
pairs(df[,c(1:8,10)])
```
```{r}
# correlation matrix
round(cor(df[,c(1:8,10)]),2)
```
```{r}
library('tidyverse')
# split into training and test set
set.seed(123)
df <- df |>
mutate(splitter = sample(c("train", "test"), nrow(df), replace = TRUE))
train2 <- df |> filter(splitter == "train")
test2 <- df |> filter(splitter == "test")
```
```{r}
# LDA model
lda_fit3 = lda(mpg01 ~ cylinders+displacement+horsepower+weight, data=train2)
# Predictions and confusion matrix
lda_pred3 = predict(lda_fit3,test2)
predictions = lda_pred3$class
actual = test2$mpg01
table(predictions,actual)