-
Notifications
You must be signed in to change notification settings - Fork 5
/
Copy path05-logistic-regression.Rmd
895 lines (592 loc) · 43.5 KB
/
05-logistic-regression.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
```{r setup5, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, cache=TRUE)
```
# Logistic regression {#logistic-regression}
**Preliminary code**
This code is needed to make other code below work:
```{r, message=F, error=F, warning=F}
library(gridExtra) # for grid.arrange() to print plots side-by-side
library(languageR)
library(ggplot2)
library(dplyr)
library(arm)
library(boot)
## loads givennessMcGillLing620.csv from OSF project for Wagner (2012) data
givenness <- read.csv(url("https://osf.io/q9e3a/download"))
## make standardized (numeric/centered/scaled) versions of 'givenness' dataset predictors:
givenness <- mutate(givenness,
conditionLabel.williams = arm::rescale(conditionLabel),
clabel.williams = arm::rescale(conditionLabel),
npType.pronoun = arm::rescale(npType),
npType.pron = arm::rescale(npType),
voice.passive = arm::rescale(voice),
order.std = arm::rescale(order),
stressshift.num = (as.numeric(stressshift) - 1),
acoustics.std = arm::rescale(acoustics)
)
```
<script src="js/hideOutput.js"></script>
**Note**: Answers to some questions/exercises not listed in text are in [Solutions](#c4solns)
<!-- TODO FUTURE: actually fill in full solutions -->
## Simple logistic regression
Analogously to "simple linear regression", which is predicting a continuous $Y$ from a single predictor $X$, in *simple logistic regression* we predict a binary $Y$.
Notation:
* Our sample consists of $n$ observations: $(x_1, y_1), ..., (x_n, y_n)$
* The **response** $Y$ is binary: each $y_i$ = 1 or 0.
* There is a single **predictor** $X$, which may be continuous or discrete.
We model the **log-odds** of $Y = 1$ vs. $Y = 0$ for the $i^{\text{th}}$ observation:
$$
\log \left[ \frac{P(y_i = 1)}{P(y_i = 0)} \right] \equiv \text{logit}(P(y_i = 1))
$$
as a function of the single predictor:
$$
\text{logit}(P(Y = 1)) = \beta_0 + \beta_1 X
(\#eq:logit)
$$
This is the model in "logit space", where the right-hand side of the equation (called the *linear predictor*) looks the same as for simple linear regression except for the lack of an error term, while the left-hand side of the equation looks similar to SLR modulo the logit function (called the *link function*).
The same model can be written in "probability space":
\begin{align}
P(Y = 1) & = \text{logit}^{-1}(\beta_0 + \beta_1 X) \\
& = \frac{1}{1 + e^{-(\beta_0 + \beta_1 X)}}
\end{align}
The model in Eq. \@ref(eq:logit) predicts the log-odds of $Y=1$ for a single observation as:
\begin{equation*}
\text{logit}(\hat{p}_i) = \hat{\beta}_0 + \hat{\beta}_1 x_i
\end{equation*}
### Hypothesis testing {#log-reg-hyp-test}
The test statistic
\begin{equation}
z = \frac{\hat{\beta}_1}{SE(\hat{\beta}_1)}
(\#eq:wald)
\end{equation}
turns out to be approximately normally distributed (e.g. @agresti2003categorical 5.5.2).^[This "large-sample approximation" is what is assumed by `glm` in R in calculating logistic regression coefficient significances. However the Wald test becomes unreliable for small enough samples ($n < 100$ is sometimes given) or very large $|\beta|$, and for these reasons @agresti2003categorical (5.2.1) recommends always using a likelihood ratio test to assess coefficient significances. The LR and Wald tests will give very similar significances for large samples.] This means we can perform a *Wald test* of the null hypothesis to show that there is no relationship between $P(Y=1)$ and $X$:
$$
H_0~:~\hat{\beta}_1 = 0
$$
and get a $p$-value and confidence interval for the slope ($\beta_1$). The same is true for the intercept ($\beta_0$).
### Interpreting the coefficients: Logit, odds, and probability
The regression coefficients, $\beta_0$ and $\beta_1$, can be interpreted in terms of log-odds, odds, or probability:
Intercept:
* $\beta_0$: **predicted log-odds** of $Y = 1$ when $X = 0$
* $e^{\beta_0}$: predicted odds of $Y = 1$ when $X = 0$
* $\text{logit}^{-1}(\beta_0)$: probability of $Y = 1$ when $X = 0$
Slope:
* $\beta_1$: **predicted change in log-odds** of $Y = 1$ for a unit change in $X$
* $e^{\beta_1}$: predicted amount odds of $Y = 1$ is multiplied by for a unit change in $X$
* The meaning of $\beta_1$ in probability depends on the value of $X$
### Logistic regression as a GLM
Logistic regressions are fit in R using the `glm()` function with the option `family="binomial"`.
Why? Logistic regression is one type of *generalized linear model* (GLM): a family of models that look like linear regression, but with different choices for each part of Eq. \@ref(eq:logit): link function, linear predictor, probability distribution over $Y$. Logistic regression assumes a binary response $Y$, and uses the logit link function. Because the logit link is the most commonly used for binomially-distributed data (of which a binary response is a special case, with $n=1$), specifying `family="binomial"` gives a logit link by default (it's the same as writing `family = binomial(link = "logit")`).^[There is a further complication here: we have presented logistic regression as modeling data with a binary outcome, that is a *Bernoulli* random variable $y_i$ for the $i^{\text{th}}$ observation, where we model $P(y_i=1)$. Why is the option to `glm()` then `family = "binomial"`, given that a binomial distribution refers to $n {\ge} 0$ trials with probability $p$ of any outcome being 1? (The Bernoulli case is when $n=1$.) It turns out that logistic regression can be characterized either as the probability of a binary outcome for each observation, or as modeling the number of "hits" for a Bernoulli distribution over $n$ outcomes, for each set of observations that share the same predictor values. The `glm()` with `family = "binomial"` can handle data in either format ($y_i$ binary or giving the number of 0's and 1's for a set of predictor values), and if the response variable in the data passed to `glm()` only has two values, R assumes that you're modeling a binary outcome.] There are many types of GLM, including some described in an [Appendix below](#c4appendix2), but we'll only discuss logistic regression in this book.
#### Example 1: Single continuous predictor {- #c4ex1}
In the `givenness` data (described in detail [here](#givedata)), whether stress shifted is captured by the continuous variable `acoustics`, which is a composite of prosodic measurements. We check using a simple logistic regression how the perceptual measure is related to the acoustic measure. That is:
* $Y$ is `stressshift`
* $=1$ if shifted, 0 if not
* $X$ is `acoustics`
* (centered + scaled)
To fit and summarize this model:
```{r}
mod1 <- glm(stressshift ~ acoustics.std, data=givenness, family="binomial")
summary(mod1)
```
A few notes about the model output, compared to the output of the linear regression models we've seen so far:
* There are no references to "residuals" or "residual standard error". This is because the logistic regression equation Eq. \@ref(eq:logit) has no error term, so there are no residuals.
* There are no "R-squared" values. $R^2$ is not a well-defined concept for logistic regression, for which different measures of goodness of fit are used (discussed [below](#logistic-regression-pseudo-r2)).
* The model table contains `z statistic` and `Pr(>|z|)` columns rather than `t` and `Pr(>|t|)` columns. This is because the test statistic $\hat{\beta_i}/{SE(\hat{\beta}_i})$ for logistic regression ($z$ in Eq. \@ref(eq:wald)) looks like a z-score and [follows a normal distribution](##log-reg-hyp-test), while the test statistic follows a $t$ distribution for linear regression.
#### Interpreting the model: Odds and log-odds
> **Questions**:
>
> * What are the log-odds of stress shifting when `acoustics` = 0 ? (Remember this means "mean value of `acoustics`", because it's centered.)
<div class="fold o">
```{r, echo=FALSE}
coefficients(mod1)[[1]]
```
</div>
> * What is $P$(stress shift) when `acoustics` = 0 ?
<div class="fold o">
```{r, echo=FALSE}
invlogit(coefficients(mod1)[[1]])
```
</div>
> * Increasing `acoustics` by 1 (= 2 SD) corresponds to increasing the log-odds of stress shifting by _______ ?
<div class="fold o">
```{r, echo=FALSE}
coefficients(mod1)[[2]]
```
</div>
> * Increasing `acoustics` by 1 (= 2 SD) corresponds to multiplying the odds of stress shifting by _______ ?
<div class="fold o">
```{r, echo=FALSE}
exp(coefficients(mod1)[[2]])
```
</div>
<!-- invlogit(-0.6897) = 0.334. increasing acoustics.std by 1 corresponds to increase in logit by 1.637, = multiplying odds by 5.14 = exp(1.637) -->
#### Interpreting the model: Probability
We can plot the predicted $P$(stress shift) as a function of `acoustics` to get a sense of what the model predicts:
```{r, fig.align='center', fig.height=3, fig.width=3.5}
## set up dataframe with range of acoustics.std we want to predict over
newdata <- data.frame(acoustics.std=seq(-1.75,1.75, by=0.01))
## get the model's predictions in log-odds space
newdata$predLogit <- predict(mod1, newdata=newdata)
## transform those preditions to probability space
newdata <- newdata %>% mutate(predP=invlogit(predLogit))
ggplot(aes(x=acoustics.std, y=predP), data=newdata) +
geom_line() +
geom_point(data=givenness, aes(y=(as.numeric(stressshift)-1)), size=0.5) +
xlab("acoustics (standardized)") +
ylab("P(stress shift)")
```
As expected, there is a strong relationship between the acoustic proxy for stress shift, and whether stress is perceived as having shifted.
<!-- * Note: The data covers most of the range in probability space. Strong effect, good design. -->
Let's work out in more detail what "increase in log-odds" means in probability space for this model:
* When `acoustics` = 0:
* log-odds = -0.68
* odds = 0.506
* $P$(stress shift) = 0.336 = $\text{logit}^{-1}$(-0.68)
* When `acoustics` = 1:
* log-odds = 0.947
* odds = 2.58 = $e^{0.947}$
* $P$(Y = 1) = 0.720 = $\text{logit}^{-1}$(0.947)
Thus, the change in probability $P$(Y = 1) when you increase $X$ by 1 is 0.348 here ($X: 0 \to 1$) However, in general what change in probability results when $X$ is increased by 1 (a "unit change") will depend on where you start from:
```{r, echo=FALSE, fig.align='center', fig.height=3, fig.width=3.5}
x <- seq(-2, 2, by=0.01)
pred1 <- predict(mod1, newdata=data.frame(acoustics.std=x))
pred2 <- predict(mod1, newdata=data.frame(acoustics.std=x+1))
df <- data.frame(x=x, probDiff = invlogit(pred2) - invlogit(pred1))
ggplot(aes(x=x, y=probDiff), data=df) +
geom_line() +
xlab("X") +
ylab("Change in prob. for unit change in X")
```
Note that the greatest change in probability occurs when $P(Y=1|X) = 0.5$ (at $X=0.69 \implies \beta_0 + \beta_1 X = 0$). The change in probability here ($\approx 0.4$) is roughly 1/4 of the fitted slope value:
$$
\frac{\beta_1}{4} = \frac{1.637}{4} = 0.41
$$
It turns out that this relationship holds more generally, and the *divide-by-4 rule* can be used to roughly interpret logistic regression slopes in terms of probability (@gelman2007data, p. 82): the slope (in log-odds) is about 4x the maximum change in probability.
#### Example 2: Single categorical predictor {- #c4ex2}
Let's now predict the probability of stress shifting (`stressshift`) for the `givenness` data, as a function of the NP type (`npType`). This predictor is "dummy coded", with values:
* $X = 0$: Full NP
* $X = 1$: Pronoun
The regression model is still Eq. \@ref(eq:logit).
> **Questions**: What are the interpretations of the regression coefficients?
>
> * Intercept ($\beta_0$)
>
> * Slope ($\beta_1$)
To fit and summarize this model:
```{r}
mod2 <- glm(stressshift ~ npType, data=givenness, family="binomial")
summary(mod2)
```
The model predicts $P$(stress shift) to be:
* Full NPs: $\text{logit}^{-1}(-0.832)=$ **0.303**
* Pronouns: $\text{logit}^{-1}(-0.832 + 0.435)=$ **0.402**
We can compare to the observed proportion of stress-shifted observations for each NP type:
```{r}
## Observed condtion means:
prop.table(xtabs(~npType + stressshift, data=givenness), margin=1)
```
and see that the model matches these proportions exactly (`shift` column). This makes sense, as predicting these two numbers is all the model has to do for simple logistic regression with one categorical predictor.
### Differences from linear regression: Fitting and interpretation {#c4differences}
There are two important differences between logistic regression and linear regression, for our purposes:
1. The **relationship between the response and what's being modeled**. In linear regression, these are the same thing: $Y$ is the response variable, which is also what the regression is modeling. In logistic regression, these are not the same thing: we observe $Y$, which takes on values 0 or 1, but we model the *expected value* of $Y$ ($E(Y)$): the probability that $Y=1$.
2. The **presence of an error term**. In linear regression, there is an error term $\epsilon_i$, capturing the difference between the fitted value ($\hat{y}_i$) and the actual value ($y_i$) for each observation, along with the linear predictor ($\beta_0 + \beta_1 x_{i1} + \cdots + \beta_p x_{ip}$). In logistic regression, there is no error term: the right-hand side of Eq. \@ref(eq:logit) predicts $P(Y=1)$ using just the linear predictor.
Why? For a Bernoulli random variable $Y$ (basically, a coin that comes up "heads" with some probability $p$), the variance and expectation only depend on one unknown parameter, $p$:
$$
E(Y) = p, \quad \text{Var}(Y) = p(1-p)
$$
So unlike for linear regression, where the mean ($\mu$) and variance ($\sigma$) in our model of the response can be tweaked independently to better fit the data, in logistic regression these two things are not independent. By fitting the mean (probability that $Y=1$), we have already determined the variance.
Both differences make interpreting and fitting logistic regression models less straightforward than linear regression models.
### Fitting a logistic regression model
For linear regression, it was possible to estimate the regression coefficients by "least squares": minimizing the difference between the model's prediction and the observed data ($\epsilon_i^2$), across all points. Least-squares has a *closed-form* solution: the regression coefficients can be solved for by plugging the predictor and response values for all observations into an equation.
For logistic regression, there are no residuals, so we can't use a least-squares method. Instead, we use a likelihood-based method: determine the probability of the observed data, given values of the regression coefficients (the data *likelihood*), then find the regression coefficient values that maximize this probability.
For simple logistic regression: for any $\beta_0$ and $\beta_1$, the likelihood of the data is:
\begin{equation*}
L(\beta_0, \beta_1) = \sum^n_{i = 1} \left[ \text{logit}^{-1}(\beta_0 + \beta_1 x_i) \right]^{y_i}
(\#eq:likelihood)
\end{equation*}
We choose $\hat{\beta}_0, \hat{\beta}_1$ that maximize the likelihood: the *maximum likelihood* (ML) estimates. There is no closed form solution for $\hat{\beta}_0, \hat{\beta}_1$, which must instead be estimated numerically, using an estimation algorithm.^[`glm()` uses the *Fisher scoring algorithm* by default, hence the reference to "Number of Fisher Scoring iterations" in the `summary()` output of a model fitted with `glm()`. ]
### Interpretation
You of course don't need to understand the math to fit `lm()` or `glm()` models---R will do this for you automatically. However, it can be helpful to know some basic aspects of how these models are fitted to interpret model output, and some otherwise cryptic error messages. For example:
* `(Dispersion parameter for binomial family taken to be 1)` in logistic regression output: related to the relationship between mean and variance---the amount of variability in the data is assumed to be similar to what you'd expect based on the estimated mean. If it's not, your data is "over-dispersed" or "under-dispersed".^[Dispersion is not a concern for logistic regression models where each observation is just 0 or 1, the only kind of logistic regression covered in this chapter. (As opposed to more complex flavors, like observing proportions between 0 and 1 over multiple observations.)]
* `Number of Fisher Scoring iterations: X` in logistic regression output: refers to the numeric algorithm used to fit logistic regression models (see footnote 3).
* There is no $F$ statistic or $p$ value in logistic regression output: this is because there are no residuals, so the "full" and "reduced" models cannot be compared via their residual sums of squares (which are used to conduct an $F$ test for linear regression).
* If you try to fit a logistic regression model with just $Y=1$ response values (no $Y=0$ values), you'll get the message `glm.fit: algorithm did not converge`:
```{r}
badmodel <- glm(stressshift ~ acoustics.std, data=filter(givenness, stressshift=='shift'), family="binomial")
```
This means that the numeric algorithm used to maximize likelihood couldn't find a good solution for the regression coefficients. This makes sense, because the model is trying to fit 100\% probabilities (so log-odds = $\infty$), and has no data on the basis of which it can estimate the slope ($\beta_1$).
## Evaluating logistic regression models {#evaluating-logistic-regression-models}
We would like a measure of goodness-of-fit for logistic regression models: how well does the model predict $Y$, compared to a baseline model? For linear regression, we used $R^2$ to quantify how similar the model's predictions ($\hat{y}_i$) were to the observations ($y_i$), relative to a baseline model where the model's prediction is always the grand mean---$R^2$ was simply the correlation between observed and predicted values.
However, for logistic regression we cannot directly compare model predictions ($p_i$: (log-odds of) probabilities) to observations ($y_i$: 0 or 1), so we need a different measure of goodness-of-fit. We consider three commonly used options:
1. Likelihood ratio test
2. Classification accuracy
3. Pseudo-$R^2$
to compare a logistic regression model to a baseline (intercept-only model). We assume here that the logistic regression model has just one predictor $X$, and denote this model by $M_1$ and the baseline model by $M_0$.
### Likelihood ratio test {#c4lrt}
This option is analogous to the $F$ test for linear regression models, where the full model is compared to the intercept-only model, and we test whether the change in sum-of-squares is significant, given the added predictors in the full model.
Logistic regression models don't have sums-of-squares because they don't have residuals. Instead, we compare the **likelihood** (Eq. \@ref(eq:likelihood)) of the two models: is the change in likelihood between $M_0$ and $M_1$ significant?
Most methods involving likelihood use log-transformed likelihood, mostly to avoid numerical issues when computing with very small numbers. Instead of working with straight log-likelihood, it is also customary to define the *deviance* of a model $M$ (@agresti2007 3.4.3), as
\begin{equation*}
D = -2 \log L(M)
\end{equation*}
Thus, a better model (higher likelihood) has lower deviance.
The difference in deviance ($\Delta D$) between models with ($M_1$) and without ($M_0$) the single predictor is related to the log of their *likelihood ratio*:
\begin{equation*}
\Delta D = -2 \log \frac{L(M_0)}{L(M_1)}
\end{equation*}
For large enough sample size, $\Delta D$ follows a $\chi^2(1)$ distribution if the slope of $X$ ($\beta_1$) is 0. Thus, we can use $\Delta D$ as a test statistics for the null hypothesis that $\beta_1 = 0$. This kind of test is called a *likelihood ratio test*. The $p$ value of this test gives a measure of goodness of fit, while $\Delta D$ gives a sort of measure of variance accounted for (if you think of deviance as capturing "variance" from the perfect model).
For our [Example 1](#c4ex1), an LR test is conducted in R as follows:
```{r}
mod1 <- glm(stressshift ~ acoustics.std, data=givenness, family="binomial")
mod0 <- glm(stressshift ~ 1, data=givenness, family="binomial")
## LR test of the effect of acoustics.std in mod1
anova(mod0, mod1, test='Chisq')
```
Note that for simple linear regression, this is just another way to test the hypothesis that $\beta_1 = 0$, in addition to the $z$ test reported in the regression summary.
```{r}
summary(mod1)
```
However, for multiple regression models with more than one predictor, the LR test gives a measure of overall model fit, and is not testing the same thing as the $z$ tests reported by `summary(model)` for each predictor.
### Classification accuracy
A second way to assess model quality comes from thinking of the model as predicting $Y$, as follows:
\begin{equation*}
\hat{y}_i =
\begin{cases}
1 & \text{if } \hat{p}_i > 0.5 \\
0 & \text{if } \hat{p}_i \leq 0.5
\end{cases}
\end{equation*}
That is, just predict $Y=1$ if the predicted log-odds are positive, and $Y=0$ if the predicted log-odds are negative. We can then define *classification accuracy* as the percentage of observations where the predicted and observed values are the same ($\hat{y}_i = y_i$).
Classification accuracy always needs to be compared to a baseline, usually "how often would we classify correctly if we chose the most common case?" Formally, if $n_1$ and $n_2$ are the number of observations where $y_i = 0$ and $y_i = 1$ in the data, the baseline classification accuracy is:
\begin{equation*}
\max \left( \frac{n_1}{n}, \frac{n_2}{n} \right)
\end{equation*}
Here are functions for computing the classification accuracy and baseline accuracy of a logistic regression model:
```{r}
## function for computing accuracy of a logistic regression model
## (on the dataset used to fit the model)
## lrMod = fitted model
## responseVar = name of response variable for lrMod
##
## adapted from: https://www.r-bloggers.com/evaluating-logistic-regression-models/
lrAcc <- function(lrMod, responseVar){
## convert response variable into a factor if it's not one
if(!is.factor(model.frame(lrMod)[,responseVar])){
model.frame(lrMod)[,responseVar] <- as.factor(model.frame(lrMod)[,responseVar])
}
## model predictions in log-odds
preds = predict(lrMod, newdata=model.frame(lrMod))
## transform to 0/1 prediction
preds <- ((sign(preds)/2)+0.5)
## response variable values, transformed to 0/1
y <- (as.numeric(model.frame(lrMod)[,responseVar])-1)
## how often is prediction the same as the actual response
acc <- sum(preds==y)/length(preds)
return(acc)
}
## baseline accuracy for a logisitic regression model lrMod
## with a given response variable
baselineAcc <- function(lrMod, responseVar){
response <- model.frame(lrMod)[,responseVar]
tab <- table(response)
return(max(tab)/sum(tab))
}
## baseline accuracy
```
For our [Example 1](#c4ex1) (`stresshift ~ acoustics.std`), the classification accuracy is
```{r}
lrAcc(mod1, 'stressshift')
```
Compared to a baseline of `r round(baselineAcc(mod1, 'stressshift'),2)` (using `baselineAcc(mod1, 'stressshift')`). The model performs better than the baseline, but its performance is less impressive than it seems (70\%) given that the most common case (not shifting stress) accounts for 65\% of the data.
### Pseudo-$R^2$ {#logistic-regression-pseudo-r2}
There is no equivalent to $R^2$ for logistic regression, meaning a quantity with similar properties and multiple interpretations:
* Fraction reduction in sum-of-squares (there is no sum-of-squares)
* Degree of "variance" accounted for ("variance" isn't well-defined)
* Squared correlation between fitted and observed values (different scales: probabilities versus 0/1)
Nonetheless, we might want an $R^2$-like quantity that at least has similar properties, if we find such measures easier to interpret than classification accuracy or an LR test result. A number of *pseudo-$R^2$* measures exist, of which the two most common are:
* **Cox-Snell** pseudo-$R^2$: a value $\geq 0$
* **Nagelkerke** pseudo-$R^2$: a value between 0 and 1 (like $R^2$ for linear regression)
Both measures are related to the likelihood ratio of the full and reduced (intercept-only) model. Pseudo-$R^2$ measures should not be taken too seriously, but can be useful, for example for comparing goodness of fit between logistic regression and linear regression models. We won't consider these methods further, but they are reported in some papers.
#### Example {-}
How well does our [Example 2](#c4ex2) model (`stressshift ~ npType`) do at predicting whether stress shifts?
```{r}
mod2 <- glm(stressshift ~ npType, data=givenness, family="binomial")
```
By the likelihood ratio test:
```{r}
## LR test of the effect of npType in mod2
anova(mod2,mod0, test='Chisq')
```
there is a (barely) significant effect of `npType`, corresponding to a difference in deviance of $\Delta D = 4.1$.
However, classification accuracy:
```{r}
## accuracy of mod2
lrAcc(mod2, 'stressshift')
## it's the same as the baseline's accuracy
baselineAcc(mod1, 'stressshift')
```
does not differ between the baseline and full models. This example illustrates a couple of points:
* Different methods for comparing two models won't necessarily give the same qualitative answers---as we [already saw](#non-nested-model-comparison) with AIC versus BIC-based model selection.
<!-- (See Sec. \@ref(non-nested-model-comparison) for more details.) -->
* Classification accuracy is a blunter tool than an LR test---it does not reflect effects that are small compared to baseline accuracy, because in order to affect classification accuracy an effect has to be big enough to change the sign of the predicted log odds.
## Multiple logistic regression
Like for linear regression, generalizing from one predictor to multiple predictors is straightforward for logistic regression.
We assume there are $p$ predictors ($X_1, ..., X_p$) of a binary response $Y$, where each $X_i$ can be continuous or categorical. The logistic regression model is now:
\begin{equation*}
\text{logit}(P(Y = 1)) = \beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p
\end{equation*}
Rewriting this model in terms of odds by exponentiating each side:
\begin{equation*}
\frac{P(Y = 1)}{P(Y = 0)} = e^{\beta_0}e^{\beta_1 X_1}\cdots e^{\beta_p X_p}
\end{equation*}
or, rewriting with $\alpha_i = e^{\beta_i}$:
\begin{equation*}
\frac{P(Y = 1)}{P(Y = 0)} = \alpha_0 \alpha_1^{X_1}\cdots \alpha_p^{X_p}
\end{equation*}
That is, each predictor's effect on the odds are **multiplicative**. A predictor that has no effect corresponds to multiplying by 1 ($\alpha_i = 1$), while predictors with "negative" effects decrease the odds ($\alpha_i < 1$) and predictors with "positive"" effects increase the odds ($\alpha_i > 1$).
### Likelihood ratio test: General case
The likelihood ratio test generalizes to the case of comparing two nested logistic regression models, with $p$ and $q$ predictors:
\begin{align}
M_0 & : X_1, ..., X_p \\
M_1 & : X_1, ..., X_p, X_{p + 1}, ..., X_q
\end{align}
With the difference in deviance again defined as
$$
\Delta D = - 2\log\frac{L(M_0)}{L(M_1)}
$$
$\Delta D$ approximately follows a $\chi^2(q - p)$ distribution (@agresti2003categorical 3.4.4), if
$$
\beta_{p + 1} = \beta_{p + 2} = \cdots = \beta_q = 0
$$
Thus, we can use $\Delta D$ as a test statistic for the null hypothesis that all the added predictors have no effect:
$$
H_0: \beta_{p + 1} = \beta_{p + 2} = \cdots = \beta_q = 0
$$
### Worked example {#log-reg-worked-example}
In this example, we model the probability of shifting stress (`stressshift`) for the `givenness` data, as a function of four predictors---all standardized (see code chunk at the beginning of this chapter):
1. Condition: `conditionLabel.williams`
2. NP Type: `npType.pronoun`
3. Voicing: `voice.passive`
4. Trial number: `order`
We will carry out a full analysis (except model criticism):
* Exploratory data analysis, motivating which terms to include in a model
* Fit a first model
* Variable selection: adding and dropping terms
* Model evaluation
#### Exploratory data analysis {-}
In this empirical plot of how `order` affects the probability of stress shifting, our actual observations appear as points at $y=0$ or $y=1$, while the prediction from a simple logistic regression (on just `order`) appears as the fitted blue line:
```{r, fig.align='center', fig.height=4, fig.width=6}
ggplot(aes(x=order, y=stressshift.num), data=givenness) +
geom_smooth(method="glm", method.args = list(family = "binomial")) +
geom_jitter(position=position_jitter(width=0.2,height=0.03),size=1) +
ylab("Proportion shifted") +
ggtitle("Order vs. proportion with shifted prominence") +
theme(plot.title = element_text(hjust = 0.5))
```
Proportion tables showing how the probability of stress shifting depends on each categorical predictor (plots would be better):
* `conditionLabel`:
```{r}
prop.table(xtabs(~conditionLabel+stressshift, data=givenness), margin=1)
```
* `npType`:
```{r}
prop.table(xtabs(~npType+stressshift, data=givenness), margin=1)
```
* `voice`:
```{r}
prop.table(xtabs(~voice+stressshift, data=givenness), margin=1)
```
The expected directions of effects of the predictors on percent stress-shifted are:
* `order`: positive
* `conditionLabel.williams`, `npType.pronoun`, `voice.passive`: positive
#### Model {-}
We fit a first multiple logistic regression model using just main effects of the four predictors:
```{r}
mod3 <- glm(stressshift ~ npType.pron + clabel.williams + voice.passive + order.std,
family = "binomial",
data = givenness)
summary(mod3)
```
Note that because the predictors are standardized, we can interpret their values as relative effect sizes: the `conditionLabel` effect is much stronger than the others, then `voice` > `npType` > `order`. The direction of all effects is positive, as predicted.
Let's interpret a couple terms from the model, in terms of log-odds and probability:
1. Intercept
* $\beta_0 = -1.01$, corresponds to probability $0.267$
* Thus, 27% stress shift is predicted overall (all predictors held at mean values)
2. Slope for `conditionLabel.williams`
* Coefficient = 3.18
* Interpretation: the odds of stress shifting are multiplied by **24.0** ($e^{3.18}$) in Williams condition (with other predictors at mean values). This is a huge effect!
* Corresponds to a maximum change of 0.80 in probability (divide-by-4-rule: 3.18/4=0.80).
**Exercise**:
What are the predicted log-odds of stress shifting when:
* `order` = mean value
* `npType` = pronoun
* `conditionLabel` = Williams
* `voice` = active
Hint: assume that the three categorical variables just take on values -0.5 and 0.5.
<div class="fold o">
```{r, echo=FALSE}
coeffs <- coefficients(mod3)
coeffs[['(Intercept)']] + 0.5*coeffs[['npType.pron']] + 0.5*coeffs[['clabel.williams']] -0.5*0.5*coeffs[['voice.passive']]
```
</div>
#### Model evaluation {-}
To evaluate the model in terms of how much likelihood improves from a null model, we first set the null model:
```{r}
mod3.0 <- glm(stressshift ~ 1, family = "binomial", data = givenness)
```
Then carry out the likelihood ratio test:
```{r}
anova(mod3, mod3.0, test="Chisq")
```
Note that the degrees of freedom of this comparison is $p-q = 4$, since there are four predictors in the full model. The difference in deviance is $\Delta D = 160.34$. Since we find `Pr(>Chisq) < 2.2e-16` we may conclude that our four terms make a very significant contribution to the overall model likelihood.
To assess model quality using classification accuracy:
```{r}
## classification accuracy
lrAcc(mod3, 'stressshift')
```
Compared to baseline accuracy:
```{r}
## baseline accuracy
baselineAcc(mod3, 'stressshift')
```
Thus, the four predictors improve classification accuracy by 15%.
#### Variable selection {-}
In this exercise we will follow the variable selection guidelines from Gelman & Hill discussed [in the previous chapter](#c2varselect).
First two steps:
1. Include main effects for predictors expected to affect the response (done)
2. Consider interactions between terms with large effects (we do this now)
These empirical plots, for each pair of variables with the largest effects, tell us which interactions are most suggestive:
```{r, fig.align='center', fig.height=4.5, fig.width=8}
day14plt1 <- ggplot(aes(x=conditionLabel, y=stressshift.num), data = givenness) +
stat_summary(fun.data="mean_cl_boot", geom="errorbar", width=0.25, aes(color=voice)) +
facet_wrap(~voice) +
ylab("% shifted stress") +
ylim(0,1)
day14plt2 <- ggplot(aes(x=voice, y=stressshift.num), data = givenness) +
stat_summary(fun.data="mean_cl_boot", geom="errorbar", width=0.25, aes(color=npType)) +
facet_wrap(~npType) +
ylab("% shifted stress") +
ylim(0,1)
day14plt3 <- ggplot(aes(x=conditionLabel, y=stressshift.num), data = givenness) +
stat_summary(fun.data="mean_cl_boot", geom="errorbar", width=0.25, aes(color=npType)) +
facet_wrap(~npType) +
ylab("% shifted stress") +
ylim(0,1)
grid.arrange(day14plt1, day14plt2, day14plt3, ncol = 2)
```
Let's try adding the `voice:conditionLabel` interaction, which looks the largest:
```{r}
mod3 <- glm(stressshift ~ npType.pron + clabel.williams + voice.passive + order.std,
family = "binomial",
data = givenness)
mod3.1 <- glm(stressshift ~ npType.pron + clabel.williams * voice.passive + order.std,
data = givenness,
family = "binomial")
anova(mod3, mod3.1, test = 'Chisq')
```
This model comparison suggests we should include the `voice:conditionLabel` interaction in the model.
Why, pedantically:
* $\Delta D = 10.08$
* Probability of $\Delta D$ at least this larger under $\chi^2(1)$ distribution is 0.0015
* $\implies$ **reject** $H_0$ that $\beta_{\texttt{voice:conditionLabel}} = 0$
* $\implies$ **include** `voice:conditionLabel` interaction in the model.
We can carry out similar tests for the other two possible two-way interactions, and get:
```{r}
mod3.2 <- glm(stressshift ~ npType.pron + clabel.williams + voice.passive + order.std + npType.pron * clabel.williams,
data = givenness,
family = "binomial")
anova(mod3, mod3.2, test="Chisq")
```
```{r}
mod3.3 <- glm(stressshift ~ npType.pron + clabel.williams+voice.passive + order.std + voice.passive * npType.pron,
data = givenness,
family = "binomial")
anova(mod3, mod3.3, test="Chisq")
```
Thus, $p>0.3$ for the `npType:conditionLabel` and `voice:npType` interactions, so we don't add these interactions to the model.
The new model is:
```{r}
summary(mod3.1)
```
Examining the coefficient values, and comparing to our prior expectations---from EDA and from the study design---there are no coefficients to consider removing by Gelman & Hill's guidelines. (Make sure you understand the different reasons we don't drop the `order.std` or `voice.passive` main effects.)
The classification accuracy for this new model is:
```{r}
lrAcc(mod3.1, 'stressshift')
```
Recall that without the `voice:conditionLabel` interaction, we had the classification accuracy:
```{r}
lrAcc(mod3, 'stressshift')
```
So accuracy actually **decreases** slightly when this interaction is added. This can happen, because improving model likelihood and improving classification accuracy are not the same thing.^[It's worth noting that classification accuracy is very similar for the two models---close enough to be effectively the same, taking finite sample size into account. For a proportion estimated at $\hat{p} = 0.80$ (like classification accuracy, here) from $n=382$ observations, standard error is $\sqrt{p(1-p)/n} = 0.02$, so the 95\% CIs of classification accuracy of the two models overlap.]
## Model criticism for logistic regression {#model-criticism-logistic-regression}
Regression assumptions and model diagnostics are just as necessary to assess a fitted model ("model criticism") as for linear regression, where [we covered these topics](#linear-regression-assumptions) in depth. The assumptions and diagnostics differ somewhat for logistic regression, but not at a qualitative level. For example:
* Linearity: The predictors are assumed to be linearly related to **log-odds of $Y=1$** (rather than to $Y$ itself, for linear regression).
* Collinearity: The predictors are assumed to not be linearly dependent, and the presence of collinearity causes similar issues for regression coefficients as for linear regression.
* Residual plots and measures of influence: conceptually very similar for logistic regression, except "residuals" needs to be defined differently because there is no error term ($e_i$) for logistic regression.
We will only briefly discuss these topics here, but you should read more (e.g. @chatterjee2012regression, @baayen2008analyzing, @gelman2007data) if you are using logistic regression to analyze data.
### Residual plots {#residual-plots}
The **raw** residuals for logistic regression would consist of the model prediction (log-odds, or probabilities) minus the observation (0 or 1) for point $i$. R will give you these residuals if you call `resid` on a logistic regression model, but any plot using these residuals is not very informative, because the model's predictions and the observations are conceptually different (as discussed in Sec. \@ref(c4differences)). For example, here is a fitted-residuals plot for a model we fit above, of the kind used to assess homoscedasticity for linear regression models:
```{r}
plot(mod1,which = 1)
```
We get a pattern of two lines in plots using raw residuals, corresponding to 0/1 observations. This plot is not very useful for assessing whether there is constant "variance" in some sense throughout the dataset.
There are several alternative residuals that can be defined for logistic regression, including:
* Deviance residuals (use `glm.diag()` in the `boot` package)
* Binned residuals (use `binned.resids()` in the `arm` package)
Using either kind of residuals, you can evaluate a logistic regression model using similar diagnostic plots as for linear regression: Q-Q plots, fitted-residual plots, and plots of each predictor versus residuals.
#### Example: Binned residuals versus expected values {-}
*Binned residuals* are the average (raw) residual over observations within a certain range of expected values (log-odds). That is, you chop the plot immediately above into vertical slices, and average the "Residuals" value within each slice to get a single $y$-axis number for each slice.
We expect these residuals to be roughly normally distributed with mean 0 and constant variance across different expected values---that is, we expect an expected-value/binned-residual plot to look like the ideal fitted/residual plot for a linear regression.
Consider the final model we ended up with in [the example above](#log-reg-worked-example):
```{r}
ex1.mod <- glm(stressshift ~ npType.pron + conditionLabel.williams + voice.passive + order.std + conditionLabel.williams:voice.passive,
family="binomial",
data=givenness)
```
To make a plot of expected values (= fitted values) versus binned residuals:
```{r, fig.align='center', fig.height=4, fig.width=6}
## binned residual plot
binnedplot(predict(ex1.mod), resid(ex1.mod))
```
The diagnostic plot doesn't suggest any major issue (like a "funnel" shape would), but the observations around expected value = -2.5 merit further investigation.
### Cook's distance {#logistic-regression-cooks-distance}
Cook's distance (CD) can be defined for each observation in a logistic regression model, similarly [to linear regression](#lin-reg-measuring-influence). In R, the `glm.diag()` function in the `boot` package computes CD. Similarly to linear regression, the distribution of CD values can be plotted and used to identify observations with extreme values.
Here is an example using the multiple logistic regression model considered above:
```{r, message=FALSE, fig.align='center', fig.height=4, fig.width=6}
## add a column showing Cook's distance values to dataframe
givenness <- mutate(givenness, cooksDistance=glm.diag(ex1.mod)$cook)
ggplot(aes(x=cooksDistance), data=givenness) +
geom_histogram()
```
It turns out that the points with highest Cook's Distance are passive voice observations in contrast condition where stress was shifted:
```{r, fig.align='center', fig.height=4, fig.width=6, message=FALSE}
ggplot(aes(x=cooksDistance), data=givenness) +
geom_histogram(aes(fill=conditionLabel)) +
facet_wrap(~stressshift+voice)
```
In fact, all shifted observations in the *contrast* condition are influential, as well as many non-shifted observations in the *williams* condition (top-right panel). This is related to the fact that the Williams effect in this data is very strong: basically, any observation that is unexpected---does not show the Williams effect in the *williams* condition, or does show the effect in the control (*contrast*) condition---is influential, because it goes against the overwhelming majority of the data.
This example illustrates why it may be preferable to use visual inspection of the distribution of CD values to flag "highly influential" points, rather than a rigid cutoff (@chatterjee2012regression, 4.9.1). One common suggested cutoff is 4/$n$ ($n$ = number of observations), which would be `r round(4/nrow(givenness),2)` for this dataset. This cutoff would flag as influential *most* data where stress doesn't shift as expected by the Williams effect. This is correct, in some sense (those points are more influential than others), but not helpful in terms of deciding which observations are "highly influential"---these are the exact observations which let us estimate the size of the Williams effect.
## Other readings
Some good sources for general audiences covering the same logistic regression topics in more detail:
* @gelman2007data Ch. 4.6, 5.1-5.5 (The rest of Ch. 5, on additional topics, is excellent.)
* @chatterjee2012regression 12.1-12.5
* @agresti2007 Ch. 4-5
For language scientists and psychologists specifically, we are not familiar with in-depth treatments. Some shorter ones:
* @johnson2008quantitative 5.4-5.7 (5.7 is interesting if you are familiar with GoldVarb from sociolinguistics)
* @jaeger08
* @baayen2008analyzing 6.3.1
## Solutions {#c4solns}
**Q**: What are the interpretations of the regression coefficients?
* Intercept ($\beta_0$)
* Slope ($\beta_1$)
**A**: $\beta_0$ is the probability of a stress shift when the NP type is 'Full NP'. $\beta_1$ is the change in probability between when the NP type is 'Full NP' or 'Pronoun'.
## Appendix: Other Generalized Linear Models {#c4appendix2}
Logistic regression is the only type of generalized linear model we will cover in this book However, other types of generalized linear model are very useful and cover many other types of data. For example:
* **Multinomial regression**: response = multiple discrete outcomes
* Example: Verb `regularity` data (outcomes: "zijn", "zijnheb", "hebben")
* Multinomial regression models are approximately equivalent to "maximum entropy models" widely used in linguistics (especially phonology)
* **Poisson regression**: response = counts (0, 1, 2, 3, ...)
* Example: Number of words in a lexicon with a given phonotactic shape, or a number of times a word occurs in a corpus (e.g. @baayen2001wfd)
* **Linear model log link**: response = positive values only, somewhat right-skewed data.
* Example: Many phonetic parameters (e.g. VOT)
* Apparently this model is often more appropriate for log-normal data than just transforming the response to $\log(Y)$ and modeling that; see [here](https://stats.stackexchange.com/questions/47840/linear-model-with-log-transformed-response-vs-generalized-linear-model-with-log/48679#48679).