-
Notifications
You must be signed in to change notification settings - Fork 5
/
Copy path03-linear-regression.Rmd
2059 lines (1287 loc) · 95.1 KB
/
03-linear-regression.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
```{r setup3, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, cache=TRUE)
library(languageR)
```
# Linear regression
**Preliminary code**
This code is needed to make other code below work:
```{r, message=F, error=F, warning=F}
library(gridExtra) # for grid.arrange() to print plots side-by-side
require(languageR)
library(dplyr)
library(ggplot2)
library(arm)
## loads alternativesMcGillLing620.csv from OSF project for Wagner (2016) data
alt <- read.csv(url("https://osf.io/6qctp/download"))
## loads french_medial_vowel_devoicing.txt from OSF project for Torreira & Ernestus (2010) data
df <- read.delim(url("https://osf.io/uncd8/download"))
## loads halfrhymeMcGillLing620.csv from OSF project for Harder (2013) data
halfrhyme <- read.csv(url("https://osf.io/37uqt/download"))
```
<!-- TODO later: just wrap this into actual text -->
<!-- Old links: -->
<!-- alt <- read.delim('datasets/alternatives.txt') -->
<!-- df <- read.delim('datasets/french_medial_vowel_devoicing.txt') -->
<script src="js/hideOutput.js"></script>
**Note**: Most answers to questions/exercises not listed in text are in [Solutions](#c2solns).
This chapter introduces linear regression. The broad topics we will cover are:
1. Regression: general introduction
2. Simple linear regression
3. Multiple linear regression
4. Linear regression assumptions, model criticism, and interpretation
5. Model comparison
## Regression: General introduction
First, what is "regression"? @chatterjee2012regression define it as "a conceptually simple method for investigating functional relationships among variables":
* The variable to be explained is called the *response*, often written $Y$
* The explanatory variables are called *predictors*, often written $X_1, X_2, \ldots$.
Here is an example of a study that uses regression analysis:
<center>
![](images/cow_milk.png)
</center>
> **Questions**:
>
> * What is the response variable in this study?
>
> * What could be a predictor variable?
### Linear models
The relationship between variables is captured by a regression *model*:
\begin{equation*}
Y = f(X_1, X_2, ..., X_p) + \epsilon
\end{equation*}
In this model, $Y$ is approximated by a function of the predictors, and the difference between the model and reality is called the *error* ($\epsilon$).
Throughout this book we will be dealing exclusively with *linear models* (including "generalized linear models", like logistic regression), where the model takes the form:
\begin{equation}
Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \cdots + \beta_p X_p + \epsilon
(\#eq:linreg1)
\end{equation}
The $\beta$'s are called *regression coefficients*.
This turns out to be an extremely general class of model, which can be applied to a wide range of phenomena. Some important kinds of models that don't appear at first glance to fit Equation \@ref(eq:linreg1) can be linearized into this form.
### Terminology
We will consider two broad types of regression in this book:
1. *linear regression*, where the response ($Y$) is a continuous variable. An example would be modeling reaction time (`RTlexdec`) as a function of word frequency (`WrittenFrequency`) for [the `english` dataset](#engdata).
2. Later we will consider *logistic regression*, where the response ($Y$) is binary: 0 or 1. An example would be modeling whether tapping occurs or not (`tapping`) as a function of `vowelDuration` and `speakingRate` in [the `tapping` dataset](#tapdata).
Regression with just one predictor is called *simple*, while regression with multiple predictors is called *multiple*. The two examples just given would be "simple linear regression" and "multiple logistic regression".
Predictors can be *continuous*, such as milk consumption, or *categorical*---also called "factors"---such as participant gender, or word type.
Certain special cases of linear models that are common go by names such as:
* *Analysis of variance*: continuous $Y$, categorical predictors. (Models variability between groups)
* *Analysis of covariance*: continuous $Y$, mix of categorical and continuous predictors.
We are not covering these cases in this book, but ANOVAs (and ANCOVAs) are widely used in language research, and you may have seen them before. ANOVAs can be usefully thought of as just a **special case** of regression, as discussed in @vasishth2011foundations, @levy2012probabilistic, @gelman2007data. Once you understand linear regression well, understanding ANOVA analyses is relatively straightforward.
### Steps and assumptions of regression analysis
Regression analyses have five broad steps, as usefully discussed by @chatterjee2012regression:
1. Statement of the problem
* Ex: Does milk consumption affect height gain?
2. Selection of potentially relevant variables
* Ex: Height gain, daily milk consumption, age, and sex.
3. Data collection
* Ex: data from an existing database.
4. Model specification & fitting
* Ex: height gain = $\beta_0 + \beta_1 \cdot$ milk consumption $+ \beta_2 \cdot$ sex $+$ error
5. Model validation and criticism
It is important to remember that the **validity of a regression analysis depends on the assumptions of the data and model**.
For example, if you are modeling data where $Y$ has a maximum value, fitting a simple linear regression (= a line) to this data doesn't make conceptual sense, a priori. Here's an example using the `english` dataset:
```{r, fig.align='center', fig.height=4, fig.width=5}
library(languageR) ## makes the 'english' dataset available
ggplot(aes(x = Familiarity, y = CorrectLexdec), data = english) + geom_point(, alpha=0.1) + geom_smooth( method='lm')
```
Because `CorrectLexdec` has a maximum value of 30, fitting a line doesn't make sense---the predicted value when `Familiarity`=6 is above 30, but this is impossible given the definition of `CorrectLexdec`.
For now, we will not go into what the assumptions of linear regression are, and just assume that they are met. After introducing simple and multiple linear regressions, we'll come back to this issue in Section \@ref(linear-regression-assumptions):
* What are the assumptions (of linear regression)?
* For each assumption, how do we determine whether it's valid?
* How much of a problem is it, and what can be done, if the assumption is not met?
Note that R almost never checks whether your data and model meet regression analysis assumptions, unlike other software (e.g. SPSS, sometimes).
## Simple linear regression
The simplest application of simple linear regression (SLR) is to model an association between two continuous variables.
#### Example: `english` data, young participants only {-}
* $X$: `WrittenFrequency` (= predictor)
* $Y$: `RTlexdec` (= response)
```{r, fig.align='center', fig.height=3, fig.width=5, warning=FALSE}
young <- filter(english, AgeSubject=='young')
ggplot(young, aes(WrittenFrequency, RTlexdec)) +
geom_point(size=0.5)
```
<!-- TODO later: when "previous chapters" put into book, add this back in: -->
<!-- In previous chapters, we have described this kind of association using a _correlation -->
One way to describe this kind of association which you may be familiar with is a _correlation_ coefficient which gives us two types of information about the relationship:
1. *direction*: $r = -0.434$
* $\implies$ negative relationship
2. *strength*: $r^2 = 0.189$
* $\implies$ weak relationship ($0 \le r^2 \le 1$)
A simple linear regression gives a *line of best fit*:
```{r, fig.align='center', fig.height=3, fig.width=5}
ggplot(young, aes(WrittenFrequency, RTlexdec)) +
geom_point(size=0.5) +
geom_smooth(method="lm", se=F)
```
which gives some information not captured by a correlation coefficient:
* Prediction of $Y$ for a given $X$
* Numerical description of relationship
Regression gives both types of information, and more.
### SLR: Continuous predictor
The formula for simple linear regression, written for the $i^{\text{th}}$ observation, is:
\begin{equation}
y_i = \underbrace{\beta_0}_{\text{intercept}} + \underbrace{\beta_1}_{\text{slope}} x_i + \epsilon_i
(\#eq:linreg2)
\end{equation}
In this expression:
* $\beta_0, \beta_1$ are *coefficients*
* For the $i^{\text{th}}$ observation
* $x_i$ is the value of the *predictor*
* $y_i$ is the value of the *response*
* $\epsilon_i$ is the value of the *error* (or *residual*)
This is our first linear model of a random variable ($Y$) as a function of a predictor variable ($X$). The actual model, written not for individual observations, is written:
$$
Y = \beta_0 + \beta_1 X + \epsilon
$$
That is, we use notation like $X$ for a variable, and notation like $x_5$ for actual values that it takes on.
For example, for [the `english` dataset](#engdata), with $Y$ = `RTlexdec` and $X$ = `WrittenFrequency`:
```{r}
head(dplyr::select(english, WrittenFrequency, RTlexdec))
```
* $x_2$=`r as.numeric(english[2,'WrittenFrequency'])` (predictor value for second observation)
* $y_1$=`r as.numeric(english[1,'RTlexdec'])` (response value for first observation)
> **Question**
>
> * What is $y_5$?
### SLR: Parameter estimation
To get a line of best fit, we want: $\beta_0$ and $\beta_1$, the *population values*. Recall that we can't actually observe these (Sec.\@ref(sample-population)), so we obtain *sample estimates*, written $\hat{\beta}_0, \hat{\beta}_1$.
Once sample estimates are specified, Equation \@ref(eq:linreg2) gives *fitted values* for each observation, written $\hat{y}_i$:
$$
\hat{y}_i = \hat{\beta}_0 + \hat{\beta}_1 x_i
$$
Note that there are no residuals in this equation---$\epsilon_i$ are again population values, which we can't observe. Our estimates of the residuals, given an estimated line of best it, are written $e_i$ (*error*):
$$
e_i = y_i - \hat{y}_i
$$
This diagram shows the relationship between some of these quantities, for a single observation:
<center>
![](images/slr_continuous_model.png){width=600}
</center>
Our goal is to find coefficient values that minimize the difference between observed and expected values---the magnitudes of error ($|e_i|$), which are minimized by minimizing the squared errors ($e_i^2$).
We choose $\hat{\beta}_0$, $\hat{\beta}_1$ that minimize the sum of the $e_i^2$; these are called the *least-squares estimates*.
One useful property of the resulting regression line is that it always passes through the point (mean($X$), mean($Y$)). A consequence is that SLR is easily thrown off by observations which are outliers in $X$ or $Y$ (why?).
#### Example {-}
Here is an example of fitting an SLR model in R, of reaction time vs. frequency for young speakers in the `english` dataset:
```{r}
young <- filter(english, AgeSubject == "young")
m <- lm(RTlexdec ~ WrittenFrequency, young)
```
The model output is:
```{r}
m
```
The interpretation of the two coefficients is:
* $\beta_0$: the predicted $Y$ value when $X = 0$
```{r}
m$coefficients[1]
```
* $\beta_1$: predicted change in $Y$ for every unit change in $X$
```{r}
m$coefficients[2]
```
The regression line is:
\begin{equation}
\hat{y}_i = \hat{\beta}_0 + \hat{\beta}_1 x_i
\end{equation}
In terms of the variables used in this example:
\begin{equation}
\text{Predicted RTlexdec}_i = 6.625 - 0.037 \cdot \text{WrittenFrequency}_i
\end{equation}
> **Questions**:
>
> What is the predicted `RTlexdec` when:
>
> * `WrittenFrequency` = 5?
>
> * `WrittenFrequency` = 10?
<!-- 6.44, 6.25 -->
### Hypothesis testing
The least-squared estimators are normally-distributed. These are sample estimates, so we can also approximate the standard errors of the estimators:
\begin{equation*}
SE(\hat{\beta}_0) \quad SE(\hat{\beta}_1)
\end{equation*}
and apply $t$-tests to test for significance and obtain confidence intervals (CI) for the coefficients.
In particular, we are testing the null hypotheses of no relationship:
* $H_0~:~\beta_1 = 0$
(and similarly for $\beta_0$).
We then apply a $t$-test, using test statistic:
\begin{align}
t_1 &= \frac{\hat{\beta}_1}{SE(\hat{\beta}_1)} \\
df &= n - 2 \nonumber
\end{align}
(where $n$ = number of observations).
The resulting $p$-value tells us how surprised are we to get a slope this far from zero ($\beta_1$) under $H_0$. (With this high a standard error, given this much data.)
To see the results of these hypothesis tests in R:
```{r}
summary(lm(RTlexdec~WrittenFrequency, young))
```
The $t$-values and associated $p$-values are in the `Coefficients:` table, where the first row is for $\hat{\beta}_0$ and the second row is for $\hat{\beta}_1$.
> **Questions**:
>
> * Why does the column for the $p$-value say `Pr(>|t|)`?
Having the SEs of the coefficients also lets us compute 95\% confidence intervals for the least-squared estimators. Going from the null hypothesis ($H_0$) above: if the 95\% CI of the slope ($\beta_1$) does not include 0, we can reject $H_0$ with $\alpha = 0.05$.
In R, you can get confidence intervals for a fitted model as follows:
```{r}
confint(lm(RTlexdec~WrittenFrequency, young))
```
Neither CI includes zero, consistent with the very low $p$-values in the model table above.
#### Example: Small subset {-}
For the full dataset of `english` young speakers, it's a little silly to do hypothesis testing given how much data there is and the clarity of the pattern---the line of best fit has a tiny confidence interval. Just for exposition, let's look at the line of best fit for a subset of just $n=100$ points:
```{r, fig.align='center', fig.height=3, fig.width=5}
set.seed(2903) # This makes the following "random" sampling step always give the same resuult
young_sample <- young %>% sample_n(100)
ggplot(young_sample, aes(WrittenFrequency, RTlexdec)) +
geom_point() +
geom_smooth(method="lm")
```
> **Questions**:
>
> * What does `geom_smooth(method="lm")` do?
In this plot, the shading around the line is the 95\% confidence interval. "Can we reject $H_0$?" is equivalent to asking, "Can a line with 0 slope cross the shaded area through the range of $x$ (and going through (mean($X$), mean($Y$))?
### Quality of fit
Here is the model we have been discussing, plotted on top of the empirical data:
```{r, fig.align='center', fig.height=3, fig.width=5}
young <- filter(english, AgeSubject == "young")
ggplot(young, aes(WrittenFrequency, RTlexdec)) +
geom_point(size=0.25) +
geom_smooth(method="lm") +
theme_bw()
```
We often want a metric quantifying how well a model fits the data---the *goodness of fit*.
For simple linear regression, we can derive such a metric by first defining three quantities:
* *SST*: Total sum of squares
* <span style="color:blue">*SSR*</span>: Sum of squares due to regression
* <span style="color:orange">*SSE*</span>: Sum of squares due to error
<center>
![](images/slr_quality_of_fit.png)
</center>
(Source: slides from *Business Statistics: A First Course (Third edition)*)
<!-- TODO FUTURE: make own figure or get public domain figure -->
The fundamental equality is:
*SST* = <span style="color:blue">*SSR*</span> + <span style="color:orange">*SSE*</span>
Intuitively we want a measure of how much of SST is accounted for by SSR. This is $R^2$: the proportion of total variability in $Y$ accounted for by $X$:
\begin{equation}
R^2 = \frac{SS_{\text{fit}}}{SS_{\text{total}}}
\end{equation}
$R^2$ always lies between 0 and 1, which can conceptually be thought of as:
* 0: none of the variance in $Y$ is accounted for by $X$
* 1: all of the variance `` ``
It turns out $R^2$ is also the square of (Pearson's) correlation, $r$.
#### Example {-}
Here is the SLR model fitted to a subset of 100 data points, repeated for convenience:
```{r}
set.seed(2903)
d <- english %>% filter(AgeSubject=="young") %>% sample_n(100)
summary(lm(RTlexdec~WrittenFrequency, d))
```
In the model table, note the values of the sample statistic under `t value` and its significance in the `Pr(>|t|)` column in the `WrittenFrequency` row, as well as of the correlation statistic `Multiple R-squared`.
This is a hypothesis test for Pearson's $r$ for the same data, checking whether it is significantly different from 0:
```{r}
cor.test(d$WrittenFrequency, d$RTlexdec)
```
Note that $t$, $p$, and $r$ (the square root of `Multiple R-squared`) are exactly the same. Thus, fitting a simple linear regression and conducting a correlation test give us two ways of finding the same information.
### Categorical predictor
Simple linear regression easily extends to the case of a binary $X$ (a *factor*).
#### Example {-}
* `english` data
* Predictor: `AgeSubject`
* Response: `RTlexdec`
Everything is the same as for the case where $X$ is continuous, except now we have:
* $x_i = 0$: if `AgeSubject == "old"`
* $x_i = 1$: if `AgeSubject == "young"`
The regression equation is exactly the same as Equation \@ref(eq:linreg2):
$$
y_i = \beta_0 + \beta_1 x_i + \epsilon_i
$$
Only the interpretation of the coefficients differs:
* $\beta_0$: **mean** `RTlexdec` when `AgeSubject == "old"` (since $x_i = 0 \iff$ `AgeSubject == "old"`)
* $\beta_1$: **difference in mean** `RTlexdec` between `young` and `old`
Hypothesis tests, $p$-values, CIs, and goodness of fit work exactly the same as for a continuous predictor.
#### Example {-}
Suppose we want to test whether the difference in group means is statistically significantly different from 0:
```{r, fig.align='center', fig.height=4, fig.width=3}
english %>% ggplot(aes(AgeSubject, RTlexdec)) +
geom_boxplot() +
stat_summary(fun.y="mean", geom="point", color="blue", size=3) +
theme_bw()
```
> **Question**:
>
> * What goes in the blanks?
```{r, eval=F}
m1 <- lm(_____ ~ _____, english)
summary(m1)
```
<!-- ```{r} -->
<!-- m1 <- lm(RTlexdec ~ AgeSubject, english) -->
<!-- summary(m1) -->
<!-- ``` -->
Note the $t$ and $p$-values for `AgeSubjectyoung`, we'll need them in a second.
### SLR with a binary categorical predictor vs. two-sample $t$-test
Conceptually, we just did the same thing as a two-sample $t$-test---tested the difference between two groups in the value of a continuous variable. Let's see what the equivalent $t$-test gives us:
```{r}
t.test(RTlexdec ~ AgeSubject, english, var.equal=T)
```
(The `var.equal` option forces the $t$-test to assume equal variances in both groups, which is one assumption of linear regression.)
You should find that both tests give identical $t$ and $p$-values. So, a $t$-test can be thought of as a special case of simple linear regression.
#### Bonus: Linear vs. smooth regression lines
We have forced an SLR fit in the plots above using the `method='lm'` flag, but by default `geom_smooth` uses a *nonparametric smoother* (such as LOESS, the `geom_smooth` default for small samples):
```{r, fig.align='center', fig.height=3, fig.width=7, message=FALSE}
young <- filter(english, AgeSubject=='young')
set.seed(2903)
young_sample <- young %>% sample_n(100)
day7_plt1 <- ggplot(young_sample, aes(WrittenFrequency, RTlexdec)) +
geom_point(size=0.5) +
geom_smooth(method="lm") +
ggtitle("Linear regression line and 95% CI")
day7_plt2 <- ggplot(young_sample, aes(WrittenFrequency, RTlexdec)) +
geom_point(size=0.5) +
geom_smooth() +
ggtitle("Smooth regression line and 95% CI")
grid.arrange(day7_plt1, day7_plt2, ncol = 2)
```
Note the differences in the two plots:
* Linear/nonlinear
* CI widths related to distance from mean, versus **amount of data nearby**
(Hence "**L**ocal" in LOESS.)
## Multiple linear regression
In *multiple linear regression*, we use a linear model to predict a continuous response with $p$ predictors ($p>1$):
$$
Y = \beta_0 + \beta_1 X_i + \beta_2 X_2 + \cdots + \beta_p X_p + \epsilon
$$
Each predictor $X_i$ can be continuous or categorical.
#### Example: RT ~ frequency + age {- #ex1}
For [the `english` dataset](#engdata), let's model reaction time as a function of word frequency and participant age. Recall that in addition to the word frequency effect, older speakers react more slowly than younger speakers:
```{r, fig.align='center', fig.height=4, fig.width=4}
english %>% ggplot(aes(AgeSubject, RTlexdec)) +
geom_boxplot() +
stat_summary(fun.y="mean", geom="point", color="blue", size=3) +
theme_bw()
```
The response and predictors are:
* $Y$: `RTlexdec`
* $X_1$: `WrittenFrequency` (continuous)
* $X_2$: `AgeSubject` (categorical) (0: `old`, 1: `young`)
Because $p=2$, the regression equation for observation $i$ is
$$
y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \epsilon_i
$$
where $x_{ij}$ means the value of the $j^{\text{th}}$ predictor for the $i^{\text{th}}$ observation.
To fit this model in R:
```{r}
m2 <- lm(RTlexdec~WrittenFrequency+AgeSubject, english)
```
Summary of the model:
```{r}
summary(m2)
```
This model tells us that the least-squares solution for the regression line is:
$$
\texttt{RTlexdec} = \underbrace{6.846}_{\beta_0} + \underbrace{(- 0.037)}_{\beta_1} \cdot \texttt{WrittenFrequency} + \underbrace{(- 0.221)}_{\beta_2} \cdot \texttt{AgeSubject} + \text{error}
$$
> **Questions**:
>
> * What RT does the model predict for an observation with `WrittenFrequency`=5 and `AgeSubject`='old'?
<div class="fold s o">
```{r}
6.846 + (-0.037 * 5) + (-0.221 * 0)
```
</div>
Note that in this MLR model, the interpretation of each coefficient is:
* $\beta_0$: predicted value when all predictors = 0
* $\beta_1$, $\beta_2$: change in a predictor **when others are held constant**
For example, the difference between old and young speakers in RT is 0.221, when word frequency is held constant.
### Goodness of fit metrics
With $R^2$ defined as for simple linear regression, in terms of sums of squares, the exact same measure works to quantify goodness of fit of a multiple linear regression:
$$
R^2 = \frac{SS_{\text{fit}}}{SS_{\text{total}}}
$$
sometimes called *multiple $R^2$*.
An alternative to $R^2$ when there's more than one predictor (MLR) is *adjusted $R^2$*, defined as:
$$
R^2 = 1 - \frac{SS_{\text{fit}}/(n - p - 1)}{ SS_{\text{total}}/(n - 1) }
$$
where $p$ is the number of predictors and $n$ is the number of observations.
In this expression, the sum-of-squares term can be thought of as a ratio comparing the amount of variance explained by two models: the "full" model (the one with $p$ predictors) and the "baseline" model (the one with just the intercept). Each model's sum of squares is scaled by its degrees of freedom; intuitively, this gives a measure of how much variance is explained **given the number of predictors**. (We expect that if you throw more predictors in a model, more variance can be explained, just by chance.)
The adjusted $R^2$ measure only increases if the $p$ additional predictors improve the model more than would be expected by chance.
* **Pro**: Adjusted $R^2$ is more appropriate as a metric for comparing different possible model---unlike "multiple $R^2$", adjusted $R^2$ doesn't automatically increase whenever new predictors are added.
* **Con**: Multiple $R^2$ is no longer interpretable as "fraction of the variation accounted for by the model".
R reports both adjusted and non-adjusted versions, as seen in the model summary above.
### Interactions and factors
The models we have considered so far assume that each predictor affects the response **independently**. For example, in [the example above](#c2ex1) (`RT ~ frequency + age`), our model assumes that the slope of the frequency effect on RT is the same for old speakers as for young speakers. This looks like it might be approximately true:
```{r, fig.align='center', fig.height=3, fig.width=5}
ggplot(english, aes(WrittenFrequency, RTlexdec)) +
geom_point(size=0.5) +
geom_smooth(method="lm", se=F) +
facet_wrap(~AgeSubject)
```
in that there seems to be a similarly negative slope for both groups. That is, a model of this form seems approximately correct:
$$
Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \epsilon
$$
($\epsilon$ = error).
Here is a (fake) example where the independence assumption is definitely not true:
* $Y$: Job performance (continuous)
* $X1$: Training (categorical)
* $X2$: Autonomy (categorical)
```{r, echo=F, fig.align='center', fig.height=3, fig.width=6}
jp <- c(2.6, 2.4, 3.4, 4.5)
training <- c("Low training", "Low training", "High training", "High training")
autonomy <- c("Low autonomy", "High autonomy", "Low autonomy", "High autonomy")
fake.df <- data.frame(training, autonomy, jp)
fake.df$training <- relevel(fake.df$training, ref = "Low training")
fake.df$autonomy <- relevel(fake.df$autonomy, ref = "Low autonomy")
ggplot(fake.df, aes(x = training, y = jp, shape = autonomy, linetype = autonomy, group = autonomy)) +
geom_point() +
geom_line() +
ylab("Job performance") +
xlab("") +
guides(shape=guide_legend(title = NULL),
linetype=guide_legend(title = NULL)) +
theme_bw()
```
The effect of training on job performance is larger for high-autonomy participants. In this case, we say there is an *interaction* between training and autonomy: the value of one predictor modulates the effect of the other. This interaction is modeled by adding an extra term to the regression equation:
$$
Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_1 X_2 + \epsilon
$$
which is the *product* of the two terms. Note that
$$
\beta_2 X_2 + \beta_3 X_1 X_2 = (\beta_2 + \beta_3 X_1) X_2
$$
so the interaction coefficient $\beta_3$ modulates the slope of $X_2$: depending on the value of $X_1$, $X_2$ has a different effect on $Y$.
#### Example {-}
Returning to [the example above](#c2ex1), suppose we'd like to know how much the slope of `WrittenFrequency` does actually differ between old and young speakers, and whether the difference is statistically significant ($\alpha = 0.05$).
```{r, fig.align='center', fig.height=3, fig.width=5}
english %>% ggplot(aes(WrittenFrequency, RTnaming)) +
geom_point(size=0.5) +
geom_smooth(aes(group=AgeSubject, color=AgeSubject), method="lm", se=F)
```
`X1:X2` means "interaction between `X1` and `X2`, and the notation used in R for interactions is `X1*X2`, which expands automatically to `X1 + X2 + X1:X2`. (The non-interaction terms are sometimes called *main effects*.) Note that in R these are equivalent:
```
lm(RTnaming ~ WrittenFrequency * AgeSubject, english)
lm(RTnaming ~ WrittenFrequency + AgeSubject + WrittenFrequency:AgeSubject, english)
```
To fit a model including an interaction between frequency and age:
```{r}
m3 <- lm(RTnaming ~ WrittenFrequency * AgeSubject, english)
```
In the summary of this model, of interest is the `WrittenFrequency:AgeSubjectyoung` row, which is the interaction effect:
```{r}
summary(m3)
```
We see that there is indeed a significant interaction between `WrittenFrequency` and `AgeSubject`.
> **Questions**:
>
> * What does it mean that this coefficient is positive?
>
> * For this regression model including an interaction, what is the model for an observation with `WrittenFrequency=3` and `AgeSubject=='old'`? [Solution](#c2sol1).
>
> * What is the model for an observation with `WrittenFrequency=3` and `AgeSubject=='young'`? [Solution](#c2sol2).
### Plotting interactions
In general, making plots is indispensable for interpreting interactions. It is possible, with practice, to interpret interactions from the regression table, but examining a good plot is usually also necessary and much faster.
In later chapters we will cover how to actually visualize model predictions---exactly what the model predicts for different combinations of predictor values. You can usually get a reasonable approximation of this by making the relevant empirical plot, such as:
```{r, fig.align='center', fig.height=3, fig.width=5}
english %>% ggplot(aes(WrittenFrequency, RTnaming)) +
geom_smooth(method='lm', aes(color=AgeSubject)) +
xlab('Log written frequency') +
ylab('Naming RT in log(s)')
```
It is often better to use empirical plots to visualize interactions---even though, strictly speaking, you are not plotting the model's predictions.
* **Pros**: Empirical plots are more intuitive, and if you have a robust effect it should probably show up in an empirical plot.
* **Cons**: Empirical plots don't show actual model predictions, and in particular don't control for the effects of other predictors.
### Categorical factors with more than two levels
We are often interested in categorical predictors with more than two levels. For example, for the [Dutch `regularity` data](#dregdata), we might wonder whether the size of a verb's morphological family size is affected by what auxiliary it takes in the past tense.
```{r, fig.align='center', fig.height=4, fig.width=5}
regularity %>% ggplot(aes(Auxiliary, FamilySize)) +
geom_boxplot() +
stat_summary(fun.y="mean", geom="point", color="blue", size=3)
```
The relevant variable, `Auxiliary`, has three levels. Let's see how this kind of variable is dealt with in a regression model.
#### Exercise {-}
1. Fit a regression model predicting `FamilySize` from `Auxiliary`.
2. What does the intercept ($\beta_0$) represent?
3. What do the two coefficients for `Auxiliary` ($\beta_1$, $\beta_2$) represent?
Hint: Compare $\beta$ coefficients with group means, which you can check using `summarise()` from `dplyr`.
Solution to (1):
<div class="fold s o">
```{r}
m4 <- lm(FamilySize ~ Auxiliary, regularity)
summary(m4)
```
</div>
Solution to (2): The value of `FamilySize` when `Auxiliary`="hebben".
Solution to (3): The predicted difference in `FamilySize` between `Auxiliary`="zijn" and "hebben", and between `Auxiliary`="zijnheb" and "hebben".
### Releveling factors
It's often useful for conceptual understanding to change the ordering of a factor's levels. For the `regularity` example, we could make `zijn` the base level of the `Auxiliary` factor:
```{r}
regularity$Auxiliary <-relevel(regularity$Auxiliary, "zijn")
m5 <- lm(FamilySize ~ Auxiliary, regularity)
summary(m5)
```
> **Questions**:
>
> * What is the interpretation of the intercept and the two `Auxiliary` coefficients in this new model?
## Linear regression assumptions {#linear-regression-assumptions}
Up to now, we have discussed regression models without worrying about the assumptions that are made by linear regression, about your data and the model. We will cover six main assumptions, the first four have to do with the form of the model and errors:
1. Linearity
2. Independence of errors
3. Normality of errors
4. Constancy of errors (*homoscediasticity*)
followed by two assumptions about the predictors and observations:
5. Linear independence of predictors
6. Observations have roughly equal influence on the model
We'll discuss each in turn.
Our presentation of regression assumptions and diagnostics is indebted to Chapters 4, 6 and 9 of @chatterjee2012regression, where you can find more detail.
### Visual methods
Visualization is crucial for checking model assumptions, and for data analysis in general. A famous example illustrating this is *Anscombe's quartet*: a set of four small datasets of $(x,y)$ pairs with:
* The same mean and variance for $x$
* The same mean and variance for $y$
* A correlation($x$, $y$) = 0.816
* The same regression line ($y = 3 + 0.5\cdot x$)
in each case---and yet the datasets show qualitatively different patterns, as can be seen by plotting $y$ against $x$:
<center>
![](images/anscombe.png)
</center>
(Source: unknown, but definitely taken from somewhere)
<!-- TODO FUTURE: find or fix -->
With more than one predictor it becomes difficult to check regression assumptions by just plotting the data, and visual methods such as *residual plots* (presented below) are crucial.
### Assumption 1: Linearity
The first assumption of a linear regression model is that the relationship between the response ($Y$) and predictors ($X_i$) is... linear.
While obvious, this assumption is very important: if it is violated, the model's predictions can be in serious error.
The linearity assumption can be partially checked by making a scatterplot of $Y$ as a function of each predictor $X_i$. It is hard to exhaustively check linearity for MLR, because nonlinearity might only become apparent when $Y$ is plotted as a function of several predictors.
#### Example {-}
Consider relative pitch, intensity, and duration in [the `alternatives` dataset](#altdata).
We can make *pairwise plots* of these variables using the `pairscor.fnc()` function `languageR`, to see if any of these variables might be a function of the other two.
```{r, fig.align='center', fig.height=5, fig.width=5}
pairscor.fnc(with(alt, cbind(rpitch, rintensity, rduration)))
```
> **Questions**:
>
> * Is this the case?
Let's examine the relationship between realtive duration and relative intensity more closely:
```{r, fig.align='center', fig.height=3, fig.width=5}
alt %>% ggplot(aes(rduration, rintensity)) +
geom_point()
```
We can try to fit a line to this data, but if we compare to using a nonlinear smoother, it seems clear that the relationship is not linear:
```{r, fig.align='center', fig.height=3, fig.width=5, message=FALSE}
alt %>% ggplot(aes(rduration, rintensity)) +
geom_point() + geom_smooth(col='red', se=F) +
geom_smooth(method="lm", col="blue", se=F)
```
In particular, it looks like there is a quadratic trend. This means that we can in fact fit a linear regression, we just need to include coefficients for both `rduration` and its square, like so:
```{r}
mq <- lm(rintensity ~ rduration + I(rduration^2), alt)
summary(mq)
```
We will cover more nonlinear functions of predictors in [a later chapter](#nonlinear-effects). The important point here is that a model with just `rduration` as a predictor would have violated the linearity assumption, but a model with both `rduration` and `rduration^2` as predictors doesn't (arguably).
### Assumption 2: Independence of errors {#c2ioe}
All regression equations we have considered contain an *error* or *residual* term: $\epsilon_i$ for the $i^{\text{th}}$ observation. A crucial assumption is that these $\epsilon_i$ are **independent**: knowing the error for one observation shouldn't tell you anything about the error for another observation.
Unfortunately, violations of this assumption are endemic in realistic data. The simplest example is in time series, or longitudinal data---such as pitch measurements taken every 10 msec in a speech signal.
> **Questions**:
>
> * Can you think of why this might be the case?
In linguistics and psycholinguistics, violations of the independence assuption are common because most datasets include multiple observations per participant or per item (or per word, etc.). Crucially, violations of the independence assumption are often *anti-conservative*: CIs will be too narrow and $p$-values too small if the lack of independence of errors is not taken into account by the model.
Some solutions to these issues:
* [**Paired-t-tests**](#paired-t-test), where applicable (binary predictor; two measures per participant)
* **Mixed-effects regression** (more general solution, major focus later this term)
Until we cover mixed-effects regression, we will be getting around the fact that the independence-of-errors assumption usually doesn't hold for linguistic data, in one of two ways:
1. Selectively using datasets where this assumption **does** hold, such as `regularity`.
2. Analyzing datasets where this assumption does **not** hold, such as `tapping` or `english`, using analysis methods that do assume indepedence of errors (such as linear regression), with the understanding that our regression models are probably giving results that are "wrong" in some sense.
### Assumption 3: Normality of errors
The next major assumption is that the errors $\epsilon_i$ are normally distributed, with mean 0 and a fixed variance. This assumption is impossible to check directly, because we never observe the true *errors* $\epsilon_i$, only the *residuals* $e_i$. The residuals are no longer normally distributed (even if the errors are), because some observations will be more influential than others in determining the fitted responses $\hat{y}_i$ when fitting the least-squares estimates of the regression coefficients.
#### Example {-}
```{r, fig.align='center', fig.height=3, fig.width=7}
young <- filter(english, AgeSubject=='young')
set.seed(2903)
young_sample <- young %>% sample_n(100)
ggplot(young_sample, aes(WrittenFrequency, RTlexdec)) +
geom_point(size=0.5) +
geom_smooth(method="lm") +
ggtitle("Linear regression line and 95% CI")
```
The width of the confidence interval increases for points further from (average of `WrittenFrequency`, average of `RTlexdec`), because these points are more influential, causing the variance of the residuals to increase---thus, the variance is not constant.
In order to correct for non-normality, the residuals are transformed in a way which accounts for the different influence of different observations (see e.g. @chatterjee2012regression 4.3), to *studentized* or *standardized residuals*.^[The studentized and standardized residuals, or "externally studentized" and "internally studentized" residuals (in @chatterjee2012regression), differ slightly in how they estimate the error variance: a leave-one-out estimate versus an estimate using all observations. This difference shouldn't matter much except when certain observations are highly influential or in small datasets.] (In R, by applying `rstudent` or `rstandard` to a fitted model.)
In general, we **check assumptions about errors by examining the distribution of standardized residuals**. This is because **if** the normality of errors assumption holds, **then** the standardized residuals will be normally distributed with mean 0 and fixed variance. So if they are not, we know the normality of errors assumption does not hold. (If they are, it's not a guarantee that the normality of errors assumption holds, but we hope for the best.)
#### Example {- #c2ex2}
This exercises uses the `halfrhyme` data, briefly described [here](#halfdata). Let's abstract away from what the variables actually mean, and just think of them as $Y$ and $X$:
```{r, warning=FALSE}
ggplot(aes(x=cohortSize, y=rhymeRating), data=filter(halfrhyme, conditionLabel=='bad')) +
geom_point() + geom_smooth(method='lm') +
geom_jitter() + xlab("X") + ylab("Y")
```
The distribution of the standarized residuals for the regression of $Y$ as a function of $X$ is:
```{r, message=FALSE}
halfrhyme.sub <- filter(halfrhyme, conditionLabel=='bad' & !is.na(cohortSize))
mod <- lm(rhymeRating ~ cohortSize, data=halfrhyme.sub)
halfrhyme.sub$resid <- rstandard(mod)
ggplot(aes(x=resid), data=halfrhyme.sub) + geom_histogram() + xlab("Residual (standardized)")
```
> **Questions**:
>
> * Why do the residuals have this distribution?
This example illustrates probably the most common source of non-normal residuals: a highly non-normal distirbution of the predictor or response.
#### Effect and solution
Non-normality of residuals is a pretty common violation of regression assumptions. How much does it actually matter? @gelman2007data (p. 46) argue "not much", at least in terms of the least-squares estimates of the regression line (i.e., the regression coefficient values), which is often what you are interested in.
However, non-normality of residuals, especially when severe, can signal other issues with the data, such as the presence of outliers, or the predictor or response being on the same scale. (Example: using non-log-transformed word frequency as a predictor.) Non-normality of residuals can often be dealt with by **transforming the predictor or response** to have a more normal distribution (see Sec. \@ref(transforming-to-normality)).
Non-normality of residuals can also signal other errors, such as an important predictor missing.
#### Exercise {-}
1. Using the `english` data, plot `RTlexdec` as a function of `WrittenFrequency`, and add a linear regression line.
<div class="fold s o">
```{r, fig.align='center', fig.height=4, fig.width=5}
ggplot(english, aes(x = WrittenFrequency, y = RTlexdec)) +
geom_point(size = 0.5) +
geom_smooth(method = "lm", se=F)
```
</div>
2. Do you think the residuals of this model are normally distributed? Why/why not?
3. Now plot a histogram of the standardized residuals of the mode. Does the plot confirm your first impressions?
```{r, eval=F}
m8 <- lm(RTlexdec ~ _______, english)
m8.resid.std <- rstandard(______)
hist(______, breaks = 50)
```
<div class="fold s o">
```{r, fig.align='center', fig.height=3, fig.width=8}
day9_plt1 <- ggplot(english, aes(WrittenFrequency, RTlexdec)) +
geom_point(size=0.5) +
geom_smooth(method="lm", se=F)
m8 <- lm(RTlexdec~WrittenFrequency, english)
m8.resid.std <- rstandard(m8)