-
Notifications
You must be signed in to change notification settings - Fork 5
/
Copy path04-CDA.Rmd
469 lines (301 loc) · 19.2 KB
/
04-CDA.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
```{r setup4, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, cache=TRUE)
```
# Categorical data analysis: Preliminaries {#cda}
**Preliminary code**
This code is needed to make other code below work:
```{r, message=F, error=F, warning=F}
library(gridExtra) # for grid.arrange() to print plots side-by-side
library(languageR)
library(ggplot2)
library(dplyr)
library(arm)
library(boot)
## loads alternativesMcGillLing620.csv from OSF project for Wagner (2016)
## "Information structure and production planning"
alternatives <- read.delim(url("https://osf.io/6qctp/download"), stringsAsFactors = FALSE)
```
**Note**: Answers to some questions/exercises not listed in text are in [Solutions](#c3solns)
<script src="js/hideOutput.js"></script>
<!-- TODO FUTURE: actually complete solutions -->
## Introduction
So far in this book, we have mostly considered data analysis where the outcome is a continuous variable, like reaction time or vowel duration. In linear regression, the outcome ($Y$) is modeled as a function of one or more categorical and continuous predictors ($X_i$).
We now turn to *categorical data analysis*, where the outcome being modeled is a categorical variable. We assume that you have had some exposure to some basic categorical data analysis topics:
* *Contingency tables* (e.g. `xtabs` in R)
* Including the notion of **observed** and **expected** values based on a contingency table, and "observed/expected ratios" (O/E ratios).
* *Tests of independence* of categorical variables
* $\chi^2$ tests
* Fisher's exact test
which we refresh briefly below. If you haven't seen these topics before, some places to read more are given [below](#cda-other-readings).
### 2x2 contingency tables
<!-- TODO future: re-add this if we ever have an EDA chapter -->
<!-- As discussed in (TODO: update to Exploratory Data Analysis chapter once this exists), -->
Contingency tables show the number of observations for each combination of values of categorical variables. A 2x2 contingency table in particular shows the number of observations for each combination of two categorical variables, $X_1$ and $X_2$.
#### Example {-}
For this subset of [the `alternatives` data](#altdata):
* `prominence` = `Adjective` or `Noun`
* `context`= `Alternative` or `NoAlternative`
A contingency table showing the number of observations for each combination of `prominence` and `context`:
```{r}
alternatives.sub <- filter(alternatives, context%in%c('Alternative', 'NoAlternative') &
prominence %in% c('Adjective', 'Noun'))
xtabs(~prominence + context, data=alternatives.sub)
```
#### Observed and expected values {-}
In the contingency table above, it looks like prominence isn't independent of context: `Adjective` prominence is much more likely relative to `Noun` prominence in `Alternative` context. We would like to formally test this hypothesis of non-independence. We can do so using this null hypothesis:
* $H_0$: `prominence` and `context` occur independently in this data.
Under $H_0$, we can work out:
1. The estimated P(`context` = `alternative`)
2. The estimated P(`prominence` = `adjective`)
3. The expected count in each cell, given the total number of observations $n$.
For example, for the contingency table above, we can calculate the probabilities #1 and #2:
```{r}
tab <- xtabs(~prominence + context, data=alternatives.sub)
## total number of observations
n <- sum(tab)
## P(context = alternative)
pAlt <- sum(tab[,'Alternative'])/n
## p(prominence = adjective)
pAdj <- sum(tab['Adjective',])/n
## print these probabilities
pAlt
pAdj
```
The estimated count in the `context`=`Alternative` & `prominence`=`Adjective` cell is then:
\begin{align}
n \cdot P(`context` = `Alternative`) \cdot P(`prominence` = `Adjective`) & =
`r n` \cdot `r pAlt ` \cdot `r pAdj` \\
& = `r n*pAlt*pAdj`
\end{align}
#### Exercise 1 {-}
Calculate the estimated counts for the other three cells.
### The chi-squared test
For a 2x2 contingency table, we have:
* The *observed* counts in each cell, denoted $O_i$
* The *expected* counts in each cell, denoted $E_i$, calculated assuming independence ($H_0$).
We can then define *Pearson's test statistic*:
\begin{equation*}
X^2 = \sum^n_{i = 1} \frac{(O_i - E_i)^2}{E_i}
(\#eq:pearson)
\end{equation*}
which measures how much the observed and expected values differ, across the whole contingency table. $X^2$ is 0 when the observed and expected values are exactly the same, and increases the more the observed and expected values differ.
Pearson's test statistic **approximately** follows a $\chi^2$ distribution (pronounced "chi-squared") with one degree of freedom, denoted $\chi^2(1)$, under conditions described below. Thus, we can test the null hypothesis that $X_1$ and $X_2$ are independent by comparing the value of $X^2$ to the $\chi^2(1)$ distribution.
The same methodology applies for testing independence of two categorical variables with any number of levels. For $X_1$ and $X_2$ with $p$ and $q$ levels, $X^2$ can be calculated
using Eq. \@ref(eq:pearson), and the null hypothesis tested using a $\chi^2((p-1)(q-1))$ distribution.
#### Examples {-}
1. Are `prominence` and `context` independent for the contingency table above?
```{r}
chisq.test(tab)
```
2. Are `Regularity` and `Auxiliary` independent in the Dutch [`regularity` data](#dregdata)?
```{r}
xtabs(~Auxiliary + Regularity, regularity)
```
```{r}
chisq.test(xtabs(~Auxiliary + Regularity, regularity))
```
---
Note that in the `regularity` example, there is a warning message, related to the "approximately" condition above. The "$\chi^2$ approximation" (that the test statistic $X^2$ follows a $\chi^2$ distribution) is only valid when the expected count in each cell is "big enough"---otherwise, $p$ values are anti-conservative. Common rules of thumb for "big enough" are 5 or 10 observations per cell.
It is common to have fewer than 10 observations in some cell in a contingency table, making $\chi^2$ tests frequently inappropriate. Widespread use of the $\chi^2$ test is to some extent a holdover from when it was computationally difficult to compute "exact" tests (= no approximation to distribution of test statistic), which require simulation, and there is little reason to use the $\chi^2$ approximation today. For example, you can run a "chi-squared test" but without assuming the $\chi^2$ approximation by using the `simulate.p.value` flag to `chi.sq`:
```{r}
chisq.test(xtabs(~Auxiliary + Regularity, regularity), simulate.p.value=TRUE)
```
which gets rid of the error.
It is still worth knowing about the $\chi^2$ test and its limitations because it is still frequently used, and in older literature is very widely used. For example, if you are reading a paper where the crucial result relies on a $\chi^2$ test with $p=0.02$ for a contingency table with 5 observations in one cell, you should be suspicious. If the contingency table has 20+ observations per cell, you shouldn't be.
### Fisher's exact test
A good alternative to the $\chi^2$ test is *Fisher's exact test*, which as an "exact test" does not place any assumptions on counts per cell. Fisher's test asks a slightly different question from a $\chi^2$ test:
* Given the *marginal counts* (row and column totals) in the contingency table, if $X_1$ and $X_2$ were independent, how likely would an arrangement of the data at least this extreme be?
For example, for testing independence of `Auxiliary` and `Regularity` for the [Dutch `regularity` data](#dregdata):
```{r}
xtabs(~Auxiliary + Regularity, regularity)
```
Fisher's test asks: for 159 irregular verbs, assuming independence, how likely would we be to have $\geq$ 108 *hebben*, $\leq$ 12 *zijn*, $\leq$ 39 *zijnheb*, and so on.
To perform Fisher's exact test in R:
```{r}
fisher.test(xtabs(~Auxiliary + Regularity, regularity))
```
As expected, whether a verb is irregular and which `Auxiliary` it takes are not independent.
## Towards logistic regression
Intuitively, in logistic regression ([next chapter](#logistic-regression)) we will predict the probability that some event happens (e.g. tapping) as a function of predictors, given a bunch of data where each observation corresponds to observing whether the event happened or not, once ($Y=$ 0 or 1). It would be tempting to just apply the tool we've learned so far---linear regression---to this task. However, it's not obvious **what to predict** in such a model:
* We can't do a linear regression using $Y$ as the response, because $Y$ only takes on the values 0 and 1. We want to model a **probability**.
* However, we can't do a linear regression using a probability as a response, because (among other reasons) probabilities are bounded by 0 and 1 and linear regression assumes that the response variable can be any number. (We don't want the model to be able to predict "probability = 2".)
In order to predict probabilities using a regression model, we need:
**Goal 1**: A way to think of probabilities on a continuous, unbounded scale.
**Goal 2**: A way to estimate these probabilities, such that the sample statistic is normally distributed.
Goal 2 is necessary so that we can apply all the statistical inference machinery we have used so far to do things like conduct hypothesis tests.
<!-- Roughly speaking: we will want our measure of probability to be approximately linear as a function of $p$, for it to make sense for us to apply similar machinery to that used for linear regression. -->
<!-- not sure that's true.. -->
The answer turns out to be to use "log-odds", for which we must first discuss "odds".
### Odds
*Odds* are a way of thinking about probability, as in "how likely is this event to happen versus not happen?"
> **Questions**:
>
> Intuitively, what does it mean to say the odds are 3:1 ("three-to-one") that it will be sunny tomorrow?
The relationship between probabilities ($p \in [0,1]$) and odds are:
$$
\text{Odds}(p) = \frac{p}{1 - p}
$$
For the example above: if the probability that it is sunny tomorrow is 0.75, then the odds of it being sunny tomorrow are 3 ($0.75/(1-0.75) = 3$). The odds of it *not* being sunny tomorrow are 1/3 ($0.25/(1-0.25)$), pronounced "one-to-three", or "three-to-one against sun tomorrow."
Figure \@ref(fig:odds1) shows odds as a function of probability.
```{r odds1, echo=FALSE, fig.align='center', fig.height=3, fig.width=3, fig.cap = 'Odds as a function of probability.'}
p <- seq(0.01, 0.99, by = 0.01)
qplot(x = p, y = p/(1-p), geom = 'line') +
xlab("Probability") +
ylab("Odds")
```
Odds may be more intuitive than probabilities, but they do not meet Goal 1 (unbounded scale) because odds are always positive.
### Log-odds
The *log-odds*, corresponding to a probability $p$ are:^[Note that $\log$ here is base $e$, where $e=2.718..$ is the base of the *natural logarithm*.]
$$
\text{log-odds}(p) = \log \frac{p}{1-p}
$$
The following figure shows log-odds as a function of probability.
```{r logodds, echo=FALSE, fig.align='center', fig.height=3, fig.width=3, fig.cap = 'Log-odds as a function of probability'}
qplot(x = p, y = log(p/(1-p)), geom = 'line') +
xlab("Probability") +
ylab("Log-odds")
```
Log-odds meet our goals: they range from $-\infty$ to $\infty$, and are nearly linear as a function of $p$, so long as $p$ isn't too close to 0 or 1. It also turns out that log-odds satisfy our Goal 2 (normally-distributed sampling statistic), as discussed below.
Log-odds do have an important drawback: they are unintuitive, compared to odds or probability. With some practice you can learn to think in log-odds.
#### Exercise 2: log-odds practice {-}
What are the log-odds corresponding to these probabilities?
* $p = 0.5$
* $p = 0.73$
* $p = 0.88$
* $p = 0.953$
<div class="fold o">
```{r, echo=FALSE}
log(0.5/(1-0.5))
log(0.73/(1-0.73))
log(0.88/(1-0.88))
log(0.953/(1-0.953))
```
</div>
This example illustrates:
* Probability space is contracted in log-odds, as we approach 0 or 1: a change of 1 in log-odds corresponds to a smaller and smaller change in $p$.
* Log-odds of -3 to 3 encompasses 90\% of probability space ($\approx p \in (0.05, 0.95)$).
Note that 4 is huge in log-odds ($p \approx 0.983$).
> **Questions**:
>
> What is $p$ corresponding to log-odds of -4?
<div class="fold o">
```{r, echo=FALSE}
invlogit(-4)
```
</div>
### Odds ratios
Let's start with an example from [the `tapping` data](#ctapdata), looking at how the likelihood of tapping depends on `syntax`:
* $Y =$ `tapped` (1 = yes, 0 = no)
* $X =$ `syntax` (1 = transitive, 0 = intransitive)
Suppose that the proportions of cases in each cell for our data are:
\begin{array}{c|cc}
& Y = 1 & Y = 0 \\
\hline
X = 1 & 0.462 & 0.037 \\
X = 0 & 0.43 & 0.07
\end{array}
We would like a measure of how much more likely tapping is when $X=1$ than when $X=0$. One way to do this is to calculate the odds of tapping in each case, then take their ratio:
* when $X = 1$: $\frac{0.462}{0.037} = 12.49$
* when $X = 0$: $\frac{0.43}{0.07} = 6.14$
* Ratio: $\frac{12.49}{6.14}=$ **2.03**
Thus, the odds of tapping are roughly doubled for transitive verbs, relative to intransitive verbs.
To define this *odds ratio* more generally, suppose that two binary random variables $X$ and $Y$ co-occur with these (population) probabilities:
\begin{array}{c|cc}
& Y = 1 & Y = 0 \\
\hline
X = 1 & p_{11} & p_{10} \\
X = 0 & p_{01} & p_{00}
\end{array}
The odds that $Y=1$ are then:
* When $X = 1$: $\frac{p_{11}}{p_{10}}$
* When $X = 0$: $\frac{p_{01}}{p_{00}}$
and the *odds ratio*, as a population statistic, is defined as:
$$
OR = \frac{p_{11}/p_{10}}{p_{01}/p_{00}}
$$
The odds ratio describes how much the occurrence of one (binary) event depends on another (binary) event, and is often used as an intuitive measure:
* "Your odds of developing lung cancer are halved if you quit smoking."
* "Marie's odds of winning the election tripled from January to March."
Note that odds ratios are interpreted multiplicatively: it makes sense to say "$X$ raises the odds of $Y$ by a factor of 2", but not "$X$ raises the odds of $Y$ by 2".
Odds ratios are in part a way of thinking about changes in probability. For the tapping example:
$$
P(\,\text{tapped}~|~\text{transitive}\,) = 0.926, \quad
P(\,\text{tapped}~|~\text{intransitive}\,) = 0.86
$$
Thus, the change in probability from intransitive to transitive sentences is 0.066. However, odds ratios are *not* equivalent to changes in probability, because with odds ratios, what a change from $p$ to $p + \Delta p$ corresponds to depends on the probability ($p$) you start at. For example, Figure \@ref(fig:oddsratio) shows the odds ratio corresponding to a change in probability of 0.1, as a function of the starting probability.
```{r oddsratio, echo=FALSE, warning=FALSE, fig.align='center', fig.height=3, fig.width=3.5, fig.cap="Odds ratio when a probability ($p$) is increased by 0.1, as a function of $p$"}
p <- seq(0.001, 0.899, by = 0.01)
oddsRat <- ((p+0.1)/(1-p-0.1))/(p/(1-p))
qplot(p, oddsRat, geom = "line") +
xlab("Probability") +
ylab("Odds ratio when p -> p+0.1") + xlim(0,1) + ylim(1.4,4)
```
For a change in probability from 0.5 to 0.6, the odds increase by about 1.5x, while for a change in probability from 0.85 to 0.95 the odds increase by 3.35x.
To summarize:
* Odds have a one-to-one relationship with probabilities
* **Changes** in odds don't straightforwardly correspond to **changes** in probabilities.
The non-equivalence between changes in odds and changes in probability is important to remember when interpreting logistic regressions, which are in log-odds space.
### Log odds: sample and population
The odds ratio is a useful *summary statistic* of the degree of dependence between two binary variables, analogous to correlations ($r$, $\rho$) for summarizing dependence between two continuous variables.
<!-- TODO FUTURE: if a chapter including correlations ever added in, add cross-ref -->
Because we never actually observe the population value of the odds ratio---which is based on probabilities of each combination of $X$ and $Y$---we need a *sample statistic* to estimate it from observed counts for each $X$/$Y$ combination.
Suppose we observe the following counts:
\begin{array}{c|cc}
& Y = 1 & Y = 0 \\
\hline
X = 1 & n_{11} & n_{10} \\
X = 0 & n_{01} & n_{00}
\end{array}
To calculate the empirical odds ratio, we estimate each probability from the observed counts. For example:
\begin{align}
P(Y=1|X=1) \approx \frac{n_{11}}{n_{11}+n_{10}}, \quad P(Y=0|X=1) \approx\frac{n_{10}}{n_{11}+n_{10}} \\
\implies Odds(Y=1|X=1) = \frac{n_{11}}{n_{10}}
\end{align}
Similarly estimating $Odds(Y=1|X=1)$, then taking the logarithm of $Odds(Y=1|X=1)/Odds(Y=1|X=0)$, gives the sample change in log odds:
$$
\hat{L} = \log \left[ \frac{n_{11}/n_{10}}{n_{01}/n_{00}} \right]
$$
It turns out that **$\hat{L}$ is normally distributed with mean $\log(OR)$**, where $OR$ is the population value of the odds ratio. Thus, if we want to do inferential statistics on the effect of binary $X$ on binary $Y$, it makes sense to work with **log-odds**, rather than probabilities.
#### Calculation and interpretation
* The function from probability to log-odds is called *logit* ($\text{logit}(p)$)
* The function from log-odds to probability is called *inverse logit* ($\text{logit}^{-1}(x)$), a.k.a. the "logistic function" ($1/(1+e^{-x})$).
Both can be calculated using R functions, such as `logit` and `invlogit` in the `arm` package.
Note that **adding** in the log-odds scale corresponds to **multiplying** in the odds scale, by a power of $e$.
For example, changing log-odds by 1 corresponds to multiplying the odds ratio by $e$:
\begin{eqnarray*}
\log(OR_1) = 0.69 & \implies & OR_1 = e^{0.69} = 2 \\
\log(OR_2) = 1.69 & \implies & OR_2 = e^{1.69} = 5.42 = 2{\cdot}e
\end{eqnarray*}
#### Exercise 3 {-}
Suppose that in the `tapping` data, one participant's tapping rate (i.e. probability of tapping) is 0.90 in fast speech and 0.70 in slow speech.
* What are the participant's **odds** of tapping in fast and slow speech (two numbers)?
<div class="fold o">
```{r, echo=FALSE}
0.90/(1-0.90)
0.70/(1-0.70)
```
</div>
* What is the **odds ratio** of tapping (in fast vs. slow speech) (one number)?
<div class="fold o">
```{r, echo=FALSE}
(0.90/(1-0.90))/(0.70/(1-0.70))
```
</div>
* What is the equivalent **change in log-odds**?
<div class="fold o">
```{r, echo=FALSE}
log(0.90/(1-0.90)) - log(0.70/(1-0.70))
```
</div>
## Other readings {#cda-other-readings}
Most introductory statistics books cover contingency tables, chi-squared tests, and Fisher exact tests, including:
* General audience: @dalgaard2008introductory Ch. 8, @agresti2007 Ch. 1-2
* For psychologists and language scientists: @NavarroOnline Ch. 12, @johnson2008quantitative Ch. 3.1, 5.1-5.3
The Agresti book is a particularly good and accessible resource for categorical data analysis generally. (@agresti2003categorical is Agresti's more technical/comprehensive book, which is the reference on CDA.)
Some discussion of odds and odds ratios is given by @agresti2007 Ch. 2, @gelman2007data 5.2, @crawley2015statistics Ch. 14.
## Solutions {#c3solns}
### Solutions to Exercise 1:
Estimated Counts for:
* `Prominence` = `Noun`, `Context` = `Alternative` cell = 118.5542
* `Prominence` = `Adjective`, `Context` = `NoAlternative` cell = 88.55422
* `Prominence` = `Adjective`, `Context` = `NoAlternative` cell = 121.4458