-
Notifications
You must be signed in to change notification settings - Fork 66
/
13-overdispersed-glms.Rmd
172 lines (121 loc) · 6.39 KB
/
13-overdispersed-glms.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
# Overdispersion and GLMs
# Goals
- Learn to detect and deal with overdispersion in Poisson and binomial GLMs
# What is overdispersion?
"Overdispersion is the polite statistician's version of Murphy's Law: if something can go wrong, it will"
- Crawley (2007, p. 522) "The R Book"
So far, all our fitted GLM models have matched the simulated data. One common thing that can go wrong is that there is more variability in the data than allowed by a distribution.
This isn't a problem for distributions like the normal, Gamma, or negative binomial, because these distributions have a parameter that lets them be as narrow or wide as they need to be.
But some distributions, notably the Poisson and binomial (with multiple samples), assume a fixed level of variability for a given mean value. But the real world is messy and this isn't always the case. Let's take a look at what that means.
## Quasipoisson, log link
We'll start by generating count data that we know are overdispersed for a Poisson.
```{r}
library(ggplot2)
library(dplyr)
set.seed(111)
N <- 500
x <- runif(N, -1, 1)
a <- 0.5
b <- 1.3
d <- data_frame(x = x)
inverse_logit <- function(x) plogis(x)
```
```{r}
y_true <- exp(a + b * x)
rqpois <- function (n, lambda, d = 1) { # generate random quasipoisson values
if (d == 1)
rpois(n, lambda)
else
rnbinom(n, size = (lambda / (d - 1)), mu = lambda)
}
set.seed(1234)
y <- rqpois(N, lambda = y_true, d = 5)
plot(x, y)
```
Let's look at the data that we just created.
In the following, the dashed line indicates the one-to-one line (Poisson), the blue line indicates the variance scaling linearly with the mean but not one-to-one (the true relationship here, quasipoisson), and the red line indicates the variance scaling quadratically with the mean (negative binomial).
I've grouped the values along the x-axis into 15 bins in order to make this plot.
```{r}
d$y <- y
d$x_group <- findInterval(d$x, seq(min(d$x), max(d$x), length.out = 15))
group_by(d, x_group) %>%
summarise(m = mean(y), v = var(y)) %>%
ggplot(aes(m, v)) +
geom_smooth(method = "lm",
formula = y ~ x - 1, se = F, colour = "blue") +
geom_smooth(method = "lm",
formula = y ~ I(x^2) + offset(x) - 1, colour = "red", se = F) +
geom_abline(intercept = 0, slope = 1, lty = 2) +
geom_point()
```
Let's fit a GLM with a Poisson distribution and a log link even though we know that the data are overdispersed.
```{r}
(m_poisson <- glm(y ~ x, family = poisson(link = "log"), data = d))
ggeffects::ggpredict(m_poisson, "x") %>%
ggplot(aes(x, predicted)) +
geom_line(colour = "red") +
geom_ribbon(aes(ymin = conf.low, ymax = conf.high), alpha = .2) +
geom_point(data = d, aes(x = x, y = y))
```
If we look at the residuals, these should be constant with the predicted mean value. It can be hard to see these patterns in the residuals. There is not a lot to see here even though we know there is overdispersion.
```{r}
plot(fitted(m_poisson), residuals(m_poisson))
```
We can also look at whether the deviance is approximately equal to the residual degrees of freedom. If the deviance is much higher then that is evidence of overdispersion.
```{r}
deviance(m_poisson)/m_poisson$df.residual
```
We can also statistically test that:
```{r}
# Cameron, A.C. & Trivedi, P.K. (1990). Regression-based tests for overdispersion in the Poisson model. Journal of Econometrics, 46, 347–364.
AER::dispersiontest(m_poisson)
```
To deal with the overdispersion, we will refit our model with the quasipoisson family. This simply estimates how overdispersed the data are and scales the standard errors on our parameter estimates appropriately.
This means we leave the world of likelihood and can't simply calculate values such as AIC. (There's qAIC etc.)
(Knowing what we discussed previously about the 2 negative binomial distributions, what would be an alternative way to fit these data?)
```{r}
(m_qp <- glm(y ~ x, family = quasipoisson(link = "log"), data = d))
confint(m_qp)
confint(m_poisson)
```
What do you notice about the confidence intervals in these 2 models?
# Quasibinomial, logit link
We can end up with overdispersed data from a binomial distribution if we have repeated trials. (There is no such thing as overdispersion if we are modeling single trials, i.e. 0s and 1s.)
When might that happen? For example, maybe you are measuring the proportion of frogs that survive in a given tank.
For this example, let's say you have 30 frogs per tank and 40 tanks.
Let's simulate the proportion of frogs that survived after some experiment in a case with overdispersed data, and in a case with no overdispersion:
```{r}
set.seed(1)
n <- 30
y <- emdbook::rbetabinom(40, 0.5, size = n, theta=1)
y2 <- rbinom(40, 0.5, size = n)
par(mfrow = c(2, 1))
plot(table(y/n)/length(y), xlim = c(0, 1), ylab = "prop.",
main = "Overdispersed")
plot(table(y2/n)/length(y2), xlim = c(0, 1), ylab = "prop.",
main = "Not overdispersed")
```
What we are looking at here is a histogram of the proportion of frogs that survived in each tank. Note how much more spread out the values are in the overdispersed scenario compared to the pure binomial distribution.
Let's plot the estimated mean proportion survived with a GLM fitted with the binomial error distribution, and a GLM that allows for overdispersion with the quasibinomial distribution.
```{r}
par(mfrow = c(1, 1))
plot(table(y/n)/length(y), xlim = c(0, 1), ylab = "prop.", col = "grey80")
abline(v = 0.5, col = "black", lwd = 10)
ss <- rep(n, length(y))
m <- glm(y/n ~ 1, family = binomial(link = "logit"),
weights = ss)
ci <- inverse_logit(confint(m))
abline(v = ci, col = "red", lwd = 5)
m2 <- glm(y/n ~ 1, family = quasibinomial(link = "logit"),
weights = rep(n, length(y)))
ci2 <- inverse_logit(confint(m2))
abline(v = ci2, col = "blue", lwd = 5)
```
In the above plot, the true value is indicated by the thick black vertical line.
The binomial GLM 95% confidence interval is indicated by the red vertical lines.
And the quasibinomial GLM 95% confidence interval is indicated by the blue vertical lines.
Note how our confidence intervals look too small if we don't allow for overdispersion.
Since this is a course on GLMMs, an alternative way to deal with over dispersion here would be to model a random intercept for each tank. But we won't get into that yet.
# More information
Ben Bolker has a wonderful vignette on quasi likelihood in R in his bbmle package:
<https://cran.r-project.org/web/packages/bbmle/vignettes/quasi.pdf>