forked from proback/BeyondMLR
-
Notifications
You must be signed in to change notification settings - Fork 0
/
11-Generalized-Linear-Multilevel-Models.Rmd
1350 lines (1082 loc) · 95 KB
/
11-Generalized-Linear-Multilevel-Models.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "Chapter 11"
subtitle: "Multilevel Generalized Linear Models"
output:
pdf_document:
number_sections: yes
html_document: default
editor_options:
chunk_output_type: console
---
# Multilevel Generalized Linear Models {#ch-GLMM}
## Learning Objectives {#objectives}
After finishing this chapter, students should be able to:
- Recognize when multilevel generalized linear models (multilevel GLMs) are appropriate.
- Understand how multilevel GLMs essentially combine ideas from earlier chapters.
- Apply exploratory data analysis techniques to multilevel data with non-normal responses.
- Recognize when crossed random effects are appropriate and how they differ from nested random effects.
- Write out a multilevel generalized linear statistical model, including assumptions about variance components.
- Interpret model parameters (including fixed effects and variance components) from a multilevel GLM.
- Generate and interpret random effect estimates.
```{r load_packages11, message = FALSE}
# Packages required for Chapter 11
library(gridExtra)
library(lme4)
library(pander)
library(ggmosaic)
library(knitr)
library(kableExtra)
library(broom)
library(tidyverse)
```
```{r, include=F}
if(knitr::is_html_output()){options(knitr.table.format = "html")} else {options(knitr.table.format = "latex")}
```
## Case Study: College Basketball Referees {#cs:refs}
An article by @Anderson2009 describes empirical evidence that officials in NCAA men's college basketball tend to "even out" foul calls over the course of a game, based on data collected in 2004-2005. Using logistic regression to model the effect of foul differential on the probability that the next foul called would be on the home team (controlling for score differential, conference, and whether or not the home team had the lead), Anderson and Pierce found that "the probability of the next foul being called on the visiting team can reach as high as 0.70." More recently, Moskowitz and Wertheim, in their book *Scorecasting*, argue that the number one reason for the home field advantage in sports is referee bias. Specifically, in basketball, they demonstrate that calls over which referees have greater control---offensive fouls, loose ball fouls, ambiguous turnovers such as palming and traveling---were more likely to benefit the home team than more clearcut calls, especially in crucial situations [@Moskowitz2011].
Data have been gathered from the 2009-2010 college basketball season for three major conferences to investigate the following questions [@Noecker2012]:
- Does evidence that college basketball referees tend to "even out" calls still exist in 2010 as it did in 2005?
- How do results change if our analysis accounts for the correlation in calls from the same game and the same teams?
- Is the tendency to even out calls stronger for fouls over which the referee generally has greater control? Fouls are divided into offensive, personal, and shooting fouls, and one could argue that referees have the most control over offensive fouls (for example, where the player with the ball knocks over a stationary defender) and the least control over shooting fouls (where an offensive player is fouled in the act of shooting).
- Are the actions of referees associated with the score of the game?
## Initial Exploratory Analyses {#explore-glmm}
### Data Organization
Examination of data for Case Study \@ref(cs:refs) reveals the following key variables in `basketball0910.csv`:
- `game` = unique game identification number
- `date` = date game was played (YYYYMMDD)
- `visitor` = visiting team abbreviation
- `hometeam` = home team abbreviation
- `foul.num` = cumulative foul number within game
- `foul.home` = indicator if foul was called on the home team
- `foul.vis` = indicator if foul was called on the visiting team
- `foul.diff` = the difference in fouls before the current foul was called (home - visitor)
- `score.diff` = the score differential before the current foul was called (home - visitor)
- `lead.vis` = indicator if visiting team has the lead
- `lead.home` = indicator if home team has the lead
- `previous.foul.home` = indicator if previous foul was called on the home team
- `previous.foul.vis` = indicator if previous foul was called on the visiting team
- `foul.type` = categorical variable if current foul was offensive, personal, or shooting
- `shooting` = indicator if foul was a shooting foul
- `personal` = indicator if foul was a personal foul
- `offensive` = indicator if foul was an offensive foul
- `time` = number of minutes left in the first half when foul called
Data was collected for 4972 fouls over 340 games from the Big Ten, ACC, and Big East conference seasons during 2009-2010. We focus on fouls called during the first half to avoid the issue of intentional fouls by the trailing team at the end of games. Table \@ref(tab:table1chp11) illustrates key variables from the first 10 rows of the data set.
```{r, include=FALSE}
#getting-started
theme.1 <- theme(axis.title.x = element_text(size = 11),
axis.title.y = element_text(size = 11),
plot.title=element_text(hjust=.9,face="italic",size=12))
refdata <- read.csv("data/basketball0910.csv")
# refdata <- read.csv(file=file.choose()) # read in basketball0910.csv
dim(refdata) # should be 4972 x 19
head(refdata) # examine first 6 rows
```
```{r, echo=FALSE, warning=FALSE}
# Illustrate long data
#table1chp11 <- data.frame(refdata %>% slice(1:10) %>%
# dplyr::select(., 2, 4, 5, 6, 7, 9, 10, 12, 15, 19))
#if(knitr::is_html_output()){
# kable(table1chp11) %>%
# kable_styling(bootstrap_options = c("striped")) %>%
# scroll_box(width = "100%")
#} else{
# kable(table1chp11) %>%
# kable_styling(bootstrap_options = c("striped"))
#}
# Would place outside R chunk:
#<caption>(\#tab:table1chp11) Key variables from the first 10 rows of data from the College Basketball Referees case study. Each row represents a different foul called; we see all 8 first-half fouls from Game 1 followed by the first 2 fouls called in Game 2.
```
```{r table1chp11, echo=FALSE, warning=FALSE}
# Illustrate long data
table1chp11 <- data.frame(refdata %>% slice(1:10) %>%
dplyr::select(., 2, 4, 5, 6, 7, 9, 10, 12, 15, 19))
kable(table1chp11, booktabs=T,
caption="Key variables from the first 10 rows of data from the College Basketball Referees Case Study. Each row represents a different foul called; we see all 8 first-half fouls from Game 1 followed by the first 2 fouls called in Game 2.") %>%
kable_styling(latex_options = "scale_down", font_size = 9)
```
For our initial analysis, our primary response variable is `foul.home`, and our primary hypothesis concerns evening out foul calls. We hypothesize that the probability a foul is called on the home team is inversely related to the foul differential; that is, if more fouls have been called on the home team than the visiting team, the next foul is less likely to be on the home team.
The structure of this data suggests a couple of familiar attributes combined in an unfamiliar way. With a binary response variable, a generalized linear model is typically applied, especially one with a logit link function (indicating logistic regression). But, with covariates at multiple levels---some at the individual foul level and others at the game level---a multilevel model would also be sensible. So what we need is a multilevel model with a non-normal response; in other words, a __multilevel generalized linear model (multilevel GLM)__. \index{multilevel GLM} We will investigate what such a model might look like in the next section, but we will still begin by exploring the data with initial graphical and numerical summaries.
As with other multilevel situations, we will begin with broad summaries across all 4972 foul calls from all 340 games. Most of the variables we have collected can vary with each foul called; these Level One variables include:
- whether or not the foul was called on the home team (our response variable),
- the game situation at the time the foul was called (the time remaining in the first half, who is leading and by how many points, the foul differential between the home and visiting team, and who the previous foul was called on), and
- the type of foul called (offensive, personal, or shooting).
Level Two variables, those that remain unchanged for a particular game, then include only the home and visiting teams, although we might consider attributes such as attendance, team rankings, etc.
### Exploratory Analyses {#glmm-eda}
In Figure \@ref(fig:gmu-histmat1), we see histograms for the continuous Level One covariates (time remaining, foul differential, and score differential). These plots treat each foul within a game as independent even though we expect them to be correlated, but they provide a sense for the overall patterns. We see that time remaining is reasonably uniform. Score differential and foul differential are both bell-shaped, with a mean slightly favoring the home team in both cases -- on average, the home team leads by 2.04 points (SD 7.24) and has 0.36 fewer previous fouls (SD 2.05) at the time a foul is called.
```{r, include=FALSE}
# Summarize Level 1 covariates (and responses) by ignoring
# within subject correlation and pretending all observations
# are independent
time.hist <- ggplot(data = refdata, aes(x = time)) +
geom_histogram(binwidth = 2, color = "black",
fill = "white") +
theme.1 + xlab("Time left in first half") +
ylab("Frequency") + labs(title = "(a)")
score.hist <- ggplot(data = refdata, aes(x = score.diff)) +
geom_histogram(binwidth = 5, color = "black",
fill = "white") +
theme.1 + xlab("Score difference (home-visitor)") +
ylab("Frequency") + labs(title = "(b)")
foul.hist <- ggplot(data = refdata, aes(x = foul.diff)) +
geom_histogram(binwidth = 1.5, color = "black",
fill = "white") +
theme.1 + xlab("Foul difference (home-visitor)") +
ylab("Frequency") + labs(title = "(c)")
```
```{r gmu-histmat1, fig.cap= "Histograms showing distributions of the 3 continuous Level One covariates: (a) time remaining, (b) score difference, and (c) foul difference.", fig.align="center",out.width="60%", echo=FALSE}
grid.arrange(time.hist, score.hist, foul.hist, ncol=2, nrow=2)
```
Summaries of the categorical response (whether the foul was called on the home team) and categorical Level One covariates (whether the home team has the lead and what type of foul was called) can be provided through tables of proportions. More fouls are called on visiting teams (52.1\%) than home teams, the home team is more likely to hold a lead (57.1\%), and personal fouls are most likely to be called (51.6\%), followed by shooting fouls (38.7\%) and then offensive fouls (9.7\%).
For an initial examination of Level Two covariates (the home and visiting teams), we can take the number of times, for instance, Minnesota (MN) appears in the long data set (with one row per foul called as illustrated in Table \@ref(tab:table1chp11)) as the home team and divide by the number of unique games in which Minnesota is the home team. This ratio (12.1), found in Table \@ref(tab:table2chp11), shows that Minnesota is among the bottom three teams in the average total number of fouls in the first halves of games in which it is the home team. That is, games at Minnesota have few total fouls relative to games played elsewhere. Accounting for the effect of home and visiting team will likely be an important part of our model, since some teams tend to play in games with twice as many fouls called as others, and other teams see a noticeable disparity in the total number of fouls depending on if they are home or away.
```{r, table2chp11, echo=FALSE, warning=FALSE}
three <- c(" ","Top 3"," "," "," ","Bottom 3"," ")
homet <- c("Duke","VaTech","Nova"," ","Mich","Ill","MN")
homes <- c("20.0","19.4","19.1"," ","10.6","11.6","12.1")
visitt <- c("WVa","Nova","Wake"," ","Wisc","Mich","PSU")
visits <- c("21.4","19.0","18.6"," ","10.4","11.1","11.3")
Difft <- c("Duke","Wisc","Pitt"," ","WVa","Mia","Clem")
Diffs <- c("4.0","2.6","2.3"," ","-6.9","-2.7","-2.6")
table2chp11 <- data.frame(three, homet, homes, visitt,
visits, Difft, Diffs)
colnames(table2chp11) <- c(" ", "Team", "Fouls", "Team",
"Fouls", "Team", "Fouls")
kable(table2chp11, booktabs=T, caption=" Average total number of fouls in the first half over all games in which a particular team is home or visitor. The left columns show the top 3 and bottom 3 teams according to total number of fouls (on both teams) in first halves of games in which they are the home team. The middle columns correspond to games in which the listed teams are the visitors, and the right columns show the largest differences (in both directions) between total fouls in games in which a team is home or visitor.") %>%
add_header_above(c(" ","Home"=2,"Visitor"=2, "Difference"=2))
```
Next, we inspect numerical and graphical summaries of relationships between Level One model covariates and our binary model response. As with other multilevel analyses, we will begin by observing broad trends involving all 4972 fouls called, even though fouls from the same game may be correlated. The conditional density plots in the first row of Figure \@ref(fig:gmu-cdelogitmat1) examine continuous Level One covariates. Figure \@ref(fig:gmu-cdelogitmat1)a provides support for our primary hypothesis about evening out foul calls, indicating a very strong trend for fouls to be more often called on the home team at points in the game when more fouls had previously been called on the visiting team. Figures \@ref(fig:gmu-cdelogitmat1)b and \@ref(fig:gmu-cdelogitmat1)c then show that fouls were somewhat more likely to be called on the home team when the home team's lead was greater and (very slightly) later in the half. Conclusions from the conditional density plots in Figures \@ref(fig:gmu-cdelogitmat1)a-c are supported with associated empirical logit plots in Figures \@ref(fig:gmu-cdelogitmat1)d-f. If a logistic link function is appropriate, these plots should be linear, and the stronger the linear association, the more promising the predictor. We see in Figure \@ref(fig:gmu-cdelogitmat1)d further confirmation of our primary hypothesis, with lower log-odds of a foul called on the home team associated with a greater number of previous fouls the home team had accumulated compared to the visiting team. Figure \@ref(fig:gmu-cdelogitmat1)e shows that game score may play a role in foul trends, as the log-odds of a foul on the home team grows as the home team accumulates a bigger lead on the scoreboard, and Figure \@ref(fig:gmu-cdelogitmat1)f shows a very slight tendency for greater log-odds of a foul called on the home team as the half proceeds (since points on the right are closer to the beginning of the game).
Conclusions about continuous Level One covariates are further supported by summary statistics calculated separately for fouls called on the home team and those called on the visiting team. For instance, when a foul is called on the home team, there is an average of 0.64 additional fouls on the visitors at that point in the game, compared to an average of 0.10 additional fouls on the visitors when a foul is called on the visiting team. Similarly, when a foul is called on the home team, they are in the lead by an average of 2.7 points, compared to an average home lead of 1.4 points when a foul is called on the visiting team. As expected, the average time remaining in the first half at the time of the foul is very similar for home teams and visitors (9.2 vs. 9.5 minutes, respectively).
```{r, include=FALSE}
# analyses using tidyverse
with(refdata, summary(time))
with(refdata, summary(score.diff))
with(refdata, summary(foul.diff))
with(refdata, sd(time))
with(refdata, sd(score.diff))
with(refdata, sd(foul.diff))
refdata %>% count(foul.home) %>%
mutate(prop = n/sum(n))
refdata %>% count(lead.home) %>%
mutate(prop = n/sum(n))
refdata %>% count(previous.foul.home) %>%
mutate(prop = n/sum(n))
refdata %>% count(foul.type) %>%
mutate(prop = n/sum(n))
# There are no Level 2 covariates other than Home and
# Visiting Team. Tables below give overall summary of
# which teams called for more fouls.
f1 <- refdata %>% count(hometeam) %>%
mutate(prop = n/sum(n)) %>%
rename(n_fouls = n)
f1
f2 <- refdata %>% count(visitor) %>%
mutate(prop = n/sum(n)) %>%
rename(n_fouls = n)
f2
table(refdata$hometeam)
table(refdata$visitor)
# These tables tell how many games each team appears in.
onepergame <- refdata %>%
group_by(game) %>%
filter(row_number() == 1) %>%
ungroup
p1 <- onepergame %>% count(hometeam) %>%
mutate(prop = n/sum(n)) %>%
rename(n_games = n)
print(p1, n = Inf)
p2 <- onepergame %>% count(visitor) %>%
dplyr::mutate(prop = n/sum(n)) %>%
rename(n_games = n)
print(p2, n = Inf)
table(onepergame$hometeam)
table(onepergame$visitor)
# These tables tell average total fouls for both teams
# in the first half of games where a certain team is
# home or visitor
allhome <- p1 %>%
inner_join(f1, by = "hometeam") %>%
dplyr::mutate(fouls_per_game = n_fouls / n_games) %>%
arrange(desc(fouls_per_game))
print(allhome, n = Inf)
allvis <- p2 %>%
inner_join(f2, by = "visitor") %>%
dplyr::mutate(fouls_per_game = n_fouls / n_games) %>%
arrange(desc(fouls_per_game))
print(allvis, n = Inf)
allboth <- allhome %>%
rename(team = hometeam, home_fouls = fouls_per_game) %>%
dplyr::select(team, home_fouls) %>%
inner_join(allvis, by = c("team" = "visitor")) %>%
dplyr::select(team, home_fouls, fouls_per_game) %>%
rename(visitor_fouls = fouls_per_game) %>%
dplyr::mutate(diff = home_fouls - visitor_fouls) %>%
arrange(desc(diff))
print(allboth, n = Inf)
allhome0 = table(refdata$hometeam)/table(onepergame$hometeam)
allvis0 = table(refdata$visitor)/table(onepergame$visitor)
sort(allhome0)
sort(allvis0)
sort(allhome0-allvis0)
# Look at relationships among Level 1 covariates and primary
# response (again ignoring correlation). Cdplots and elogits
# for continuous covariates and segmented barcharts for
# categorical covariates.
with(refdata, by(time, foul.home, summary))
with(refdata, by(score.diff, foul.home, summary))
with(refdata, by(foul.diff, foul.home, summary))
# Find empirical logits and proportions by grouping fouls
# by foul.diff. First remove foul.diff=-8 or 6, since not
# enough data
foo1 <- refdata %>%
filter(foul.diff >= -7 & foul.diff <= 5)
foo1 %>% count(foul.diff, foul.home)
table(foo1$foul.home, foo1$foul.diff)
foul.df <- foo1 %>%
group_by(foul.diff) %>%
summarise(foul.phats = mean(foul.home)) %>%
mutate(foul.elogits = log(foul.phats/(1 - foul.phats)) )
print(foul.df)
ggplot(data = foul.df, aes(x = foul.diff, y = foul.elogits)) +
geom_point() +
geom_smooth(method = "lm", color = "black") + theme.1 +
ylab("Empirical logits") + xlab("Foul differential")
model0 = lm(foul.elogits ~ foul.diff, data = foul.df)
summary(model0)
# Repeat by grouping fouls by score.diff. First remove
# score.diff<=-12 or >=19, since not enough data
foo1 <- refdata %>%
filter(score.diff >= -11 & score.diff <= 18)
foo1 %>% count(score.diff, foul.home)
table(foo1$foul.home, foo1$score.diff)
score.df <- foo1 %>%
group_by(score.diff) %>%
summarise(score.phats = mean(foul.home)) %>%
mutate(score.elogits = log(score.phats/(1 - score.phats)) )
print(score.df)
ggplot(data = score.df, aes(x = score.diff,
y = score.elogits)) +
geom_point() +
geom_smooth(method = "lm", color = "black") + theme.1 +
ylab("Empirical logits") + xlab("Score differential")
model0 = lm(score.elogits ~ score.diff, data = score.df)
summary(model0)
# Repeat by grouping fouls by time remaining
# Must group in intervals since time truly continuous
foo1 <- refdata %>%
mutate(group = cut(time,
breaks = c(-Inf, 2, 4, 6, 8, 10, 12, 14, 16, 18, Inf),
labels = c(1, 3, 5, 7, 9, 11, 13, 15, 17, 19))) %>%
mutate(times = as.numeric(levels(group))[group]) %>%
select(-group)
foo1 %>% count(times, foul.home)
table(foo1$foul.home, foo1$times)
time.df <- foo1 %>%
group_by(times) %>%
summarise(time.phats = mean(foul.home)) %>%
mutate(time.elogits = log(time.phats/(1 - time.phats)) )
print(time.df)
ggplot(data = time.df, aes(x = times, y = time.elogits)) +
geom_point() +
geom_smooth(method = "lm", color = "black") + theme.1 +
ylab("Empirical logits") + xlab("Time remaining")
model0 = lm(time.elogits ~ times, data = time.df)
summary(model0)
refdata <- refdata %>%
mutate(foul.factor = as.factor(ifelse(foul.home == 1,
"Home", "Visitor")) )
```
```{r, include=FALSE}
#needed for next figure
foul.cd = ggplot(data = refdata, aes(x = foul.diff)) +
theme(legend.title = element_blank()) + theme.1 +
geom_density(aes(fill = foul.factor), position = "fill",
adjust = 2, alpha = 0.5) +
xlab("Foul difference (H-V)") +
ylab("Probability of Home Foul") +
labs(title="(a)") +
scale_fill_manual(values = c("grey20", "grey80"))
score.cd = ggplot(data = refdata, aes(x = score.diff)) +
theme(legend.title = element_blank()) + theme.1 +
geom_density(aes(fill = foul.factor), position = "fill",
adjust = 2, alpha = 0.5) +
xlab("Score difference (H-V)") +
ylab("Probability of Home Foul") +
labs(title="(b)") +
scale_fill_manual(values = c("grey20", "grey80"))
time.cd = ggplot(data = refdata, aes(x = time)) +
theme(legend.title = element_blank()) + theme.1 +
geom_density(aes(fill = foul.factor), position = "fill",
adjust = 2, alpha = 0.5) +
xlab("Time left in half") +
ylab("Probability of Home Foul") +
labs(title="(c)") +
scale_fill_manual(values = c("grey20", "grey80"))
foul.el <- ggplot(data = foul.df, aes(x = foul.diff,
y = foul.elogits)) +
geom_point(color="dark grey") +
theme.1 + xlab("Foul difference (H-V)") +
ylab("Empirical Log-odds of Home Foul") +
labs(title = "(d)") +
geom_smooth(se = FALSE, method = "lm", color = "black")
score.el <- ggplot(data = score.df, aes(x = score.diff,
y = score.elogits)) +
geom_point(color="dark grey") +
theme.1 + xlab("Score difference (H-V)") +
ylab("Empirical Log-odds of Home Foul") +
labs(title = "(e)") +
geom_smooth(se = FALSE, method = "lm", color = "black")
time.el <- ggplot(data = time.df, aes(x = times,
y = time.elogits)) +
geom_point(color="dark grey") +
theme.1 + xlab("Time left in half") +
ylab("Empirical Log-odds of Home Foul") +
labs(title = "(f)") +
geom_smooth(se = FALSE, method = "lm", color = "black")
```
```{r, gmu-cdelogitmat1, fig.cap= "Conditional density and empirical logit plots of the binary model response (foul called on home or visitor) vs. the three continuous Level One covariates (foul differential, score differential, and time remaining). The dark shading in a conditional density plot shows the proportion of fouls called on the home team for a fixed value of (a) foul differential, (b) score differential, and (c) time remaining. In empirical logit plots, estimated log odds of a home team foul are calculated for each distinct foul (d) and score (e) differential, except for differentials at the high and low extremes with insufficient data; for time (f), estimated log odds are calculated for two-minute time intervals and plotted against the midpoints of those intervals.", fig.align="center",out.width="60%", echo=FALSE, message=F}
grid.arrange(foul.cd, score.cd, time.cd,
foul.el, score.el, time.el, ncol = 3, nrow = 2)
```
The mosaic plots in Figure \@ref(fig:gmu-barmat1) examine categorical Level One covariates, indicating that fouls were more likely to be called on the home team when the home team was leading, when the previous foul was on the visiting team, and when the foul was a personal foul rather than a shooting foul or an offensive foul. A total of 51.8\% of calls go against the home team when it is leading the game, compared to only 42.9\% of calls when it is behind; 51.3\% of calls go against the home team when the previous foul went against the visitors, compared to only 43.8\% of calls when the previous foul went against the home team; and, 49.2\% of personal fouls are called against the home team, compared to only 46.9\% of shooting fouls and 45.7\% of offensive fouls. Eventually we will want to examine the relationship between foul type (personal, shooting, or offensive) and foul differential, examining our hypothesis that the tendency to even out calls will be even stronger for calls over which the referees have greater control (personal fouls and especially offensive fouls).
```{r, include=FALSE}
refdata <- refdata %>%
mutate(leadyes = ifelse(lead.home == 0, "No", "Yes"),
prevyes = ifelse(previous.foul.home == 0, "No", "Yes")) %>%
rename(whofoul = foul.factor)
barplot1 <- ggplot(data = refdata) +
geom_mosaic(aes(weight = 1, x = product(whofoul, foul.type),
fill = whofoul)) +
xlab("Foul Type") +
ylab("Proportion within Foul Type") +
labs(title = "(a)") + scale_fill_grey() +
theme(legend.title = element_blank()) + theme.1
barplot2 <- ggplot(data = refdata) +
geom_mosaic(aes(weight = 1, x = product(whofoul, leadyes),
fill = whofoul)) +
xlab("Home Team in Lead") +
ylab("Proportion within Leading Team") +
labs(title = "(b)") + scale_fill_grey() +
theme(legend.title = element_blank()) + theme.1
barplot3 <- ggplot(data = refdata) +
geom_mosaic(aes(weight = 1, x = product(whofoul, prevyes),
fill = whofoul)) +
xlab("Previous Foul on Home Team") +
ylab("Proportion within Previous Foul") +
labs(title = "(c)") + scale_fill_grey() +
theme(legend.title = element_blank()) + theme.1
```
```{r gmu-barmat1, fig.cap= "Mosaic plots of the binary model response (foul called on home or visitor) vs. the three categorical Level One covariates (foul type (a), team in the lead (b), and team called for the previous foul (c)). Each bar shows the percentage of fouls called on the home team vs. the percentage of fouls called on the visiting team for a particular category of the covariate. The bar width shows the proportion of fouls at each of the covariate levels.", fig.align="center",out.width="60%", echo=FALSE}
grid.arrange(barplot1, barplot2, barplot3, ncol = 2, nrow = 2)
```
The exploratory analyses presented above are an essential first step in understanding our data, seeing univariate trends, and noting bivariate relationships between variable pairs. However, our important research questions (a) involve the effect of foul differential after adjusting for other significant predictors of which team is called for a foul, (b) account for potential correlation between foul calls within a game (or within a particular home or visiting team), and (c) determine if the effect of foul differential is constant across game conditions. In order to address research questions such as these, we need to consider multilevel, multivariate statistical models for a binary response variable.
## Two-Level Modeling with a Generalized Response {#twolevelmodeling-glmm}
### A GLM Approach {#multregr-glmm}
One quick and dirty approach to analysis might be to run a multiple logistic regression model on the entire long data set of 4972 fouls. In fact, Anderson and Pierce ran such a model in their 2009 paper, using the results of their multiple logistic regression model to support their primary conclusions, while justifying their approach by confirming a low level of correlation within games and the minimal impact on fixed effect estimates that accounting for clustering would have. Output from one potential multiple logistic regression model is shown below; this initial modeling attempt shows significant evidence that referees tend to even out calls (i.e., that the probability of a foul called on the home team decreases as total home fouls increase compared to total visiting team fouls---that is, as `foul.diff` increases) after accounting for score differential and time remaining (Z=-3.078, p=.002). The extent of the effect of foul differential also appears to grow (in a negative direction) as the first half goes on, based on an interaction between time remaining and foul differential (Z=-2.485, p=.013). We will compare this model with others that formally account for clustering and correlation patterns in our data.
```{r, comment = NA}
# Logistic regression model (not multilevel)
mod0 = glm(foul.home ~ foul.diff + score.diff + lead.home +
time + foul.diff:time + lead.home:time,
family = binomial, data = refdata)
```
```{r, echo=FALSE, message=FALSE}
coef(summary(mod0))
cat(" Residual deviance = ",
summary(mod0)$deviance, " on ",
summary(mod0)$df.residual, "df")
```
### A Two-Stage Modeling Approach {#twostage-glmm}
As we saw in Section \@ref(twostage), to avoid clustering we could consider fitting a separate regression model to each unit of observation at Level Two (each game in this case). Since our primary response variable is binary (Was the foul called on the home team or not?), we would fit a logistic regression model to the data from each game. For example, consider the 14 fouls called during the first half of the March 3, 2010, game featuring Virginia at Boston College (Game 110) in Table \@ref(tab:table3chp11).
```{r, table3chp11, echo=FALSE, warning=FALSE}
# Model for Game 110 only (foul.diff as sole predictor)
game110 <- refdata %>%
filter(game == 110) %>% # pick off data for Game 110
select(.,4, 5, 6, 7, 9, 10, 15, 19)
# took out 2 and 12 - too wide
kable(game110, booktabs=T,
caption="Key variables from the March 3, 2010, game featuring Virginia at Boston College (Game 110).") %>%
kable_styling(latex_options = "scale_down")
```
Is there evidence from this game that referees tend to "even out" foul calls when one team starts to accumulate more fouls? Is the score differential associated with the probability of a foul on the home team? Is the effect of foul differential constant across all foul types during this game? We can address these questions through multiple logistic regression applied to only data from Game 110.
First, notation must be defined. Let $Y_{ij}$ be an indicator variable recording if the $j^{th}$ foul from Game $i$ was called on the home team (1) or the visiting team (0). We can consider $Y_{ij}$ to be a Bernoulli random variable with parameter $p_{ij}$, where $p_{ij}$ is the true probability that the $j^{th}$ foul from Game $i$ was called on the home team. As in Chapter \@ref(ch-logreg), we will begin by modeling the logit function of $p_{ij}$ for Game $i$ as a linear function of Level One covariates. For instance, if we were to consider a simple model with foul differential as the sole predictor, we could model the probability of a foul on the home team in Game 110 with the model:
\begin{equation}
\log\bigg(\frac{p_{110j}}{1-p_{110j}}\bigg)=a_{110}+b_{110}\mathrm{foul.diff}_{110j}
(\#eq:lev1glmm)
\end{equation}
where $i$ is fixed at 110. Note that there is no separate error term or variance parameter, since the variance is a function of $p_{ij}$ with a Bernoulli random variable.
Maximum likelihood estimators for the parameters in this model ($a_{110}$ and $b_{110}$) can be obtained through statistical software. $e^{a_{110}}$ represents the odds that a foul is called on the home team when the foul totals in Game 110 are even, and $e^{b_{110}}$ represents the multiplicative change in the odds that a foul is called on the home team for each additional foul for the home team relative to the visiting team during the first half of Game 110.
For Game 110, we estimate $\hat{a}_{110}=-5.67$ and $\hat{b}_{110}=-2.11$ (see output below). Thus, according to our simple logistic regression model, the odds that a foul is called on the home team when both teams have an equal number of fouls in Game 110 is $e^{-5.67}=0.0035$; that is, the probability that a foul is called on the visiting team (0.9966) is $1/0.0035 = 289$ times higher than the probability a foul is called on the home team (0.0034) in that situation. While these parameter estimates seem quite extreme, reliable estimates are difficult to obtain with 14 observations and a binary response variable, especially in a case like this where the fouls were only even at the start of the game. Also, as the gap between home and visiting fouls increases by 1, the odds that a foul is called on the visiting team increases by a multiplicative factor of more than 8 (since $1/e^{-2.11}=8.25$). In Game 110, this trend toward referees evening out foul calls is statistically significant at the 0.10 level (Z=-1.851, p=.0642).
```{r, comment = NA}
lreg.game110 <- glm(foul.home ~ foul.diff,
family = binomial, data = game110)
```
```{r, echo=FALSE, message=FALSE}
coef(summary(lreg.game110))
cat(" Residual deviance = ",
summary(lreg.game110)$deviance, " on ",
summary(lreg.game110)$df.residual, "df")
```
As in Section \@ref(twostage), we can proceed to fit similar logistic regression models for each of the 339 other games in our data set. Each model will yield a new estimate of the intercept and the slope from Equation \@ref(eq:lev1glmm). These estimates are summarized graphically in Figure \@ref(fig:gmu-histmat2). There is noticeable variability among the 340 games in the fitted intercepts and slopes, with an IQR for intercepts ranging from -1.45 to 0.70, and an IQR for slopes ranging from -1.81 to -0.51. The majority of intercepts are below 0 (median = -0.41), so that in most games a foul is less likely to be called on the home team when foul totals are even. In addition, almost all of the estimated slopes are below 0 (median = -0.89), indicating that as the home team's foul total grows in relation to the visiting team's foul total, the odds of a foul on the home team continues to decrease.
```{r, include=FALSE}
# Examine intercepts and slopes by game for model
# with foul.diff as only predictor
regressions <- refdata %>%
group_by(game) %>%
do(fit = glm(foul.home ~ foul.diff, family = binomial,
data = .))
glm_info <- regressions %>%
tidy(fit) %>%
ungroup() %>%
dplyr::select(game, term, estimate) %>%
spread(key = term, value = estimate) %>%
rename(rate = foul.diff, int = `(Intercept)`)
# Descriptive statistics of the estimates obtained by
# fitting the logistic model by game.
summary(glm_info$int)
summary(glm_info$rate)
sd(glm_info$int)
sd(glm_info$rate)
cor(glm_info$int, glm_info$rate)
```
```{r, include=FALSE}
# histograms for ints, rates of change - gmu-histmat2.eps
int.hist2 <- ggplot(glm_info) +
geom_histogram(aes(x = int), color = "black",
fill = "white") +
theme.1 +
labs(x = "Intercepts", y = "Frequency", title = "(a)") +
scale_x_continuous(limits = c(-10, 10))
rate.hist2 <- ggplot(glm_info) +
geom_histogram(aes(x = rate), color = "black",
fill = "white") +
theme.1 +
labs(x = "Slopes", y = "Frequency", title = "(b)") +
scale_x_continuous(limits = c(-5, 5))
```
```{r gmu-histmat2, fig.align="center",out.width="60%",fig.cap= 'Histograms of (a) intercepts and (b) slopes from fitting simple logistic regression models by game. Several extreme outliers have been cut off in these plots for illustration purposes.', echo=FALSE, warning=FALSE, message=FALSE}
grid.arrange(int.hist2, rate.hist2, ncol = 1)
```
At this point, you might imagine expanding model building efforts in a couple of directions: (a) continue to improve the Level One model in Equation \@ref(eq:lev1glmm) by controlling for covariates and adding potential interaction terms, or (b) build Level Two models to explain the game-to-game variability in intercepts or slopes using covariates which remain constant from foul to foul within a game (like the teams playing). While we could pursue these two directions independently, we can accomplish our modeling goals in a much cleaner and more powerful way by proceeding as in Chapter \@ref(ch-multilevelintro) and building a unified multilevel framework under which all parameters are estimated simultaneously and we remain faithful to the correlation structure inherent in our data.
### A Unified Multilevel Approach {#unified-glmm}
As in Chapters \@ref(ch-multilevelintro) and \@ref(ch-lon), we will write out a composite model after first expressing Level One and Level Two models. That is, we will create Level One and Level Two models as in Section \@ref(twostage-glmm), but we will then combine those models into a composite model and estimate all model parameters simultaneously. Once again $Y_{ij}$ is an indicator variable recording if the $j^{th}$ foul from Game $i$ was called on the home team (1) or the visiting team (0), and $p_{ij}$ is the true probability that the $j^{th}$ foul from Game $i$ was called on the home team. Our Level One model with foul differential as the sole predictor is given by Equation \@ref(eq:lev1glmm) generalized to Game $i$:
\[ \log\bigg(\frac{p_{ij}}{1-p_{ij}}\bigg)=a_i+b_i\mathrm{foul.diff}_{ij} \]
Then we include no fixed covariates at Level Two, but we include error terms to allow the intercept and slope from Level One to vary by game, and we allow these errors to be correlated:
\begin{align*}
a_i & = \alpha_{0}+u_i \\
b_i & = \beta_{0}+v_i,
\end{align*}
where the error terms at Level Two can be assumed to follow a multivariate normal distribution:
\[ \left[ \begin{array}{c}
a_i \\ b_i
\end{array} \right] \sim N \left( \left[
\begin{array}{c}
0 \\ 0
\end{array} \right], \left[
\begin{array}{cc}
\sigma_{u}^{2} & \\
\sigma_{uv} & \sigma_{v}^{2}
\end{array} \right] \right) \]
Our composite model then looks like:
\begin{align*}
\log\bigg(\frac{p_{ij}}{1-p_{ij}}\bigg) & = a_i+b_i\mathrm{foul.diff}_{ij} \\
& = (\alpha_{0}+u_i) + (\beta_{0}+v_i)\mathrm{foul.diff}_{ij} \\
& = [\alpha_{0}+\beta_{0}\mathrm{foul.diff}_{ij}]+[u_i+v_i\mathrm{foul.diff}_{ij}]
\end{align*}
The major changes when moving from a normally distributed response to a binary response are the form of the response variable (a logit function) and the absence of an error term at Level One.
Again, we can use statistical software to obtain parameter estimates for this unified multilevel model using all 4972 fouls recorded from the 340 games. For example, the `glmer()` function from the `lme4` package in R extends the `lmer()` function to handle generalized responses and to account for the fact that fouls are not independent within games. Results are given below for the two-level model with foul differential as the sole covariate and Game as the Level Two observational unit.
```{r, include=FALSE, message=FALSE}
# Model for Boston College at home only (foul.diff as
# sole predictor)
# pick off data when BC at home
bc.home = refdata %>% filter(hometeam == "BC")
dim(bc.home)
lreg.bchm <- glm(foul.home ~ foul.diff, family = binomial,
data = bc.home)
summary(lreg.bchm)
# Examine intercepts by hometeam for model with foul.diff
# as only predictor
int <- by(refdata, refdata$hometeam, function(data)
coefficients(glm(foul.home~foul.diff,
family=binomial,data=data))[[1]])
summary(int) # summary statistics for 39 intercepts
# Examine slopes by hometeam for model with foul.diff
# as only predictor
rate <- by(refdata, refdata$hometeam, function(data)
coefficients(glm(foul.home~foul.diff,
family=binomial,data=data))[[2]])
summary(rate)
regressions <- refdata %>%
group_by(hometeam) %>%
do(fit = glm(foul.home ~ foul.diff, family = binomial,
data = .))
glm_info <- regressions %>%
tidy(fit) %>%
ungroup() %>%
dplyr::select(hometeam, term, estimate) %>%
spread(key = term, value = estimate) %>%
rename(rate = foul.diff, int = `(Intercept)`)
# Descriptive statistics of the estimates obtained by
# fitting the logistic model by hometeam.
summary(glm_info$int)
summary(glm_info$rate)
sd(glm_info$int)
sd(glm_info$rate)
cor(glm_info$int, glm_info$rate)
```
```{r, comment=NA, message=FALSE}
# Multilevel model with only foul.diff and errors on slope
# and int and 1 RE
model.b1 <- glmer(foul.home ~ foul.diff + (foul.diff|game),
family = binomial, data = refdata)
```
```{r, echo=FALSE, message=FALSE}
VCrandom <- VarCorr(model.b1)
print(VCrandom, comp = c("Variance", "Std.Dev."))
cat(" Number of Level Two groups = ",
summary(model.b1)$ngrps)
coef(summary(model.b1))
cat(" AIC = ", AIC(model.b1), "; BIC = ", BIC(model.b1))
```
When parameter estimates from the multilevel model above are compared with those from the naive logistic regression model assuming independence of all observations (below), there are noticeable differences. For instance, each additional foul for the visiting team is associated with a 33\% increase ($1/e^{-.285}$) in the odds of a foul called on the home team under the multilevel model, but the single level model estimates the same increase as only 14\% ($1/e^{-.130}$). Also, estimated standard errors for fixed effects are greater under multilevel generalized linear modeling, which is not unusual after accounting for correlated observations, which effectively reduces the sample size.
```{r, comment = NA}
# Logistic regression model (not multilevel) with
# only foul.diff
mod0a <- glm(foul.home ~ foul.diff, family = binomial,
data = refdata)
```
```{r, echo=FALSE, message=FALSE}
coef(summary(mod0a))
cat(" Residual deviance = ",
summary(mod0a)$deviance, " on ",
summary(mod0a)$df.residual, "df")
```
## Crossed Random Effects {#crossedre}
In the College Basketball Referees case study, our two primary Level Two covariates are home team and visiting team. In Section \@ref(glmm-eda) we showed evidence that the probability a foul is called on the home team changes if we know precisely who the home and visiting teams are. However, if we were to include an indicator variable for each distinct team, we would need 38 indicator variables for home teams and 38 more for visiting teams. That's a lot of degrees of freedom to spend! And adding those indicator variables would complicate our model considerably, despite the fact that we're not very interested in specific coefficient estimates for each team---we just want to control for teams playing to draw stronger conclusions about referee bias (focusing on the `foul.diff` variable). One way, then, of accounting for teams is by treating them as __random effects__ rather than fixed effects. For instance, we can assume the effect that Minnesota being the home team has on the log odds of foul call on the home team is drawn from a normal distribution with mean 0 and variance $\sigma^{2}_{v}$. Maybe the estimated random effect for Minnesota as the home team is -0.5, so that Minnesota is slightly less likely than the typical home team to have fouls called on it, and the odds of a foul on the home team are somewhat correlated whenever Minnesota is the home team. In this way, we account for the complete range of team effects, while only having to estimate a single parameter ($\sigma^{2}_{v}$) rather than coefficients for 38 indicator variables. The effect of visiting teams can be similarly modeled.
How will treating home and visiting teams as random effects change our multilevel model? Another way we might view this situation is by considering that Game is not the only Level Two observational unit we might have selected. What if we instead decided to focus on Home Team as the Level Two observational unit? That is, what if we assumed that fouls called on the same home team across all games must be correlated? In this case, we could redefine our Level One model from Equation \@ref(eq:lev1glmm). Let $Y_{hj}$ be an indicator variable recording if the $j^{th}$ foul from Home Team $h$ was called on the home team (1) or the visiting team (0), and $p_{hj}$ be the true probability that the $j^{th}$ foul from Home Team $h$ was called on the home team. Now, if we were to consider a simple model with foul differential as the sole predictor, we could model the probability of a foul on the home team for Home Team $h$ with the model:
\begin{equation*}
\log\bigg(\frac{p_{hj}}{1-p_{hj}}\bigg)=a_h+b_h\mathrm{foul.diff}_{hj}
\end{equation*}
In this case, $e^{a_{h}}$ represents the odds that a foul is called on the home team when total fouls are equal between both teams in a game involving Home Team $h$, and $e^{b_{h}}$ represents the multiplicative change in the odds that a foul is called on the home team for every extra foul on the home team compared to the visitors in a game involving Home Team $h$. After fitting logistic regression models for each of the 39 teams in our data set, we see in Figure \@ref(fig:gmu-histmat3) variability in fitted intercepts (mean=-0.15, sd=0.33) and slopes (mean=-0.22, sd=0.12) among the 39 teams, although much less variability than we observed from game-to-game. Of course, each logistic regression model for a home team was based on about 10 times more foul calls than each model for a game, so observing less variability from team-to-team was not unexpected.
```{r, include=FALSE}
# histograms for ints, rates of change
int.hist3 <- ggplot(glm_info) +
geom_histogram(aes(x = int), binwidth = .1,
color = "black", fill = "white") +
theme.1 +
labs(x = "Intercepts", y = "Frequency", title = "(a)")
rate.hist3 <- ggplot(glm_info) +
geom_histogram(aes(x = rate), binwidth = .04,
color = "black", fill = "white") +
theme.1 +
labs(x = "Slopes", y = "Frequency", title = "(b)")
```
```{r, gmu-histmat3, fig.align="center",out.width="60%", fig.cap='Histograms of (a) intercepts and (b) slopes from fitting simple logistic regression models by home team.',echo=FALSE, warning=FALSE}
grid.arrange(int.hist3, rate.hist3, ncol = 1)
```
From a modeling perspective, accounting for clustering by game and by home team (not to mention by visiting team) brings up an interesting issue we have not yet considered---can we handle random effects that are not nested? Since each foul called is associated with only one game (or only one home team and one visiting team), foul is considered nested in game (or home or visiting team). However, a specific home team is not associated with a single game; that home team appears in several games. Therefore, any effects of game, home team, and visiting team are considered __crossed random effects__. \index{crossed random effects}
A two-level model which accounts for variability among games, home teams, and visiting teams would take on a slightly new look. First, the full subscripting would change a bit. Our primary response variable would now be written as $Y_{i[gh]j}$, an indicator variable recording if the $j^{th}$ foul from Game $i$ was called on the home team (1) or the visiting team (0), where Game $i$ pitted Visiting Team $g$ against Home Team $h$. Square brackets are introduced since $g$ and $h$ are essentially at the same level as $i$, whereas we have assumed (without stating so) throughout this book that subscripting without square brackets implies a movement to lower levels as the subscripts move left to right (e.g., $ij$ indicates $i$ units are at Level Two, while $j$ units are at Level One, nested inside Level Two units). We can then consider $Y_{i[gh]j}$ to be a Bernoulli random variable with parameter $p_{i[gh]j}$, where $p_{i[gh]j}$ is the true probability that the $j^{th}$ foul from Game $i$ was called on Home Team $h$ rather than Visiting Team $g$. We will include the crossed subscripting only where necessary.
Typically, with the addition of crossed effects, the Level One model will remain familiar and changes will be seen at Level Two, especially in the equation for the intercept term. In the model formulation below we allow, as before, the slope and intercept to potentially vary by game:
- Level One:
\begin{equation}
\log\bigg(\frac{p_{i[gh]j}}{1-p_{i[gh]j}}\bigg)=a_{i}+b_{i}\mathrm{foul.diff}_{ij}
(\#eq:lev1cross)
\end{equation}
- Level Two:
\begin{align*}
a_{i} & = \alpha_{0}+u_{i}+v_{h}+w_{g} \\
b_{i} & = \beta_{0},
\end{align*}
Therefore, at Level Two, we assume that $a_{i}$, the log odds of a foul on the home team when the home and visiting teams in Game $i$ have an equal number of fouls, depends on four components:
- $\alpha_{0}$ is the population average log odds across all games and fouls (fixed)
- $u_{i}$ is the effect of Game $i$ (random)
- $v_{h}$ is the effect of Home Team $h$ (random)
- $w_{g}$ is the effect of Visiting Team $g$ (random)
where error terms (random effects) at Level Two can be assumed to follow independent normal distributions:
\begin{align*}
u_{i} & \sim N \left( 0 , \sigma_{u}^{2} \right) \\
v_{h} & \sim N \left( 0 , \sigma_{v}^{2} \right) \\
w_{g} & \sim N \left( 0 , \sigma_{w}^{2} \right).
\end{align*}
We could include terms that vary by home or visiting team in other Level Two equations, but often adjusting for these random effects on the intercept is sufficient. The advantages to including additional random effects are three-fold. First, by accounting for additional sources of variability, we should obtain more precise estimates of other model parameters, including key fixed effects. Second, we obtain estimates of variance components, allowing us to compare the relative sizes of game-to-game and team-to-team variability. Third, as outlined in Section \@ref(estimatedRE), we can obtain estimated random effects which allow us to compare the effects on the log-odds of a home foul of individual home and visiting teams.
Our composite model then looks like:
\begin{equation*}
\log\bigg(\frac{p_{i[gh]j}}{1-p_{i[gh]j}}\bigg) = [\alpha_{0}+\beta_{0}\mathrm{foul.diff}_{ij}]+[u_{i}+v_{h}+w_{g}].
\end{equation*}
We will refer to this as Model A3, where we look at the effect of foul differential on the odds a foul is called on the home team, while accounting for three crossed random effects at Level Two (game, home team, and visiting team). Parameter estimates for Model A3 are given below:
```{r, include=FALSE}
# Multilevel model with only foul.diff and 1 random effect
model.a1 <- glmer(foul.home ~ foul.diff + (1|game),
family = binomial, data = refdata)
summary(model.a1)
anova(model.b1, model.a1)
```
```{r, comment=NA}
# Model A3
model.a3 <- glmer(foul.home ~ foul.diff + (1|game) +
(1|hometeam) + (1|visitor),
family = binomial, data = refdata)
```
```{r, echo=FALSE, message=FALSE}
VCrandom <- VarCorr(model.a3)
print(VCrandom, comp = c("Variance", "Std.Dev."))
cat(" Number of games = ",
summary(model.a3)$ngrps[1], "\n",
"Number of hometeams = ",
summary(model.a3)$ngrps[2], "\n",
"Number of visitors = ",
summary(model.a3)$ngrps[3])
coef(summary(model.a3))
cat(" AIC = ", AIC(model.a3), "; BIC = ", BIC(model.a3))
```
From this output, we obtain estimates of our five model parameters:
- $\hat{\alpha}_{0}=-0.188=$ the mean log odds of a home foul at the point where total fouls are equal between teams. In other words, when fouls are balanced between teams, the probability that a foul is called on the visiting team (.547) is 20.7\% ($1/e^{-.188}=1.207$) higher than the probability a foul is called on the home team (.453).
- $\hat{\beta}_{0}=-0.264=$ the decrease in mean log odds of a home foul for each 1 foul increase in the foul differential. More specifically, the odds the next foul is called on the visiting team rather than the home team increases by 30.2\% with each additional foul called on the home team ($1/e^{-.264}=1.302$).
- $\hat{\sigma}_{u}^{2}=0.172=$ the variance in intercepts from game-to-game.
- $\hat{\sigma}_{v}^{2}=0.068=$ the variance in intercepts among different home teams.
- $\hat{\sigma}_{w}^{2}=0.023=$ the variance in intercepts among different visiting teams.
Based on the t-value (-6.80) and p-value ($p<.001$) associated with foul differential in this model, we have significant evidence of a negative association between foul differential and the odds of a home team foul. That is, we have significant evidence that the odds that a foul is called on the home team shrinks as the home team has more total fouls compared with the visiting team. Thus, there seems to be preliminary evidence in the 2009-2010 data that college basketball referees tend to even out foul calls over the course of the first half. Of course, we have yet to adjust for other significant covariates.
## Parametric Bootstrap for Model Comparisons {#glmm-paraboot}
Our estimates of variance components provide evidence of the relative variability in the different Level Two random effects. For example, an estimated 65.4\% of variability in the intercepts is due to differences from game-to-game, while 25.9\% is due to differences among home teams, and 8.7\% is due to differences among visiting teams. At this point, we could reasonably ask: if we use a random effect to account for differences among games, does it really pay off to also account for which team is home and which is the visitor?
To answer this, we could compare models with and without the random effects for home and visiting teams (i.e., Model A3, the full model, vs. Model A1, the reduced model) using a likelihood ratio test. In Model A1, Level Two now looks like:
\begin{align*}
a_{i} & = \alpha_{0}+u_{i} \\
b_{i} & = \beta_{0},
\end{align*}
The likelihood ratio test (see below) provides significant evidence (LRT=16.074, df=2, p=.0003) that accounting for variability among home teams and among visiting teams improves our model.
```{r comment=NA, message=FALSE}
drop_in_dev <- anova(model.a3, model.a1, test = "Chisq")
```
```{r comment=NA, message=F, echo=F}
did_print <- data.frame(npar=drop_in_dev$npar,
AIC=drop_in_dev$AIC, BIC=drop_in_dev$BIC,
logLik=drop_in_dev$logLik, dev=drop_in_dev$deviance,
Chisq=drop_in_dev$Chisq, Df=drop_in_dev$Df,
pval=drop_in_dev$`Pr(>Chisq)`)
row.names(did_print) <- row.names(drop_in_dev)
did_print
# model A is better, but p-value can be off (REML=F)
```
In Section \@ref(multileveltechnical) we noted that REML estimation is typically used when comparing models that differ only in random effects for multilevel models with normal responses, but with multilevel generalized linear models, full maximum likelihood (ML) estimation procedures are typically used regardless of the situation (including in the function `glmer()` in R). So the likelihood ratio test above is based on ML methods---can we trust its accuracy? In Section \@ref(longitudinal-paraboot) we introduced the parametric bootstrap \index{parametric bootstrap} as a more reliable method in many cases for conducting hypothesis tests compared to the likelihood ratio test, especially when comparing full and reduced models that differ in their variance components. What would a parametric bootstrap test say about testing $H_{0}: \sigma_{v}^{2}=\sigma_{w}^{2}=0$ vs. $H_{A}:$ at least one of $\sigma_{v}^{2}$ or $\sigma_{w}^{2}$ is not equal to 0? Under the null hypothesis, since the two variance terms are being set to a value (0) on the boundary constraint, we would not expect the chi-square distribution to adequately approximate the behavior of the likelihood ratio test statistic.
Figure \@ref(fig:gmu-lrt1) illustrates the null distribution of the likelihood ratio test statistic derived by the parametric bootstrap procedure with 100 samples as compared to a chi-square distribution. As we observed in Section \@ref(longitudinal-paraboot), the parametric bootstrap provides a more reliable p-value in this case ($p<.001$ from output below) because a chi-square distribution puts too much mass in the tail and not enough near 0. However, the parametric bootstrap is computationally intensive, and it can take a long time to run even with moderately complex models. With this data, we would select our full Model A3 based on a parametric bootstrap test.
```{r, echo=FALSE, comment=NA, message=FALSE, warning=FALSE}
# run bootstrapAnova function, then line below takes about
# 15 minutes (much quicker with B=100 rather than B=1000)
# Parametric bootstrap code for lme4-models
# from Fabian Scheipl on stack exchange
#m0 is the lmer model under the null hypothesis
# (i.e. the smaller model)
#mA is the lmer model under the alternative
bootstrapAnova <- function(mA, m0, B=1000){
oneBootstrap <- function(m0, mA){
d <- drop(simulate(m0))
m2 <-refit(mA, newresp=d)
m1 <-refit(m0, newresp=d)
return(anova(m2,m1)$Chisq[2])
}
nulldist <- replicate(B, oneBootstrap(m0, mA))
ret <- anova(mA, m0)
ret$"Pr(>Chisq)"[2] <- mean(ret$Chisq[2] < nulldist)
names(ret)[8] <- "Pr_boot(>Chisq)"
attr(ret, "heading") <- c(attr(ret, "heading")[1],
paste("Parametric bootstrap with", B,"samples."),
attr(ret, "heading")[-1])
attr(ret, "nulldist") <- nulldist
return(ret)
}
#use like this (and increase B for more believeability):
# bRLRT <- bootstrapAnova(mA=<BIG MODEL>, m0=<SMALLER MODEL>)
# Since this takes a while, it was run once and results
# stored in the data folder
# Compare Model A3 vs Model A1 with parametric bootstrap
# - took 15 minutes to run just 100 bootstrap samples
#set.seed(33)
#bRLRT = bootstrapAnova(mA=model.a3, m0=model.a1, B=100)
#bRLRT
#save(bRLRT, file = "data/modelaVS0a11.rda")
```
```{r, eval=FALSE}
bootstrapAnova(mA=model.a3, m0=model.a1, B=1000)
```
```{r comment=NA, message=F, echo=F}
load(file = "data/modelaVS0a11.rda")
drop_in_dev <- bRLRT
did_print <- data.frame(Df=drop_in_dev$Df,
logLik=drop_in_dev$logLik, dev=drop_in_dev$deviance,
Chisq=drop_in_dev$Chisq, ChiDf=drop_in_dev$`Chi Df`,
pval_boot=drop_in_dev$`Pr_boot(>Chisq)`)
row.names(did_print) <- row.names(drop_in_dev)
did_print
```
```{r, gmu-lrt1, fig.align="center",out.width="60%", fig.cap='Null distribution of likelihood ratio test statistic comparing Models A3 and A1 derived using parametric bootstrap with 100 samples (histogram) compared to a chi-square distribution with 2 degrees of freedom (smooth curve). The vertical line represents the observed likelihood ratio test statistic.',echo=FALSE, warning=FALSE}
nullLRT = attr(bRLRT,"nulldist")
x=seq(0,max(nullLRT),length=100)
y=dchisq(x,2)
nullLRT.1 <- as.data.frame(cbind(nullLRT=nullLRT,x=x,y=y))
ggplot(nullLRT.1) +
geom_histogram(aes(x=nullLRT,y=..density..),
binwidth=1,color="black",fill="white")+
geom_vline(xintercept=16.074,size=1) +
theme.1 +
geom_line(aes(x=x,y=y)) +
labs(x="Likelihood Ratio Test Statistics from Null Distribution",y="Density")
# sum(nullLRT>=16.074)/1000 # parametric bootstrap p-value
```
We might also reasonably ask: is it helpful to allow slopes (coefficients for foul differential) to vary by game, home team, and visiting team as well? Again, since we are comparing models that differ in random effects, and since the null hypothesis involves setting random effects at their boundaries, we use the parametric bootstrap. Formally, we are comparing Model A3 to Model B3, which has the same Level One equation as Model A3:
\[ \log\bigg(\frac{p_{i[gh]j}}{1-p_{i[gh]j}}\bigg)=a_{i}+b_{i}\mathrm{foul.diff}_{ij} \]
but 6 variance components to estimate at Level Two:
\begin{align*}
a_{i} & = \alpha_{0}+u_{i}+v_{h}+w_{g} \\
b_{i} & = \beta_{0}+z_{i}+r_{h}+s_{g},
\end{align*}
where error terms (random effects) at Level Two can be assumed to follow independent normal distributions:
\begin{align*}
u_{i} & \sim N \left( 0 , \sigma_{u}^{2} \right) \\
z_{i} & \sim N \left( 0 , \sigma_{z}^{2} \right) \\
v_{h} & \sim N \left( 0 , \sigma_{v}^{2} \right) \\
r_{h} & \sim N \left( 0 , \sigma_{r}^{2} \right) \\
w_{g} & \sim N \left( 0 , \sigma_{w}^{2} \right) \\
s_{g} & \sim N \left( 0 , \sigma_{s}^{2} \right).
\end{align*}
Thus our null hypothesis for comparing Model A3 vs. Model B3 is $H_{0}: \sigma_{z}^{2}=\sigma_{r}^{2}=\sigma_{s}^{2}=0$. We do not have significant evidence (LRT=0.349, df=3, p=.46 by parametric bootstrap) of variability among slopes, so we will only include random effects for game, home team, and visiting team for the intercept going forward. Figure \@ref(fig:gmu-lrt3) illustrates the null distribution of the likelihood ratio test statistic derived by the parametric bootstrap procedure as compared to a chi-square distribution, again showing that the tails are too heavy in the chi-square distribution.
```{r, 11verb8a, echo=FALSE, comment=NA, include=FALSE}
# Model B3 (Multilevel model with only foul.diff and
# all 3 random effects applied to both intercepts and
# slopes but no correlations)
model.b3 <- glmer(foul.home ~ foul.diff + (1|game) +
(1|hometeam) + (1|visitor) + (0+foul.diff|game) +
(0+foul.diff|hometeam) + (0+foul.diff|visitor),
family = binomial, data = refdata)
summary(model.b3)
# Compare Model A3 vs Model B3 with parametric bootstrap
# - took 45 minutes to run just 100 bootstrap samples
#set.seed(33)
#bRLRT = bootstrapAnova(mA=model.b3, m0=model.a3, B=100)
#bRLRT
#save(bRLRT, file = "data/modela3VSa11.rda")
```
```{r, eval=FALSE}
bootstrapAnova(mA=model.b3, m0=model.a3, B=1000)
```
```{r comment=NA, message=F, echo=F}
load(file = "data/modela3VSa11.rda")
drop_in_dev <- bRLRT
did_print <- data.frame(Df=drop_in_dev$Df,
logLik=drop_in_dev$logLik, dev=drop_in_dev$deviance,
Chisq=drop_in_dev$Chisq, ChiDf=drop_in_dev$`Chi Df`,
pval_boot=drop_in_dev$`Pr_boot(>Chisq)`)
row.names(did_print) <- row.names(drop_in_dev)
did_print
```
```{r gmu-lrt3, fig.align="center",out.width="60%", fig.cap='Null distribution of likelihood ratio test statistic comparing Models A3 and B3 derived using parametric bootstrap with 100 samples (histogram) compared to a chi-square distribution with 3 degrees of freedom (smooth curve). The vertical line represents the observed likelihood ratio test statistic.',echo=FALSE, warning=FALSE}
nullLRT = attr(bRLRT,"nulldist")
x=seq(0,max(nullLRT),length=100)
y=dchisq(x,3)
nullLRT.1 <- as.data.frame(cbind(nullLRT=nullLRT,x=x,y=y))
ggplot(nullLRT.1) +
geom_histogram(aes(x=nullLRT,y=..density..),binwidth=1,
color="black",fill="white")+
geom_vline(xintercept=0.349,size=1) +
theme.1 +
geom_line(aes(x=x,y=y)) +
labs(x="Likelihood Ratio Test Statistics from Null Distribution",y="Density")
# sum(nullLRT>=0.349)/100 # parametric bootstrap p-value
```
\vspace{1cm}
Note that we could have also allowed for a correlation between the error terms for the intercept and slope by game, home team, or visiting team -- i.e., assume, for example:
\[ \left[ \begin{array}{c}
u_{i} \\ z_{i}
\end{array} \right] \sim N \left( \left[
\begin{array}{c}
0 \\ 0
\end{array} \right], \left[
\begin{array}{cc}
\sigma_{u}^{2} & \\
\sigma_{uz} & \sigma_{z}^{2}
\end{array} \right] \right) \]
while error terms by game, home team, or visiting team are still independent. Here, the new model would have 6 additional parameters when compared with Model A3 (3 variance terms and 3 covariance terms). By the parametric bootstrap, there is no significant evidence that the model with 6 additional parameters is necessary (LRT=6.49, df=6, p=.06 by parametric bootstrap). The associated p-value based on a likelihood ratio test with approximate chi-square distribution (and restricted maximum likelihood estimation) is .370, reflecting once again the overly heavy tails of the chi-square distribution.
```{r, 11verb9, echo=FALSE, comment=NA, include=FALSE}
# Model B6 (Multilevel model with only foul.diff and
# all 3 random effects applied to both intercepts and
# slopes along with correlations)
model.b6 <- glmer(foul.home ~ foul.diff + (1+foul.diff|game) +
(1+foul.diff|hometeam) + (1+foul.diff|visitor),
family = binomial, data = refdata)
summary(model.b6)
# Compare Model A3 vs Model B6 with parametric bootstrap
# - took many hours to run just 100 bootstrap samples
#set.seed(33)
#bRLRT = bootstrapAnova(mA=model.b6, m0=model.a3, B=100)
#bRLRT
#save(bRLRT, file = "data/modela6VSa11.rda")
load(file = "data/modela6VSa11.rda")
bRLRT
nullLRT = attr(bRLRT,"nulldist")
x=seq(0,max(nullLRT),length=100)
y=dchisq(x,6)