forked from proback/BeyondMLR
-
Notifications
You must be signed in to change notification settings - Fork 0
/
10-Multilevel-Data-With-More-Than-Two-Levels.Rmd
1724 lines (1361 loc) · 125 KB
/
10-Multilevel-Data-With-More-Than-Two-Levels.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "Chapter 10"
subtitle: "Multilevel Data with More Than Two Levels"
output:
html_document: default
pdf_document:
number_sections: yes
---
# Multilevel Data With More Than Two Levels {#ch-3level}
## Learning Objectives
After finishing this chapter, you should be able to:
- Extend the standard multilevel model to cases with more than two levels.
- Apply exploratory data analysis techniques specific to data from more than two levels.
- Formulate multilevel models including the variance-covariance structure.
- Build and understand a taxonomy of models for data with more than two levels.
- Interpret parameters in models with more than two levels.
- Develop strategies for handling an exploding number of parameters in multilevel models.
- Recognize when a fitted model has encountered boundary constraints and understand strategies for moving forward.
- Apply a parametric bootstrap test of significance to appropriate situations with more than two levels.
```{r load_packages10, message = FALSE, warning = FALSE}
# Packages required for Chapter 10
library(knitr)
library(gridExtra)
library(GGally)
library(mice)
library(nlme)
library(lme4)
library(mnormt)
library(boot)
library(HLMdiag)
library(kableExtra)
library(pander)
library(tidyverse)
```
```{r, include=F}
if(knitr::is_html_output()){options(knitr.table.format = "html")} else {options(knitr.table.format = "latex")}
```
## Case Studies: Seed Germination {#cs:seeds}
It is estimated that 82-99\% of historic tallgrass prairie ecosystems have been converted to agricultural use [@Baer2002]. A prime example of this large scale conversion of native prairie to agricultural purposes can be seen in Minnesota, where less than 1\% of the prairies that once existed in the state remain [@Camill2004]. Such large-scale alteration of prairie communities has been associated with numerous problems. For example, erosion and decomposition that readily take place in cultivated soils have increased atmospheric $\textrm{CO}_2$ levels and increased nitrogen inputs to adjacent waterways (@Baer2002, @Camill2004, @Knops2000). In addition, cultivation practices are known to affect rhizosphere composition as tilling can disrupt networks of soil microbes [@Allison2005]. The rhizosphere is the narrow region of soil that is directly influenced by root secretions and associated soil microorganisms; much of the nutrient cycling and disease suppression needed by plants occur immediately adjacent to roots. It is important to note that microbial communities in prairie soils have been implicated with plant diversity and overall ecosystem function by controlling carbon and nitrogen cycling in the soils [@Zak2003].
There have been many responses to these claims, but one response in recent years is reconstruction of the native prairie community. These reconstruction projects provide a new habitat for a variety of native prairie species, yet it is important to know as much as possible about the outcomes of prairie reconstruction projects in order to ensure that a functioning prairie community is established. The ecological repercussions resulting from prairie reconstruction are not well known. For example, all of the aforementioned changes associated with cultivation practices are known to affect the subsequent reconstructed prairie community (@Baer2002, @Camill2004), yet there are few explanations for this phenomenon. For instance, prairies reconstructed in different years (using the same seed combinations and dispersal techniques) have yielded disparate prairie communities.
Researchers at a small midwestern college decided to experimentally explore the underlying causes of variation in reconstruction projects in order to make future projects more effective. Introductory ecology classes were organized to collect longitudinal data on native plant species grown in a greenhouse setting, using soil samples from surrounding lands [@Angell2010]. We will examine their data to compare germination and growth of two species of prairie plants---leadplants (*Amorpha canescens*) and coneflowers (*Ratibida pinnata*)---in soils taken from a remnant (natural) prairie, a cultivated (agricultural) field, and a restored (reconstructed) prairie. Additionally, half of the sampled soil was sterilized to determine if rhizosphere differences were responsible for the observed variation, so we will examine the effects of sterilization as well.
The data we'll examine was collected through an experiment run using a 3x2x2 factorial design, with 3 levels of soil type (remnant, cultivated, and restored), 2 levels of sterilization (yes or no), and 2 levels of species (leadplant and coneflower). Each of the 12 treatments (unique combinations of factor levels) was replicated in 6 pots, for a total of 72 pots. Six seeds were planted in each pot (although a few pots had 7 or 8 seeds), and initially student researchers recorded days to germination (defined as when two leaves are visible), if germination occurred. In addition, the height of each germinated plant (in mm) was measured at 13, 18, 23, and 28 days after planting. The study design is illustrated in Figure \@ref(fig:seedstudy).
```{r, seedstudy, out.width='80%', fig.show='hold', fig.cap="The design of the seed germination study.", echo=FALSE}
knitr::include_graphics("data/StudyDesignDiagram.PNG")
# Can also put these lines outside of R chunk
# ![The design of the seed germination study](data/StudyDesignDiagram.PNG)
```
## Initial Exploratory Analyses {#explore3}
### Data Organization {#organizedata3}
Data for Case Study \@ref(cs:seeds) in `seeds2.csv` contains the following variables:
- `pot` = Pot plant was grown in (1-72)
- `plant` = Unique plant identification number
- `species` = L for leadplant and C for coneflower
- `soil` = STP for reconstructed prairie, REM for remnant prairie, and CULT for cultivated land
- `sterile` = Y for yes and N for no
- `germin` = Y if plant germinated, N if not
- `hgt13` = height of plant (in mm) 13 days after seeds planted
- `hgt18` = height of plant (in mm) 18 days after seeds planted
- `hgt23` = height of plant (in mm) 23 days after seeds planted
- `hgt28` = height of plant (in mm) 28 days after seeds planted
This data is stored in wide format, with one row per plant (see 12 sample plants in Table \@ref(tab:table1chp10)). As we have done in previous multilevel analyses, we will convert to long format (one observation per plant-time combination) after examining the missing data pattern and removing any plants with no growth data. In this case, we are almost assuredly losing information by removing plants with no height data at all four time points, since these plants did not germinate, and there may well be differences between species, soil type, and sterilization with respect to germination rates. We will handle this possibility by analyzing germination rates separately (see Chapter \@ref(ch-GLMM)); the analysis in this chapter will focus on effects of species, soil type, and sterilization on initial growth and growth rate among plants that germinate.
```{r, include=FALSE}
#Getting started
#great base theme for ggplots
theme.1 <- theme(axis.title.x = element_text(size = 14),
axis.title.y = element_text(size = 14),
plot.title=element_text(hjust=.9,face="italic",size=12))
# seedwd <- read.csv(file=file.choose()) # seeds2.csv
seedwd <- read.csv("data/seeds2.csv")
seedwd[135:146,2:11] # illustrate wide data
```
```{r table1chp10, echo=FALSE}
table1chp10 <- seedwd[135:146,2:11] # illustrate wide data
kable(table1chp10, booktabs=T,
caption="A snapshot of data (Plants 231-246) from the Seed Germination case study in wide format.") %>%
kable_styling(latex_options = "scale_down")
```
Although the experimental design called for $72*6=432$ plants, the wide data set has 437 plants because a few pots had more than six plants (likely because two of the microscopically small seeds stuck together when planted). Of those 437 plants, 154 had no height data (did not germinate by the 28th day) and were removed from analysis (for example, see rows 145-146 in Table \@ref(tab:table1chp10)). A total of 248 plants had complete height data (e.g., rows 135-139 and 141-143), 13 germinated later than the 13th day but had complete heights once they germinated (e.g., row 144), and 22 germinated and had measurable height on the 13th day but died before the 28th day (e.g., row 140). Ultimately, the long data set contains 1132 unique observations where plant heights were recorded; representation of plants 236-242 in the long data set can be seen in Table \@ref(tab:table2chp10).
```{r, include=FALSE}
# Investigate patterns of missingness
md.pattern(seedwd)
# Remove plants that did not germinate
seedwd <- seedwd %>%
mutate(nope = is.na(hgt13) + is.na(hgt18) + is.na(hgt23) +
is.na(hgt28)) %>%
filter(nope < 4)
# Create data frame in LONG form (one obs per plant
# measurement time)
seedlg <- seedwd %>%
gather(key = time, value = hgt, hgt13:hgt28) %>%
mutate(time = as.integer(str_sub(time, -2)),
time13 = time - 13 )
```
```{r table2chp10, echo=FALSE}
#tablelong
tab2chp10 <- seedlg %>% filter(plant %in% c(236:242)) %>%
arrange(plant, time13)
tab2chp10df <- data.frame(tab2chp10)
table2chp10 <- tab2chp10df[,c(2,3,4,5,6,7,10,11)]
kable(table2chp10, booktabs=T,
caption="A snapshot of data (Plants 236-242) from the Seed Germination case study in long format.")
```
Notice the __three-level structure__ \index{three-level structure} of this data. Treatments (levels of the three experimental factors) were assigned at the `pot` level, then multiple plants were grown in each pot, and multiple measurements were taken over time for each plant. Our multilevel analysis must therefore account for pot-to-pot variability in height measurements (which could result from factor effects), plant-to-plant variability in height within a single pot, and variability over time in height for individual plants. In order to fit such a three-level model, we must extend the two-level model which we have used thus far.
### Exploratory Analyses {#explore3v2}
We start by taking an initial look at the effect of Level Three covariates (factors applied at the pot level: species, soil type, and sterilization) on plant height, pooling observations across pot, across plant, and across time of measurement within plant. First, we observe that the initial balance which existed after randomization of pot to treatment no longer holds. After removing plants that did not germinate (and therefore had no height data), more height measurements exist for coneflowers (n=704, compared to 428 for leadplants), soil from restored prairies (n=524, compared to 288 for cultivated land and 320 for remnant prairies), and unsterilized soil (n=612, compared to 520 for sterilized soil). This imbalance indicates possible factor effects on germination rate; we will take up those hypotheses in Chapter \@ref(ch-GLMM). In this chapter, we will focus on the effects of species, soil type, and sterilization on the growth patterns of plants that germinate.
Because we suspect that height measurements over time for a single plant are highly correlated, while height measurements from different plants from the same pot are less correlated, we calculate mean height per plant (over all available time points) before generating exploratory plots investigating Level Three factors. Figure \@ref(fig:boxbyspec) then examines the effects of soil type and sterilization separately by species. Sterilization seems to have a bigger benefit for coneflowers, while soil from remnant prairies seems to lead to smaller leadplants and taller coneflowers.
```{r, include=FALSE}
#getting started-2
# create indicator variables for later analyses
seedlg <- seedlg %>%
mutate(cult=ifelse(soil=="CULT",1,0),
rem=ifelse(soil=="REM",1,0),
stp=ifelse(soil=="STP",1,0),
lead=ifelse(species=="L",1,0),
cone=ifelse(species=="C",1,0),
strl=ifelse(sterile=="Y",1,0),
nostrl=ifelse(sterile=="N",1,0) )
# create separate data sets of leadplants and coneflowers
conedata <- seedlg %>%
filter(cone==1)
leaddata <- seedlg %>%
filter(lead==1)
## Exploratory data analysis ##
# Rough look at Level Three covariates
seedlg %>% count(species)
seedlg %>% count(soil)
seedlg %>% count(sterile)
# From here on, do everything separately for leadplants and coneflowers #
# Add average across all time points for each plant for EDA plots
meanplant <- seedlg %>% group_by(plant) %>%
summarise(meanplant = mean(hgt, na.rm = TRUE))
seedwd <- seedwd %>%
left_join(meanplant, by = "plant")
conewd <- seedwd %>%
filter(species=="C")
leadwd <- seedwd %>%
filter(species=="L")
```
```{r boxbyspec, fig.align="center",out.width="60%", fig.cap='Plant height comparisons of (a) soil type and (b) sterilization within species. Each plant is represented by the mean height over all measurements at all time points for that plant.',echo=FALSE, warning=FALSE}
# One obs per plant - assume plants relatively independent
# within pots
cone.soil <- ggplot(conedata,aes(x=soil,y=hgt)) +
geom_boxplot() +
theme.1 +
labs(x="Soil type",y="Plant Height (mm)",
title="Coneflowers (a)")
cone.sterile <- ggplot(conedata,aes(x=sterile,y=hgt)) +
geom_boxplot() +
theme.1 +
labs(x="Sterilized",y="Plant Height (mm)",
title="Coneflowers (b)")
lead.soil <- ggplot(leaddata,aes(x=soil,y=hgt)) +
geom_boxplot() +
theme.1 +
labs(x="Soil type",y="Plant Height (mm)",
title="Leadplants (a)")
lead.sterile <- ggplot(leaddata,aes(x=sterile,y=hgt)) +
geom_boxplot() +
theme.1 +
labs(x="Sterilized",y="Plant Height (mm)",
title="Leadplants (b)")
grid.arrange(cone.soil, cone.sterile, lead.soil,
lead.sterile, ncol=2)
```
We also use spaghetti plots to examine time trends within species to see (a) if it is reasonable to assume linear growth between Day 13 and Day 28 after planting, and (b) if initial height and rate of growth is similar in the two species. Figure \@ref(fig:spagbyspec) illustrates differences between species. While both species have similar average heights 13 days after planting, coneflowers appear to have faster early growth which slows later, while leadplants have a more linear growth rate which culminates in greater average heights 28 days after planting. Coneflowers also appear to have greater variability in initial height and growth rate, although there are more coneflowers with height data.
```{r,spagbyspec,fig.align="center",out.width="60%", fig.cap='Spaghetti plot by species with loess fit. Each line represents one plant.',echo=FALSE, warning=FALSE, message=FALSE}
# spaghetti plot by species - all plants (This plot used
# earlier now); add loess smoothing curve for overall trend
seedlg <- seedlg %>%
mutate(speciesname = ifelse(species=="C","Coneflowers",
"Leadplants") )
options(warn = -1)
ggplot(seedlg,aes(x=time,y=hgt)) +
geom_line(aes(group=plant),color="dark grey") +
facet_wrap(~speciesname,ncol=2) +
geom_smooth(se=FALSE,color="black") +
theme.1 +
labs(x="Days since seeds planted",y="Plant height (mm)")
options(warn = -1)
```
Exploratory analyses such as these confirm the suspicions of biology researchers that leadplants and coneflowers should be analyzed separately. Because of biological differences, it is expected that these two species will show different growth patterns and respond differently to treatments such as fertilization. Coneflowers are members of the aster family, growing up to 4 feet tall with their distinctive gray seed heads and drooping yellow petals. Leadplants, on the other hand, are members of the bean family, with purple flowers, a height of 1 to 3 feet, and compound grayish green leaves which look to be dusted with white lead. Leadplants have deep root systems and are symbiotic N-fixers, which means they might experience stifled growth in sterilized soil compared with other species. For the remainder of this chapter, we will focus on __leadplants__ and how their growth patterns are affected by soil type and sterilization. You will have a chance to analyze coneflower data later in the Exercises section.
Lattice plots, illustrating several observational units simultaneously, each with fitted lines where appropriate, are also valuable to examine during the exploratory analysis phase. Figure \@ref(fig:lattlpall) shows height over time for 24 randomly selected leadplants that germinated in this study, with a fitted linear regression line. Linearity appears reasonable in most cases, although there is some variability in the intercepts and a good deal of variability in the slopes of the fitted lines. These intercepts and slopes by plant, of course, will be potential parameters in a multilevel model which we will fit to this data. Given the three-level nature of this data, it is also useful to examine a spaghetti plot by pot (Figure \@ref(fig:spaglppot)). While linearity appears to reasonably model the average trend over time within pot, we see differences in the plant-to-plant variability within pot, but some consistency in intercept and slope from pot to pot.
```{r, lattlpall, fig.align="center",out.width="60%", fig.cap=' Lattice plot of linear trends fit to 24 randomly selected leadplants. Two plants with only a single height measurement have no associated regression line.',echo=FALSE, warning=FALSE, message=FALSE}
# Plot time trend for 25 randomly selected leadplants
# - linear fit
set.seed(1313)
randplant = as.vector(sample(leadwd$plant,size=25))
seedrdl <- seedlg %>%
filter(plant %in% randplant)
ggplot(seedrdl, aes(x=time, y=hgt)) +
facet_wrap(~plant,ncol=5) +
geom_point() +
geom_smooth(method="lm") +
theme.1 +
theme(strip.text.x=element_blank()) +
labs(x="Time",y="Plant height (mm)")
```
```{r, spaglppot, fig.align="center",out.width="60%", fig.cap='Spaghetti plot for leadplants by pot with loess fit.',echo=FALSE, warning=FALSE, message=FALSE}
# spaghetti plot by pot - add loess smoothing curve
# for overall trend
oldw <- getOption("warn")
options(warn = -1)
ggplot(leaddata, aes(x=time, y=hgt)) +
geom_line(aes(group=plant), color="dark grey") +
theme.1 +
facet_wrap(~pot, ncol=7) +
geom_smooth(se=FALSE, color="black") +
labs(x="Days since seeds planted",y="Plant height (mm)")
options(warn = oldw)
```
Spaghetti plots can also be an effective tool for examining the potential effects of soil type and sterilization on growth patterns of leadplants. Figure \@ref(fig:spaglpsoil) and Figure \@ref(fig:spaglpster) illustrate how the growth patterns of leadplants depend on soil type and sterilization. In general, we observe slower growth in soil from remnant prairies and soil that has not been sterilized.
```{r, spaglpsoil, fig.align="center",out.width="60%", fig.cap='Spaghetti plot for leadplants by soil type with loess fit.',echo=FALSE, warning=FALSE, message=FALSE}
leaddata <- leaddata %>%
mutate(soilname = recode(soil, STP="Reconstructed",
CULT="Cultivated", REM="Remnant") )
options(warn = -1)
ggplot(leaddata,aes(x=time,y=hgt)) +
geom_line(aes(group=plant),color="dark grey") +
facet_wrap(~soilname,ncol=3) +
geom_smooth(se=FALSE,color="black") +
theme.1 +
labs(x="Days since seeds planted",y="Plant height (mm)")
options(warn = oldw)
```
```{r, spaglpster, fig.align="center",out.width="60%", fig.cap='Spaghetti plot for leadplants by sterilization with loess fit.',echo=FALSE, warning=FALSE, message=FALSE}
leaddata <- leaddata %>%
mutate(sterilename = recode(sterile, Y="Sterilized",
N="Not Sterilized") )
options(warn = -1)
ggplot(leaddata,aes(x=time,y=hgt)) +
geom_line(aes(group=plant),color="dark grey") +
facet_wrap(~sterilename,ncol=2) +
geom_smooth(se=FALSE,color="black") +
theme.1 +
labs(x="Days since seeds planted",y="Plant height (mm)")
options(warn = oldw)
```
We can further explore the variability in linear growth among plants and among pots by fitting regression lines and examining the estimated intercepts and slopes, as well as the corresponding $R^2$ values. Figures \@ref(fig:intrateplant) and \@ref(fig:intratepot) provide just such an analysis, where Figure \@ref(fig:intrateplant) shows results of fitting lines by plant, and Figure \@ref(fig:intratepot) shows results of fitting lines by pot. Certain caveats accompany these summaries. In the case of fitted lines by plant, each plant is given equal weight regardless of the number of observations (2-4) for a given plant, and in the case of fitted lines by pot, a line is estimated by simply pooling all observations from a given pot, ignoring the plant from which the observations came, and equally weighting pots regardless of how many plants germinated and survived to Day 28. Nevertheless, the summaries of fitted lines provide useful information. When fitting regression lines by plant, we see a mean intercept of 1.52 (SD=0.66), indicating an estimated average height at 13 days of 1.5 mm, and a mean slope of 0.114 mm per day of growth from Days 13 to 28 (SD=0.059). Most R-squared values were strong (e.g., 84\% were above 0.8). Summaries of fitted regression lines by pot show similar mean intercepts (1.50) and slopes (0.107), but somewhat less variability pot-to-pot than we observed plant-to-plant (SD=0.46 for intercepts and SD=0.050 for slopes).
```{r, include=FALSE}
# Summary stats for linear models by plant - use centered time
hgt.list=lmList(hgt~time13 | plant, data=leaddata, na.action=na.exclude)
# coef(hgt.list)
int = as.matrix(coef(hgt.list))[,1]
summary(int) # summary statistics for 107 intercepts
rate = as.matrix(coef(hgt.list))[,2]
summary(rate)
rsq <- by(leaddata, leaddata$plant, function(data)
summary(lm(hgt ~ time13, data = data,na.action=na.exclude))$r.squared)
summary(rsq)
sum(rsq[!is.na(rsq)]>=.8)/length(rsq[!is.na(rsq)])
int.rate.rsq <- as.data.frame(cbind(int,rate,rsq))
int.hist <- ggplot(int.rate.rsq,aes(x=int)) +
geom_histogram(binwidth=0.5,color="black",fill="white") +
theme.1 + labs(x="Intercepts",y="Frequency",title="(a)")
rate.hist <- ggplot(int.rate.rsq,aes(x=rate)) +
geom_histogram(binwidth=0.05,color="black",fill="white") +
theme.1 + labs(x="Slopes",y="Frequency",title="(b)")
rsq.hist <- ggplot(int.rate.rsq,aes(x=rsq)) +
geom_histogram(binwidth=0.1,color="black",fill="white") +
theme.1 + labs(x="R-squared values",y="Frequency",title="(c)")
```
```{r, intrateplant, fig.align="center",out.width="60%", fig.cap=' Histograms of (a) intercepts, (b) slopes, and (c) R-squared values for linear fits across all leadplants.',echo=FALSE, warning=FALSE}
grid.arrange(int.hist,rate.hist,rsq.hist,ncol=2)
```
```{r, include=FALSE}
# Descriptive statistics of the estimates obtained by
# fitting the linear model by plant.
mean(int)
sd(int)
mean(rate, na.rm=T)
sd(rate, na.rm=T)
cor(int, rate, use="complete.obs")
# Summary stats for linear models by pot
hgt2.list=lmList(hgt~time13 | pot, data=leaddata, na.action=na.exclude)
# coef(hgt2.list)
int2 = as.matrix(coef(hgt2.list))[,1]
summary(int2) # summary statistics for 32 intercepts
rate2 = as.matrix(coef(hgt2.list))[,2]
summary(rate2)
rsq2 <- by(leaddata, leaddata$pot, function(data)
summary(lm(hgt ~ time13, data = data,na.action=na.exclude))$r.squared)
summary(rsq2)
int.rate.rsq2 <- as.data.frame(cbind(int2,rate2,rsq2))
int2.hist <- ggplot(int.rate.rsq2,aes(x=int2)) +
geom_histogram(binwidth=0.5,color="black",fill="white") +
theme.1 + labs(x="Intercepts",y="Frequency",title="(a)")
rate2.hist <- ggplot(int.rate.rsq2,aes(x=rate2)) +
geom_histogram(binwidth=0.05,color="black",fill="white") +
theme.1 + labs(x="Slopes",y="Frequency",title="(b)")
rsq2.hist <- ggplot(int.rate.rsq2,aes(x=rsq2)) +
geom_histogram(binwidth=0.2,color="black",fill="white") +
theme.1 + labs(x="R-squared values",y="Frequency",title="(c)")
```
```{r, intratepot, fig.align="center",out.width="60%", fig.cap='Histograms of (a) intercepts, (b) slopes, and (c) R-squared values for linear fits across all pots with leadplants.',echo=FALSE, warning=FALSE}
grid.arrange(int2.hist,rate2.hist,rsq2.hist,ncol=2)
```
Another way to examine variability due to plant vs. variability due to pot is through summary statistics. Plant-to-plant variability can be estimated by averaging standard deviations from each pot (.489 for intercepts and .039 for slopes), while pot-to-pot variability can be estimated by finding the standard deviation of average intercept (.478) or slope (.051) within pot. Based on these rough measurements, variability due to plants and pots is comparable.
Fitted lines by plant and pot are modeled using a centered time variable (`time13`), adjusted so that the first day of height measurements (13 days after planting) corresponds to `time13`=0. This centering has two primary advantages. First, the estimated intercept becomes more interpretable. Rather than representing height on the day of planting (which should be 0 mm, but which represents a hefty extrapolation from our observed range of days 13 to 28), the intercept now represents height on Day 13. Second, the intercept and slope are much less correlated (r=-0.16) than when uncentered time is used, which improves the stability of future models.
Fitted intercepts and slopes by plant can be used for an additional exploratory examination of factor effects to complement those from the earlier spaghetti plots. Figure \@ref(fig:irboxbyspec) complements Figure \@ref(fig:spagbyspec), again showing differences between species---coneflowers tend to start smaller and have slower growth rates, although they have much more variability in growth patterns than leadplants. Returning to our focus on leadplants, Figure \@ref(fig:irboxbysoil) shows that plants grown in soil from cultivated fields tend to be taller at Day 13, and plants grown in soil from remnant prairies tend to grow more slowly than plants grown in other soil types. Figure \@ref(fig:irboxbyster) shows the strong tendency for plants grown in sterilized soil to grow faster than plants grown in non-sterilized soil. We will soon see if our fitted multilevel models support these observed trends.
```{r, include=FALSE}
# Descriptive statistics of the estimates obtained by
# fitting the linear model by pot.
mean(int2)
sd(int2)
mean(rate2, na.rm=T)
sd(rate2, na.rm=T)
cor(int2, rate2, use="complete.obs")
# Try to compare variability between plants and between pots
# Plus ANOVA-type boxplot of within pot vs within
# plant variability
seedwdplus <- seedwd %>%
filter(species=="L") %>%
mutate(int = int, rate = rate)
bypot <- seedwdplus %>%
group_by(pot) %>%
summarise(sd_int = unname(sd(int)),
mean_int = unname(mean(int)),
sd_rate = unname(sd(rate)),
mean_rate = unname(mean(rate)) )
# intercept variability between plants = .489
mean(bypot$sd_int,na.rm=T)
# rate variability between plants = .039
mean(bypot$sd_rate,na.rm=T)
# intercept variability between pots = .478
sd(bypot$mean_int,na.rm=T)
# rate variability between pots = .051
sd(bypot$mean_rate,na.rm=T)
# Boxplots to compare ints and rates by factor levels
hgt.listc=lmList(hgt~time13 | plant, data=conedata, na.action=na.exclude)
intc = as.matrix(coef(hgt.listc))[,1]
ratec = as.matrix(coef(hgt.listc))[,2]
allfits <- tibble(int = c(int, intc),
rate = c(rate, ratec),
species = c(rep("Leadplants", 107),
rep("Coneflowers", 176)) )
```
```{r, irboxbyspec,fig.align="center",out.width="60%", fig.cap='Boxplots of (a) intercepts and (b) slopes for all plants by species, based on a linear fit to height data from each plant.',echo=FALSE, warning=FALSE}
int.byspecies <- ggplot(allfits,aes(x=species,y=int)) +
geom_boxplot(aes(group=species)) +
theme.1 +
labs(x="Species",y="Intercepts",title="(a)")
rate.byspecies <- ggplot(allfits, aes(x=species,y=rate)) +
geom_boxplot(aes(group=species)) +
theme.1 +
labs(x="Species",y="Slopes",title="(b)")
grid.arrange(int.byspecies,rate.byspecies,ncol=2)
```
```{r, irboxbysoil,fig.align="center",out.width="60%", fig.cap='Boxplots of (a) intercepts and (b) slopes for all leadplants by soil type, based on a linear fit to height data from each plant.',echo=FALSE, warning=FALSE}
seedwdplus <- seedwdplus %>%
mutate(soilname = recode(soil, STP="Reconstructed",
CULT="Cultivated",
REM="Remnant") )
int.bysoil <- ggplot(seedwdplus,aes(x=soilname,y=int)) +
geom_boxplot() +
theme.1 +
labs(x="Soil type",y="Intercepts",title="(a)")
rate.bysoil <- ggplot(seedwdplus,aes(x=soilname,y=rate)) +
geom_boxplot() +
theme.1 +
labs(x="Soil type",y="Slopes",title="(b)")
grid.arrange(int.bysoil,rate.bysoil,ncol=2)
```
```{r, irboxbyster, fig.align="center",out.width="60%", fig.cap='Boxplots of (a) intercepts and (b) slopes for all leadplants by sterilization, based on a linear fit to height data from each plant.',echo=FALSE, warning=FALSE}
seedwdplus <- seedwdplus %>%
mutate(sterilename = recode(sterile, Y="Sterilized",
N="Not Sterilized") )
int.bysterile <- ggplot(seedwdplus,aes(x=sterilename,y=int)) +
geom_boxplot() +
theme.1 +
labs(x="Sterile",y="Intercepts",title="(a)")
rate.bysterile <- ggplot(seedwdplus,aes(x=sterilename,y=rate)) +
geom_boxplot() +
theme.1 +
labs(x="Sterile",y="Slopes",title="(b)")
grid.arrange(int.bysterile,rate.bysterile,ncol=2)
```
Since we have time at Level One, any exploratory analysis of Case Study \@ref(cs:seeds) should contain an investigation of the variance-covariance structure within plant. Figure \@ref(fig:corrstruct) shows the potential for an autocorrelation structure in which the correlation between observations from the same plant diminishes as the time between measurements increases. Residuals five days apart have correlations ranging from .77 to .91, while measurements ten days apart have correlations of .62 and .70, and measurements fifteen days apart have correlation of .58.
```{r, corrstruct, fig.align="center",out.width="60%",fig.cap='Correlation structure within plant. The upper right contains correlation coefficients between residuals at pairs of time points, the lower left contains scatterplots of the residuals at time point pairs, and the diagonal contains histograms of residuals at each of the four time points.',echo=FALSE, warning=FALSE, message=FALSE}
# Examine correlation structure
seed.nona <- leaddata %>%
filter(!is.na(hgt))
hgt.lm = lm(hgt~time13, data=seed.nona)
seed.nona <- seed.nona %>%
mutate(lmres = resid(hgt.lm))
hgtw <- seed.nona %>%
dplyr::select(plant, time13, lmres) %>%
spread(key = time13, value = lmres, sep = "=")
hgtw.1 <- na.omit(hgtw)
ggpairs(hgtw.1[,2:5], lower=list(continuous="smooth"),
upper=list(), diag=list(continuous="bar", discrete="bar"),
axisLabels="show")
```
## Initial Models {#initialmodels-3level}
The structure and notation for three level models will closely resemble the structure and notation for two-level models, just with extra subscripts. Therein lies some of the power of multilevel models---extensions are relatively easy and allow you to control for many sources of variability, obtaining more precise estimates of important parameters. However, the number of variance component parameters to estimate can quickly mushroom as covariates are added at lower levels, so implementing simplifying restrictions will often become necessary (see Section \@ref(sec:explodingvarcomps)).
### Unconditional Means
We once again begin with the __unconditional means model__, \index{unconditional means model} in which there are no predictors at any level, in order to assess the amount of variation at each level. Here, Level Three is pot, Level Two is plant within pot, and Level One is time within plant. Using model formulations at each of the three levels, the unconditional means three-level model can be expressed as:
- Level One (timepoint within plant):
\begin{equation*}
Y_{ijk} = a_{ij}+\epsilon_{ijk} \textrm{ where } \epsilon_{ijk}\sim N(0,\sigma^2)
\end{equation*}
- Level Two (plant within pot):
\begin{equation*}
a_{ij} = a_{i}+u_{ij} \textrm{ where } u_{ij}\sim N(0,\sigma_{u}^{2})
\end{equation*}
- Level Three (pot):
\begin{equation*}
a_{i} = \alpha_{0}+\tilde{u}_{i} \textrm{ where } \tilde{u}_{i} \sim N(0,\sigma_{\tilde{u}}^{2})
\end{equation*}
where the heights of plants from different pots are considered independent, but plants from the same pot are correlated as well as measurements at different times from the same plant.
Keeping track of all the model terms, especially with three subscripts, is not a trivial task, but it's worth spending time thinking it through. Here is a quick guide to the meaning of terms found in our three-level model:
- $Y_{ijk}$ is the height (in mm) of plant $j$ from pot $i$ at time $k$
- $a_{ij}$ is the true mean height for plant $j$ from pot $i$ across all time points. This is not considered a model parameter, since we further model $a_{ij}$ at Level Two.
- $a_{i}$ is the true mean height for pot $i$ across all plants from that pot and all time points. This is also not considered a model parameter, since we further model $a_{i}$ at Level Three.
- $\alpha_{0}$ is a fixed effects model parameter representing the true mean height across all pots, plants, and time points.
- $\epsilon_{ijk}$ describes how far an observed height $Y_{ijk}$ is from the mean height for plant $j$ from pot $i$.
- $u_{ij}$ describe how far the mean height of plant $j$ from pot $i$ is from the mean height of all plants from pot $i$.
- $\tilde{u}_{i}$ describes how far the mean height of all observations from pot $i$ is from the overall mean height across all pots, plants, and time points. None of the error terms ($\epsilon, u, \tilde{u}$) are considered model parameters; they simply account for differences between the observed data and expected values under our model.
- $\sigma^2$ is a variance component (random effects model parameter) that describes within-plant variability over time.
- $\sigma_{u}^{2}$ is the variance component describing plant-to-plant variability within pot.
- $\sigma_{\tilde{u}}^{2}$ is the variance component describing pot-to-pot variability.
The three-level unconditional means model can also be expressed as a composite model:
\begin{equation*}
Y_{ijk}=\alpha_{0}+\tilde{u}_{i}+u_{ij}+\epsilon_{ijk}
\end{equation*}
and this composite model can be fit using statistical software:
```{r, comment=NA}
# Model A - unconditional means
modelal = lmer(hgt ~ 1 + (1|plant) + (1|pot),
REML=T, data=leaddata)
```
```{r, echo=FALSE, message=FALSE}
VCrandom <- VarCorr(modelal)
print(VCrandom, comp = c("Variance", "Std.Dev."))
cat(" Number of Level Two groups = ",
summary(modelal)$ngrps[1], "\n",
"Number of Level Three groups = ",
summary(modelal)$ngrps[2])
coef(summary(modelal))
```
From this output, we obtain estimates of our four model parameters:
- $\hat{\alpha}_{0}=2.39=$ the mean height (in mm) across all time points, plants, and pots.
- $\hat{\sigma}^2=0.728=$ the variance over time within plants.
- $\hat{\sigma}_{u}^{2}=0.278=$ the variance between plants from the same pot.
- $\hat{\sigma}_{\tilde{u}}^{2}=0.049=$ the variance between pots.
From the estimates of variance components, 69.0\% of total variability in height measurements is due to differences over time for each plant, 26.4\% of total variability is due to differences between plants from the same pot, and only 4.6\% of total variability is due to difference between pots. Accordingly, we will next explore whether the incorporation of time as a linear predictor at Level One can reduce the unexplained variability within plant.
### Unconditional Growth
The __unconditional growth model__ \index{unconditional growth model} introduces time as a predictor at Level One, but there are still no predictors at Levels Two or Three. The unconditional growth model allows us to assess how much of the within-plant variability (the variability among height measurements from the same plant at different time points) can be attributed to linear changes over time, while also determining how much variability we see in the intercept (Day 13 height) and slope (daily growth rate) from plant-to-plant and pot-to-pot. Later, we can model plant-to-plant and pot-to-pot differences in intercepts and slopes with Level Two and Three covariates.
The three-level unconditional growth model (Model B) can be specified either using formulations at each level:
- Level One (timepoint within plant):
\begin{equation*}
Y_{ijk} = a_{ij}+b_{ij}\textrm{time}_{ijk}+\epsilon_{ijk}
\end{equation*}
- Level Two (plant within pot):
\begin{align*}
a_{ij} & = a_{i}+u_{ij} \\
b_{ij} & = b_{i}+v_{ij}
\end{align*}
- Level Three (pot):
\begin{align*}
a_{i} & = \alpha_{0}+\tilde{u}_{i} \\
b_{i} & = \beta_{0}+\tilde{v}_{i}
\end{align*}
or as a composite model:
\begin{equation*}
Y_{ijk}=[\alpha_{0}+\beta_{0}\textrm{time}_{ijk}]+
[\tilde{u}_{i}+{v}_{ij}+\epsilon_{ijk}+(\tilde{v}_{i}+{v}_{ij})\textrm{time}_{ijk}]
\end{equation*}
where $\epsilon_{ijk}\sim N(0,\sigma^2)$,
\[ \left[ \begin{array}{c}
u_{ij} \\ v_{ij}
\end{array} \right] \sim N \left( \left[
\begin{array}{c}
0 \\ 0
\end{array} \right], \left[
\begin{array}{cc}
\sigma_{u}^{2} & \\
\sigma_{uv} & \sigma_{v}^{2}
\end{array} \right] \right), \] and
\[ \left[ \begin{array}{c}
\tilde{u}_{i} \\ \tilde{v}_{i}
\end{array} \right] \sim N \left( \left[
\begin{array}{c}
0 \\ 0
\end{array} \right], \left[
\begin{array}{cc}
\sigma_{\tilde{u}}^{2} & \\
\sigma_{\tilde{u}\tilde{v}} & \sigma_{\tilde{v}}^{2}
\end{array} \right] \right). \]
In this model, at Level One the trajectory for plant $j$ from pot $i$ is assumed to be linear, with intercept $a_{ij}$ (height on Day 13) and slope $b_{ij}$ (daily growth rate between Days 13 and 28); the $\epsilon_{ijk}$ terms capture the deviation between the true growth trajectory of plant $j$ from pot $i$ and its observed heights. At Level Two, $a_{i}$ represents the true mean intercept and $b_{i}$ represents the true mean slope for all plants from pot $i$, while $u_{ij}$ and $v_{ij}$ capture the deviation between plant $j$'s true growth trajectory and the mean intercept and slope for pot $i$. The deviations in intercept and slope at Level Two are allowed to be correlated through the covariance parameter $\sigma_{uv}$. Finally, $\alpha_{0}$ is the true mean intercept and $\beta_{0}$ is the true mean daily growth rate over the entire population of leadplants, while $\tilde{u}_{i}$ and $\tilde{v}_{i}$ capture the deviation between pot $i$'s true overall growth trajectory and the population mean intercept and slope. Note that between-plant and between-pot variability are both partitioned now into variability in initial status ($\sigma_{u}^{2}$ and $\sigma_{\tilde{u}}^{2}$) and variability in rates of change ($\sigma_{v}^{2}$ and $\sigma_{\tilde{v}}^{2}$).
Using the composite model specification, the unconditional growth model can be fit to the seed germination data:
```{r, comment=NA}
# Model B - unconditional growth
modelbl = lmer(hgt ~ time13 + (time13|plant) + (time13|pot),
REML=T, data=leaddata)
```
```{r, echo=FALSE, message=FALSE}
VCrandom <- VarCorr(modelbl)
print(VCrandom, comp = c("Variance", "Std.Dev."))
cat(" Number of Level Two groups = ",
summary(modelbl)$ngrps[1], "\n",
"Number of Level Three groups = ",
summary(modelbl)$ngrps[2])
coef(summary(modelbl))
```
From this output, we obtain estimates of our nine model parameters (two fixed effects and seven variance components):
- $\hat{\alpha}_{0}=1.538=$ the mean height of leadplants 13 days after planting.
- $\hat{\beta}_{0}=0.112=$ the mean daily change in height of leadplants from 13 to 28 days after planting.
- $\hat{\sigma}=.287=$ the standard deviation in within-plant residuals after accounting for time.
- $\hat{\sigma}_{u}=.547=$ the standard deviation in Day 13 heights between plants from the same pot.
- $\hat{\sigma}_{v}=.0346=$ the standard deviation in rates of change in height between plants from the same pot.
- $\hat{\rho}_{uv}=.280=$ the correlation in plants' Day 13 height and their rate of change in height.
- $\hat{\sigma}_{\tilde{u}}=.210=$ the standard deviation in Day 13 heights between pots.
- $\hat{\sigma}_{\tilde{v}}=.0355=$ the standard deviation in rates of change in height between pots.
- $\hat{\rho}_{\tilde{u}\tilde{v}}=-.610=$ the correlation in pots' Day 13 height and their rate of change in height.
We see that, on average, leadplants have a height of 1.54 mm 13 days after planting (pooled across pots and treatment groups), and their heights tend to grow by 0.11 mm per day, producing an average height at the end of the study (Day 28) of 3.22 mm. According to the t-values listed in R, both the Day 13 height and the growth rate are statistically significant. The estimated within-plant variance $\hat{\sigma}^2$ decreased by 88.7\% from the unconditional means model (from 0.728 to 0.082), implying that 88.7\% of within-plant variability in height can be explained by linear growth over time.
## Encountering Boundary Constraints {#sec:boundary}
Typically, with models consisting of three or more levels, the next step after adding covariates at Level One (such as time) is considering covariates at Level Two. In the seed germination experiment, however, there are no Level Two covariates of interest, and the treatments being studied were applied to pots (Level Three). We are primarily interested in the effects of soil type and sterilization on the growth of leadplants. Since soil type is a categorical factor with three levels, we can represent soil type in our model with indicator variables for cultivated lands (`cult`) and remnant prairies (`rem`), using reconstructed prairies as the reference level. For sterilization, we create a single indicator variable (`strl`) which takes on the value 1 for sterilized soil.
Our Level One and Level Two models will look identical to those from Model B; our Level Three models will contain the new covariates for soil type (`cult` and `rem`) and sterilization (`strl`):
\begin{align*}
a_{i} & = \alpha_{0}+\alpha_{1}\textrm{strl}_{i}+\alpha_{2}\textrm{cult}_{i}+\alpha_{3}\textrm{rem}_{i}+\tilde{u}_{i} \\
b_{i} & = \beta_{0}+\beta_{1}\textrm{strl}_{i}+\beta_{2}\textrm{cult}_{i}+\beta_{3}\textrm{rem}_{i}+\tilde{v}_{i}
\end{align*}
where the error terms at Level Three follow the same multivariate normal distribution as in Model B. In our case, the composite model can be written as:
\begin{align*}
Y_{ijk} & = (\alpha_{0}+\alpha_{1}\textrm{strl}_{i}+\alpha_{2}\textrm{cult}_{i}+\alpha_{3}\textrm{rem}_{i}+\tilde{u}_{i}+u_{ij}) + \\
& (\beta_{0}+\beta_{1}\textrm{strl}_{i}+\beta_{2}\textrm{cult}_{i}+\beta_{3}\textrm{rem}_{i}+\tilde{v}_{i}+
v_{ij})\textrm{time}_{ijk}+\epsilon_{ijk}
\end{align*}
which, after combining fixed effects and random effects, can be rewritten as:
\begin{align*}
Y_{ijk} & = [\alpha_{0}+\alpha_{1}\textrm{strl}_{i}+\alpha_{2}\textrm{cult}_{i}+\alpha_{3}\textrm{rem}_{i} +
\beta_{0}\textrm{time}_{ijk} + \\
& \beta_{1}\textrm{strl}_{i}\textrm{time}_{ijk}+\beta_{2}\textrm{cult}_{i}\textrm{time}_{ijk}+ \beta_{3}\textrm{rem}_{i}\textrm{time}_{ijk}] + \\
& [\tilde{u}_{i}+u_{ij}+\epsilon_{ijk}+\tilde{v}_{i}\textrm{time}_{ijk}+v_{ij}\textrm{time}_{ijk}]
\end{align*}
From the output below, the addition of Level Three covariates in Model C (`cult`, `rem`, `strl`, and their interactions with `time`) appears to provide a significant improvement (likelihood ratio test statistic = 32.2 on 6 df, $p<.001$) to the unconditional growth model (Model B).
```{r, comment=NA, message=F}
# Model C - add covariates at pot level
modelcl = lmer(hgt ~ time13 + strl + cult + rem +
time13:strl + time13:cult + time13:rem + (time13|plant) +
(time13|pot), REML=T, data=leaddata)
```
```{r, echo=FALSE, message=FALSE}
VCrandom <- VarCorr(modelcl)
print(VCrandom, comp = c("Variance", "Std.Dev."))
cat(" Number of Level Two groups = ",
summary(modelcl)$ngrps[1], "\n",
"Number of Level Three groups = ",
summary(modelcl)$ngrps[2])
coef(summary(modelcl))
```
```{r comment=NA, message=FALSE}
drop_in_dev <- anova(modelbl, modelcl, test = "Chisq")
```
```{r comment=NA, message=F, echo=F}
did_print <- data.frame(npar=drop_in_dev$npar,
AIC=drop_in_dev$AIC, BIC=drop_in_dev$BIC,
logLik=drop_in_dev$logLik, dev=drop_in_dev$deviance,
Chisq=drop_in_dev$Chisq, Df=drop_in_dev$Df,
pval=drop_in_dev$`Pr(>Chisq)`)
row.names(did_print) <- row.names(drop_in_dev)
did_print
# go with Model C (although that had corr=-1)
# remember that anova() assumes REML=F
```
However, Model C has encountered a __boundary constraint__ \index{boundary constraint} with an estimated Level 3 correlation between the intercept and slope error terms of -1. "Allowable" values of correlation coefficients run from -1 to 1; by definition, it is impossible to have a correlation between two error terms below -1. Thus, our estimate of -1 is right on the boundary of the allowable values. But how did this happen, and why is it potentially problematic?
Consider a model in which we have two parameters that must be estimated: $\beta_0$ and $\sigma^2$. As the intercept, $\beta_0$ can take on any value; any real number is "allowable". But, by definition, variance terms such as $\sigma^2$ must be non-negative; that is, $\sigma^2 \geq 0$. Under the Principle of Maximum Likelihood, maximum likelihood estimators for $\beta_0$ and $\sigma^2$ will be chosen to maximize the likelihood of observing our given data. The left plot in Figure \@ref(fig:boundary) shows hypothetical contours of the likelihood function $L(\beta_0, \sigma^2)$; the likelihood is clearly maximized at $(\hat{\beta}_0 , \hat{\sigma}^2)=(4,-2)$. However, variance terms cannot be negative! A more sensible approach would have been to perform a constrained search for MLEs, considering any potential values for $\beta_0$ but only non-negative values for $\sigma^2$. This constrained search is illustrated in the right plot in Figure \@ref(fig:boundary). In this case, the likelihood is maximized at $(\hat{\beta}_0 , \hat{\sigma}^2)=(4,0)$. Note that the estimated intercept did not change, but the estimated variance is simply set at the smallest allowable value -- at the __boundary constraint__.
(ref:capboundary) Left (a): hypothetical contours of the likelihood function $L(\beta_0, \sigma^2)$ with no restrictions on $\sigma^2$; the likelihood function is maximized at $(\hat{\beta}_0, \hat{\sigma}^2)=(4,-2)$. Right (b): hypothetical contours of the likelihood function $L(\beta_0, \sigma^2)$ with the restriction that $\sigma^2 \geq 0$; the constrained likelihood function is maximized at $(\hat{\beta}_0, \hat{\sigma}^2)=(4,0)$.
```{r,boundary, fig.align="center",out.width="60%",fig.cap='(ref:capboundary)',echo=FALSE, warning=FALSE, message=FALSE}
#thr-contour-boundary
# Diagram to help explain boundary constraints
b0 <- seq(-4,12,length=51)
sigma2 <- seq(-8,4,length=51)
xy <- expand.grid(b0,sigma2)
# Include all points
Sigma <- matrix(c(12,0,0,6),2,2)
Mu <- c(4,-2)
z <- dmnorm(xy, Mu, Sigma)
zframe <- data.frame(xy, z)
MLE <- xy[z==max(z),]
con.1 <- ggplot(data=zframe,aes(x=Var1,y=Var2,z=z)) +
theme.1 +
geom_contour(stat="contour",lineend="butt",
linejoin="round",linemitre=1,
na.rm=FALSE,colour="black") +
labs(x="b0",y="sigma2",title="(a)") +
scale_y_continuous(limits=c(-8,4)) +
geom_abline(intercept=0,slope=0) +
geom_abline(intercept=0,slope=1000)
# Include all points where sigma2 >= 0
z <- ifelse(xy[,2]<0,0,dmnorm(xy, Mu, Sigma))
zframe.1 <- zframe[zframe$Var2>=0,]
MLE <- xy[z==max(z),]
con.2 <- ggplot(data=zframe.1,aes(x=Var1,y=Var2,z=z)) +
geom_contour(stat="contour",lineend="butt",
linejoin="round",linemitre=1,
na.rm=FALSE,colour="black") +
theme.1 +
scale_y_continuous(limits=c(-8,4)) +
labs(x="b0",y="sigma2",title="(b)") +
geom_abline(intercept=0,slope=0) +
geom_abline(intercept=0,slope=1000)
grid.arrange(con.1,con.2, ncol=2)
```
Graphically, in this simple illustration, the effect of the boundary constraint is to alter the likelihood function from a nice hill (in the left plot in Figure \@ref(fig:boundary)) with a single peak at $(4,-2)$, to a hill with a huge cliff face where $\sigma^2=0$. The highest point overlooking this cliff is at $(4,0)$, straight down the hill from the original peak.
In general, then, boundary constraints occur when the maximum likelihood estimator of at least one model parameter occurs at the limits of allowable values (such as estimated correlation coefficients of -1 or 1, or estimated variances of 0). Maximum likelihood estimates at the boundary tend to indicate that the likelihood function would be maximized at non-allowable values of that parameter, if an unconstrained search for MLEs was conducted. Most software packages, however, will only report maximum likelihood estimates with allowable values. Therefore, boundary constraints would ideally be avoided, if possible.
What should you do if you encounter boundary constraints? Often, boundary constraints signal that your model needs to be reparameterized, i.e., you should alter your model to feature different parameters or ones that are interpreted differently. This can be accomplished in several ways:
- remove parameters, especially those variance and correlation terms which are being estimated on their boundaries.
- fix the values of certain parameters; for instance, you could set two variance terms equal to each other, thereby reducing the number of unknown parameters to estimate by one.
- transform covariates. Centering variables, standardizing variables, or changing units can all help stabilize a model. Numerical procedures for searching for and finding maximum likelihood estimates can encounter difficulties when variables have very high or low values, extreme ranges, outliers, or are highly correlated.
Although it is worthwhile attempting to reparameterize models to remove boundary constraints, sometimes they can be tolerated if (a) you are not interested in estimates of those parameters encountering boundary issues, and (b) removing those parameters does not affect conclusions about parameters of interest. For example, in the output below we explore the implications of simply removing the correlation between error terms at the pot level (i.e., assume $\rho_{\tilde{u}\tilde{v}}=0$ rather than accepting the (constrained) maximum likelihood estimate of $\hat{\rho}_{\tilde{u}\tilde{v}}=-1$ that we saw in Model C).
```{r, comment=NA, message=F}
# Try Model C without correlation between L2 errors
modelcl.noL2corr <- lmer(hgt ~ time13 + strl + cult + rem +
time13:strl + time13:cult + time13:rem + (time13|plant) +
(1|pot) + (0+time13|pot), REML=T, data=leaddata)
```
```{r, echo=FALSE, message=FALSE}
VCrandom <- VarCorr(modelcl.noL2corr)
print(VCrandom, comp = c("Variance", "Std.Dev."))
cat(" Number of Level Two groups = ",
summary(modelcl.noL2corr)$ngrps[1], "\n",
"Number of Level Three groups = ",
summary(modelcl.noL2corr)$ngrps[2])
coef(summary(modelcl.noL2corr))
```
Note that the estimated variance components are all very similar to Model C, and the estimated fixed effects and their associated t-statistics are also very similar to Model C. Therefore, in this case we could consider simply reporting the results of Model C despite the boundary constraint.
However, when it is possible to remove boundary constraints through reasonable model reparameterizations, that is typically the preferred route. In this case, one option we might consider is simplifying Model C by setting $\sigma_{\tilde{v}}^{2}=\sigma_{\tilde{u}\tilde{v}}=0$. We can then write our new model (Model C.1) in level-by-level formulation:
- Level One (timepoint within plant):
\begin{equation*}
Y_{ijk} = a_{ij}+b_{ij}\textrm{time}_{ijk}+\epsilon_{ijk}
\end{equation*}
- Level Two (plant within pot):
\begin{align*}
a_{ij} & = a_{i}+u_{ij} \\
b_{ij} & = b_{i}+v_{ij}
\end{align*}
- Level Three (pot):
\begin{align*}
a_{i} & = \alpha_{0}+\alpha_{1}\textrm{strl}_{i}+\alpha_{2}\textrm{cult}_{i}+\alpha_{3}\textrm{rem}_{i}+\tilde{u}_{i} \\
b_{i} & = \beta_{0}+\beta_{1}\textrm{strl}_{i}+\beta_{2}\textrm{cult}_{i}+\beta_{3}\textrm{rem}_{i}
\end{align*}
Note that there is no longer an error term associated with the model for mean growth rate $b_{i}$ at the pot level. The growth rate for pot $i$ is assumed to be fixed, after accounting for soil type and sterilization; all pots with the same soil type and sterilization are assumed to have the same growth rate. As a result, our error assumption at Level Three is no longer bivariate normal, but rather univariate normal: $\tilde{u}_{i}\sim N(0,\sigma_{\tilde{u}}^{2})$. By removing one of our two Level Three error terms ($\tilde{v}_{i}$), we effectively removed two parameters: the variance for $\tilde{v}_{i}$ and the correlation between $\tilde{u}_{i}$ and $\tilde{v}_{i}$. Fixed effects remain similar, as can be seen in the output below:
```{r, comment=NA, message=F}
# Try Model C without time at pot level
modelcl0 <- lmer(hgt ~ time13 + strl + cult + rem +
time13:strl + time13:cult + time13:rem + (time13|plant) +
(1|pot), REML=T, data=leaddata)
```
```{r, echo=FALSE, message=FALSE}
VCrandom <- VarCorr(modelcl0)
print(VCrandom, comp = c("Variance", "Std.Dev."))
cat(" Number of Level Two groups = ",
summary(modelcl0)$ngrps[1], "\n",
"Number of Level Three groups = ",
summary(modelcl0)$ngrps[2])
coef(summary(modelcl0))
```
We now have a more stable model, free of boundary constraints. In fact, we can attempt to determine whether or not removing the two variance component parameters for Model C.1 provides a significant reduction in performance. Based on a likelihood ratio test (see below), we do not have significant evidence (chi-square test statistic=2.089 on 2 df, p=0.3519) that $\sigma_{\tilde{v}}^{2}$ or $\sigma_{\tilde{u}\tilde{v}}$ is non-zero, so it is advisable to use the simpler Model C.1. However, Section \@ref(threelevel-paraboot) describes why this test may be misleading and prescribes a potentially better approach.
```{r comment=NA, message=FALSE}
drop_in_dev <- anova(modelcl0, modelcl, test = "Chisq")
```
```{r comment=NA, message=F, echo=F}
did_print <- data.frame(npar=drop_in_dev$npar,
AIC=drop_in_dev$AIC, BIC=drop_in_dev$BIC,
logLik=drop_in_dev$logLik, dev=drop_in_dev$deviance,
Chisq=drop_in_dev$Chisq, Df=drop_in_dev$Df,
pval=drop_in_dev$`Pr(>Chisq)`)
row.names(did_print) <- row.names(drop_in_dev)
did_print
# go with Model C.0 (fewer varcomps)
```
## Parametric Bootstrap Testing {#threelevel-paraboot}
As in Section \@ref(longitudinal-paraboot), when testing variance components at the boundary (such as $\sigma_{\tilde{v}}^{2} = 0$), a method like the **parametric bootstrap** \index{parametric bootstrap} should be used, since using a chi-square distribution to conduct a likelihood ratio test produces p-values that tend to be too large.
Under the parametric bootstrap, we must simulate data under the null hypothesis many times. Here are the basic steps for running a parametric bootstrap procedure to compare Model C.1 with Model C:
- Fit Model C.1 (the null model) to obtain estimated fixed effects and variance components (this is the "parametric" part).
- Use the estimated fixed effects and variance components from the null model to regenerate a new set of plant heights with the same sample size ($n=413$) and associated covariates for each observation as the original data (this is the "bootstrap" part).
- Fit both Model C.1 (the reduced model) and Model C (the full model) to the new data.
- Compute a likelihood ratio statistic comparing Models C.1 and C.
- Repeat the previous 3 steps many times (e.g., 1000).
- Produce a histogram of likelihood ratio statistics to illustrate its behavior when the null hypothesis is true.
- Calculate a p-value by finding the proportion of times the bootstrapped test statistic is greater than our observed test statistic.
Let's see how new plant heights are generated under the parametric bootstrap. Consider, for instance, $i=1$ and $j=1,2$. That is, consider Plants \#11 and \#12 as shown in Table \@ref(tab:10verb7). These plants are found in Pot \#1, which was randomly assigned to contain sterilized soil from a restored prairie (STP):
```{r, 10verb7, echo=FALSE, comment=NA}
verb7 <- seedwd[1:2, c(2:11)]
kable(verb7, booktabs=T,
caption = "Original data for Plants 11 and 12 from Pot 1.") %>%
kable_styling(latex_options = "scale_down")
```
__Level Three__
One way to see the data generation process under the null model (Model C.1) is to start with Level Three and work backwards to Level One. Recall that our Level Three models for $a_{i}$ and $b_{i}$, the true intercept and slope from Pot $i$, in Model C.1 are:
\begin{align*}
a_{i} & = \alpha_{0}+\alpha_{1}\textrm{strl}_{i}+\alpha_{2}\textrm{cult}_{i}+\alpha_{3}\textrm{rem}_{i}+\tilde{u}_{i} \\
b_{i} & = \beta_{0}+\beta_{1}\textrm{strl}_{i}+\beta_{2}\textrm{cult}_{i}+\beta_{3}\textrm{rem}_{i}
\end{align*}
All the $\alpha$ and $\beta$ terms will be fixed at their estimated values, so the one term that will change for each bootstrapped data set is $\tilde{u}_{i}$. As we obtain a numeric value for $\tilde{u}_{i}$ for each pot, we will fix the subscript. For example, if $\tilde{u}_{i}$ is set to -.192 for Pot \#1, then we would denote this by $\tilde{u}_{1}=-.192$. Similarly, in the context of Model C.1, $a_{1}$ represents the mean height at Day 13 across all plants in Pot \#1, where $\tilde{u}_{1}$ quantifies how Pot \#1's Day 13 height relates to other pots with the same sterilization and soil type.
According to Model C.1, each $\tilde{u}_{i}$ is sampled from a normal distribution with mean 0 and standard deviation .240 (note that the standard deviation $\sigma_{u}$ is also fixed at its estimated value from Model C.1, given in Section \@ref(sec:boundary)). That is, a random component to the intercept for Pot \#1 ($\tilde{u}_{1}$) would be sampled from a normal distribution with mean 0 and SD .240; say, for instance, $\tilde{u}_{1}=-.192$. We would sample $\tilde{u}_{2},...,\tilde{u}_{72}$ in a similar manner. Then we can produce a model-based intercept and slope for Pot \#1:
\begin{align*}
a_{1} & = 1.512-.088(1)+.133(0)+.107(0)-.192 = 1.232 \\
b_{1} & = .101+.059(1)-.031(0)-.036(0) = .160
\end{align*}
Notice a couple of features of the above derivations. First, all of the coefficients from the above equations ($\alpha_{0}=1.512$, $\alpha_{1}=-.088$, etc.) come from the estimated fixed effects from Model C.1 reported in Section \@ref(sec:boundary). Second, "restored prairie" is the reference level for soil type, so that indicators for "cultivated land" and "remnant prairie" are both 0. Third, the mean intercept (Day 13 height) for observations from sterilized restored prairie soil is 1.512 - 0.088 = 1.424 mm across all pots, while the mean daily growth is .160 mm. Pot \#1 therefore has mean Day 13 height that is .192 mm below the mean for all pots with sterilized restored prairie soil, but every such pot is assumed to have the same growth rate of .160 mm/day because of our assumption that there is no pot-to-pot variability in growth rate (i.e., $\tilde{v}_{i}=0$).
__Level Two__
We next proceed to Level Two, where our equations for Model C.1 are:
\begin{align*}
a_{ij} & = a_{i}+u_{ij} \\
b_{ij} & = b_{i}+v_{ij}
\end{align*}
We will initially focus on Plant \#11 from Pot \#1. Notice that the intercept (Day 13 height = $a_{11}$) for Plant \#11 has two components: the mean Day 13 height for Pot \#1 ($a_{1}$) which we specified at Level Three, and an error term ($u_{11}$) which indicates how the Day 13 height for Plant \#11 differs from the overall average for all plants from Pot \#1. The slope (daily growth rate = $b_{11}$) for Plant \#11 similarly has two components. Since both $a_{1}$ and $b_{1}$ were determined at Level Three, at this point we need to find the two error terms for Plant \#11: $u_{11}$ and $v_{11}$. According to our multilevel model, we can sample $u_{11}$ and $v_{11}$ from a bivariate normal distribution with means both equal to 0, standard deviation for the intercept of .543, standard deviation for the slope of .036, and correlation between the intercept and slope of .194.
For instance, suppose we sample $u_{11}=.336$ and $v_{11}=.029$. Then we can produce a model-based intercept and slope for Plant \#11:
\begin{align*}
a_{11} & = 1.232+.336 = 1.568 \\
b_{11} & = .160+.029 = .189
\end{align*}
Although plants from Pot \#1 have a mean Day 13 height of 1.232 mm, Plant \#11's mean Day 13 height is .336 mm above that. Similarly, although plants from Pot \#1 have a mean growth rate of .160 mm/day (just like every other pot with sterilized restored prairie soil), Plant \#11's growth rate is .029 mm/day faster.
__Level One__
Finally we proceed to Level One, where the height of Plant \#11 is modeled as a linear function of time ($1.568 + .189\textrm{time}_{11k}$) with a normally distributed residual $\epsilon_{11k}$ at each time point $k$. Four residuals (one for each time point) are sampled independently from a normal distribution with mean 0 and standard deviation .287 -- the standard deviation again coming from parameter estimates from fitting Model C.1 to the actual data as reported in Section \@ref(sec:boundary). Suppose we obtain residuals of $\epsilon_{111}=-.311$, $\epsilon_{112}=.119$, $\epsilon_{113}=.241$, and $\epsilon_{114}=-.066$. In that case, our parametrically generated data for Plant \#11 from Pot \#1 would look like:
\[ \begin{array}{rcccl}
Y_{111} & = & 1.568+.189(0)-.311 & = & 1.257 \\
Y_{112} & = & 1.568+.189(5)+.119 & = & 2.632 \\
Y_{113} & = & 1.568+.189(10)+.241 & = & 3.699 \\
Y_{114} & = & 1.568+.189(15)-.066 & = & 4.337 \\
\end{array} \]
We would next turn to Plant \#12 from Pot \#1 ($i=1$ and $j=2$). Fixed effects would remain the same, as would coefficients for Pot \#1, $a_{1} = 1.232$ and $b_{1} = .160$, at Level Three. We would, however, sample new residuals $u_{12}$ and $v_{12}$ at Level Two, producing a different intercept $a_{12}$ and slope $b_{12}$ than those observed for Plant \#11. Four new independent residuals $\epsilon_{12k}$ would also be selected at Level One, from the same normal distribution as before with mean 0 and standard deviation .287.
Once an entire set of simulated heights for every pot, plant, and time point have been generated based on Model C.1, two models are fit to this data:
- Model C.1 -- the correct (null) model that was actually used to generate the responses
- Model C -- the incorrect (full) model that contains two extra variance components, $\sigma_{\tilde{v}}^{2}$ and $\sigma_{\tilde{u}\tilde{v}}$, that were not actually used when generating the responses
```{r, echo=FALSE, comment=NA, message=FALSE, warning=FALSE}
# run bootstrapAnova function, then line below takes about
# 15 minutes (much quicker with B=100 rather than B=1000)
# Parametric bootstrap code for lme4-models
# from Fabian Scheipl on stack exchange
#m0 is the lmer model under the null hypothesis
# (i.e. the smaller model)
#mA is the lmer model under the alternative
bootstrapAnova <- function(mA, m0, B=1000){
oneBootstrap <- function(m0, mA){
d <- drop(simulate(m0))
m2 <-refit(mA, newresp=d)
m1 <-refit(m0, newresp=d)
return(anova(m2,m1)$Chisq[2])
}
nulldist <- replicate(B, oneBootstrap(m0, mA))
ret <- anova(mA, m0)
ret$"Pr(>Chisq)"[2] <- mean(ret$Chisq[2] < nulldist)
names(ret)[8] <- "Pr_boot(>Chisq)"
attr(ret, "heading") <- c(attr(ret, "heading")[1],
paste("Parametric bootstrap with", B,"samples."),
attr(ret, "heading")[-1])
attr(ret, "nulldist") <- nulldist
return(ret)
}
# Run models with ML instead of REML to try to speed up,
# since LRT refits with ML
modelcl0.ml <- lmer(hgt ~ time13 + strl + cult + rem +
time13:strl + time13:cult + time13:rem + (time13|plant) +
(1|pot), REML=F, data=leaddata)
modelcl.ml = lmer(hgt ~ time13 + strl + cult + rem +
time13:strl + time13:cult + time13:rem + (time13|plant) +
(time13|pot), REML=F, data=leaddata)
#use like this (and increase B for more believeability):
# bRLRT <- bootstrapAnova(mA=<BIG MODEL>, m0=<SMALLER MODEL>)
# Since this takes a while, it was run once and results
# stored in the data folder
#set.seed(33)
#bRLRT = bootstrapAnova(mA=modelcl.ml, m0=modelcl0.ml, B=1000)
#bRLRT
#save(bRLRT, file = "data/modelclVScl0.rda")
```
```{r, comment = NA, message=FALSE}
# Generate 1 set of bootstrapped data and run chi-square test
# (will also work if use REML models, but may take longer)
set.seed(3333)
d <- drop(simulate(modelcl0.ml))
m2 <- refit(modelcl.ml, newresp=d)
m1 <- refit(modelcl0.ml, newresp=d)
drop_in_dev <- anova(m2, m1, test = "Chisq")
```
```{r comment=NA, message=F, echo=F}
did_print <- data.frame(npar=drop_in_dev$npar,
AIC=drop_in_dev$AIC, BIC=drop_in_dev$BIC,
logLik=drop_in_dev$logLik, dev=drop_in_dev$deviance,
Chisq=drop_in_dev$Chisq, Df=drop_in_dev$Df,
pval=drop_in_dev$`Pr(>Chisq)`)
row.names(did_print) <- row.names(drop_in_dev)
did_print
```
A likelihood ratio test statistic is calculated comparing Model C.1 to Model C. For example, after continuing as above to generate new $Y_{ijk}$ values corresponding to all 413 leadplant height measurements, we fit both models to the "bootstrapped" data. Since the data was generated using Model C.1, we would expect the two extra terms in Model C ($\sigma^2_{\tilde{v}}$ and $\sigma_{\tilde{u}\tilde{v}}$) to contribute very little to the quality of the fit; Model C will have a slightly larger likelihood and loglikelihood since it contains every parameter from Model C.1 plus two more, but the difference in the likelihoods should be due to chance. In fact, that is what the output above shows. Model C does have a larger loglikelihood than Model C.1 (-280.54 vs. -281.24), but this small difference is not statistically significant based on a chi-square test with 2 degrees of freedom (p=.4953).
However, we are really only interested in saving the likelihood ratio test statistic from this bootstrapped sample ($2*(-280.54 - (-281.24) = 1.40$). By generating ("bootstrapping") many sets of responses based on estimated parameters from Model C.1 and calculating many likelihood ratio test statistics, we can observe how this test statistic behaves under the null hypothesis of $\sigma_{\tilde{v}}^{2} = \sigma_{\tilde{u}\tilde{v}} = 0$, rather than making the (dubious) assumption that its behavior is described by a chi-square distribution with 2 degrees of freedom. Figure \@ref(fig:paraboot10) illustrates the null distribution of the likelihood ratio test statistic derived by the parametric bootstrap procedure as compared to a chi-square distribution. A p-value for comparing our full and reduced models can be approximated by finding the proportion of likelihood ratio test statistics generated under the null model which exceed our observed likelihood ratio test (2.089). The parametric bootstrap provides a more reliable p-value in this case (.088 from table below); a chi-square distribution puts too much mass in the tail and not enough near 0, leading to overestimation of the p-value. Based on this test, we would still choose our simpler Model C.1, but we nearly had enough evidence to favor the more complex model.
```{r, eval=FALSE}
bootstrapAnova(mA=modelcl.ml, m0=modelcl0.ml, B=1000)
```
```{r comment=NA, message=F, echo=F}
load(file = "data/modelclVScl0.rda")
drop_in_dev <- bRLRT
did_print <- data.frame(Df=drop_in_dev$Df,
logLik=drop_in_dev$logLik, dev=drop_in_dev$deviance,
Chisq=drop_in_dev$Chisq, ChiDf=drop_in_dev$`Chi Df`,
pval_boot=drop_in_dev$`Pr_boot(>Chisq)`)
row.names(did_print) <- row.names(drop_in_dev)