forked from proback/BeyondMLR
-
Notifications
You must be signed in to change notification settings - Fork 0
/
09-Two-Level-Longitudinal-Data.Rmd
2151 lines (1704 loc) · 152 KB
/
09-Two-Level-Longitudinal-Data.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "Chapter 9"
subtitle: "Two-Level Longitudinal Data"
output:
pdf_document:
number_sections: yes
html_document: default
editor_options:
chunk_output_type: console
---
```{r,include=F}
if(knitr::is_html_output()){options(knitr.table.format = "html")} else {options(knitr.table.format = "latex")}
```
# Two-Level Longitudinal Data {#ch-lon}
## Learning Objectives
After finishing this chapter, you should be able to:
- Recognize longitudinal data as a special case of multilevel data, with time at Level One.
- Consider patterns of missingness and implications of that missing data on multilevel analyses.
- Apply exploratory data analysis techniques specific to longitudinal data.
- Build and understand a taxonomy of models for longitudinal data.
- Interpret model parameters in multilevel models with time at Level One.
- Compare models, both nested and not, with appropriate statistical tests and summary statistics.
- Consider different ways of modeling the variance-covariance structure in longitudinal data.
- Understand how a parametric bootstrap test of significance works and when it might be useful.
```{r load_packages9, message = FALSE, warning = FALSE}
# Packages required for Chapter 9
library(GGally)
library(data.table)
library(Hmisc)
library(mice)
library(lattice)
library(nlme)
library(reshape2)
library(MASS)
library(mnormt)
library(lme4)
library(gridExtra)
library(knitr)
library(kableExtra)
library(broom)
library(tidyverse)
```
## Case Study: Charter Schools {#cs:charter}
Charter schools were first introduced in the state of Minnesota in 1991 [@CharterSchools]. Since then, charter schools have begun appearing all over the United States. While publicly funded, a unique feature of charter schools is their independence from many of the regulations that are present in the public school systems of their respective city or state. Thus, charters will often extend the school days or year and tend to offer non-traditional techniques and styles of instruction and learning.
One example of this unique schedule structure is the KIPP (Knowledge Is Power Program) Stand Academy in Minneapolis, MN. KIPP stresses longer days and better partnerships with parents, and they claim that 80\% of their students go to college from a population where 87\% qualify for free and reduced lunch and 95\% are African American or Latino [@KIPP]. However, the larger question is whether or not charter schools are out-performing non-charter public schools in general. Because of the relative youthfulness of charter schools, data has just begun to be collected to evaluate the performance of charter versus non-charter schools and some of the factors that influence a school's performance. Along these lines, we will examine data collected by the Minnesota Department of Education for all Minnesota schools during the years 2008-2010.
Comparisons of student performance in charter schools versus public schools have produced conflicting results, potentially as a result of the strong differences in the structure and population of the student bodies that represent the two types of schools. A study by the Policy and Program Studies Service of five states found that charter schools are less likely to meet state performance standards than conventional public schools [@Finnigan2004]. However, @Witte2007 performed a statistical analysis comparing Wisconsin charter and non-charter schools and found that average achievement test scores were significantly higher in charter schools compared to non-charter schools, after controlling for demographic variables such as the percentage of white students. In addition, a study of California students who took the Stanford 9 exam from 1998 through 2002 found that charter schools, on average, were performing at the same level as conventional public schools [@Buddin2005]. Although school performance is difficult to quantify with a single measure, for illustration purposes in this chapter we will focus on that aspect of school performance measured by the math portion of the Minnesota Comprehensive Assessment (MCA-II) data for 6th grade students enrolled in 618 different Minnesota schools during the years 2008, 2009, and 2010 [@MNDepartmentOfEducation]. Similar comparisons could obviously be conducted for other grade levels or modes of assessment.
As described in @Green2003, it is very challenging to compare charter and public non-charter schools, as charter schools are often designed to target or attract specific populations of students. Without accounting for differences in student populations, comparisons lose meaning. With the assistance of multiple school-specific predictors, we will attempt to model sixth grade math MCA-II scores of Minnesota schools, focusing on the differences between charter and public non-charter school performances. In the process, we hope to answer the following research questions:
- Which factors most influence a school's performance in MCA testing?
- How do the average math MCA-II scores for 6th graders enrolled in charter schools differ from scores for students who attend non-charter public schools? Do these differences persist after accounting for differences in student populations?
- Are there differences in yearly improvement between charter and non-charter public schools?
## Initial Exploratory Analyses {#exploratoryanalysis}
### Data Organization {#data}
Key variables in `chart_wide_condense.csv` which we will examine to address the research questions above are:
- `schoolid` = includes district type, district number, and school number
- `schoolName` = name of school
- `urban` = is the school in an urban (1) or rural (0) location?
- `charter` = is the school a charter school (1) or a non-charter public school (0)?
- `schPctnonw` = proportion of non-white students in a school (based on 2010 figures)
- `schPctsped` = proportion of special education students in a school (based on 2010 figures)
- `schPctfree` = proportion of students who receive free or reduced lunches in a school (based on 2010 figures). This serves as a measure of poverty among school families.
- `MathAvgScore.0` = average MCA-II math score for all sixth grade students in a school in 2008
- `MathAvgScore.1` = average MCA-II math score for all sixth grade students in a school in 2009
- `MathAvgScore.2` = average MCA-II math score for all sixth grade students in a school in 2010
This data is stored in WIDE format, with one row per school, as illustrated in Table \@ref(tab:table1chp9).
```{r, include=FALSE, warning=FALSE}
#Getting started
chart.wide = read_csv("data/chart_wide_condense.csv")
```
```{r table1chp9, echo=FALSE, warning=TRUE}
table1chp9 <- head(chart.wide[2:11])
kable(table1chp9, booktabs=T,
caption = "The first six observations in the wide data set for the Charter Schools case study.") %>%
kable_styling(latex_options = "scale_down", font_size = 9)
#ifelse(knitr::is_html_output(),
# kab1chp9 %>% scroll_box(width="100%"), kab1chp9)
```
For most statistical analyses, it will be advantageous to convert WIDE format to LONG format, with one row per year per school. To make this conversion, we will have to create a time variable, which under the LONG format is very flexible---each school can have a different number of and differently-spaced time points, and they can even have predictors which vary over time. Details for making this conversion can be found in the R Markdown code for this chapter, and the form of the LONG data in this study is exhibited in the next section.
### Missing Data {#missing}
In this case, before we convert our data to LONG form, we should first address problems with missing data. \index{missing data} Missing data is a common phenomenon in longitudinal studies. For instance, it could arise if a new school was started during the observation period, a school was shut down during the observation period, or no results were reported in a given year. Dealing with missing data in a statistical analysis is not trivial, but fortunately many multilevel packages (including the lme4 package in R) are adept at handling missing data.
First, we must understand the extent and nature of missing data in our study. Table \@ref(tab:table2chp9) is a frequency table of missing data patterns, where 1 indicates presence of a variable and 0 indicates a missing value for a particular variable; this table is a helpful starting point. Among our 618 schools, 540 had complete data (all covariates and math scores for all three years), 25 were missing a math score for 2008, 35 were missing math scores in both 2008 and 2009, etc.
The number of schools with a particular missing data pattern are listed in the left column; the remaining columns of 0's and 1's describe the missing data pattern, with 0 indicating a missing value. Some covariates that are present for every school are not listed. The bottom row gives the number of schools with missing values for specific variables; the last entry indicates that 121 total observations were missing.
```{r table2chp9,echo=FALSE, warning=FALSE}
table2chp9 <- md.pattern(chart.wide[c(5,9,10,11)], plot=FALSE)
kable(table2chp9, booktabs=T,
caption="A frequency table of missing data patterns.") %>%
kable_styling(latex_options = "scale_down")
```
Statisticians have devised different strategies for handling missing data; a few common approaches are described briefly here:
- Include only schools with complete data. This is the cleanest approach analytically; however, ignoring data from 12.6\% of the study's schools (since 78 of the 618 schools had incomplete data) means that a large amount of potentially useful data is being thrown away. In addition, this approach creates potential issues with informative missingness. Informative missingness occurs when a school's lack of scores is not a random phenomenon but provides information about the effectiveness of the school type (e.g., a school closes because of low test scores).
- Last observation carried forward. Each school's last math score is analyzed as a univariate response, whether the last measurement was taken in 2008, 2009, or 2010. With this approach, data from all schools can be used, and analyses can be conducted with traditional methods assuming independent responses. This approach is sometimes used in clinical trials because it tends to be conservative, setting a higher bar for showing that a new therapy is significantly better than a traditional therapy. Of course, we must assume that a school's 2008 score is representative of its 2010 score. In addition, information about trajectories over time is thrown away.
- Imputation of missing observations. Many methods have been developed for sensibly "filling in" missing observations, using imputation models which base imputed data on subjects with similar covariate profiles and on typical observed time trends. Once an imputed data set is created (or several imputed data sets), analyses can proceed with complete data methods that are easier to apply. Risks with the imputation approach include misrepresenting missing observations and overstating precision in final results.
- Apply multilevel methods, which use available data to estimate patterns over time by school and then combine those school estimates in a way that recognizes that time trends for schools with complete data are more precise than time trends for schools with fewer measurements. @Laird1988 demonstrates that multilevel models are valid under the fairly unrestrictive condition that the probability of missingness cannot depend on any unobserved predictors or the response. This is the approach we will follow in the remainder of the text.
Now, we are ready to create our LONG data set. Fortunately, many packages (including R) have built-in functions for easing this conversion, and the functions are improving constantly. The resulting LONG data set is shown in Table \@ref(tab:table3chp9), where `year08` measures the number of years since 2008.
```{r, include=FALSE, warning=FALSE}
#Getting started-2
# Create data frame in LONG form (one obs per school-year)
# chart.long is 1854x10 with 121 NAs for MathAvgScore
select <- dplyr::select
chart.long <- chart.wide %>%
gather(key = "key", value = "MathAvgScore",
MathAvgScore.0:MathAvgScore.2) %>%
separate(key, into = c("name", "year08"), sep = "\\.") %>%
select(-c("X1", "name")) %>%
arrange(schoolid, year08) %>%
mutate(year08 = as.numeric(year08))
head(chart.long)
# print first 6 rows after sorting by school and year within school
smallchart.long <- filter(chart.long, row_number() <= 72)
head(smallchart.long)
```
(ref:caplontable3chp9) \@ref(tab:table1chp9)
```{r table3chp9,echo=FALSE}
table3chp9 <- head(smallchart.long[,c(2,4,6,7,8,9)])
kable(table3chp9, booktabs=T,
caption= "The first six observations in the long data set for the Charter Schools case study; these lines correspond to the first two observations from the wide data set illustrated in Table (ref:caplontable3chp9).") %>%
kable_styling(latex_options = "scale_down")
```
### Exploratory Analyses for General Multilevel Models {#generalanalyses}
Notice the **longitudinal** \index{longitudinal} structure of our data---we have up to three measurements of test scores at different time points for each of our 618 schools. With this structure, we can address questions at two levels:
- Within school---changes over time
- Between schools---effects of school-specific covariates (charter or non-charter, urban or rural, percent free and reduced lunch, percent special education, and percent non-white) on 2008 math scores and rate of change between 2008 and 2010.
As with any statistical analysis, it is vitally important to begin with graphical and numerical summaries of important variables and relationships between variables. We'll begin with initial exploratory analyses that we introduced in the previous chapter, noting that we have no Level One covariates other than time at this point (potential covariates at this level may have included measures of the number of students tested or funds available per student). We will, however, consider the Level Two variables of charter or non-charter, urban or rural, percent free and reduced lunch, percent special education, and percent non-white. Although covariates such as percent free and reduced lunch may vary slightly from year to year within a school, the larger and more important differences tend to occur between schools, so we used percent free and reduced lunch for a school in 2010 as a Level Two variable.
As in Chapter \@ref(ch-multilevelintro), we can conduct initial investigations of relationships between Level Two covariates and test scores in two ways. First, we can use all 1733 observations to investigate relationships of Level Two covariates with test scores. Although these plots will contain dependent points, since each school is represented by up to three years of test score data, general patterns exhibited in these plots tend to be real. Second, we can calculate mean scores across all years for each of the 618 schools. While we lose some information with this approach, we can more easily consider each plotted point to be independent. Typically, both types of exploratory plots illustrate similar relationships, and in this case, both approaches are so similar that we will only show plots using the second approach, with one observation per school.
Figure \@ref(fig:lon-hist1) shows the distribution of MCA math test scores as somewhat left-skewed. MCA test scores for sixth graders are scaled to fall between 600 and 700, where scores above 650 for individual students indicate "meeting standards". Thus, schools with averages below 650 will often have increased incentive to improve their scores the following year. When we refer to the "math score" for a particular school in a particular year, we will assume that score represents the average for all sixth graders at that school. In Figure \@ref(fig:lon-box1), we see that test scores are generally higher for both schools in rural areas and for public non-charter schools. Note that in this data set there are 237 schools in rural areas and 381 schools in urban areas, as well as 545 public non-charter schools and 73 charter schools. In addition, we can see in Figure \@ref(fig:lon-scat1) that schools tend to have lower math scores if they have higher percentages of students with free and reduced lunch, with special education needs, or who are non-white.
```{r,include=FALSE, warning=FALSE}
#Getting started-3
# Add average across all years for each school for EDA plots
chart.means <- chart.long %>%
group_by(schoolid) %>%
summarise(mean3yr = mean(MathAvgScore, na.rm=T))
chart.wide <- chart.wide %>%
mutate(urban0 = ifelse(urban==1, "urban", "rural"),
charter0 = ifelse(charter==1, "charter",
"public non-charter")) %>%
left_join(chart.means, by="schoolid")
wide.charter <- chart.wide %>% filter(charter == 1)
set.seed(27) #pulls same random sample every time
samp = sample(1:length(chart.wide$charter==0), size=dim(wide.charter)[1])
samp # getting equal number of charters and non-charters
wide.public <- chart.wide %>%
filter(charter == 0) %>%
sample_n( dim(wide.charter)[1] )
sampdata <- bind_rows(wide.charter, wide.public) %>%
select(-X1) %>%
mutate(vars = row_number()) # Just use numbers 1-146 as school ids
head(sampdata)
sampdata.l <- sampdata %>%
gather(key = "key", value = "MathAvgScore", MathAvgScore.0:MathAvgScore.2) %>%
separate(key, into = c("name", "year08"), sep = "\\.") %>%
select(-name) %>%
arrange(charter, vars, year08) %>%
mutate(year08 = as.numeric(year08))
head(sampdata.l)
theme.1 <- theme(axis.title.x = element_text(size = 14),
axis.title.y = element_text(size = 14),
plot.title=element_text(hjust=.9,face="italic",size=12))
###EDA###
# table(chart.wide$charter0)
# table(chart.wide$urban0)
chart.wide %>% count(charter0)
chart.wide %>% count(urban0)
```
```{r, lon-hist1, fig.align="center",out.width="60%",fig.cap='Histogram of mean sixth grade MCA math test scores over the years 2008-2010 for 618 Minnesota schools.',echo=FALSE, warning=FALSE}
###histogram of mean math scores (1 obs per school)
ggplot(data=chart.wide,aes(x=mean3yr)) +
geom_histogram(binwidth=5,color="black",fill="white") +
theme.1 +
xlab("Mean Math Scores by School") + ylab("Frequency")
```
```{r, include=FALSE, warning=FALSE}
#summaries (charter vs. non-charter) - 1 obs per school
# tapply(chart.wide$mean3yr, chart.wide$charter, summary)
chart.wide %>% group_by(charter0) %>%
summarise(means = mean(mean3yr),
sds = sd(mean3yr),
meds = median(mean3yr),
q1s = quantile(mean3yr, 0.25),
q3s = quantile(mean3yr, 0.75),
mins = min(mean3yr),
maxs = max(mean3yr), ns = n())
#summaries (urban vs. rural)
chart.wide %>% group_by(urban0) %>%
summarise(means = mean(mean3yr),
sds = sd(mean3yr),
meds = median(mean3yr),
q1s = quantile(mean3yr, 0.25),
q3s = quantile(mean3yr, 0.75),
mins = min(mean3yr),
maxs = max(mean3yr), ns = n())
#needed for second figure
charter.school <- ggplot(data = chart.wide,
aes(x = factor(charter0), y = mean3yr)) +
geom_boxplot() + coord_flip() +
theme.1 + ylab("Mean Math Scores by School") +
xlab("") + labs(title="(a)")
urban.school <- ggplot(data = chart.wide,
aes(x = factor(urban0), y = mean3yr)) +
geom_boxplot() + coord_flip() +
theme.1 + ylab("Mean Math Scores by School") +
xlab("") + labs(title="(b)")
lon.box1 <- grid.arrange(charter.school, urban.school,
ncol = 1, nrow = 2)
```
```{r, lon-box1,fig.align="center",out.width="60%", fig.cap='Boxplots of categorical Level Two covariates vs. average MCA math scores. Plot (a) shows charter vs. public non-charter schools, while plot (b) shows urban vs. rural schools.',echo=FALSE, warning=FALSE}
grid.arrange(charter.school,urban.school,ncol=1,nrow=2)
```
```{r, include=FALSE, warning=FALSE, message=FALSE}
#Changing percentage scale to 0 to 100
chart.long <- chart.long %>%
mutate(SchPctFree = schPctfree*100,
SchPctSped = schPctsped*100,
SchPctNonw = schPctnonw*100)
chart.wide <- chart.wide %>%
mutate(SchPctFree.wide = schPctfree*100,
SchPctSped.wide = schPctsped*100,
SchPctNonw.wide = schPctnonw*100)
# math score vs. continuous level two covariates
PctFree.school <- ggplot(data = chart.wide,
aes(x = schPctfree, y = mean3yr)) +
geom_point(color="dark grey") + theme.1 +
geom_smooth(se=FALSE,method="lm",color="black") +
xlab("Percent Free/Reduced Lunch") +
ylab("Mean Math Scores\nby School") + labs(title="(a)")
PctSped.school <- ggplot(data = chart.wide,
aes(x = schPctsped, y = mean3yr)) +
geom_point(color="dark grey") + theme.1 +
geom_smooth(se=FALSE,method="lm",color="black") +
xlab("Percent Special Ed") +
ylab("Mean Math Scores\nby School") + labs(title="(b)")
PctNonw.school <- ggplot(data = chart.wide,
aes(x = schPctnonw, y = mean3yr)) +
geom_point(color="dark grey") + theme.1 +
geom_smooth(se=FALSE,method="lm",color="black") +
xlab("Percent Non-white") +
ylab("Mean Math Scores\nby School") + labs(title="(c)")
lon.scat1 <- grid.arrange(PctFree.school, PctSped.school,
PctNonw.school, ncol = 2)
```
```{r, lon-scat1, fig.align="center",out.width="60%", fig.cap=' Scatterplots of average MCA math scores by (a) percent free and reduced lunch, (b) percent special education, and (c) percent non-white in a school.',echo=FALSE, warning=FALSE, message=FALSE}
grid.arrange(PctFree.school, PctSped.school,
PctNonw.school, ncol = 2)
```
### Exploratory Analyses for Longitudinal Data {#longitudinalanalyses}
In addition to the initial exploratory analyses above, longitudinal data---multilevel data with time at Level One---calls for further plots and summaries that describe time trends within and across individuals. For example, we can examine trends over time within individual schools. Figure \@ref(fig:lon-lat1) provides a **lattice plot** \index{lattice plot} illustrating trends over time for the first 24 schools in the data set. We note differences among schools in starting point (test scores in 2008), slope (change in test scores over the three-year period), and form of the relationship. These differences among schools are nicely illustrated in so-called **spaghetti plots** \index{spaghetti plot} such as Figure \@ref(fig:lon-spag1), which overlays the individual schools' time trends (for the math test scores) from Figure \@ref(fig:lon-lat1) on a single set of axes. In order to illustrate the overall time trend without making global assumptions about the form of the relationship, we overlaid in bold a non-parametric fitted curve through a **loess smoother**. \index{loess smoother} LOESS comes from "locally estimated scatterplot smoother", in which a low-degree polynomial is fit to each data point using weighted regression techniques, where nearby points receive greater weight. LOESS is a computationally intensive method which performs especially well with larger sets of data, although ideally there would be a greater diversity of x-values than the three time points we have. In this case, the loess smoother follows very closely to a linear trend, indicating that assuming a linear increase in test scores over the three-year period is probably a reasonable simplifying assumption. To further examine the hypothesis that linearity would provide a reasonable approximation to the form of the individual time trends in most cases, Figure \@ref(fig:lon-lat2) shows a lattice plot containing linear fits through ordinary least squares rather than connected time points as in Figure \@ref(fig:lon-lat1).
```{r, include=FALSE, warning=FALSE}
#Lattice plots
# First change names of Central and Chaska
smallchart.long$schoolName[7:9]="CENTRAL108"
smallchart.long$schoolName[37:39]="CHASKAEAST"
smallchart.long$schoolName[40:42]="CHASKAWEST"
smallchart.long$schoolName[64:66]="CENTRAL13"
```
```{r, lon-lat1, fig.align="center",out.width="60%",fig.cap='Lattice plot by school of math scores over time for the first 24 schools in the data set.',echo=FALSE, warning=FALSE, message=FALSE}
ggplot(smallchart.long, aes(x = year08, y = MathAvgScore)) +
geom_point() + geom_line() +
facet_wrap(~schoolName,ncol=6) +
scale_x_continuous(limits=c(0,2), breaks=c(0,1,2)) +
theme.1 + theme(strip.text.x=element_blank()) +
labs(x="Years since 2008",y="Math Score")
```
```{r, lon-spag1, fig.align="center",out.width="60%",fig.cap=' Spaghetti plot of math scores over time by school, for all the charter schools and a random sample of public non-charter schools, with overall fit using loess (bold).',echo=FALSE, warning=FALSE, message=FALSE}
ggplot(sampdata.l, aes(x = year08, y = MathAvgScore)) +
geom_line(aes(group = schoolid), color = "dark grey") +
geom_smooth(aes(group = 1), color = "black", size = 1) +
theme.1 +
labs(x = "Years since 2008", y = "Math Score")
```
```{r, lon-lat2, fig.align="center",out.width="60%",fig.cap=' Lattice plot by school of math scores over time with linear fit for the first 24 schools in the data set.',echo=FALSE, warning=FALSE, message=FALSE}
ggplot(smallchart.long, aes(x = year08, y = MathAvgScore)) +
geom_point() + stat_smooth(method=lm) +
facet_wrap(~schoolName,ncol=6) +
scale_x_continuous(limits=c(0,2), breaks=c(0,1,2)) +
scale_y_continuous(limits=c(640,665)) +
theme.1 + theme(strip.text.x=element_blank()) +
labs(x="Years since 2008",y="Math Scores")
```
```{r, include=FALSE, warning=FALSE}
##Spaghetti Plots
#get rid of NA data
newsampdata.l <- sampdata.l %>% na.omit()
```
```{r, lon-spag3, fig.align="center",out.width="60%",fig.cap='Spaghetti plots showing time trends for each school by school type, for a random sample of charter schools (left) and public non-charter schools (right), with overall fits using loess (bold).',echo=FALSE, warning=FALSE, message=FALSE}
ggplot(newsampdata.l, aes(x = year08, y = MathAvgScore)) +
geom_line(aes(group=schoolid),color="dark grey") +
facet_grid(.~charter0) +
geom_smooth(aes(group=1),color="black",size=1) +
labs(x="Years since 2008",y="Math Scores")
```
```{r, include=FALSE, warning=FALSE}
newsampdata.l <- newsampdata.l %>%
mutate(splitup = paste("Quartile",
as.numeric(cut2(schPctfree, g=4))))
```
```{r,lon-spagmat1, fig.align="center",out.width="60%",fig.cap='Spaghetti plots showing time trends for each school by quartiles of percent free and reduced lunch, with loess fits.', message=FALSE, echo=FALSE, warning=FALSE}
ggplot(newsampdata.l,aes(x=year08,y=MathAvgScore)) +
geom_line(aes(group=schoolid),color="dark grey") +
geom_smooth(method="loess",color="black",se=FALSE,size=.75) +
facet_grid(~splitup) +
labs(x="Years since 2008",y="Math Scores") +
scale_x_continuous(limits=c(0,2), breaks=c(0,1,2)) +
theme.1
```
Just as we explored the relationship between our response (average math scores) and important covariates in Section \@ref(generalanalyses), we can now examine the relationships between time trends by school and important covariates. For instance, Figure \@ref(fig:lon-spag3) shows that charter schools had math scores that were lower on average than public non-charter schools and more variable. This type of plot is sometimes called a **trellis graph**, \index{trellis graph} since it displays a grid of smaller charts with consistent scales, where each smaller chart represents a condition---an item in a category. Trends over time by school type are denoted by bold loess curves. Public non-charter schools have higher scores across all years; both school types show little growth between 2008 and 2009, but greater growth between 2009 and 2010, especially charter schools. Exploratory analyses like this can be repeated for other covariates, such as percent free and reduced lunch in Figure \@ref(fig:lon-spagmat1). The trellis plot automatically divides schools into four groups based on quartiles of their percent free and reduced lunch, and we see that schools with lower percentages of free and reduced lunch students tend to have higher math scores and less variability. Across all levels of free and reduced lunch, we see greater gains between 2009 and 2010 than between 2008 and 2009.
## Preliminary Two-Stage Modeling {#twostage9}
### Linear Trends Within Schools {#lineartwostage}
Even though we know that every school's math test scores were not strictly linearly increasing or decreasing over the observation period, a linear model for individual time trends is often a simple but reasonable way to model data. One advantage of using a linear model within school is that each school's data points can be summarized with two summary statistics---an intercept and a slope (obviously, this is an even bigger advantage when there are more observations over time per school). For instance, we see in Figure \@ref(fig:lon-lat2) that sixth graders from the school depicted in the top right slot slowly increased math scores over the three-year observation period, while students from the school depicted in the fourth column of the top row generally experienced decreasing math scores over the same period. As a whole, the linear model fits individual trends pretty well, and many schools appear to have slowly increasing math scores over time, as researchers in this study may have hypothesized.
Another advantage of assuming a linear trend at Level One (within schools) is that we can examine summary statistics across schools. Both the intercept and slope are meaningful for each school: the *intercept* conveys the school's math score in 2008, while the *slope* conveys the school's average yearly increase or decrease in math scores over the three-year period. Figure \@ref(fig:lon-cis1) shows that point estimates and uncertainty surrounding individual estimates of intercepts and slopes vary considerably. In addition, we can generate summary statistics and histograms for the 618 intercepts and slopes produced by fitting linear regression models at Level One, in addition to R-squared values which describe the strength of fit of the linear model for each school (Figure \@ref(fig:lon-histmat1)). For our 618 schools, the mean math score for 2008 was 651.4 (SD=7.28), and the mean yearly rate of change in math scores over the three-year period was 1.30 (SD=2.51). We can further examine the relationship between schools' intercepts and slopes. Figure \@ref(fig:lon-scat5) shows a general decreasing trend, suggesting that schools with lower 2008 test scores tend to have greater growth in scores between 2008 and 2010 (potentially because those schools have more room for improvement); this trend is supported with a correlation coefficient of $-0.32$ between fitted intercepts and slopes. Note that, with only 3 or fewer observations for each school, extreme or intractable values for the slope and R-squared are possible. For example, slopes cannot be estimated for those schools with just a single test score, R-squared values cannot be calculated for those schools with no variability in test scores between 2008 and 2010, and R-squared values must be 1 for those schools with only two test scores.
```{r, include=FALSE, warning=FALSE}
#95% CI's for slope and intercept of 24 schools (2 are filtered out since 1 obs)
regressions <- smallchart.long %>%
group_by(schoolid) %>%
do(fit = lm(MathAvgScore ~ year08, data=.))
sd_filter <- smallchart.long %>%
group_by(schoolid) %>%
summarise(sds = sd(MathAvgScore))
regressions <- regressions %>%
right_join(sd_filter, by="schoolid") %>%
filter(!is.na(sds))
lm_info1 <- regressions %>%
tidy(fit) %>%
ungroup() %>%
select(schoolid, term, estimate) %>%
spread(key = term, value = estimate) %>%
rename(rate = year08, int = `(Intercept)`)
lm_info2 <- regressions %>%
tidy(fit) %>%
ungroup() %>%
select(schoolid, term, std.error) %>%
spread(key = term, value = std.error) %>%
rename(se_rate = year08, se_int = `(Intercept)`)
lm_info <- regressions %>%
glance(fit) %>%
ungroup() %>%
select(schoolid, r.squared, df.residual) %>%
inner_join(lm_info1, by = "schoolid") %>%
inner_join(lm_info2, by = "schoolid") %>%
mutate(tstar = qt(.975, df.residual),
intlb = int - tstar * se_int,
intub = int + tstar * se_int,
ratelb = rate - tstar * se_rate,
rateub = rate + tstar * se_rate)
head(data.frame(lm_info))
#lon-cis1.eps
slope.ci <- ggplot(lm_info, aes(y=int, x=1:22)) +
geom_point() + theme.1 +
geom_errorbar(aes(ymin=intlb, ymax=intub)) +
coord_flip() + labs(y="Intercepts",x="Schools",title="(a)")
int.ci <- ggplot(lm_info, aes(y=rate, x=1:22)) +
geom_point() + theme.1 +
geom_errorbar(aes(ymin=ratelb, ymax=rateub)) +
coord_flip() + labs(y="Slopes",x="Schools",title="(b)")
lon.cis1 <- grid.arrange(slope.ci, int.ci, ncol=2)
```
(ref:lon-cis1-cap) Point estimates and 95% confidence intervals for (a) intercepts and (b) slopes by school, for the first 24 schools in the data set.
```{r, lon-cis1, fig.align="center",out.width="60%",fig.cap="(ref:lon-cis1-cap)",echo=FALSE, warning=FALSE}
grid.arrange(slope.ci,int.ci,ncol=2)
```
```{r, include=FALSE, warning=FALSE}
# Find slope and intercept of all 618 schools (540 after filter those with 1 obs)
regressions <- chart.long %>%
group_by(schoolid) %>%
do(fit = lm(MathAvgScore ~ year08, data=.))
sd_filter <- chart.long %>%
group_by(schoolid) %>%
summarise(sds = sd(MathAvgScore))
regressions <- regressions %>%
right_join(sd_filter, by="schoolid") %>%
filter(!is.na(sds))
lm_info1 <- regressions %>%
tidy(fit) %>%
ungroup() %>%
select(schoolid, term, estimate) %>%
spread(key = term, value = estimate) %>%
rename(rate = year08, int = `(Intercept)`)
lm_info2 <- regressions %>%
tidy(fit) %>%
ungroup() %>%
select(schoolid, term, std.error) %>%
spread(key = term, value = std.error) %>%
rename(se_rate = year08, se_int = `(Intercept)`)
lm_info <- regressions %>%
glance(fit) %>%
ungroup() %>%
select(schoolid, r.squared, df.residual) %>%
inner_join(lm_info1, by = "schoolid") %>%
inner_join(lm_info2, by = "schoolid") %>%
mutate(tstar = qt(.975, df.residual),
intlb = int - tstar * se_int,
intub = int + tstar * se_int,
ratelb = rate - tstar * se_rate,
rateub = rate + tstar * se_rate)
head(data.frame(lm_info))
# summary stats for intercepts
summary(lm_info$int)
sd(lm_info$int)
# summary stats for fitted rate of change
summary(lm_info$rate)
sd(lm_info$rate,na.rm=T)
# summary stats for R sq
summary(lm_info$r.squared)
# histograms for ints, rates of change, and Rsq values - lon-histmat1.eps
int.hist1 <- ggplot(lm_info) +
geom_histogram(aes(x=int),
binwidth=4, color="black", fill="white") +
theme.1 +
labs(x="Intercepts", y="Frequency", title="(a)")
rate.hist1 <- ggplot(lm_info) +
geom_histogram(aes(x=rate),
binwidth=2, color="black", fill="white") +
theme.1 +
labs(x="Slopes", y="Frequency", title="(b)")
rsq.hist1 <- ggplot(lm_info) +
geom_histogram(aes(x=r.squared),
binwidth=0.2, color="black", fill="white") +
theme.1 +
labs(x="Rsquared values", y="Frequency", title="(c)")
# correlation between slopes and intercepts for subjects with slope
with(lm_info, cor(int, rate, use="complete.obs"))
```
```{r, lon-histmat1, fig.align="center",out.width="60%",fig.cap=' Histograms for (a) intercepts, (b) slopes, and (c) R-squared values from fitted regression lines by school.',echo=FALSE, warning=FALSE}
lon.histmat1 <- grid.arrange(int.hist1, rate.hist1,
rsq.hist1, ncol=2)
```
```{r, lon-scat5, fig.align="center",out.width="60%",fig.cap='Scatterplot showing the relationship between intercepts and slopes from fitted regression lines by school.',echo=FALSE, warning=FALSE, message=FALSE}
# correlation between slopes and intercepts for subjects with slope
ggplot(data = lm_info, aes(x = int, y = rate)) +
geom_point(color = "dark grey") +
geom_smooth(se=FALSE, method="lm", color="black") +
theme.1 + xlab("Fitted Intercepts") +
ylab("Fitted Slopes")
```
### Effects of Level Two Covariates on Linear Time Trends {#lineartwostageL2effects}
Summarizing trends over time within schools is typically only a start, however. Most of the primary research questions from this study involve comparisons among schools, such as: (a) are there significant differences between charter schools and public non-charter schools, and (b) do any differences between charter schools and public schools change with percent free and reduced lunch, percent special education, or location? These are Level Two questions, and we can begin to explore these questions by graphically examining the effects of school-level variables on schools' linear time trends. By school-level variables, we are referring to those covariates that differ by school but are not dependent on time. For example, school type (charter or public non-charter), urban or rural location, percent non-white, percent special education, and percent free and reduced lunch are all variables which differ by school but which don't change over time, at least as they were assessed in this study. Variables which would be time-dependent include quantities such as per pupil funding and reading scores.
Figure \@ref(fig:lon-box2) shows differences in the average time trends by school type, using estimated intercepts and slopes to support observations from the spaghetti plots in Figure \@ref(fig:lon-spag3). Based on intercepts, charter schools have lower math scores, on average, in 2008 than public non-charter schools. Based on slopes, however, charter schools tend to improve their math scores at a slightly faster rate than public schools, especially at the 75th percentile and above. By the end of the three-year observation period, we would nevertheless expect charter schools to have lower average math scores than public schools. For another exploratory perspective on school type comparisons, we can examine differences between school types with respect to math scores in 2008 and math scores in 2010. As expected, boxplots by school type (Figure \@ref(fig:lon-box3)) show clearly lower math scores for charter schools in 2008, but differences are slightly less dramatic in 2010.
```{r, include=FALSE, warning=FALSE}
# Boxplots to compare school types
chart.wide <- lm_info %>%
select(schoolid, int, rate , r.squared) %>%
right_join(chart.wide, by = "schoolid")
#lon-box2.eps
int.box1 <- ggplot(chart.wide) +
geom_boxplot(aes(x = factor(charter0), y = int)) +
theme.1 + coord_flip() +
labs(x="School Type", y="Fitted Intercepts", title="(a)")
rate.box1 <- ggplot(chart.wide) +
geom_boxplot(aes(x = factor(charter0), y = rate)) +
theme.1 + coord_flip() +
labs(x="School Type", y="Fitted Slopes", title="(b)")
lon.box2 <- grid.arrange(int.box1, rate.box1, nrow=2)
```
```{r, lon-box2, fig.align="center",out.width="60%",fig.cap='Boxplots of (a) intercepts and (b) slopes by school type (charter vs. public non-charter).',echo=FALSE, warning=FALSE}
grid.arrange(int.box1,rate.box1,nrow=2)
```
```{r, include=FALSE, warning=FALSE}
#lon-box3.eps
year08.box <- chart.long %>% filter(year08 == 0) %>%
ggplot() +
geom_boxplot(aes(x = factor(charter), y = MathAvgScore)) +
theme.1 + coord_flip() +
labs(x="School Type", y="Math Score in 2008", title="(a)")
year10.box <- chart.long %>% filter(year08 == 2) %>%
ggplot() +
geom_boxplot(aes(x = factor(charter), y = MathAvgScore)) +
theme.1 + coord_flip() +
labs(x="School Type", y="Math Score in 2010", title="(b)")
lon.box3 <- grid.arrange(year08.box, year10.box, nrow=2)
```
```{r, lon-box3, fig.align="center",out.width="60%",fig.cap='Boxplots of (a) 2008 and (b) 2010 math scores by school type (charter (1) vs. public non-charter (0)).',echo=FALSE, warning=FALSE}
grid.arrange(year08.box,year10.box, nrow=2)
```
Any initial exploratory analyses should also investigate effects of potential confounding variables such as school demographics and location. If we discover, for instance, that those schools with higher levels of poverty (measured by the percentage of students receiving free and reduced lunch) display lower test scores in 2008 but greater improvements between 2008 and 2010, then we might be able to use percentage of free and reduced lunch in statistical modeling of intercepts and slopes, leading to more precise estimates of the charter school effects on these two outcomes. In addition, we should also look for any interaction with school type---any evidence that the difference between charter and non-charter schools changes based on the level of a confounding variable. For example, do charter schools perform better relative to non-charter schools when there is a large percentage of non-white students at a school?
With a confounding variable such as percentage of free and reduced lunch, we will treat this variable as continuous to produce the most powerful exploratory analyses. We can begin by examining boxplots of free and reduced lunch percentage against school type (Figure \@ref(fig:lon-boxcatmat1)). We observe that charter schools tend to have greater percentages of free and reduced lunch students as well as greater school-to-school variability. Next, we can use scatterplots to graphically illustrate the relationships between free and reduced lunch percentages and significant outcomes such as intercept and slope (also Figure \@ref(fig:lon-boxcatmat1)). In this study, it appears that schools with higher levels of free and reduced lunch (i.e., greater poverty) tend to have lower math scores in 2008, but there is little evidence of a relationship between levels of free and reduced lunch and improvements in test scores between 2008 and 2010. These observations are supported with correlation coefficients between percent free and reduced lunch and intercepts (r=-0.61) and slopes (r=-0.06).
A less powerful but occasionally informative way to look at the effect of a continuous confounder on an outcome variable is by creating a categorical variable out of the confounder. For instance, we could classify any school with a percentage of free and reduced lunch students above the median as having a high percentage of free and reduced lunch students, and all other schools as having a low percentage of free and reduced lunch students. Then we could examine a possible interaction between percent free and reduced lunch and school type through a series of four boxplots (Figure \@ref(fig:lon-boxmat1)). In fact, these boxplots suggest that the gap between charter and public non-charter schools in 2008 was greater in schools with a high percentage of free and reduced lunch students, while the difference in rate of change in test scores between charter and public non-charter schools appeared similar for high and low levels of free and reduced lunch. We will investigate these trends more thoroughly with statistical modeling.
```{r, include=FALSE, warning=FALSE}
# OLS estimates plotted against the predictor Pct Free and Reduced Lunch
box1 <- ggplot(chart.wide) + geom_boxplot(aes(x=factor(charter0),y=100*schPctfree)) +
theme.1 + coord_flip() +
labs(x="School Type",
y="% Free/Reduce Lunch",title="(a)")
int.scat1 <- ggplot(chart.wide,aes(x=100*schPctfree,y=int)) +
geom_point() + theme.1 +
labs(x="% Free/Reduce Lunch",
y="Fitted Intercepts",title="(b)") +
geom_smooth(se=FALSE,method="lm",color="black",size=.75)
rate.scat1 <- ggplot(chart.wide,aes(x=100*schPctfree,y=rate)) +
geom_point() + theme.1 +
labs(x="% Free/Reduce Lunch",
y="Fitted Slopes",title="(c)") +
geom_smooth(se=FALSE,method="lm",color="black",size=.75)
lon.boxscatmat1 <- grid.arrange(box1, int.scat1,
rate.scat1, ncol=2)
```
```{r, lon-boxcatmat1,fig.align="center",out.width="60%",fig.cap='(a) Boxplot of percent free and reduced lunch by school type (charter vs. public non-charter), along with scatterplots of (b) intercepts and (c) slopes from fitted regression lines by school vs. percent free and reduced lunch.',echo=FALSE, message=FALSE, warning=FALSE}
grid.arrange(box1,int.scat1,rate.scat1,ncol=2)
```
```{r, include=FALSE, warning=FALSE}
# Divide subjects into low and high percent FRL at median
# in order to illustrate charter school effect by percent FRL
with(chart.wide, cor( cbind( schPctnonw, int, rate),
use="pairwise.complete.obs"))
medpctfree <- median(chart.wide$schPctfree)
chart.wide <- chart.wide %>%
mutate(highpctfree = ifelse(schPctfree > medpctfree,
"High Pct Free/Reduced Lunch",
"Low Pct Free/Reduced Lunch"))
#lon-boxmat1.eps
int.box <- ggplot(chart.wide) + theme.1 +
geom_boxplot(aes(x=factor(charter0),y=int)) + coord_flip() +
labs(x="School Type",y="Fitted Intercepts",title="(a)") +
facet_grid(highpctfree~.)
rate.box <- ggplot(chart.wide) + theme.1 +
geom_boxplot(aes(x=factor(charter0),y=rate)) + coord_flip() +
labs(x="School Type",y="Fitted Slopes",title="(b)") +
facet_grid(highpctfree~.)
lon.boxmat1 <- grid.arrange(int.box,rate.box,ncol=2)
```
```{r, lon-boxmat1, fig.align="center",out.width="60%",fig.cap='Boxplots of (a) intercepts and (b) slopes from fitted regression lines by school vs. school type (charter vs. public non-charter), separated by high and low levels of percent free and reduced lunch.',echo=FALSE, warning=FALSE}
grid.arrange(int.box,rate.box,ncol=2)
```
The effect of other confounding variables (e.g., percent non-white, percent special education, urban or rural location) can be investigated in a similar fashion to free and reduced lunch percentage, both in terms of main effect (variability in outcomes such as slope and intercept which can be explained by the confounding variable) and interaction with school type (ability of the confounding variable to explain differences between charter and public non-charter schools). We leave these explorations as an exercise.
### Error Structure Within Schools {#lineartwostageerror2}
Finally, with longitudinal data it is important to investigate the error variance-covariance structure of data collected within a school (the Level Two observational unit). In multilevel data, as in the examples we introduced in Chapter \@ref(ch-corrdata), we suspect observations within group (like a school) to be correlated, and we strive to model that correlation. When the data within group is collected over time, we often see distinct patterns in the residuals that can be modeled---correlations which decrease systematically as the time interval increases, variances that change over time, correlation structure that depends on a covariate, etc. A first step in modeling the error variance-covariance structure is the production of an exploratory plot such as Figure \@ref(fig:lon-cor1). To generate this plot, we begin by modeling MCA math score as a linear function of time using all 1733 observations and ignoring the school variable. This population (marginal) trend is illustrated in Figure \@ref(fig:lon-spag1) and is given by:
\begin{equation*}
\hat{Y}_{ij}=651.69+1.20\textrm{Time}_{ij},
\end{equation*}
where $\hat{Y}_{ij}$ is the predicted math score of the $i^{th}$ school at time $j$, where time $j$ is the number of years since 2008. In this model, the predicted math score will be identical for all schools at a given time point $j$. Residuals $Y_{ij}-\hat{Y}_{ij}$ are then calculated for each observation, measuring the difference between actual math score and the average overall time trend. Figure \@ref(fig:lon-cor1) then combines three pieces of information: the upper right triangle contains correlation coefficients for residuals between pairs of years, the diagonal contains histograms of residuals at each time point, and the lower left triangle contains scatterplots of residuals from two different years. In our case, we see that correlation between residuals from adjacent years is strongly positive (0.81-0.83) and does not drop off greatly as the time interval between years increases.
```{r, include=FALSE, warning=FALSE}
##Correlation structure
score.nonm <- chart.long %>%
filter(is.na(MathAvgScore) == FALSE)
hgtm.lm <- lm(MathAvgScore~year08, data=score.nonm)
score.nonm <- score.nonm %>%
mutate(lmres = resid(hgtm.lm))
hgtwm <- score.nonm %>%
select(schoolid, lmres, year08) %>%
mutate(name = rep("lmres", n())) %>%
unite(newcol, name, year08, sep = ".") %>%
spread(key = newcol, value = lmres)
```
```{r, lon-cor1, fig.align="center",out.width="60%",fig.cap='Correlation structure within school. The upper right contains correlation coefficients between residuals at pairs of time points, the lower left contains scatterplots of the residuals at time point pairs, and the diagonal contains histograms of residuals at each of the three time points.',echo=FALSE, warning=FALSE, message=FALSE}
ggpairs(hgtwm[,c(2:4)],upper=list(),
lower=list(continuous="smooth"),
diag=list(continuous="bar", discrete="bar"),
axisLabels="show")
```
## Initial Models {#lineartwostageerror}
Throughout the exploratory analysis phase, our original research questions have guided our work, and now with modeling we return to familiar questions such as:
- are differences between charter and public non-charter schools (in intercept, in slope, in 2010 math score) statistically significant?
- are differences between school types statistically significant, even after accounting for school demographics and location?
- do charter schools offer any measurable benefit over non-charter public schools, either overall or within certain subgroups of schools based on demographics or location?
As you might expect, answers to these questions will arise from proper consideration of variability and properly identified statistical models.
As in Chapter \@ref(ch-multilevelintro), we will begin model fitting with some simple, preliminary models, in part to establish a baseline for evaluating larger models. Then, we can build toward a final model for inference by attempting to add important covariates, centering certain variables, and checking assumptions.
### Unconditional Means Model {#modela}
In the multilevel context, we almost always begin with the **unconditional means model**, \index{unconditional means model} in which there are no predictors at any level. The purpose of the unconditional means model is to assess the amount of variation at each level, and to compare variability within school to variability between schools. Define $Y_{ij}$ as the MCA-II math score from school $i$ and year $j$. Using the composite model specification from Chapter \@ref(ch-multilevelintro):
\begin{equation*}
Y _{ij} = \alpha_{0} + u_{i} + \epsilon_{ij} \textrm{ with } u_{i} \sim N(0, \sigma^2_u) \textrm{ and } \epsilon_{ij} \sim N(0, \sigma^2)
\end{equation*}
the unconditional means model can be fit to the MCA-II data:
```{r, comment=NA}
#Model A (Unconditional means model)
model.a <- lmer(MathAvgScore~ 1 + (1|schoolid),
REML=T, data=chart.long)
```
```{r, echo=FALSE, message=FALSE}
VCrandom <- VarCorr(model.a)
print(VCrandom, comp = c("Variance", "Std.Dev."))
cat(" Number of Level Two groups = ",
summary(model.a)$ngrps)
coef(summary(model.a))
```
From this output, we obtain estimates of our three model parameters:
- $\hat{\alpha}_{0}$ = 652.7 = the mean math score across all schools and all years
- $\hat{\sigma}^2$= 10.6 = the variance in within-school deviations between individual yearly scores and the school mean across all years
- $\hat{\sigma}^2_u$= 41.9 = the variance in between-school deviations between school means and the overall mean across all schools and all years
Based on the intraclass correlation coefficient:
\begin{equation*}
\hat{\rho}=\frac{\hat{\sigma}^2_u}{\hat{\sigma}^2_u + \hat{\sigma}^2} = \frac{41.869}{41.869+10.571}= 0.798
\end{equation*}
79.8\% of the total variation in math scores is attributable to differences among schools rather than changes over time within schools. We can also say that the average correlation for any pair of responses from the same school is 0.798.
### Unconditional Growth Model {#modelb9}
The second model in most multilevel contexts introduces a covariate at Level One (see Model B in Chapter \@ref(ch-multilevelintro)). With longitudinal data, this second model introduces time as a predictor at Level One, but there are still no predictors at Level Two. This model is then called the **unconditional growth model**. \index{unconditional growth model} The unconditional growth model allows us to assess how much of the within-school variability can be attributed to systematic changes over time.
At the lowest level, we can consider building individual growth models over time for each of the 618 schools in our study. First, we must decide upon a form for each of our 618 growth curves. Based on our initial exploratory analyses, assuming that an individual school's MCA-II math scores follow a linear trend seems like a reasonable starting point. Under the assumption of linearity, we must estimate an intercept and a slope for each school, based on their 1-3 test scores over a period of three years. Compared to time series analyses of economic data, most longitudinal data analyses have relatively few time periods for each subject (or school), and the basic patterns within subject are often reasonably described by simpler functional forms.
Let $Y_{ij}$ be the math score of the $i^{th}$ school in year $j$. Then we can model the linear change in math test scores over time for School $i$ according to Model B:
\begin{equation*}
Y_{ij} = a_{i} + b_{i}\textrm{Year08}_{ij} + \epsilon_{ij} \textrm{ where } \epsilon_{ij} \sim N(0, \sigma^2)
\end{equation*}
The parameters in this model $(a_{i}, b_{i},$ and $\sigma^2)$ can be estimated through LLSR methods. $a_{i}$ represents the true intercept for School $i$---i.e., the expected test score level for School $i$ when time is zero (2008)---while $b_{i}$ represents the true slope for School $i$---i.e., the expected yearly rate of change in math score for School $i$ over the three-year observation period. Here we use Roman letters rather than Greek for model parameters since models by school will eventually be a conceptual first step in a multilevel model. The $\epsilon_{ij}$ terms represent the deviation of School $i$'s actual test scores from the expected results under linear growth---the part of school $i$'s test score at time $j$ that is not explained by linear changes over time. The variability in these deviations from the linear model is given by $\sigma^2$. In Figure \@ref(fig:lon-scat3), which illustrates a linear growth model for Norwood Central Middle School, $a_{i}$ is estimated by the $y$-intercept of the fitted regression line, $b_{i}$ is estimated by the slope of the fitted regression line, and $\sigma^2$ is estimated by the variability in the vertical distances between each point (the actual math score in year $j$) and the line (the predicted math score in year $j$).
```{r, lon-scat3, fig.align="center",out.width="60%",fig.cap='Linear growth model for Norwood Central Middle School.',echo=FALSE, message=FALSE, warning=FALSE}
#lon-scat3.eps
Norwood <- chart.long %>% slice(7:9)
model0 <- lm(MathAvgScore ~ year08, data = Norwood)
ggplot(Norwood, aes(x = year08, y = MathAvgScore)) +
theme.1 + geom_point() +
geom_smooth(se=FALSE, method="lm", color="black", size=.75) +
scale_y_continuous(limits=c(654,661)) +
labs(x="Years since 2008", y="Math Score",
title="Norwood Central") +
geom_segment(aes(x = year08[1], y = MathAvgScore[1],
xend = year08[1], yend = model0$fitted.values[1]),
linetype=2) +
geom_segment(aes(x = Norwood$year08[2], y = MathAvgScore[2],
xend = year08[2], yend = model0$fitted.values[2]),
linetype=2) +
geom_segment(aes(x = Norwood$year08[3], y = MathAvgScore[3],
xend = year08[3], yend = model0$fitted.values[3]),
linetype=2)
```
In a multilevel model, we let intercepts ($a_{i}$) and slopes ($b_{i}$) vary by school and build models for these intercepts and slopes using school-level variables at Level Two. An unconditional growth model features no predictors at Level Two and can be specified either using formulations at both levels:
- Level One:
\begin{equation*}
Y_{ij}=a_{i}+b_{i}\textrm{Year08}_{ij} + \epsilon_{ij}
\end{equation*}
- Level Two:
\begin{align*}
a_{i}&=\alpha_{0} + u_{i}\\
b_{i}&=\beta_{0} + v_{i}
\end{align*}
or as a composite model:
\begin{equation*}
Y_{ij}=\alpha_{0} + \beta_{0}\textrm{Year08}_{ij}+u_{i}+v_{i}\textrm{Year08}_{ij} + \epsilon_{ij}
\end{equation*}
where $\epsilon_{ij}\sim N(0,\sigma^2)$ and
\[ \left[ \begin{array}{c}
u_{i} \\ v_{i}
\end{array} \right] \sim N \left( \left[
\begin{array}{c}
0 \\ 0
\end{array} \right], \left[
\begin{array}{cc}
\sigma_{u}^{2} & \\
\sigma_{uv} & \sigma_{v}^{2}
\end{array} \right] \right) . \]
As before, $\sigma^2$ quantifies the within-school variability (the scatter of points around schools' linear growth trajectories), while now the between-school variability is partitioned into variability in initial status $(\sigma^2_u)$ and variability in rates of change $(\sigma^2_v)$.
Using the composite model specification, the unconditional growth model can be fit to the MCA-II test data:
```{r, comment=NA}
#Model B (Unconditional growth)
model.b <- lmer(MathAvgScore~ year08 + (year08|schoolid),
REML=T, data=chart.long)
```
```{r, echo=FALSE, message=FALSE}
VCrandom <- VarCorr(model.b)
print(VCrandom, comp = c("Variance", "Std.Dev."))
cat(" Number of Level Two groups = ",
summary(model.b)$ngrps)
coef(summary(model.b))
cat(" AIC = ", AIC(model.b), "; BIC = ", BIC(model.b))
```
From this output, we obtain estimates of our six model parameters:
- $\hat{\alpha}_{0}$ = 651.4 = the mean math score for the population of schools in 2008.
- $\hat{\beta}_{0}$ = 1.26 = the mean yearly change in math test scores for the population during the three-year observation period.
- $\hat{\sigma}^2$ = 8.82 = the variance in within-school deviations.
- $\hat{\sigma}^2_u$ = 39.4 = the variance between schools in 2008 scores.
- $\hat{\sigma}^2_v$ = 0.11 = the variance between schools in rates of change in math test scores during the three-year observation period.
- $\hat{\rho}_{uv}$ = 0.72 = the correlation in schools' 2008 math score and their rate of change in scores between 2008 and 2010.
We see that schools had a mean math test score of 651.4 in 2008 and their mean test scores tended to increase by 1.26 points per year over the three-year observation period, producing a mean test score at the end of three years of 653.9. According to the t-value (14.1), the increase in mean test scores noted during the three-year observation period is statistically significant.
The estimated within-school variance $\hat{\sigma}^2$ decreased by about 17\% from the unconditional means model, implying that 17\% of within-school variability in test scores can be explained by a linear increase over time:
\begin{align*}
\textrm{Pseudo }R^2_{L1} & = \frac{\hat{\sigma}^2(\textrm{uncond means}) - \hat{\sigma}^2(\textrm{uncond growth})}{\hat{\sigma^2}(\textrm{uncond means})} \\
& = \frac{10.571-8.820}{10.571}= 0.17
\end{align*}
### Modeling Other Trends over Time {#othertimetrends}
While modeling linear trends over time is often a good approximation of reality, it is by no means the only way to model the effect of time. One alternative is to model the quadratic effect of time, which implies adding terms for both time and the square of time. Typically, to reduce the correlation between the linear and quadratic components of the time effect, the time variable is often centered first; we have already "centered" on 2008. Modifying Model B to produce an **unconditional quadratic growth model** would take the following form:
- Level One:
\begin{equation*}
Y_{ij}=a_{i}+b_{i}\textrm{Year08}_{ij}+c_{i}\textrm{Year08}^{2}_{ij} + \epsilon_{ij}
\end{equation*}
- Level Two:
\begin{align*}
a_{i} & = \alpha_{0} + u_{i}\\
b_{i} & = \beta_{0} + v_{i}\\
c_{i} & = \gamma_{0} + w_{i}
\end{align*}
where $\epsilon_{ij}\sim N(0,\sigma^2)$ and
\[ \left[ \begin{array}{c}
u_{i} \\ v_{i} \\ w_{i}
\end{array} \right] \sim N \left( \left[
\begin{array}{c}
0 \\ 0 \\ 0
\end{array} \right], \left[
\begin{array}{ccc}
\sigma_{u}^{2} & & \\
\sigma_{uv} & \sigma_{v}^{2} & \\
\sigma_{uw} & \sigma_{vw} & \sigma_{w}^{2}
\end{array} \right] \right) . \]
With the extra term at Level One for the quadratic effect, we now have 3 equations at Level Two, and 6 variance components at Level Two (3 variance terms and 3 covariance terms). However, with only a maximum of 3 observations per school, we lack the data for fitting 3 equations with error terms at Level Two. Instead, we could model the quadratic time effect with fewer variance components---for instance, by only using an error term on the intercept at Level Two:
\begin{align*}
a_{i} & = \alpha_{0} + u_{i}\\
b_{i} & = \beta_{0}\\
c_{i} & = \gamma_{0}
\end{align*}
where $u_{i}\sim N(0,\sigma^2_u)$. Models like this are frequently used in practice---they allow for a separate overall effect on test scores for each school, while minimizing parameters that must be estimated. The tradeoff is that this model does not allow linear and quadratic effects to differ by school, but we have little choice here without more observations per school. Thus, using the composite model specification, the unconditional quadratic growth model with random intercept for each school can be fit to the MCA-II test data:
```{r include=FALSE}
chart.long <- chart.long %>%
mutate(yearc = year08 - 1, yearc2 = yearc ^ 2)
#Model B1 (Unconditional growth but only random intercepts)
model.b1 <- lmer(MathAvgScore~ year08 + (1|schoolid),
REML=T, data=chart.long)
summary(model.b1)
cat(" AIC = ", AIC(model.b1), "; BIC = ", BIC(model.b1))
#model.b3 <- lmer(MathAvgScore~ yearc + yearc2 + (yearc+yearc2|schoolid),
# REML=T, data=chart.long)
#summary(model.b3) # won't run - too many random effects
```
```{r, comment=NA}
# Modeling quadratic time trend
model.b2 <- lmer(MathAvgScore~ yearc + yearc2 + (1|schoolid),
REML=T, data=chart.long)
```
```{r, echo=FALSE, message=FALSE}
VCrandom <- VarCorr(model.b2)
print(VCrandom, comp = c("Variance", "Std.Dev."))
cat(" Number of Level Two groups = ",
summary(model.b2)$ngrps)
coef(summary(model.b2))
cat(" AIC = ", AIC(model.b2), "; BIC = ", BIC(model.b2))
```
From this output, we see that the quadratic effect is positive and significant (t=7.1), in this case indicating that increases in test scores are greater between 2009 and 2010 than between 2008 and 2009. Based on AIC and BIC values, the quadratic growth model outperforms the linear growth model with random intercepts only at level Two (AIC: 10308 vs. 10354; BIC: 10335 vs. 10375).
Another frequently used approach to modeling time effects is the **piecewise linear model**. In this model, the complete time span of the study is divided into two or more segments, with a separate slope relating time to the response in each segment. In our case study there is only one piecewise option---fitting separate slopes in 2008-09 and 2009-10. With only 3 time points, creating a piecewise linear model is a bit simplified, but this idea can be generalized to segments with more than two years each.
```{r include=FALSE}
# Modeling piecewise linear time trend with 3 time points
# (won't work in general)
chart.long <- chart.long %>%
mutate(year0809 = ifelse(year08==1, 1, 0),
year0810 = ifelse(year08==2, 1, 0))
# tiny bit better than quadratic but same story:
# slope of 0.2 in 0809 but 2.5-0.2=2.3 in 0910
model.b6 <- lmer(MathAvgScore~ year0809 + year0810 +
(1|schoolid), REML=T, data=chart.long)
summary(model.b6)
AIC(model.b6)
BIC(model.b6)
```
The performance of this model is very similar to the quadratic growth model by AIC and BIC measures, and the story told by fixed effects estimates is also very similar. While the mean yearly increase in math scores was 0.2 points between 2008 and 2009, it was 2.3 points between 2009 and 2010.
Despite the good performances of the quadratic growth and piecewise linear models on our three-year window of data, we will continue to use linear growth assumptions in the remainder of this chapter. Not only is a linear model easier to interpret and explain, but it's probably a more reasonable assumption in years beyond 2010. Predicting future performance is more risky by assuming a steep one-year rise or a non-linear rise will continue, rather than by using the average increase over two years.
## Building to a Final Model {#finalmodel}
### Uncontrolled Effects of School Type {#sec:modelc9}
Initially, we can consider whether or not there are significant differences in individual school growth parameters (intercepts and slopes) based on school type. From a modeling perspective, we would build a system of two Level Two models:
\begin{align*}
a_{i} & = \alpha_{0} + \alpha_{1}\textrm{Charter}_i + u_{i} \\
b_{i} & = \beta_{0} + \beta_{1}\textrm{Charter}_i + v_{i}
\end{align*}
where $\textrm{Charter}_i=1$ if School $i$ is a charter school, and $\textrm{Charter}_i=0$ if School $i$ is a non-charter public school. In addition, the error terms at Level Two are assumed to follow a multivariate normal distribution:
\[ \left[ \begin{array}{c}
u_{i} \\ v_{i}
\end{array} \right] \sim N \left( \left[
\begin{array}{c}
0 \\ 0
\end{array} \right], \left[
\begin{array}{cc}
\sigma_{u}^{2} & \\
\sigma_{uv} & \sigma_{v}^{2}
\end{array} \right] \right) . \]
With a binary predictor at Level Two such as school type, we can write out what our Level Two model looks like for public non-charter schools and charter schools.
- Public schools
\begin{align*}
a_{i} & = \alpha_{0} + u_{i}\\
b_{i} & = \beta_{0} + v_{i},
\end{align*}
- Charter schools
\begin{align*}
a_{i} & = (\alpha_{0} + \alpha_{1}) + u_{i}\\
b_{i} & = (\beta_{0}+ \beta_{1}) + v_{i}
\end{align*}
Writing the Level Two model in this manner helps us interpret the model parameters from our two-level model. We can use statistical software (such as the `lmer()` function from the `lme4` package in R) to obtain parameter estimates using our $1733$ observations, after first converting our Level One and Level Two models into a composite model (Model C) with fixed effects and random effects separated:
\begin{align*}
Y_{ij} & = a_{i} + b_{i}\textrm{Year08}_{ij}+ \epsilon_{ij} \\
& = (\alpha_{0} + \alpha_{1}\textrm{Charter}_i +u_{i}) + (\beta_{0} + \beta_{1}\textrm{Charter}_i + v_{i})\textrm{Year08}_{ij} + \epsilon_{ij} \\
& = [\alpha_{0} + \beta_{0}\textrm{Year08}_{ij} +\alpha_{1}\textrm{Charter}_i+ \beta_{1}\textrm{Charter}_i\textrm{Year08}_{ij}] + \\
& \quad [u_{i} + v_{i}\textrm{Year08}_{ij} + \epsilon_{ij}]
\end{align*}
```{r, comment=NA}
#Model C (uncontrolled effects of school type on
# intercept and slope)
model.c <- lmer(MathAvgScore~ charter + year08 +
charter:year08 + (year08|schoolid),
REML=T, data=chart.long)
```
```{r, echo=FALSE, message=FALSE}
VCrandom <- VarCorr(model.c)
print(VCrandom, comp = c("Variance", "Std.Dev."))
cat(" Number of Level Two groups = ",
summary(model.c)$ngrps)
coef(summary(model.c))
cat(" AIC = ", AIC(model.c), "; BIC = ", BIC(model.c))
```
Armed with our parameter estimates, we can offer concrete interpretations:
- Fixed effects:
- $\hat{\alpha}_{0} = 652.1.$ The estimated mean test score for 2008 for non-charter public schools is 652.1.
- $\hat{\alpha}_{1}= -6.02.$ Charter schools have an estimated test score in 2008 which is 6.02 points lower than public non-charter schools.
- $\hat{\beta}_{0}= 1.20.$ Public non-charter schools have an estimated mean increase in test scores of 1.20 points per year.
- $\hat{\beta}_{1}= 0.86.$ Charter schools have an estimated mean increase in test scores of 2.06 points per year over the three-year observation period, 0.86 points higher than the mean yearly increase among public non-charter schools.
- Variance components:
- $\hat{\sigma}_u= 5.99.$ The estimated standard deviation of 2008 test scores is 5.99 points, after controlling for school type.