-
Notifications
You must be signed in to change notification settings - Fork 2
/
06-covariance.Rmd
166 lines (121 loc) · 7.03 KB
/
06-covariance.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
---
output:
pdf_document: default
html_document: default
---
# Covariance Matrix
The current `convey` software does not account for the covariance across groups for linearization-based variance estimators (as the objects created using `svydesign`). We acknowledge that, under ideal circumstances, the covariance would be properly accounted for in the software. Modifying our software to account for the covariance using linearization remains a future goal, but users can readily use resampling-based variance estimation methods for such purpose. Transforming a `svydesign` object to a `svrepdesign` with `survey::as.svrepdesign` might be a useful approach around this limitation.
Accounting for the covariance between estimates is particularly important for inference regarding net changes, for instance.
Many countries use rotating panel schemes with overlapping samples for their labour surveys,
meaning that that part of the sample is interviewed again in the next round.
For instance, the PNADC and the Basic Monthly CPS both use a rotating panel scheme, where a household is interviewed a number of times before dropping from the sample.
For net changes, the overlapping samples tend to produce positive covariances over time, so that accounting for
the covariance between the estimates can produce improved inferences through less conservative confidence intervals.
The practical implications can be made clear by studying the variance of the difference.
Consider the two estimates $\widehat{T}_1$ and $\widehat{T}_2$, where we are interested in making inferences about $T_1 - T_2$.
We can estimate this difference using $\widehat{T}_1 - \widehat{T}_2$.
Put simply, the variance of this difference is given by
$$
Var \big( \widehat{T}_1 - \widehat{T}_2 \big) =
Var \big( \widehat{T}_1 \big) + Var \big( \widehat{T}_2 \big) - 2 Cov \big( \widehat{T}_1 , \widehat{T}_2 \big)
$$
\noindent where: $Var \big( \widehat{T}_1 \big)$ and $Var \big( \widehat{T}_2 \big)$
are the variances of the estimators $\widehat{T}_1$ and $\widehat{T}_2$;
$Cov \big( \widehat{T}_1 , \widehat{T}_2 \big)$ is the covariance of these estimators.
Usually, the estimators are assumed to be independent and the covariance term can be ignored. But, when the covariance
is strongly positive, this results in a overly conservative variance estimator for the difference.^[The covariance can also be negative; then, the variance of the difference would be larger than the sum of the variances. While theoretically possible, we have not observed it in practice yet.]
## Using the `survey` package
Influence functions and "resampling" replicates can be used to improve the inference
about differences and changes between estimates.
The `survey` package already provides an
approach for estimating the variance-covariance matrix using the `svyby` function.
Based on the `?survey::svyby` examples, we have:
```{r}
# load the survey library
library(survey)
# load data set
data( api )
# declare sampling design
dclus1 <- svydesign( id=~dnum, weights=~pw, data=apiclus1, fpc=~fpc )
# estimate means
mns <-svyby(~api99, ~stype, dclus1, svymean,covmat=TRUE)
# collect variance-covariance matrix of estimates
( m <- vcov( mns ) )
# compute variance terms
var.naive <- sum( diag( m[c(1,3),c(1,3)] ) )
cov.term <- sum( diag( m[ c(1,3),c(3,1)] ) )
# "naive" SE of the difference
sqrt( var.naive )
# SE of the difference
sqrt( var.naive - cov.term )
#... or using svycontrast
svycontrast( mns , c(E = 1, M = -1) )
```
Notice that, because the covariance terms are positive, the (actual) variance of
the difference is smaller than the "naive" variance.
A similar idea can be implemented with other estimators, such as inequality and poverty measures.
However, this has not yet been implemented for linearization/influence function methods.
In the next section, we show an example with the Gini index.
## Example Calculation
The current `convey` software does not account for the covariance across `svyby` groups for linearization-based estimation. The example below describes this calculation and presents what is, in this case, a limited impact on the final error terms.
We start by showing what has been implemented --- the resampling based approach:
```{r}
# load the convey package
library(convey)
# load the survey library
library(survey)
# load the laeken library
library(laeken)
# load the synthetic EU statistics on income & living conditions
data(eusilc)
# make all column names lowercase
names(eusilc) <- tolower(names(eusilc))
# construct a survey.design
# using our recommended setup
des_eusilc <-
svydesign(
ids = ~ rb030 ,
strata = ~ db040 ,
weights = ~ rb050 ,
data = eusilc
)
# create bootstrap-based design object
des_eusilc_rep <- as.svrepdesign( des_eusilc , type = "bootstrap" )
# immediately run the convey_prep function on it
des_eusilc <- convey_prep( des_eusilc )
des_eusilc_rep <- convey_prep(des_eusilc_rep)
# estimate gini by gender, with and without covariance matrices
ginis.rep <- svyby( ~ py010n , ~rb090 , des_eusilc_rep , svygini , na.rm = TRUE , covmat = TRUE )
ginis.rep.nocov <- svyby( ~ py010n , ~rb090 , des_eusilc_rep , svygini , na.rm = TRUE , covmat = FALSE )
# variance of the gini difference: naive
svycontrast( ginis.rep.nocov , quote( `male` - `female` ) )
# variance of the gini difference: accounting for covariance
svycontrast( ginis.rep , quote( `male` - `female` ) )
```
Notice that the warning `Only diagonal elements of vcov() available` means that
the covariance terms are null for the `vcov( ginis.rep.nocov )` result.
We highlight that, for net changes under rotating panels, the resampling procedures
should be modified to account for the covariance induced by the overlapping samples.
An `R` based solution is presented by @cillia2021.
While not implemented yet, a linearization-based approach can be applied using:
```{r}
# estimate gini by gender, with and without covariance matrices
ginis.lin.nocov <- svyby( ~ py010n , ~rb090 , des_eusilc , svygini , na.rm = TRUE , covmat = FALSE )
gini.female <- svygini( ~ py010n , subset( des_eusilc , rb090 == "female" ), na.rm = TRUE , influence = TRUE )
gini.male <- svygini( ~ py010n , subset( des_eusilc , rb090 == "male" ), na.rm = TRUE , influence = TRUE )
# estimate full variance-covariance matrix
linmat <- rep( 0 , nrow( des_eusilc$variables ) )
linmat[ attr( gini.female , "index" ) ] <- attr( gini.female , "influence" )
linmat[ attr( gini.male , "index" ) ] <- attr( gini.male , "influence" )
linmat <- linmat * des_eusilc$prob
lintot <- svyby( linmat , ~rb090 , des_eusilc , svytotal , covmat = TRUE )
# compare the variance-covariance matrices
vcov( ginis.lin.nocov ) # naive, with null covariance terms
vcov( lintot ) # full, with non-null covariance terms
# SE of the gini difference: naive
SE( svycontrast( ginis.lin.nocov , quote( `male` - `female` ) ) )
# SE of the gini difference: accounting for covariance
SE( svycontrast( lintot , quote( `male` - `female` ) ) )
```
Notice that the warning `Only diagonal elements of vcov() available` means that
the covariance terms are null for the `vcov( ginis.lin.nocov )` result.