-
Notifications
You must be signed in to change notification settings - Fork 2
/
08-analysis.Rmd
362 lines (291 loc) · 15 KB
/
08-analysis.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
# Analysis {#analysis}
## Working with survey data
We'll use a survey [data](https://osf.io/79t2k/) from a study of approach to engage christians in the issue of global warming. Using this data allow us to compare our results with the original. In this exercise, we'll replicate the application of sampling weights and the standardization of variables which are common treatments of raw survey data.
```{r fix-raster-dplyr-conflicts, include=F}
detach('package:raster', unload=TRUE)
```
```{r survey-data-1, echo=F, warning=F, message=F}
library(ggplot2)
library(gganimate)
library(haven)
library(stringr)
library(rvest)
library(survey)
library(dplyr)
options(scipen=999)
rm(list=ls())
data <- read_sav('https://files.osf.io/v1/resources/79t2k/providers/osfstorage/5c6c390e82a3950018c77ec2?action=download&direct&version=1')
# getting raw data
data <- data %>% select(-starts_with('Z')) %>% filter(`filter_$`==1)
# getting subset of variables
data.pre <- data %>% select(ends_with('_pre'))
data.post <- data %>% select(ends_with('_post'))
data.others <- data %>% select(-ends_with('_pre'), -ends_with('_post'))
selected.cols <- c(names(data.others)[c(1,2,4,10,11,21,13)], names(data.pre)[1:3], names(data.post)[1:3])
data <- data %>% select(one_of(selected.cols))
cat('Selected variables')
data.variables <- cbind(
names(data), sapply(data, function(x)attr(x, 'label'))
) %>% as_tibble()
data.variables$V2 <- format(data.variables$V2 , justify = "left")
data.variables %>% print()
```
### Variable Standardization
Standardizing variables must be performed before any regression or other statistical analysis are done with the data. It can be done by rescaling the variables using the z-score formula. Standardized variables will then have a mean of 0 and a standard deviation of 1.
```{r survey-data-2, warning=F, message=F}
# z-score formula
get.z.score <- function(X){
(X-mean(X, na.rm=T))/sd(X, na.rm=T)
}
# standardizing the pretest and posttest variables
Zdata <- data %>% mutate(across(ends_with('_pre')|ends_with('_post'), get.z.score, .names='Z{col}'))
Zdata
```
### Sample weighting
Per Josep Espaga Reig's [book](https://bookdown.org/jespasareig/Book_How_to_weight_a_survey/introduction.html), there are 4 steps in survey weighting:
- Base/design weights: if participants have different probability of being sampled;
- Non-response weights: if some participants didn't respond to the survey;
- Use of auxillary data/calibration: adjusting weights to the total population;
- Analysis of weight variability/trimming: to check variability in the computed weights;
In this example, we will estimate of the percent of US christian population who either believe global warming (GW) is happening or worried about GW. We will give descriptives of the distribution of these two variables and then a simple extrapolation and compute total for the whole US christian population.
Our variables of interest are:
- **GWHap**: On a scale from 1 to 7, how strongly do you believe that global warming is or is not happening?
- **GWworry**: How worried are you about global warming?
#### Step 1: Design weights
In the [paper](https://www.researchgate.net/publication/334298965), it is mentionned that participants were recruited via Prime Panels. And only those who previously identified as christians were selected. It should be good to assume that each participants had an equal probability of being sampled. Therefore, _we can skip the Step 1 of the weighting_. If we had to compute the design weight $d_i$, it would be equal to the inverse of the probability of being sampled: $1/p_i$. Per Josep Espaga Reig's [book](https://bookdown.org/jespasareig/Book_How_to_weight_a_survey/introduction.html): the design weights can be interepreted as "the number of units in our population that each unit in our sample represents. [...] The sum of all design weights should be equal to the total number of units in our population".
#### Step 2: Non-response weights
The paper also doesn't mention any information about non-respondent during the survey. So we can assume that there is no bias in the responses and _we can skip Step 2_. In short, the goal of Step 2 is to account for differences in propensity to respond among the participants. For example, participants from specific neighborhood, income level, etc may have lower response rates, and all the response we have are from a specific demographic groups, which would bias the analysis.
#### Step 3: Using auxillary data for weight calibration
The goal here is to calibrate the survey data to the population in general. So we would need information on both the survey respondents and the population. In this case, we will use demographic variables (gender, education, and race). One of the most common method for calibration weighting is called: _raking_ which we will use as an the example.
We'll use as auxiliary data the [PEW Datasets](https://www.pewforum.org/data/) which gives some statistics of the US christian population. We'll scrape the data directly from their website so that we'll have fewer data wrangling to do.
```{r survey-data-3, warning=F, message=F}
# PEW data
# getting only the christian respondent and their weights
pew.data <- read_sav('data/pew-data.sav') %>%
select(RELTRAD, agerec, SEX, racethn, WEIGHT) %>%
mutate(across(where(is.labelled), as_factor)) %>%
mutate(across(where(is.factor), as.character)) %>%
mutate(
religion = RELTRAD %>% recode(
`Evangelical Protestant Tradition`='christian',
`Mainline Protestant Tradition`='christian',
`Historically Black Protestant Tradition`='christian',
`Catholic`='christian',
`Mormon`='christian',
`Orthodox Christian`='christian',
`Jehovah's Witness`='christian',
`Other Christian`='christian',
`Jewish`='other',
`Muslim`='other',
`Buddhist`='other',
`Hindu`='other',
`Other World Religions`='other',
`Other Faiths`='other',
`Unaffiliated (religious "nones")`='other',
`Don't know/refused - no information on religious identity`='other',
),
pew.age = agerec %>% recode(
`Age 24 or younger`='18-29',
`Age 25-29`='18-29',
`30-34`='30-49',
`35-39`='30-49',
`40-44`='30-49',
`45-49`='30-49',
`50-54`='50-64',
`55-59`='50-64',
`60-64`='50-64',
`65-69`='65+',
`70-74`='65+',
`75-79`='65+',
`80-84`='65+',
`85-89`='65+',
`Age 90 or older`='65+',
`Don't know/refused`='NA'
),
pew.race = racethn %>% recode(
`White non-Hispanic`='White',
`Black non-Hispanic`='Black',
`Hispanic`='Hispanic',
`Other`='Other',
`Don’t know/Refused (VOL.)`='NA'
)
) %>%
filter(religion=='christian') %>%
select(SEX, pew.age, pew.race, WEIGHT) %>%
na.omit()
names(pew.data)[1] <- 'pew.gender'
head(pew.data,10)
pew.gender_age <- pew.data %>%
select(pew.age, pew.gender, WEIGHT) %>%
group_by(pew.gender, pew.age) %>% summarise(n=sum(WEIGHT)) %>%
tidyr::pivot_wider(names_from=pew.age, values_from=n) %>%
select(-`NA`)
pew.race <- pew.data %>% select(pew.race, WEIGHT) %>%
group_by(pew.race) %>% summarise(n=sum(WEIGHT)) %>%
filter(pew.race!='NA')
```
Now we'll recode our age and race variables to match the PEW data.
```{r survey-data-4, warning=F, message=F}
# interaction between gender and age
our_data.recoded <- data %>%
select(ex_ppage, ex_ppgender, raceCat) %>%
mutate(across(where(is.labelled), as_factor)) %>%
mutate(
age_group=cut(ex_ppage %>% as.numeric(), breaks=c(18, 30, 50, 65, 100), right=F)
) %>%
mutate(across(where(is.factor), as.character)) %>%
mutate(
age_group=age_group %>% recode(
`[18,30)`='18-29',
`[30,50)`='30-49',
`[50,65)`='50-64',
`[65,100)`='65+'
),
race_cat=raceCat %>% recode(
`White Non-hispanic`='White',
`Black Non-hispanic`='Black',
`Hispanic`='Hispanic',
`Other Non-hispanic`='Other',
`Two+ races non-hispanic`='Other'
)
) %>%
tidyr::unite(col=gender_age, ex_ppgender, age_group, remove=F) %>%
mutate(gender_age=replace(x=gender_age, list=gender_age %in% c('Female_NA', 'Male_NA'), values=NA)) %>%
na.omit()
# number of participants/weights by units
# this would be the sum of the final weights from Step 1 and 2 if they were necessary
our_data.gender_age <- our_data.recoded %>%
group_by(age_group, ex_ppgender) %>%
summarise(n = n()) %>%
tidyr::pivot_wider(names_from=age_group, values_from=n)
our_data.race <- our_data.recoded %>%
group_by(race_cat) %>%
summarise(n = n())
rm(list=setdiff(ls(), c('data', 'data.variables', 'Zdata', 'pew.data', 'pew.gender_age', 'pew.race', 'our_data.recoded', 'our_data.gender_age', 'our_data.race')))
```
Now, we'll scale the PEW data to our sample size.
```{r survey-data-5, warning=F, message=F}
# our total (weighted) observations
our_data.wgt <- nrow(our_data.recoded)
# scaling PEW to our data
# for age and gender
pew.gender_age.scaled <- pew.gender_age %>%
tidyr::pivot_longer(-pew.gender, names_to='pew.age', values_to='WEIGHT') %>%
tidyr::unite(col=gender_age, pew.gender, pew.age)
total.pew <- pew.gender_age.scaled$WEIGHT %>% sum()
pew.gender_age.scaled %<>%
mutate(Freq = round(WEIGHT/total.pew * our_data.wgt, 0) ) %>%
select(-WEIGHT)
pew.gender_age.scaled
# for race
pew.race.scaled <- pew.race %>%
mutate(
Freq = round(n/total.pew * our_data.wgt, 0),
race_cat=pew.race
) %>%
select(-n, -pew.race)
pew.race.scaled
```
**Implementing calibration**
We'll use the same procedure explained in this [book](https://bookdown.org/jespasareig/Book_How_to_weight_a_survey/calibration.html) which uses the `survey` package.
```{r survey-data-6, warning=F, message=F}
# No weights or probabilities supplied, assuming equal probability
our.svydesign <- svydesign(ids = ~ 0, data=our_data.recoded)
# the variable used for calibration are: gender/age interaction and race
our.raked <- rake(our.svydesign, sample.margins = list(~race_cat, ~gender_age), population = list(pew.race.scaled, pew.gender_age.scaled))
# collecting the weights
raked.weight <- our.raked$postStrata[[1]][[1]] %>% attributes() %>% .[["weights"]]
weighted_data <- our_data.recoded
weighted_data$raked.weight <- raked.weight
head(weighted_data, 10)
```
```{r survey-data-7, include=F}
# scraping data from PEW
url = 'https://www.pewforum.org/religious-landscape-study/christians/christian/'
pew.page <- read_html(url)
get_pew_table <- function(page, table_i){
header <- html_text(html_nodes(page, "table.highchart thead.right-aligned")[table_i])
content <- html_text(html_nodes(page, "table.highchart tbody")[table_i])
h.rm.t <- gsub('\\t+',';',header); h <- strsplit(gsub('\\n+','',h.rm.t), ';')[[1]]
c.rm.t <- gsub('\\t+',';',content); c <- strsplit(gsub('\\n+','',c.rm.t), ';')[[1]]
hl <- length(h); cl <- length(c); r <- cl/hl;
c.split <- split(c, rep(1:r, rep(hl, r)))
pew.table <- as_tibble(do.call(rbind, c.split)); names(pew.table) <- h
pew.table <- pew.table %>% mutate_all(funs(str_replace_all(.,'%|,',''))) %>%
mutate_all(as.integer) %>% mutate_all(funs(.*.01*320e6*.706))
return(pew.table)
}
```
## Working with COVID-19 data
In this example, we'll use daily updated data from the [JHU Covid Resource Center](https://coronavirus.jhu.edu/data) to create animated time series of coronavirus cases and deaths in the US.
The data is available in [their GitHub](https://github.com/CSSEGISandData/COVID-19/tree/master/csse_covid_19_data). We'll use the daily report for the US from 04-12-2020 to 06-22-2020.
```{r covid-1, message=F, warning=F}
repo.url <- 'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_daily_reports_us/'
# importing all csv
us.covid.files <- list()
date.format <- '%m-%d-%Y'
dates <- format(seq(as.Date('2020/04/12'), as.Date(as.Date('2020/06/22')), by='day'), date.format)
i <- 1
for (date in dates){
us.covid.files[[i]] <- read.csv(url(paste0(repo.url, date, '.csv'))) %>%
select(Province_State, Confirmed, Deaths, Recovered, Mortality_Rate) %>%
mutate(Date=as.Date(date, date.format), Confirmed.rank=rank(-Confirmed),
Deaths.rank=rank(-Deaths), Mortality.rank=rank(-Mortality_Rate))
i <- i+1
}
head(us.covid.files[[1]])
us.covid.df <- do.call(rbind, us.covid.files)
```
```{r covid-2, message=F, warning=F}
# line series by state
last.df <- us.covid.files[[length(us.covid.files)]]
us.covid.plot <- ggplot(us.covid.df %>% filter(Confirmed.rank <=10),
aes(y=Confirmed.rank, group=Province_State, fill=Province_State)) +
geom_tile(aes(x=Confirmed/2, width=Confirmed, height=0.9), alpha=0.8, color=NA) +
geom_text(aes(x=0, label=paste(Province_State, ' ')), hjust=1, vjust=1) +
geom_text(aes(x=Confirmed, label=paste0(' ', round(Confirmed/1000), 'k')), hjust=0) +
scale_y_reverse() + coord_cartesian(clip='off') +
theme_bw() +
# custom ggtheme
theme(legend.position='none', axis.line=element_blank(),
axis.title=element_blank(), axis.ticks=element_blank(),
axis.text=element_blank(), panel.grid.major.y=element_blank(),
panel.grid.minor.y=element_blank(), panel.border=element_blank(),
plot.margin = margin(2,1,2,3,'cm'))
us.covid.anim <- us.covid.plot + transition_states(Date, transition_length=4, state_length=1) +
view_follow(fixed_x=T) +
labs(title = 'Daily coronavirus cases : {closest_state}',
subtitle = 'Top 10 States',
caption ='Data Source: Johns Hopkins Coronavirus Resource Center')
# animate(us.covid.anim, nframes=400, fps=10, renderer=gifski_renderer('images/us-covid-rank.gif'))
```
![us-covid-rank](images/us-covid-rank.gif)
The previous animated plot shows the the evolution of the number of cases. And changing to other variables such as `Mortality_Rate` or `Deaths` would follow the same steps.
Now, if instead we want to show how the evolution of case change over space, we can do mapping instead.
```{r covid-3, message=F, warning=F}
library(sf)
# first we'll join with a geometry feature of the US
us.states.geo <- read_sf('data/us-states/us-states.shp')
# adding geometry
us.covid.sf <- us.states.geo %>% right_join(us.covid.df, by=c('state_name'='Province_State'))
```
```{r covid-4, message=F, warning=F}
# mapping
max_val <- max(us.covid.df$Mortality_Rate)
mid_val <- mean(us.covid.df$Mortality_Rate)
map <- ggplot(us.covid.sf) +
geom_sf(aes(fill=Mortality_Rate), size=.1) +
scale_fill_gradient2(high='#6E2C00', mid='#FDFEFE', low='#28B463', midpoint=mid_val, limits = c(0,max_val)) +
coord_sf(crs=5070) + theme_bw() +
# custom ggtheme
theme(legend.position='bottom', axis.line=element_blank(),
axis.title=element_blank(), axis.ticks=element_blank(),
axis.text=element_blank(), panel.grid.major.y=element_blank(),
panel.grid.minor.y=element_blank(), panel.border=element_blank(),
plot.margin = margin(1,1,1,1,'cm'))
map.anim <- map + transition_states(Date, transition_length=4, state_length=1) +
labs(title = 'Daily coronavirus cases : {closest_state}',
caption ='Data Source: Johns Hopkins Coronavirus Resource Center')
# animate(map.anim, nframes=100, fps=5, renderer=gifski_renderer('images/us-covid-map.gif'))
```
![us-covid-map](images/us-covid-map.gif)