-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathcovid-19-rainier.Rmd
461 lines (339 loc) · 16 KB
/
covid-19-rainier.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
---
title: "Exploration of COVID-19 tracking data from multiple resources"
author: "Wei Sun"
date: "`r Sys.Date()`"
output:
pdf_document:
toc: yes
toc_depth: '3'
html_document:
highlight: tango
number_sections: yes
theme: journal
toc: yes
toc_depth: 3
toc_float:
collapsed: yes
smooth_scroll: no
editor_options:
chunk_output_type: console
---
# Introduction
Coronavirus disease 2019 (COVID-19) is an infectious disease caused by a new type of coronavirus: severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). The outbreak first started in Wuhan, China in December 2019. The first kown case of COVID-19 in the U.S. was confirmed on January 20, 2020, in a 35-year-old man who teturned to Washington State on January 15 after traveling to Wuhan. Starting around the end of Feburary, evidence emerge for community spread in the US.
We, as all of us, are indebted to the heros who fight COVID-19 across the whole world in different ways. For this data exploration, I am grateful to many data science groups who have collected detailed COVID-19 outbreak data, including the number of tests, confirmed cases, and deaths, across countries/regions, states/provnices (administrative division level 1, or admin1), and counties (admin2). Specifically, I used the data from these three resources:
* JHU (https://coronavirus.jhu.edu/)
+ The Center for Systems Science and Engineering (CSSE) at John Hopkins University.
+ World-wide counts of coronavirus cases, deaths, and recovered ones.
+ https://github.com/CSSEGISandData/COVID-19
* NY Times (https://www.nytimes.com/interactive/2020/us/coronavirus-us-cases.html)
+ The New York Times
+ ``cumulative counts of coronavirus cases in the United States, at the state and county level, over time''
+ https://github.com/nytimes/covid-19-data
* COVID Trackng (https://covidtracking.com/)
+ COVID Tracking Project
+ ``collects information from 50 US states, the District of Columbia, and 5 other US territories to provide the most comprehensive testing data''
+ https://github.com/COVID19Tracking/covid-tracking-data
```{r setup, include=FALSE, warning = FALSE, message = FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(ggplot2)
library(ggpubr)
theme_set(theme_bw())
library(httr)
library(RColorBrewer)
```
# JHU
Assume you have cloned the JHU Github repository on your local machine at ``../COVID-19''.
## time series data
The time series provide counts (e.g., confirmed cases, deaths) starting from Jan 22nd, 2020 for 253 locations. Currently there is no data of individual US state in these time series data files.
```{r eval = TRUE, echo = FALSE, results="hide"}
dir0 = "../COVID-19/csse_covid_19_data"
dir1 = file.path(dir0, "csse_covid_19_time_series")
cases = read.csv(file.path(dir1, "time_series_covid19_confirmed_global.csv"),
as.is=TRUE)
death = read.csv(file.path(dir1, "time_series_covid19_deaths_global.csv"),
as.is=TRUE)
dim(cases)
cases[1:2,c(1:6,ncol(cases))]
dim(death)
death[1:2,c(1:6,ncol(cases))]
cases[which(cases$Country.Region=="US"),c(1:6,ncol(cases))]
death[which(cases$Country.Region=="US"),c(1:6,ncol(cases))]
table(death$Country.Region == cases$Country.Region)
cases$Region = cases$Country.Region
w2 = which(cases$Province.State != "")
cases$Region[w2] = paste(cases$Region[w2], cases$Province.State[w2], sep="-")
death$Region = death$Country.Region
w2 = which(death$Province.State != "")
death$Region[w2] = paste(death$Region[w2], death$Province.State[w2], sep="-")
```
Here is the list of 10 records with the largest number of cases or deaths on the most recent date.
```{r echo = FALSE, warning = FALSE, message = FALSE, fig.asp = 0.5}
od1 = order(cases[,ncol(cases)-1], decreasing=TRUE)
d10 = cases[od1[1:10],]
current.date = names(d10)[ncol(d10)-1]
d10$cases = d10[[current.date]]
current.date = gsub("^X", "", current.date)
d10$Region = factor(d10$Region, levels=d10$Region)
p = ggplot(d10, aes(x=Region, y=cases, fill=Region)) +
geom_bar(stat="identity") + coord_flip() +
ggtitle(sprintf("cumulative number of cases on %s", current.date))
p + theme(legend.position = "none")
```
```{r echo = FALSE, warning = FALSE, message = FALSE, fig.asp = 0.5}
od1 = order(death[,ncol(death)-1], decreasing=TRUE)
d10 = death[od1[1:10],]
current.date = names(d10)[ncol(d10)-1]
d10$death = d10[[current.date]]
current.date = gsub("^X", "", current.date)
d10$Region = factor(d10$Region, levels=d10$Region)
p = ggplot(d10, aes(x=Region, y=death, fill=Region)) +
geom_bar(stat="identity") + coord_flip() +
ggtitle(sprintf("cumulative number of deaths on %s", current.date))
p + theme(legend.position = "none")
```
Next, I check for each country/region, what is the number of new cases/deaths? This data is important to understand what is the trend under different situations, e.g., population density, social distance policies etc. Here I checked the top 10 countries/regions with the highest number of deaths.
```{r echo = FALSE, warning = FALSE, message = FALSE, fig.asp = 0.4}
od1 = order(death[,ncol(death)-1], decreasing=TRUE)
d10 = death[od1[1:10],]
c10 = cases[match(d10$Region, cases$Region),]
stopifnot(all(c10$Region == d10$Region))
for(i in 1:nrow(d10)){
d1 = as.numeric(d10[i,-c(1:4,ncol(d10))])
c1 = as.numeric(c10[i,-c(1:4,ncol(c10))])
date1 = gsub("^X", "", names(d10)[-c(1:4,ncol(d10))])
us = data.frame(date=date1, case=c1, death=d1)
us$date = gsub("\\.20$", "", us$date)
us$month = substr(us$date, 1, 1)
us$day = 1:nrow(us)
dim(us)
us[1:5,]
us = us[which(us$month >= 3),]
us$new.case = c(us$case[1], diff(us$case))
us$new.death = c(us$death[1], diff(us$death))
gcn = ggplot(us, aes(x=day, y=log10(new.case + 1))) +
geom_point(aes(color=month)) + geom_smooth(method=loess)
gdn = ggplot(us, aes(x=day, y=log10(new.death + 1))) +
geom_point(aes(color=month)) + geom_smooth(method=loess)
figure = ggarrange(gcn, gdn, ncol = 2, nrow = 1)
str1 = d10$Region[i]
str2 = "data source: https://github.com/CSSEGISandData/COVID-19"
figure = annotate_figure(figure,
top = text_grob(str1, color = "brown", face = "bold"),
bottom = text_grob(str2, color = "blue", hjust = 1, x = 1))
print(figure)
}
```
## daily reports data
The raw data from Hopkins are in the format of daily reports with one file per day. More recent files (since March 22nd) inlcude information from individual states of US or individual counties, as shown in the following figure. So I turn to NY Times data for informatoin of individual states or counties.
```{r echo = FALSE, warning = FALSE, message = FALSE, results="hide", fig.asp = 0.4}
dir1 = file.path(dir0, "csse_covid_19_daily_reports")
flist = list.files(".csv", path=dir1, full.names=TRUE)
length(flist)
flist[c(1:2, length(flist))]
dat.list = list()
for(f1 in flist){
day1 = gsub("\\.csv", "", basename(f1))
day1 = gsub("-2020", "", day1)
dat.list[[day1]] = read.csv(f1, as.is=TRUE)
names(dat.list[[day1]]) = gsub("\\.", "_", names(dat.list[[day1]]))
}
nr = sapply(dat.list, nrow)
nc = sapply(dat.list, ncol)
nr.df = data.frame(md=names(nr), records=as.numeric(nr))
nr.df$month = substring(nr.df$md, 1, 2)
nr.df$date = substring(nr.df$md, 4, 5)
nr.df$day = 1:nrow(nr.df)
p = ggplot(nr.df, aes(x=day, y=records, fill=month)) +
geom_bar(stat="identity") +
ggtitle(sprintf("number of records in Hopkins daily reports"))
str2 = "data source: https://github.com/CSSEGISandData/COVID-19, "
str2 = paste0(str2, "day 1 is 1/22/2020")
annotate_figure(p,
bottom = text_grob(str2, color = "blue",
hjust = 1, x = 1))
```
# NY Times
The data from NY Times are saved in two text files, one for state level information and the other one for county level information.
```{r echo = FALSE, warning = FALSE, message = FALSE, results="hide"}
dir1 = "../covid-19-data"
counties = read.csv(file.path(dir1, "us-counties.csv"), as.is=TRUE)
states = read.csv(file.path(dir1, "us-states.csv"), as.is=TRUE)
dim(counties)
counties[1:2,]
dim(states)
states[1:2,]
```
The currente date is
```{r echo=FALSE}
current.date = counties$date[nrow(counties)]
current.date
```
## state level data
First check the 30 states with the largest number of deaths.
```{r echo = FALSE, warning = FALSE, message = FALSE}
state20 = states[which(states$date==current.date),]
state20 = state20[order(state20$deaths, decreasing = TRUE)[1:30],]
print(state20)
```
For these 30 states, I check the number of new cases and the number of new deaths. Part of the reason for such checking is to identify whether there is any similarity on such patterns. For example, could you use the pattern seen from Italy to predict what happen in an individual state, and what are the similarities and differences across states.
```{r echo = FALSE, warning = FALSE, message = FALSE, fig.asp = 0.4}
for(k in 1:nrow(state20)){
st1 = state20$state[k]
w2check = which(states$state==st1)
df1 = states[w2check,]
df1$new.case = c(df1$cases[1], diff(df1$cases))
df1$new.death = c(df1$deaths[1], diff(df1$deaths))
df1$date = gsub("2020-", "", df1$date)
df1$month = substr(df1$date, 1, 2)
df1$day = 1:nrow(df1)
dim(df1)
df1[1:2,]
table(df1$month)
df1=df1[which(as.numeric(df1$month) >=3),]
gcn = ggplot(df1, aes(x=day, y=log10(new.case + 1))) +
geom_point(aes(color=month)) + geom_smooth(method=loess)
gdn = ggplot(df1, aes(x=day, y=log10(new.death + 1))) +
geom_point(aes(color=month)) + geom_smooth(method=loess)
figure = ggarrange(gcn, gdn, ncol = 2, nrow = 1)
str1 = sprintf("%s", st1)
str2 = "data source: https://github.com/nytimes/covid-19-data"
str3 = sprintf("%s, day 1 is %s", str2, df1$date[1])
figure = annotate_figure(figure, top = text_grob(str1, col="brown"),
bottom = text_grob(str3, color = "blue",
hjust = 1, x = 1, face = "italic"))
print(figure)
}
```
Next I check the relation between the **cumulative** number of cases and deaths for these 10 states, starting on March
```{r echo = FALSE, warning = FALSE, message = FALSE}
states$date = gsub("2020-", "", states$date)
states$month = as.numeric(substr(states$date, 1, 2))
s10 = states[which(states$month >= 3),]
s10 = s10[which(s10$state %in% state20$state),]
figure = ggplot(data=s10,
aes(x=log10(cases+1), y=log10(deaths+1), colour=state)) +
geom_line()
str2 = "data source: https://github.com/nytimes/covid-19-data"
annotate_figure(figure, bottom = text_grob(str2, color = "blue",
hjust = 1, x = 1))
```
## county level data
First check the 50 counties with the largest number of deaths.
```{r echo = FALSE, warning = FALSE, message = FALSE}
county20 = counties[which(counties$date==current.date),]
county20 = county20[order(county20$deaths, decreasing = TRUE)[1:50],]
print(county20)
```
For these 50 counties, I check the number of new cases and the number of new deaths.
```{r echo = FALSE, warning = FALSE, message = FALSE, fig.asp = 0.4, results="hide"}
for(k in 1:nrow(county20)){
ct1 = county20$county[k]
st1 = county20$state[k]
w2check = which(counties$county==ct1 & counties$state==st1)
df1 = counties[w2check,]
df1$new.case = c(df1$cases[1], diff(df1$cases))
df1$new.death = c(df1$deaths[1], diff(df1$deaths))
df1$date = gsub("2020-", "", df1$date)
df1$month = substr(df1$date, 1, 2)
df1$day = 1:nrow(df1)
dim(df1)
df1[1:2,]
table(df1$month)
df1=df1[which(as.numeric(df1$month) >=3),]
gcn = ggplot(df1, aes(x=day, y=log10(new.case + 1))) +
geom_point(aes(color=month)) + geom_smooth(method=loess)
gdn = ggplot(df1, aes(x=day, y=log10(new.death + 1))) +
geom_point(aes(color=month)) + geom_smooth(method=loess)
figure = ggarrange(gcn, gdn, ncol = 2, nrow = 1)
str1 = sprintf("%s_%s", ct1, st1)
str2 = "data source: https://github.com/nytimes/covid-19-data"
str3 = sprintf("%s, day 1 is %s", str2, df1$date[1])
figure = annotate_figure(figure, top = text_grob(str1, col="brown"),
bottom = text_grob(str3, color = "blue",
hjust = 1, x = 1))
print(figure)
}
```
# COVID Trackng
The positive rates of testing can be an indicator on how much the COVID-19 has spread. However, they can be much more noisy data since the negative testing resutls are often not reported and the tests are almost surely taken on a non-representative random sample of the population. The COVID traking project proides a grade per state: ``If you are calculating positive rates, it should only be with states that have an A grade. And be careful going back in time because almost all the states have changed their level of reporting at different times.'' (https://covidtracking.com/about-tracker/). The data are also availalbe for both counties and states, here I only look at state level data.
The grades of the states may change over timea and I strongly recommend checking their webiste before puting serious interpretation on the following plot.
```{r echo = FALSE, warning = FALSE, message = FALSE, fig.asp = 1, results="hide"}
data.dir = "../covid-tracking-data/data"
states2check = sort(c("WA", "TX", "NY", "NC", "FL", "GA",
"MI", "CA", "IL", "PA", "AZ", "OH"))
states = read.csv(file.path(data.dir, "states_daily_4pm_et.csv"))
# states = read.csv(file.path(data.dir, "daily.csv"))
dim(states)
states[1:2,]
table(states$state)
states$date = gsub("2020", "", states$date)
states$month = as.numeric(substring(states$date,1,2))
states$Month = as.factor(states$month)
table(states$month)
# check results starting at May
states = states[which(states$month >= 5),]
states$dayom = as.numeric(substring(states$date,3,4))
states$day = (states$month - 5)*31 + states$dayom # just for ranking
states[1:2,]
summary(states$day)
table(states$month)
states$positive.rate = states$positiveIncrease/states$totalTestResultsIncrease
summary(states$totalTestResultsIncrease)
summary(states$positiveIncrease)
states$positive.rate[which(states$positiveIncrease < 0)] = NA
states$positive.rate[which(states$totalTestResultsIncrease < 100)] = NA
summary(states$positive.rate)
states[which(states$positive.rate==1)[1:2],]
states$positive.rate[which(states$positive.rate > 0.9)] = NA
cbp1 <- c("#E69F00", "#56B4E9", "#009E73",
"#0072B2", "#D55E00", "#CC79A7")
cbp1 = brewer.pal(n = 8, name = "Dark2")
cbp1 = c(cbp1, brewer.pal(n = 4, name = "Set1"))
w2kp = which(states$state %in% states2check)
total = ggplot(states[w2kp,], aes(x=day, y=log10(totalTestResults),
color=state)) +
labs(y= "cumulative # of tests") +
geom_point() + geom_smooth(method=loess) +
scale_colour_manual(values=cbp1)
fig.list= list()
i = 0
for(si in states2check){
i = i + 1
w2kp = which(states$state %in% si)
fig.list[[i]] =
ggplot(states[w2kp,], aes(x=day, y=positive.rate, color=state)) +
geom_point() + geom_smooth(method=loess) +
labs(y= "positive rate") +
scale_colour_manual(values=cbp1[i])
}
summary(states$posNeg - states$totalTestResults)
summary(states$total - states$totalTestResults)
total
figure = ggarrange(fig.list[[1]], fig.list[[2]], fig.list[[3]],
fig.list[[4]], fig.list[[5]], fig.list[[6]],
ncol = 2, nrow = 3)
figure2 = ggarrange(fig.list[[7]], fig.list[[8]], fig.list[[9]],
fig.list[[10]], fig.list[[11]], fig.list[[12]],
ncol = 2, nrow = 3)
figure
figure2
# current.date = states$date[which(states$state=="WA")][1]
#
# str0 = "github.com/COVID19Tracking/"
# str1 = sprintf("%s, positive rate on %s:", str0, current.date)
#
# for(s1 in states2check){
# posR = states$positive.rate[which(states$state==s1)][1]
# str1 = sprintf("%s %.2f(%s)", str1, posR, s1)
# }
#
# str1
#
# annotate_figure(figure,
# bottom = text_grob(str1, color = "blue", hjust = 1,
# x = 1, face = "italic", size = 10))
```
# Session information
```{r}
sessionInfo()
```