-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathSegaliniBeatrice_AdvStat4PhyAnalysis_Ex1.Rmd
325 lines (248 loc) · 14.1 KB
/
SegaliniBeatrice_AdvStat4PhyAnalysis_Ex1.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
# Exercise 1
The given table gives the volume, area, length and maximum and mean depths of some Scottish lakes. Create vectors, holding the lake's name and all the parameters and build a dataframe called `scottish.lakes` from the vectors:
1. evaluate the highest and lowest volume and area lake;
2. order the frame with respect to the area and determine the two largest area lakes;
3. by summing up the areas occupied by the lakes, determine the area of Scotland covered by water.
## Solution
The first step is to create the dataframe `scottish.lakes` which contains all the information of the table. There are two ways to do so: by rows or by columns. The columnwise method is indeed more practical in this case, so it was chosen.
```{r cars}
Loch <- c( 'Loch Ness', 'Loch Lomond', 'Loch Morar', 'Loch Tay', 'Loch Awe',
'Loch Maree', 'Loch Ericht', 'Loch Lochy', 'Loch Rannoch',
'Loch Shiel', 'Loch Katrine', 'Loch Arkaig', 'Loch Shin')
Volume <-c(7.45, 2.6, 2.3, 1.6, 1.2, 1.09, 1.08, 1.07, 0.97, 0.79, 0.77, 0.75, 0.35)
Area <- c(56, 71, 27, 26.4, 39, 28.6, 18.6, 16, 19, 19.5, 12.4, 16, 22.5)
Length <- c(39, 36, 18.8, 23, 41, 20, 23, 16, 15.7, 28, 12.9, 19.3, 27.8)
Max_depth <-c(230, 190, 310, 150, 94, 114, 156, 162, 134, 128, 151, 109, 49)
Mean_depth <-c(132, 37, 87, 60.6, 32, 38, 57.6, 70, 51, 40, 43.4, 46.5, 15.5)
scottish.lakes <- data.frame(Loch, Volume, Area, Length,Max_depth,Mean_depth)
scottish.lakes
```
**1.** To evaluate the highest and lowest volume and area lake, the function `order` is applied to the columns `Volume` and `Area`: the first and last elements are selected, with the following code.
```{r}
scottish.lakes[order(Volume, decreasing=TRUE),][1,] #biggest volume
scottish.lakes[order(Volume),][1,] #smallest volume
scottish.lakes[order(Area, decreasing=TRUE),][1,] #biggest area
scottish.lakes[order(Area),][1,] #smallest area
```
Hence, it can be stated that the biggest lake of Scotland is **Loch Ness**, the smallest is **Loch Shin**; the one with the largest area is **Loch Lomond** and the smallest one is **Loch Katrine**.
**2.** Similarly, the two largest area lakes are found, which are proven to be **Loch Lomond** and **Loch Ness**. Here the first two rows of the sorted dataframe are selected and not only the first one.
```{r}
scottish.lakes[order(Area, decreasing=TRUE),][1:2,]
```
**3.** Finally, to determine the area of Scotland covered by water, we apply the function `sum` to the column `Area`, obtaining **$372$ $m^2$** of surface.
```{r}
sum(scottish.lakes$Area)
```
# Exercise 2
Install the DAAG and the tibble packages: `install.packages(c('DAAG','tibble'), type='source')`; after having loaded the library, get information on the package content and on the `ais` data frame and create a tibble from the `ais` dataframe. Perform the following analysies:
1. create a table grouping the data by gender and by sport; produce a barplot with the table adding a legend;
2. determine if any of the columns holds missing values;
3. produce boxplots of the main blood variables ('red blood cell counts', 'white blood cell counts', 'hematocrit' and 'hemaglobin concentration'), for different kind of sports;
4. make some scatter plot correlations of the same blood variables using different colors and symbols for the two genders in the sample
```{r}
#install.packages(c("DAAG","tibble"), type="source") #install the required packages
#install.packages("RColorBrewer") #this is for fancy colouring
library("lattice") #this is for plotting the correlation matrix
library("tidyverse") #load the libraries
library("DAAG")
library("RColorBrewer") #this is for fancy colouring
```
```{r}
data <- tibble(ais) #create a tibble containing the data of the dataset DAAG::ais
head(data)
```
## Solution
**1.** `t_ais` is a table that groups the data by gender and by sport. It is used as a base for producing the barplot below.
```{r}
t_ais <- table(data$sex, data$sport)
t_ais
```
```{r}
# create extra margin room on the right for an axis
par(mar=c(6,6,5,0), mgp=c(3.5, 0.5, 0))
barplot(t_ais, names.arg=colnames(t_ais), beside = TRUE, space=c(0,2),
col = c("hotpink1", "cadetblue3"), legend = c("Women","Men"),#rownames (t_ais),
main="How many men/women play a sport?", ylab="Counts", xlab="Sports",
cex.main=2, cex.lab=2, las=2)
```
**2.** to determine if any of the columns holds missing values, the function`na.omit` is applied to the whole dataset: is can be observed that its size is not changed, hence there are not missing values in any columns. As a further proof, all the missing values in the tibble are summed, obtaining $0$ for each column.
```{r}
head(na.omit(data))
apply ( apply (data, 2, is.na), 2, sum)
```
Another method to obtain the same result is to use the `any` function to the dataframe. If there are any missing values, it will return`<NA>`, otherwise `FALSE`, just like in this case.
```{r}
any(is.na(data))
```
**3.** the following boxplots represent the main blood variables ('red blood cell counts' **rcc**, 'white blood cell counts' **wcc**, 'hematocrit' **hc**, and 'hemaglobin concentration' **hg**), for all the 10 kind of sports reported in the `ais` dataframe. Each boxplot is related to a single blood variable and compares all the sports.
```{r}
par( mfrow=c(2,2), mar=c(6,6,4,0), mgp=c(4, 0.5, 0) )
colors <- brewer.pal(n = length(unique(data$sport)), name = "Paired")
boxplot(rcc~sport, data=data, col=colors, border="black",
horizontal = TRUE, las=2, cex.main=1.5, cex.lab=1.5,
main="Red Blood Cells", xlab="Counts [10^12/l]", ylab="Sport")
boxplot(wcc~sport, data=data, col=colors, border="black",
horizontal = TRUE, las=2, cex.main=1.5, cex.lab=1.5,
main="White Blood Cells", xlab="Counts [10^12/l]", ylab="Sport")
boxplot(hc~sport, data=data, col=colors, border="black",
horizontal = TRUE, las=2, cex.main=1.5, cex.lab=1.5,
main="Hematocrit", xlab="Hc [%]", ylab="Sport")
boxplot(hg~sport, data=data, col=colors, border="black",
horizontal = TRUE, las=2, cex.main=1.5, cex.lab=1.5,
main="Hemoglobin", xlab="Concentration [g/daL]", ylab="Sport")
```
**4.** The ScatterPLOt Matrix (SPLOM) is generated with the `splom` function of the `lattice` package. In this way, it is possible to compare the $4$ main blood variables with each other, and also see in the same graphs the similaritties/differencees between the two genders.
```{r}
splom(data[c(1,2,3,4)], groups=data$sex, data=data,
cex.lab=5, pch=c(1,4), col=c("hotpink1", "cadetblue3"),
panel=panel.superpose,
key = list( title="M/F Blood Variables", cex=1.5,
points=list(pch=c(1,4), col=c("hotpink1", "cadetblue3")),
columns=2, text=list(c("Females","Males") ) )
)
```
It can be seen that:
- **rcc** is correlated with both **hc** and **hg**, from the "linear" shape of their graphs;
- on the other hand, **wcc** is not related to any of the other variables;
- men in general has higher values for the variables **rcc**, **hc** and **hg** with respect to women, while **wcc** is equally distributed for both men and women.
# Exercise 3
After having loaded the required data:
1. examine the loaded tibble structure `covid`;
1. create a sub-tibble containing only the last day and produce a table with all the countries with number of deaths or number of new cases greater than 200;
1. select the top 10 countries, in terms of cases, and plot the total number of cases as a function of time. Plot the total number of deaths as a function of time. In order to compare the different curves, normalize the first date-time plot to the same $t_0$ value.
```{r}
# needed_packages <- c('lubridate', 'readxl', 'curl')
# already_installed <- needed_packages %in% installed.packages()
# for ( pack in needed_packages [! already_installed ]) {
# message( paste("To be installed :", pack, sep =" "))
# install.packages( pack )
# }
library ('lubridate')
library ('readxl')
library ('curl')
url <- "https://www.ecdc.europa.eu/sites/default/files/documents/"
fname <- "COVID-19-geographic-disbtribution-worldwide-"
date <- lubridate::today() - 1
ext = ".xlsx"
target <- paste(url, fname, date, ext, sep="")
message("target: ", target )
tmp_file <- tempfile("data", "\\tmp", fileext =ext )
tmp <- curl::curl_download( target, destfile=tmp_file )
covid <- readxl::read_xlsx(tmp_file)
covid
```
## Solution
**1.** The loaded tibble `covid` has $10$ columns: data is classified by time (**dateRep**, **day**, **month**, **year** columns) and geographic area (**countriesAndTerritories**, **geoId**, **countryterritoryCode** columns). For each country, also the total population of 2018 (**popData2018**) is reported. Cases and deaths are reported daily in the columns **cases** and **deaths**. Data is sorted by geographic area in alphabetic order.
```{r}
head(covid)
```
```{r}
apply ( apply (covid ,2,is.na), 2, sum)
print("Missing countryterritoryCode")
unique(covid[is.na(covid$countryterritoryCode)==TRUE,'countriesAndTerritories'])
print("Missing popData2018")
unique(covid[is.na(covid$popData2018)==TRUE,'countriesAndTerritories'])
```
There is some missing data: in particular, *Anguilla*, *Bonaire, Saint Eustatius and Saba*, *Czechia* and *Falkland_Islands_(Malvinas)* are missing the three letters geografic code and population number of 2018, and *Eritrea* only the latter. This information has been evaluated by using the function `is.na` and `unique` and by showing the two tibbles above.
**2.** `tb_covid` is a sub-tibble of `covid` that contains data only of the last day, which is then ordered by number of cases and deaths in decreasing order. A further selection is applied to obtain a table with all the countries with number of deaths or number of new cases greater than 200.
```{r}
tb_covid <- covid[covid$dateRep==date,] #tibble with only yesterday's data
tb_covid <- tb_covid[order(tb_covid$cases, tb_covid$deaths, decreasing=TRUE),]
tb_cov_200 <- tb_covid[(tb_covid$deaths >=200 | tb_covid$cases >=200), ]
tb_cov_200[,c('countriesAndTerritories','geoId', 'deaths', 'cases')]
```
**3.** The two-letters Id `geoId` is selected for the top 10 countries for number of cases in the last day and is put in the `states` vector. It will be used as a legend in the final plots.
```{r}
top10 <- tb_cov_200[1:10,]
states <- top10$geoId
states
```
To represent the number of deaths/cases as a function of time, the `cumsum` function is used. `na.omit` function is applied in order to avoid errors due to missing values, while `rev` is for putting the vectors in the correct order.
```{r}
#total deaths
colors <- brewer.pal(n = length(states), name = "Paired") #colors
par(mgp=c(3, 0.5, 0))
y_max <- 0
for (i in 1:10){
ys <- cumsum(na.omit(rev(covid$deaths[covid$geoId==states[i]])))
xs <- as.Date(na.omit(rev(covid$dateRep[covid$geoId==states[i]])))
y_max <- max(ys, y_max)
plot(xs, ys, col=colors[i], pch=i, las=2,
xlim=c(as.Date("2020-02-28"),date), ylim=c(0,y_max+100),
xlab="", ylab="")
par(new=TRUE, xaxt='n', yaxt='n')
}
legend('topleft', legend=states, col=colors, pch=c(1:10))
title(main="Number of deaths vs. time", cex.main=1.5,
xlab="Time [date]", ylab="Total number of deaths", cex.lab=1.5)
```
```{r}
#total deaths
colors <- brewer.pal(n = length(states), name = "Paired") #colors
par(mgp=c(3, 0.5, 0))
y_max <- 0
for (i in 1:10){
ys <- cumsum(na.omit(rev(covid$cases[covid$geoId==states[i]])))
xs <- as.Date(na.omit(rev(covid$dateRep[covid$geoId==states[i]])))
y_max <- max(ys, y_max)
plot(xs, ys, col=colors[i], pch=i, las=2,
xlim=c(as.Date("2020-02-28"),date), ylim=c(0,y_max+100),
xlab="", ylab="")
par(new=TRUE, xaxt='n', yaxt='n')
}
legend('topleft', legend=states, col=colors, pch=c(1:10))
title(main="Number of cases vs. time", cex.main=1.5,
xlab="Time [date]", ylab="Total number of cases", cex.lab=1.5)
```
In order to compare the different curves, a normalization is performed by translating curves of a time offset $t_0$, such that they have the same origin, obtaining the result below. A threshold of 10 cases/deaths is set to establish the starting of the epidemy and to choose the aforementioned offset $t_0$.
```{r}
#total cases
colors <- brewer.pal(n = length(states), name = "Paired") #colors
par(mgp=c(3, 0.5, 0))
y_max <- 0
x_max <- 0
threshold <- 10 #set threshold for chosing t0
for (i in 1:10){
y <- c()
x <- c()
y <- cumsum(na.omit(rev(covid$cases[covid$geoId==states[i]])))
x <- as.POSIXlt(na.omit(rev(covid$dateRep[covid$geoId==states[i]])))
t0 <- x[y>threshold][1]
xs <- difftime(x, t0, units="days")
x_max <- max(length(xs[y>threshold]), x_max)
y_max <- max(y, y_max)
plot(xs, log(y), col=colors[i], pch=i, las=2,
xlim= c(0, x_max), ylim=c(0,log(y_max+100)),
xlab="", ylab="")
par(new=TRUE, xaxt='n', yaxt='n')
}
legend('topleft', legend=states, col=colors, pch=c(1:10))
title(main="Number of cases vs. normalized time", cex.main=1.5,
xlab="Time [days]", ylab="Total number of cases [logscale]", cex.lab=1.5)
```
```{r}
#total cases
colors <- brewer.pal(n = length(states), name = "Paired") #colors
par(mgp=c(3, 0.5, 0))
y_max <- 0
x_max <- 0
threshold <- 10 #set threshold for chosing t0
for (i in 1:10){
y <- c()
x <- c()
y <- cumsum(na.omit(rev(covid$deaths[covid$geoId==states[i]])))
x <- as.POSIXlt(na.omit(rev(covid$dateRep[covid$geoId==states[i]])))
t0 <- x[y>threshold][1]
xs <- difftime(x, t0, units="days")
x_max <- max(length(xs[y>threshold]), x_max)
y_max <- max(y, y_max)
plot(xs, log(y), col=colors[i], pch=i, las=2,
xlim= c(0, x_max), ylim=c(0,log(y_max+100)),
xlab="", ylab="")
par(new=TRUE, xaxt='n', yaxt='n')
}
legend('topleft', legend=states, col=colors, pch=c(1:10))
title(main="Number of deaths vs. normalized time", cex.main=1.5,
xlab="Time [days]", ylab="Total number of deaths [logscale]", cex.lab=1.5)
```
For these last graphs, a logaritmic scale is chosen for the y-axis to improve the readability: in fact, shifting the curves such that they have the same origin causes an inevitable overlapping. Hence, the logscale could help in seeing which curves grow faster with respect to the others.