-
Notifications
You must be signed in to change notification settings - Fork 370
/
causeofDeath.Rmd
151 lines (115 loc) · 5.87 KB
/
causeofDeath.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
---
output:
html_document:
smart: false
---
Causes of Death, USA
======================================================================
By Susan Li
March 9 2017
Want to know how other people die? The [Centers of Disease Control and Prevention (CDC)](https://wonder.cdc.gov/ucd-icd10.html) provides estimates for the number of people who die due to each of the causes, from 1999 to 2015. The following are the causes of death at a glance.
```{r global_options, include=FALSE}
knitr::opts_chunk$set(echo=FALSE, warning=FALSE, message=FALSE)
```
```{r}
url_causes <- "https://ibm.box.com/shared/static/souzhfxe3up2hrh23phciz18pznbtqxp.csv"
df <- read.csv(url_causes)
unique(df$Cause)
sum(df$Deaths)
unique(df$Year)
unique(df$Gender)
unique(df$Age)
df[!complete.cases(df), ]
```
This mortality data contains 51 causes and 6540835 deaths for the year 2005, 2010 and 2015. The gender are male and female, and the age from 0 to 100. And there is no missing value in the dateset.
### So, what are the top causes of death in the United States?
```{r}
library(ggplot2)
library(ggthemes)
ggplot(aes(x = reorder(Cause,-Deaths), y = Deaths), data = df) +
geom_bar(stat = 'identity') +
ylab('Total Deaths') +
xlab('') +
ggtitle('Causes of Deaths') +
theme_tufte() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
```
Heart disease and cancer are far away the most important causes of death in the US. Let’s take these top 10 causes of death and make a new data frame for some plotting, as follows:
```{r}
library(dplyr)
cause_group <- group_by(df, Cause)
df.death_by_cause <- summarize(cause_group,
sum = sum(Deaths))
df.death_by_cause <- arrange(df.death_by_cause, desc(sum))
top_10 <- head(df.death_by_cause, 10)
top_10
ggplot(aes(x = reorder(Cause, sum), y = sum), data = top_10) +
geom_bar(stat = 'identity') +
theme_tufte() +
theme(axis.text = element_text(size = 12, face = 'bold')) +
coord_flip() +
xlab('') +
ylab('Total Deaths') +
ggtitle("Top 10 Causes of Death")
```
Heart desease remains the leading cause of death in the US, followed by cancer(Malignant neoplasms), Chronic lower respiratory desease takes the third place but by a wide margin.
```{r}
options("scipen"=100, "digits"=4)
```
### Are men more likely to die than women? Or vice versa? Does it depend on cause of death or age?
```{r}
df_top_10 <- df[df$Cause %in% unique(top_10$Cause), ]
ggplot(aes(x = reorder(Cause, Deaths), y = Deaths, fill = Gender), data = df_top_10) +
geom_bar(stat = 'identity', position = position_dodge()) +
theme_tufte() +
coord_flip() +
ggtitle("Top 10 Causes of Death") +
facet_wrap(~Year) +
theme_fivethirtyeight() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
```
The picture does not change much when we compare 2005, 2010 and 2015. And more Ameican women than men dying for heart disease. Also Stroke(cerebrovascular diseases), Alzheimer's disease and Influenza and pneumonia seem affect more women than men.
```{r}
ggplot(aes(x = Age, y = Deaths, color = Cause),
data = df_top_10[df_top_10$Gender == 'M',]) +
geom_smooth(method = 'loess') +
ggtitle('Causes of Death by Age for Male')
```
In general, cancer affects male a little bit earlier than heart disease, at peak around 70 years old, heart disease affects more people at older age peak around 75 years old.
Only two diseases affect more yonger people than older people - Accidents and Suicide.
```{r}
ggplot(aes(x = Age, y = Deaths, color = Cause),
data = df_top_10[df_top_10$Gender == 'F',]) +
geom_smooth(method = 'loess') +
ggtitle('Causes of Death by Age for Female')
```
Female developes heart disease more than 10 years later than male, at peak almost at 100 years old. Men appear more vulnerable to heart disease than the inaptly-named 'weaker sex'. As a result, they succumb earlier.
### What the trend look like over time?
A stacked area plot may help to display the trend and development of each cause of deaths over time.
```{r}
ggplot(aes(x = Age, y = Deaths, color = Cause, fill = Cause),
data = df_top_10) +
geom_area() +
ggtitle('Causes of Death by Age and Gender') +
facet_wrap(~Gender) +
theme_solarized()
```
To see the trend over time, I downloaded a new dataset from [Centers for Disease Control and Prevention(CDC)](https://wonder.cdc.gov/ucd-icd10.html) and did some cleaning up. This dataset contains deaths, population, crude rate per 100,000 nationwide from 1990 to 2015.
```{r}
library(xlsx)
mydata <- read.xlsx("mydata.xlsx", 1)
mydata$NA. <- NULL
ggplot(aes(x = Year, y = Deaths), data = mydata) +
geom_line(size = 2.5, alpha = 0.7, color = "mediumseagreen") +
geom_point(size = 0.5) + xlab("Year") + ylab("Number of deaths") +
ggtitle("Deaths in the US")
```
Oh no! This is terrible. The number of death are going up! But the population of US has been growing. We should look at the per capita number of deaths. These things are typically measured per 100,000 population.
```{r}
ggplot(aes(x = Year, y = Crude_rate_per_100k), data = mydata) +
geom_line(size = 2.5, alpha = 0.7, color = "mediumseagreen") +
geom_point(size = 0.5) + xlab("Year") + ylab("Number of deaths per 100,000 population") +
ggtitle("Deaths in the US")
```
The death rate had been declining for years, an effect of improvement for health and disease control and technology. Until a few years ago, it has been raising in the recent years. If we want more accurate data, we will need the age-adjusted death rate for the total population because [the birth rate in the US drops to the lowest point](http://www.cnn.com/2016/08/11/health/us-lowest-fertility-rate/). The population 10 years ago was yonger than the population now.
That's all for now, next time, probably state by state.