-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathT6-2.Rmd
301 lines (215 loc) · 17.4 KB
/
T6-2.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
---
title: 'Tutorial 6: Linear Regression'
author: 'REPLACE WITH YOUR NAME and ID#'
date: 'Due by 18/03/2022 8:00 AM'
output: html_document
---
## Introduction to R Markdown
In the first half of the course, you were getting familiar with writing R code (in individual `xxx.R` files) and also starting on using RMarkdown for professional-grade text processing with embedded R computations.
In this second half of the course, we will use RMarkdown exclusively to formalize all tutorial assignment submission. Just a very brief recap of RMarkdown below.
You can open R Markdown documents in RStudio as well. You should see a little command called "Knit", which allows you to "knit" the entire R Markdown file into a HTML document, or a PDF document, or a MS Word Document (note that for PDF, Tex distribution is needed; for MS Word, MS Word needs to be installed on your system).
R Markdown is handy because it allows you to embed code and writeup into the same document, and it produces presentable output, so you can use it to generate reports from your homework, and, when you eventually go out to work in a company, for your projects.
Here's how you embed a "chunk" of R code.
```{r example-chunk, echo=TRUE}
1+1
```
After the three apostrophes, you'll need `r`, then you can give the chunk a name. Please note that **names have to be single-word and no space is allowed**. Also, names have to be unique, that is, every chunk needs a **different** name. You can give chunks names like:
- `chunk1`
- `read-in-data`
- `run-regression`
Or, those that help you with homework:
- `q1a-read-in-data`
- `q1b-regression`
These names are for you to help organize your code. (In practice it will be very useful when you have files with thousands of lines of code...). After the name of the chunk, you can give it certain options, separated by commas. I will highlight one important option.
- `echo=TRUE` means the code chunk will be copied into the output file. For homework purposes, **always** set `echo=TRUE` so we know what code you wrote. When you go out to work in a company and you want to produce professional-looking reports, feel free to set it to FALSE.
There is a bit more syntax to learn using the R Markdown, but we don't need you to be an expert in R Markdown (although we do expect proficiency in R!). Hopefully, you can copy all the R Markdown syntax you need from the templates we provide.
Note about **working directory** in R Markdown. If you do not specify your working directory via `setwd('...')`, and you hit "Knit", the document will assume that the working directory is the directory that the `.rmd` file is in. Thus, if your rmd is in `XYZ/folder1/code.rmd` and your dataset is `XYZ/folder1/data.csv`, then you can simply run `d0 <- read.csv('data.csv')` without running `setwd()`.
## Submission Instructions
- Select `output: html_document`.
- I would also recommend you to play with PDF file using pdf_document for your own benefit (LaTex needs to be installed).
- Include all code chunks, so include `echo=TRUE` in all chunks.
- Replace the placeholder text, "Type your answer here.", with your own.
- Submit *only* the required question for grading (Part 2: Submission). You can delete everything else for that submission. Remember to include any `library('package_name')` statements that you'll need to run your code and future reproduction.
- Rename your R Markdown file `T[X]_[MatricNumber].rmd`, and the output will automatically be `T[X]_[MatricNumber].html`.
- for example, `T6_12345.html`
- X is the Tutorial number at the top of this file. For example, Linear Regression is "T6".
- Submit your both R Markdown file (.rmd) and HTML (.html) to Luminus for tutorial assignments (upload to Luminus under the correct Submission Folder). We shall do the same for practical exam.
- **It is important to be able to code and produce your Rmarkdown output file *independently*.** Keep in mind that you are responsible for de-bugging and programming in the final exam.
## Preparation
```{r load-libraries, echo=TRUE}
# load required packages
library(dplyr)
library(tidyr)
library(ggplot2) # optional. we expect you to know base graphics, but allow ggplot if you find it easier
```
## Part Two: Assignment Submission
### Question 3 (Total 15 points)
- Dataset required: `data('recid')`
Recidivism rate in Singapore is 24% in 2016 (https://data.gov.sg/dataset/recidivism-rate). Criminals tend to relapse into criminal offense after the release from the prison. Recidivism is costly and causes serious social and economical problem. It is not only wasteful with the resources invested in prison, including staffing, infrastructure investment, daily operation cost, and economic opportunity cost for both prisoners and staffs (i.e., labor values that could be generated elsewhere other than being locked up and guarding in prison, respectively), but also harms the community for the second time due to crime recommitment. Recidivism is thus a critical evaluating metric for prison performance, e.g., rehabilitation or training program. `recid.csv` contains 1445 observations of recidivism cases in United States where it currently has the largest prison population in the world (about one out every five peope imprisoned in the world is incarcerated in the United States).
```{r q3-data-input, echo=TRUE}
recid = read.csv(file = 'recid.csv', header= TRUE)
```
C.-F. Chung, P. Schmidt, and A.D. Witte (1991), “Survival Analysis: A Survey,” Journal of Quantitative Criminology 7, 59-98.
Data Source: https://www.cengage.com/cgi-wadsworth/course_products_wp.pl?fid=M20b&product_isbn_issn=9781111531041. A quick description of the data:
Obs: 1445
1. black =1 if black
2. alcohol =1 if alcohol problems
3. drugs =1 if drug history
4. super =1 if release supervised
5. married =1 if married when incarc.
6. felon =1 if felony sentence
7. workprg =1 if in N.C. pris. work prg.
8. property =1 if property crime
9. person =1 if crime against person
10. priors # prior convictions
11. educ years of schooling
12. rules # rules violations in prison
13. age in months
14. tserved time served, rounded to months
15. follow length follow period, months
16. durat max(time until return, follow) in month
17. cens =1 if duration right censored
18. ldurat log(durat)
(3a) Based on previous experience, prison staffs suggest that the recidivism period `durat` (duration between release and follow-up offense) is related with the length o f time served in the prison `tserved`. If we are interested to explain recidivism, which is the dependent (or outcome) variable? Before jumping into the numbers, visualize the relationship between the variables of interest with a scatterplot. (2 point)
<p style="color:red">
I would choose `tserved` as the independent variable and `durat` as the dependent variable. This is because the recidivism period is likely to be an outcome variable. The length of time served in the prison also occurs before the follow-up offense, when the recidivism period is available. Therefore, it makes more sense to see the if `tserved` predicts `durat`, not the other way around.
</p>
<p style="color:blue">
Variable 16 `durat` is the time until return or FOLLOW. Some criminals may never commit second crime until the end of the follow / observation period. Thus there is a cluster of points on the top left hand corner of the graph.
</p>
```{r q3a, echo=TRUE}
ggplot(recid,
aes(x=tserved,
y=durat)) +
geom_point() +
theme_bw() +
ggtitle("Scatter plot of recidivism period against length of time served in the prison") +
xlab("Length of time served in the prison") +
ylab("Revidivism period")
```
(3b) Run a simple linear regression model between `durat` and `tserved` and interpret the coefficient before `tserved`. (2 points) Is the coefficient statistically significant? (1 point)
Remark: the intercept is by default included in the regression model for this module, unless otherwise specified.
<p style="color:red">
The intercept (b0) is 60.23062, which suggests that the average recidivism period of the criminal would be 60.23062 months if the length of time the criminal served in the prison is 0 month. The t-value here is large, and the p-value is very small, less than 0.05, thus we can reject the null hypothesis that the intercept beta0 = 0, i.e., the intercept beta0 is statistically significant, and significantly different from zero.
</p>
<p style="color:red">
The slope (b1), which is the coefficient before `tserved` is -0.25327. This means that, when the length of time the criminal served in the prison increases by 1 month, then we will expect to see an average decrease of the recidivism period by 0.25327 months. The magnitude of t-value is large, and the p-value here is 8.74e-14, which is very small, less than 0.05, thus we can reject the null hypothesis that this slope beta1 = 0, i.e., the coefficient before `tserved` is statistically significant, and significantly different from zero.
</p>
<p style="color:blue">
Should consider not just whether the effect is sgatistically event, but whether the **effect size** is large enough to be meaningful in practice.
</p>
```{r q3b, echo=TRUE}
model1 <- lm(durat ~ tserved, recid)
summary(model1)
```
(3c) In addition to prison serving time `tserved`, few other variables are suggested to predict the recidivism by prison staff, including previous drug use `drugs`, alcohol abuse `alcohol`, number of prior convictions `priors` and attendance of prison training program `workprg`. If we are to have a multivariate linear regression model to explain `durat` with all the mentioned variables, please write down explicitly the population linear regression model below. If you use $Y$, and $X_1$, $X_2$, etc., clearly define or label your notations, e.g., $X_1 = ?$ (2 points)
<p style="color:red">
Multivariable linear regression model:
</p>
<p style="color:red">
$$Y = \beta_0 + \beta_1 * X_1 + \beta_2 * X_2 + \beta_3 * X_3 + \beta_4 * X_4 + \beta_5 * X_5$$
</p>
<p style="color:red">
Y = recidivism period `durat`
</p>
<p style="color:red">
X1 = prison serving time `tserved`
</p>
<p style="color:red">
X2 = previous drug use `drugs`
</p>
<p style="color:red">
X3 = alcohol abuse `alcohol`
</p>
<p style="color:red">
X4 = number of prior convitions `priors`
</p>
<p style="color:red">
X5 = attendence of prison training programme `workprg`
</p>
<p style="color:blue">
$$Y = \beta_0 + \beta_1 * X_1 + \beta_2 * X_2 + \beta_3 * X_3 + \beta_4 * X_4 + \beta_5 * X_5 + \epsilon$$
</p>
(3d) Prison staff are evaluating the effectiveness of the working program. Run a multivariate linear regression: `durat ~ tserved + drugs + alcohol + priors + workprg`. Interpret the meaning of coefficient before `workprg` and discuss if prison working program should be retained from your observation to the regression result and why? (3 points)
<p style="color:red">
The slope (b5), which is the coefficient before `workprg` is 2.32777. This means that, the average difference in recidivism period when the criminal attends the prison training programme compared to when the criminal does not attend is 2.377 months, holding all other factors constant, i.e., the average recidivism period for a criminal that has attended the prison training programme is higher by 2.32777 months than that for a criminal that has not attended the programme, holding all other factors constant.
</p>
<p style="color:blue">
Two key points: 1) **on average**; 2) **holding all other independent variables constant**
</p>
<p style="color:red">
The corresponding t-value of 1.618 is rather small. The p-value is 0.105816, which is large, greater than 0.05. Thus we cannot reject the null hypothesis that this slope beta5 = 0, i.e., there is no sufficient evidence to suggest that this value is statistically significantly different from zero. Therefore, we cannot state that the coefficient before `workprg` is statistically significant. Therefore, prison working programme should not be retained from my observation to the regression result.
</p>
<p style="color:blue">
The result implies that prison working programme does not work effectively as expected to improve situation of recidivism.
</p>
```{r q3d, echo=TRUE}
model2 <- lm(durat ~ tserved + drugs + alcohol + priors + workprg, recid)
summary(model2)
```
(3e) The warden of prison decided to terminate the working program. Do you agree with his/her decision? Why or why not? (3 points)
Hint: One of most effective way to be critical is to check the assumptions used to reach the conclusion and decision.
<p style="color:red">
Since the aforementioned conclusion is arrived from the previous multivariate linear regression model, I am to check whether the assumptions for linear regression are fulfilled for this multivariate linear regression model.
</p>
<p style="color:red">
At first, I plotted the residual plot of residuals against attendance of prison training programme. However, since the attendance is a categorical variable, it is hard to observe from the residual graph plotted.
</p>
<p style="color:red">
Then, I obtained the residual plot of residuals against fitted values y hat. Firstly, there seems to be a linear relationship between residuals and fitted y hat values, suggesting that there are biased errors. Thus, the first assumption that there is no biased errors is not fulfilled.
</p>
<p style="color:red">
Secondly, the variance of errors does not seem to change significantly across changing y hat values. Thus the second assumption of homoskedasticity (constant variance) is fulfilled.
</p>
<p style="color:red">
There is a non-random pattern in the residual plot, which is the residuals becomes lower when y hat values get larger. Thus there is an autocorrelation abnormality. The third assumption that there is no autocorrelation is not fulfilled.
</p>
<p style="color:red">
Also, the red line deviates from the line where residual = 0, indicating that there are more positive residuals, and thus the mean of errors is not zero. Thus the assumption of mean-zero error is not met.
</p>
<p style="color:red">
Then, I plotted the Q-Q Plot of the residuals. Many of the data points do not follow the 45 degree dotted line, suggesting that the errors are not normally distributed. Therefore there are non-normal errors. This confirms that the assumption that the errors are normally distributed is fulfilled.
</p>
<p style="color:red">
I have also plotted scatterplot of recidivism duration and attendance of prison training programme. However, since the X5 is a categorical variable, there is no linearity shown from the graph. There is no need to check linearity for Y against categorical variable X.
</p>
<p style="color:red">
Since many of the assumptions are not fulfilled, the regression model is not reliable in reaching the conclusion that the coefficient before `workprg` is not statistically significant, and the decision that the working programme should be terminated. Thus, the working programme should not be terminated. Thus I do not agree with the warden's decision.
</p>
<p style="color:blue">
From the residual plot, we can see a significant decreasing pattern of residuals when fitted value gets larger.
</p>
```{r q3e, echo=TRUE}
# residual plot against workprg
residuals2 = resid(model2)
plot(recid$workprg,
residuals2,
main="Plot of residuals v.s. workprg",
xlab="Attendance of prison training programme",
ylab="Residuals")
abline(0,0,lty="longdash")
# residual plot against fitted values y^hat
plot(model2, 1)
#Q-Q plot of residuals
plot(model2, 2)
# check for nonlinearity and outliers with scatterplot
plot(recid$workprg,
recid$durat,
main="Attendance of prison training programme against recidivism duration",
xlab="Attendance of prison training programme",
ylab="Recidivism duration")
```
(3f) Offenders after prison release usually are under state supervised parole program and assessment for the length of the follow period, `follow`, is crucial since it is costly. Given five variables in `recid` data set: `follow`, `rules`, `age`, `tserved`, and `married`, how could you help the staffs to predict the outcome of a recent case given that a 32-year old married offender has been serving the jail time for 3 years and 7 months and during which broken no rule in the prison? Run a proper model with those 5 variables and predict the length of follow period. (2 point)
<p style="color:red">
I have run a model with `follow` as the dependent variable and other four variables as independent variables. Then I predicted the outcome using the model obtained.
</p>
<p style="color:red">
The predicted length of follow period is 74.73948 months. The prediction interval for the individual value for this particular person is [67.83254, 81.64642] in months. The confidence interval for mean predicted value for all people sharing the same characteristics is [73.97238, 75.50658] in months. Since the prediction here concerns with a particular offender, the prediction interval is used, and the prediction interval in months is [67.83254, 81.64642].
</p>
```{r q3f, echo=TRUE}
model3 <- lm(follow ~ rules + age + tserved + married, recid)
summary(model3)
to.predict <- data.frame(rules=0, age=32*12, tserved=43, married=1)
predict(model3, newdata=to.predict, interval="prediction")
predict(model3, newdata=to.predict, interval="confidence")
```