-
Notifications
You must be signed in to change notification settings - Fork 0
/
Bayesian Analysis Part 4.Rmd
253 lines (198 loc) · 8.54 KB
/
Bayesian Analysis Part 4.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
---
title: "Assignment 4"
author: "Eveline Surbakti"
date: "4/18/2020"
output: word_document
---
```{r setup, include=FALSE}
#setwd("D:/2020 - Spring/STAT40850 Bayesian Analysis (online)/Assignment/Assignment 4")
```
I created model of the number of successful (Nsucc) putts out of a number of attempts (Ntrys) as being Binomially distributed with a probability of a success parameter (p say) that depends on some function of the distance.
Since the Binomial probability parameter p must lie between 0 and 1 we should first transform our distance
# (a) Plot the ratio of number of successes to number of attempts (y-axis) against each distance (x-axis).
Answer:
From the data, we can see that the ratio of number of successes to number of attempts is higher when we have the closer distance and it decreases as the distance increases.
```{r}
data<-read.table("putting.dat", header=TRUE)
prob=(data$Nsucc/data$Ntrys)
#PART A
plot(data$dist,prob,type = ,main = "Probability of Success to Distance", ylim = c(0,1))
```
# (b) Write JAGS code to produce 10,000 samples from the posterior for the linear model above for all unknown quantities. You may set uniform priors for alpha and beta with limits between -10 and 10. What 95% HDR intervals do you get for alpha and beta?
Answer:
```{r}
require(rjags)
library(tinytex)
#PART B
#the model
K=nrow(data)
n=data$Ntrys
X=data$dist
Y=data$Nsucc
golf<-list(K=K, n=n,X=X,Y=Y)
#the model
model_string <- "model{
for (i in 1:K){
Y[i] ~ dbin(p[i],n[i])
logit(p[i]) <- alpha + beta*X[i]
#unknown alpha and beta with - mean to make it neater
}
alpha ~ dunif(-10,10)
beta ~ dunif(-10,10)
}"
#the model
putting.model <- jags.model(textConnection(model_string),golf,n.chains=1)
#the sample
putting.samps=coda.samples(putting.model,variable.names=c("alpha","beta"),1e4)
#95% HDR intervals for alpha is
quantile(putting.samps[[1]][,"alpha"],c(0.025,0.975))
```
```{r}
#95% HDR intervals for beta is
quantile(putting.samps[[1]][,"beta"],c(0.025,0.975))
```
# (c) Add code to produce samples from the posterior predicted number of successes at the observed distances and at 25 feet for 100 attempts at each distance.
Answer:
```{r}
#PART C
OBS20<-list(25,100,NA) #Hint: the easiest way to do this is to add a row to the data with distance=25, number of tries=100, number of successes=NA.
data_mod<-rbind(data,OBS20)
K=nrow(data_mod)
n=data_mod$Ntrys
X=data_mod$dist
Y=data_mod$Nsucc
golf<-list(K=K, n=n,X=X,Y=Y)
model_string2 <- "model{
for (i in 1:K){
Y[i] ~ dbin(p[i],n[i])
logit(p[i]) <- alpha + beta*X[i]
PredY[i] ~ dbin(p[i],100)
}
alpha ~ dunif(-10,10)
beta ~ dunif(-10,10)
}"
#the model
putting.model.mod <- jags.model(textConnection(model_string2),
golf,n.chains=1)
#the sample
putting.samps.mod=coda.samples(putting.model.mod,variable.names=c("PredY"),1e4)
j<-(colMeans(putting.samps.mod[[1]][,grep("PredY", colnames(putting.samps.mod[[1]]))]))
interval <- apply(putting.samps.mod[[1]], 2, quantile, c(0.025,0.975))
#Then create a plot that shows the data as per part (a) multiplied by 100, as well as the mean and HDR number of successes out of 100 at each distance.
plot(data$dist,prob*100, type = , ylim=c(0,100), xlim=c(2, 21)) #extend the limit to see the whole graph clearly
lines(data_mod$dist,j,type = 'l',col=2)
lines(data_mod$dist, interval[1,], type = 'l',col=3)
lines(data_mod$dist, interval[2,], type = 'l',col=3)
```
#(d) What is the 95% HDR for the successes out of 100 at 25 feet? Does this seem reasonable based on your plot of the data?
Answer:
It does not seem to be reasonable, the plot of the data is not within the interval.
```{r}
#PART D
quantile(putting.samps.mod[[1]][,"PredY[20]"],c(0.025,0.975))
```
# (e) Next you will try a model where the probability of a miss depends both linearly and quadratically on distance.
Answer:
```{r}
#PART E
K=nrow(data_mod)
n=data_mod$Ntrys
X=data_mod$dist
Y=data_mod$Nsucc
golf<-list(K=K, n=n,X=X,Y=Y)
#Code a model file for this model and again produce a 95% HDR for successes out of 100 at all observed distances and 25 feet. Create a plot similar to that obtained in part (c)
model_string3 <- "model{
for (i in 1:K){
Y[i] ~ dbin(p[i],n[i])
logit(p[i]) <- alpha + beta*X[i] + lambda*X[i]^2
PredY[i] ~ dbin(p[i],100)
}
alpha ~ dunif(-10,10)
beta ~ dunif(-10,10)
lambda ~ dunif(-10,10)
}"
#the model
putting.model.mod2 <- jags.model(textConnection(model_string3),
golf,n.chains=1)
#the sample
putting.samps.mod2=coda.samples(putting.model.mod2,variable.names=c("PredY"),1e4)
j<-(colMeans(putting.samps.mod2[[1]][,grep("PredY", colnames(putting.samps.mod2[[1]]))]))
interval <- apply(putting.samps.mod2[[1]], 2, quantile, c(0.025,0.975))
plot(data$dist,prob*100, type = , ylim=c(0,100), xlim=c(2, 25))
lines(data_mod$dist, j,type = 'l', col=2)
lines(data_mod$dist, interval[1,], type = 'l',col=3)
lines(data_mod$dist, interval[2,], type = 'l',col=3)
```
#(f) Does this model lead to more believable predictions for successes at 25 feet? What is the HDR at 25 feet for successes out of 100 attempts?
Answer:
Now the model gives a more believable predictions.
The HDR at 25 feet for successes out of 100 attempts is:
```{r}
#PART F
quantile(putting.samps.mod2[[1]][,"PredY[20]"],c(0.025,0.975))
```
#(g) Now try a model based on the log of the distance. How does this compare (with reference to a plot similar to those created for the linear and quadratic models)?
```{r}
#PART G
#model based on the log of the distance
model_string4 <- "model{
for (i in 1:K){
Y[i] ~ dbin(p[i],n[i])
logit(p[i]) <- alpha + beta*log(X[i])
PredY[i] ~ dbin(p[i],100)
}
alpha ~ dunif(-10,10)
beta ~ dunif(-10,10)
}"
#the model
putting.model.mod3 <- jags.model(textConnection(model_string4),
golf,n.chains=1)
#the sample
putting.samps.mod3=coda.samples(putting.model.mod3,variable.names=c("PredY"),1e4)
j<-(colMeans(putting.samps.mod3[[1]][,grep("PredY", colnames(putting.samps.mod3[[1]]))]))
interval <- apply(putting.samps.mod3[[1]], 2, quantile, c(0.025,0.975))
plot(data$dist,prob*100, type = , ylim=c(0,100), xlim=c(2, 23))
lines(data_mod$dist,j,type = 'l',col=2)
lines(data_mod$dist, interval[1,], type = 'l',col=3)
lines(data_mod$dist, interval[2,], type = 'l',col=3)
```
Based on the graphs, part G with log of the distance gives a better prediction. Because, logically when the distance for professional golfer is increasing, the probability of success should be decreasing. Meanwhile, in part E there is a tendency of it to increase after 20 feet.
#(h) Use DIC to choose the best model. Comment on the number of parameters in each of the 3 models.
Answer:
```{r}
#PART H
#linear regression - 1st model
putting.model.mod=jags.model(textConnection(model_string2),golf,n.chains=4)
#linear and quadratic - 2nd model
putting.model.mod2=jags.model(textConnection(model_string3),golf,n.chains=4)
#log of the distance - 3rd model
putting.model.mod3=jags.model(textConnection(model_string4),golf,n.chains=4)
#the DIC
DIC1=dic.samples(putting.model.mod,1e4)
DIC2=dic.samples(putting.model.mod2,1e4)
DIC3=dic.samples(putting.model.mod3,1e4)
DIC1; DIC2; DIC3
```
Among the DIC values of all models, we can see that the log of distance model has the lowest penalized deviance. The penalty between the first and the third model is about the same because they have the same model, except the third model use log which made the mean deviance way smaller than the first model.
Based on the number of parameters in each of the 3 models, the second model has higher penalty because it has more parameters (lambda) than the rest of models.
#(i) Check your chosen model for mixing and convergence and comment on what you find.
Answer:
```{r}
#modified the model with additional function = n.adapt and n.chains
putting.model.mod3A=jags.model(textConnection(model_string4),golf,n.chains=4,n.adapt = 1e5)
#modified the sample with function thin
putting.samps.mod3A=coda.samples(putting.model.mod3,variable.names=c("alpha","beta","PredY"),1e5,thin = 25)
effectiveSize(putting.samps.mod3A)
dim(putting.samps.mod3A[[1]])
dim(putting.samps.mod3A[[2]])
dim(putting.samps.mod3A[[3]])
dim(putting.samps.mod3A[[4]])
```
Approximate convergence is diagnosed when the upper limit is close to 1 as below.
```{r}
gelman.diag(putting.samps.mod3A)
```
The autocorrelation is ideal. No more autocorrelation after lag 1.
```{r}
autocorr.plot(putting.samps.mod3A)
```