-
Notifications
You must be signed in to change notification settings - Fork 26
/
Copy pathstan.Rmd
235 lines (193 loc) · 9.47 KB
/
stan.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
# Validating a small hierarchical model with Stan {#stan}
```{r, message = FALSE, warning = FALSE, echo = FALSE}
knitr::opts_knit$set(root.dir = fs::dir_create(tempfile()))
knitr::opts_chunk$set(collapse = TRUE, comment = "#>", eval = FALSE)
options(
drake_make_menu = FALSE,
drake_clean_menu = FALSE,
warnPartialMatchArgs = FALSE,
crayon.enabled = FALSE,
readr.show_progress = FALSE,
tidyverse.quiet = TRUE
)
```
```{r, message = FALSE, warning = FALSE, echo = FALSE, eval = TRUE}
options(tidyverse.quiet = TRUE)
library(drake)
library(tidyverse)
tmp <- suppressWarnings(drake_plan(x = 1, y = 2))
```
The goal of this example workflow is to validate a small Bayesian hierarchical model.
```{r, eval = FALSE}
y_i ~ iid Normal(alpha + x_i * beta, sigma^2)
alpha ~ Normal(0, 1)
beta ~ Normal(0, 1)
sigma ~ Uniform(0, 1)
```
We simulate multiple datasets from the model and fit the model on each dataset. For each model fit, we determine if the 50% credible interval of the regression coefficient `beta` contains the true value of `beta` used to generate the data. If we implemented the model correctly, roughly 50% of the models should recapture the true `beta` in 50% credible intervals.
## The `drake` project
Because of the long computation time involved, this chapter of the manual does not actually run the analysis code. The complete code can be found at <https://github.com/wlandau/drake-examples/tree/main/stan> and downloaded with `drake::drake_example("stan")`, and we encourage you to try out the code yourself. This chapter serves to walk through the functions and plan and explain the overall thought process.
The file structure is that of a [typical `drake` project](#projects) with some additions to allow optional [high-performance computing](#hpc) on a cluster.
```{r, eval = FALSE}
├── run.sh
├── run.R
├── _drake.R
├── sge.tmpl
├── R/
├──── packages.R
├──── functions.R
├──── plan.R
├── stan/
├──── model.stan
└── report.Rmd
```
File | Purpose
---|---
`run.sh` | Shell script to run `run.R` in a persistent background process. Works on Unix-like systems. Helpful for long computations on servers.
`run.R` | R script to run `r_make()`.
`_drake.R` | The special R script that powers functions `r_make()` and friends ([details here]()).
`sge.tmpl` | A [`clustermq`](https://github.com/mschubert/clustermq) template file to deploy targets in parallel to a Sun Grid Engine cluster.
`R/packages.R` | A custom R script loading the packages we need.
`R/functions.R` | A custom R script with user-defined functions.
`R/plan.R` | A custom R script that defines the `drake` plan.
`stan/model.stan` | The specification of our Stan model.
`report.Rmd` | An R Markdown report summarizing the results of the analysis.
The following sections walk through the functions and plan.
## Functions
Good functions have meaningful inputs and outputs that are easy to generate. For data anlaysis, good inputs and outputs are typically datasets, models, and summaries of fitted models. The functions below for our Stan workflow follow this pattern.
First, we need a function to compile the model. It accepts a Stan model specification file (a `*.stan` text file) and returns a paths to both the model file and the compiled RDS file. (We need to set `rstan_options(auto_write = TRUE)` to make sure `stan_model()` generates the RDS file.) We return the file paths because the target that uses this function will be a [dynamic file target](https://books.ropensci.org/drake/plans.html#dynamic-files). `drake` will reproducibly watch these files for changes and automatically recompile and run the model if changes are detected.
```{r}
compile_model <- function(model_file) {
stan_model(model_file, auto_write = TRUE, save_dso = TRUE)
c(model_file, path_ext_set(model_file, "rds"))
}
```
Next, we need a function to simulate a dataset from the hierarchical model.
```{r}
simulate_data <- function() {
alpha <- rnorm(1, 0, 1)
beta <- rnorm(1, 0, 1)
sigma <- runif(1, 0, 1)
x <- rbinom(100, 1, 0.5)
y <- rnorm(100, alpha + x * beta, sigma)
tibble(x = x, y = y, beta_true = beta)
}
```
Lastly, we write a function to fit the compiled model to a simulated dataset. We pass in the `*.stan` model specification file, but `rstan` is smart enough to use the compiled RDS model file instead if available.
In Bayesian data analysis workflows with many runs of the same model, we need to make a conscious effort to conserve computing resources. That means we should not save all the posterior samples from every single model fit. Instead, we compute summary statistics on the chains such as posterior quantiles, coverage in credible intervals, and convergence diagnostics.
```{r}
fit_model <- function(model_file, data) {
# From https://github.com/stan-dev/rstan/issues/444#issuecomment-445108185,
# we need each stanfit object to have its own unique name, so we create
# a special new environment for it.
tmp_envir <- new.env(parent = baseenv())
tmp_envir$model <- stan_model(model_file, auto_write = TRUE, save_dso = TRUE)
tmp_envir$fit <- sampling(
object = tmp_envir$model,
data = list(x = data$x, y = data$y, n = nrow(data)),
refresh = 0
)
mcmc_list <- As.mcmc.list(tmp_envir$fit)
samples <- as.data.frame(as.matrix(mcmc_list))
beta_25 <- quantile(samples$beta, 0.25)
beta_median <- quantile(samples$beta, 0.5)
beta_75 <- quantile(samples$beta, 0.75)
beta_true <- data$beta_true[1]
beta_cover <- beta_25 < beta_true && beta_true < beta_75
psrf <- max(gelman.diag(mcmc_list, multivariate = FALSE)$psrf[, 1])
ess <- min(effectiveSize(mcmc_list))
tibble(
beta_cover = beta_cover,
beta_true = beta_true,
beta_25 = beta_25,
beta_median = beta_median,
beta_75 = beta_75,
psrf = psrf,
ess = ess
)
}
```
## Plan
Our [`drake` plan](#plans) is defined in the `R/plan.R` script.
```{r}
plan <- drake_plan(
model_file = target(
compile_model("stan/model.stan"),
format = "file",
hpc = FALSE
),
index = target(
seq_len(10), # Change the number of simulations here.
hpc = FALSE
),
data = target(
simulate_data(),
dynamic = map(index),
format = "fst_tbl"
),
fit = target(
fit_model(model_file[1], data),
dynamic = map(data),
format = "fst_tbl"
),
report = target(
render(
knitr_in("report.Rmd"),
output_file = file_out("report.html"),
quiet = TRUE
),
hpc = FALSE
)
)
```
The following subsections describe the strategy and practical adjustments behind each target.
### Model
The `model_files` target is a [dynamic file target](https://books.ropensci.org/drake/plans.html#dynamic-files) to reproducibly track our Stan model specification file (`stan/model.stan`) and compiled model file (`stan/model.rds`). Below, `format = "file"` indicates that the target is a dynamic file target, and `hpc = FALSE` tells `drake` not to run the target on a parallel worker in [high-performance computing](#hpc) scenarios.
```{r}
model_files = target(
compile_model("stan/model.stan"),
format = "file",
hpc = FALSE
),
```
### Index
The `index` target is simply a numeric vector from 1 to the number of simulations. To fit our model multiple times, we are going to [dynamically map](#dynamic) over `index`. This is a small target and we do not want to waste expensive computing resources on it, so we set `hpc = FALSE`.
```{r}
index = target(
seq_len(1000), # Change the number of simulations here.
hpc = FALSE
)
```
### Data
`data` is a [dynamic target](#dynamic) with one sub-target per simulated dataset, so we write `dynamic = map(index)` below. In addition, these datasets are data frames, so we choose `format = "fst_tbl"` below to increase read/write speeds and conserve storage space. [Read here](https://books.ropensci.org/drake/plans.html#special-data-formats-for-targets) for more on specialized storage formats.
```{r}
data = target(
simulate_data(),
dynamic = map(index),
format = "fst_tbl"
)
```
### Fit
We want to fit our model once for each simulated dataset, so our `fit` target dynamically maps over the datasets with `dynamic = map(data)`. We also pass in the path to the `stan/model.stan` specification file, but `rstan` is smart enough to use the compiled model `stan/model.rds` instead if available. And since `fit_model()` returns a data frame, we also choose `format = "fst_tbl"` here.
```{r}
fit = target(
fit_model(model_files[1], data),
dynamic = map(data),
format = "fst_tbl"
)
```
### Report
R Markdown reports should never do any heavy lifting in `drake` pipelines. They should simply leverage the computationally expensive work done in the previous targets. If we follow this good practice and our report renders quickly, we should not need heavy computing resources to process it, and we can set `hpc = FALSE` below.
The [`report.Rmd` file itself](https://github.com/wlandau/drake-examples/blob/main/stan/report.Rmd) has `loadd()` and `readd()` statements to refer to these targets, and with the `knitr_in()` keyword below, `drake` knows that it needs to update the report when the models or datasets change. Similarly, `file_out("report.html")` tells `drake` to rerun the report if the output file gets corrupted.
```{r}
report = target(
render(
knitr_in("report.Rmd"),
output_file = file_out("report.html"),
quiet = TRUE
),
hpc = FALSE
)
```
## Try it out!
The complete code can be found at <https://github.com/wlandau/drake-examples/tree/main/stan> and downloaded with `drake::drake_example("stan")`, and we encourage you to try out the code yourself.