-
Notifications
You must be signed in to change notification settings - Fork 9
/
computed_manuscript.Rmd
194 lines (147 loc) · 8.86 KB
/
computed_manuscript.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
---
title: "My Example Computed Manuscript"
subtitle: Created in Rmarkdown
titlerunning: Example computed manuscript
date: "`r format(Sys.time(), '%d %b %Y %H:%M:%S %Z')`"
author: "Jeffrey M. Perkel, Technology Editor, Nature"
output:
bookdown::html_document2: default
pdf_document: default
bookdown::word_document2: default
bookdown::pdf_book:
base_format: rticles::springer_article
extra_dependencies: booktabs
abstract: "A mock computed manuscript created in RStudio using {Rmarkdown}. The {Bookdown}
and {Rticles} packages were used to output the text in Springer Nature's desired
manuscript format. \n"
bibliography: bibliography.bib
biblio-style: spbasic
authors:
- name: Jeffrey M. Perkel
address: Springer Nature, 1 New York Plaza, New York, NY
email: jeffrey.perkel@nature.com
csl: nature.csl
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(warning = FALSE,
message = FALSE,
echo = FALSE)
```
```{r load-libraries, include=FALSE}
# load libraries
library(tidyverse)
library(ggbeeswarm)
library(bookdown)
```
# Introduction {#intro}
"Literate programming" is a style of programming that uses computational notebooks to weave together code, explanatory text, data and results into a single document, enhancing scientific communication and computational reproducibility.[@shen2014; @perkel2018a; @perkel2018] (These references were added into the document using RStudio's integration with the open-source Zotero reference manager [@perkel2020] plus the [Better BibTeX](https://retorque.re/zotero-better-bibtex/) Zotero plugin.)
Several platforms for creating such documents exist.[@perkel2021] Typically, these documents interleave code and text 'blocks' to build a computational narrative. But some, including [R Markdown](https://rmarkdown.rstudio.com/), [Observable](https://www.observablehq.com), and the [Jupyter Book](https://jupyterbook.org/intro.html) extension to the Jupyter ecosystem, also allow authors to include and execute code "inline" -- that is, within the text itself.
This makes it possible to create fully executable manuscripts in which the document itself computes and inserts values and figures into the text rather than requiring authors to input them manually. This is in many ways the 'killer feature' of computed manuscripts: it circumvents the possibility that the author will enter an incorrect number, or forget to update a figure or value should new data arise. Among other uses, that allows authors to automatically time-stamp their documents, or insert the current version number of the software they use into their methods. For instance, this document was built at **`r format(Sys.time(), "%d %b %Y %H:%M:%S %Z")`** and calls the following R packages: `{tidyverse}` ver. **`r packageVersion("tidyverse")`**, `{ggbeeswarm}` ver. **`r packageVersion("ggbeeswarm")`** and `{bookdown}` ver. **`r packageVersion("bookdown")`**.
In this manuscript, created in RStudio using the R Markdown language, we will demonstrate a more practical example. (An Observable version is [also available](https://observablehq.com/@jperkel/example-executable-observable-notebook).)
# Results {#results}
## Inline computation {#sec:1}
Imagine we are analyzing data from a clinical trial. We have grouped subjects in three bins and measured the concentration of some metabolite. (These data are simulated.)
```{r initial-data}
# read in some initial data
df1 <- read_csv('data/example-data-1.csv')
```
```{r radius}
# radius of a circle
r = 10
```
Rather than analyzing those data and then copying the results into our manuscript, we can use the programming language `R` to do that in the manuscript itself. Simply enclose the code inside backticks, with the letter `r`. For instance, we could calculate the circumference and area of a circle:
$$A = \pi r^2, C = 2 \pi r$$
You could write "A = `` `r
pi * r^2` `` and C = `` `r
2 * pi * r` ``". Plugging in the radius *r* = **`r r`**, that evaluates to "A = **`r round(pi * r^2, 2)`** and C = **`r round(2 * pi * r, 2)`**".
Returning to our dataset, we can count the rows in our table to determine the number of samples, and insert that into the text. Thus, we have **`r nrow(df1)`** (simulated) subjects in our study (see Table \@ref(tab:show-table-1); see [`R/mock_data.R`](https://github.com/jperkel/computed_manuscript/blob/main/R/mock_data.R) in the GitHub repository for code to generate a mock dataset). Note that the tables, figures and sections in this document are numbered automatically thanks to the `{bookdown}` package.
The average metabolite concentration in this dataset is **`r round(mean(df1$conc), 2)`** (range: **`r paste(min(df1$conc), max(df1$conc), sep = ' to ')`**). We have **`r df1 %>% filter(class == 'Group 1') %>% nrow()`** subjects in Group 1, **`r df1 %>% filter(class == 'Group 2') %>% nrow()`** subjects in Group 2, and **`r df1 %>% filter(class == 'Group 3') %>% nrow()`** in Group 3. (The numbers in **bold face type** throughout this document are computed values.)
```{r new-data}
# read new dataset
df2 <- read_csv('data/example-data-2.csv')
```
## Incorporating new data {#sec:2}
Now suppose we get another tranche of data (Table \@ref(tab:show-table-2)). There are **`r nrow(df2)`** subjects in this new dataset, with an average concentration of **`r round(mean(df2$conc), 2)`** (range: **`r paste(min(df2$conc), max(df2$conc), sep = ' to ')`**).
```{r combine-tables}
# merge datasets
final_data <- rbind(df1, df2)
```
Combining the two datasets, we have a total of **`r nrow(final_data)`** subjects with an average metabolite concentration of **`r round(mean(final_data$conc), 2)`** (range: **`r paste(min(final_data$conc), max(final_data$conc), sep = ' to ')`**). We now have **`r final_data %>% filter(class == 'Group 1') %>% nrow()`** subjects in Group 1, **`r final_data %>% filter(class == 'Group 2') %>% nrow()`** in Group 2, and **`r final_data %>% filter(class == 'Group 3') %>% nrow()`** in Group 3. The concentration distribution for each group in this joint dataset is shown graphically in Figure \@ref(fig:plot-data-1).
```{r plot-function}
# create a box-plot with overlaid points
create_plot <- function(mytable) {
p <- mytable %>%
ggplot(aes(x = class, y = conc, fill = class, color = class)) +
geom_boxplot(outlier.shape = NA, alpha = 0.2) +
ggbeeswarm::geom_quasirandom(width = 0.25) +
xlab("") +
ylab("Metabolite concentration") +
theme_minimal() +
theme(legend.position = "none")
p
}
```
```{r plot-data-1, fig.cap="Metabolite concentration of clinical trial subjects", fig.height=3, fig.width=4}
# plot the data
create_plot(final_data)
```
```{r get-child, child="child_doc.Rmd"}
# import the text from child_doc.Rmd
```
# Code {#code}
The following code was used to load, merge, and plot the (simulated) clinical trial data in Figure \@ref(fig:plot-data-1):
```{r show-code-1, echo=TRUE, eval=FALSE, ref.label='load-libraries'}
```
```{r show-code-2, echo=TRUE, eval=FALSE, ref.label='initial-data'}
```
```{r show-code-3, echo=TRUE, eval=FALSE, ref.label='new-data'}
```
```{r show-code-4, echo=TRUE, eval=FALSE, ref.label='combine-tables'}
```
```{r show-code-5, echo=TRUE, eval=FALSE, ref.label='plot-function'}
```
```{r show-code-6, echo=TRUE, eval=FALSE, ref.label='plot-data-1'}
```
```{r make_3col_table}
# a generic function to print an arbitrary table 3 cols wide
make_3col_table <- function(mytable) {
input_rows <- nrow(mytable)
# final_rows is the number of rows in the final table -- ie, nrow(mytable)/3
# ceiling returns input_rows/3, rounded up to the nearest integer if it's a fraction
final_rows <- ceiling(input_rows / 3)
# if input_rows is not evenly divisible by 3, pad with extra rows
if (input_rows %% 3) {
for (i in 1:(3 - (input_rows %% 3))) mytable <- rbind(mytable, rep('', 3))
}
tmp <- cbind(mytable[1:final_rows,], rep('|', final_rows),
mytable[(final_rows+1):(2*final_rows),], rep('|', final_rows),
mytable[((2*final_rows)+1):(3*final_rows),])
names(tmp) <- c('ID', 'Class', 'Conc', '|', 'ID', 'Class', 'Conc',
'|', 'ID', 'Class', 'Conc')
return (tmp)
}
```
```{r show-table-1}
knitr::kable(make_3col_table(df1), booktabs = TRUE,
caption = "Initial subject data")
```
```{r show-table-2}
knitr::kable(make_3col_table(df2), booktabs = TRUE,
caption = "Second batch of subject data")
```
```{r show-table-3}
knitr::kable(make_3col_table(df3), booktabs = TRUE,
caption = "Third batch of subject data")
```
# Colophon
This manuscript was built at **`r format(Sys.time(), "%d %b %Y %H:%M:%S %Z")`** using the following computational environment and dependencies:
```{r colophon}
sessionInfo()
```
The current Git commit details are:
```{r git-info}
# per Marwick, this line only executed if the user has installed {git2r}
if ("git2r" %in% installed.packages() & git2r::in_repository(path = '.'))
git2r::commits(here::here())[[1]]
```
# References