-
Notifications
You must be signed in to change notification settings - Fork 0
/
07-reproduction.Rmd
228 lines (155 loc) · 13.5 KB
/
07-reproduction.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
# Reproduction
The whole concept behind open science is that other researchers or students will be able to reproduce,
replicate, and extend scientific studies after they are published.
This chapter is devoted to that epilogue: how to engage with a published open science study.
The repurposing of a scientific study will normally start with reproducing the original study results - a process that will differ depending on the computational environment and original materials.
We hope that by standardizing a research template, we will have eased the work of reproducing the study.
This chapter provides some guidance for reproducing a study that has been published with an executable research compendium.
We first review procedures for studies conforming to version 2.0 or greater of the [HEGSRR-Template](https://www.github.com/HEGSRR/HEGSRR-Template).
Then, we provide some options for studies published in other formats.
For each scenario, we review options for two commonly used computational notebook formats:
R with R Markdown, and Python with Jupyter.
## Our template
Our reproducibility template, which can be found at [HEGSRR/HEGSRR-Template](https://github.com/HEGSRR/HEGSRR-Template) on GitHub, is designed to reduce common frustrations when trying to execute a compendium for computational geography.
Here are some tips on setting up the computational environments equivalent to those used in our template.
Refer to the `procedure/environment/` folder for a README.md that may contain quick start instructions.
### R
R is "a free software environment for statistical computing and graphics".
To install R, navigate to <https://cloud.r-project.org/>.
Additionally, install RStudio (<https://posit.co/downloads/>),
the most popular IDE (Integrated development environment) for R.
^[Jetbrains, [The State of Developer Ecosystem in 2020](https://www.jetbrains.com/lp/devecosystem-2020/r/) survey]
Those looking for a "cloud" solution may want to try out [Posit Cloud](https://posit.cloud/).
#### The `groundhog` package
Our template ([HEGSRR/HEGSRR-Template](https://github.com/HEGSRR/HEGSRR-Template)) uses the package `groundhog` ([CredibilityLab/groundhog](https://github.com/CredibilityLab/groundhog)) to achieve reproducibility of R scripts.
Given a date and a list of packages needed, `groundhog` can retrieve the most up-to-date version of both R and these packages on that date.
To recover the list of packages, in the relevant `.Rmd` file, look for references to `groundhog`.
This is typically in a code chunk that deals with setup tasks.
As [recommended](https://groundhogr.com/using/) by `groundhog`'s developers, a sensible first step is to change the specified `groundhog.day` to a reasonably recent date - a date corresponding to your version of R, or a very recent date such as last week.
Then, we can execute the main function, `groundhog.library()`.
Depending on the date you selected, you may need to install another version of R that was up-to-date on that date.
This is expected, because changes in R itself may impact computational results as well.
You may also need to give `groundhog` permission to install its packages to a certain folder; for our template, this has been set to `data/scratch/groundhog/`.
If the code functions as expected, then you are all set!
If not, then we can amend the groundhog day back to the original one;
after installing the correct version of R, `groundhog` will proceed to install the packages on that date.
See the article [Changing R version for the RStudio Desktop IDE](https://support.posit.co/hc/en-us/articles/200486138-Changing-R-versions-for-the-RStudio-Desktop-IDE) for how to tell RStudio to use a specific version of R.
### Python
Installing Python locally can be somewhat complicated.
In addition to Python, [Jupyter](https://jupyter.org/) is also needed to have interactive notebooks.
Tutorials found on the Internet also frequently mention [conda](https://docs.conda.io/en/latest/), a package manager;
[additional packages](https://stackoverflow.com/a/43197286) such as `ipykernel` or `nb_conda_kernels` may also be needed.
As an alternative, [many platforms](https://datasciencenotebook.org/) offer a "cloud" notebook that requires minimal setup.
Additional benefits of these cloud notebooks *may* include:
- Disposable environment (see the [Disposable environment](#disposable-environment>) section below)
- No hardware requirement for the user
- Access to premium hardware for more computational power, usually at a cost
- Share and collaborate with others
#### Disposable environment
The Python package manager, `pip`, installs packages into a global library.
This may lead to conflicts, such as when different projects require different versions of the same package.
Fortunately, if a machine is disposable such that it never runs more than one project, then one does not have to worry about managing packages - one can simply install-and-forget.
One example of such a "disposable environment" is [Google Colab](https://colab.google/), where one is given a fresh machine whenever one connects to a Jupyter runtime.
Another example is <https://mybinder.org/>, which provides a free virtual machine from a repository of Jupyter notebooks.
#### Package management
However, if you are working on a personal laptop, departmental machine, or another non-disposable environment, then keeping the system tidy is desirable.
In this case, before installing all the packages listed inside `requirements.txt`, a "virtual environment" should be created.
With a virtual environment, no changes are made at the system-level.
Python's own [venv](https://docs.python.org/3/library/venv.html) and [conda](https://conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html) are two common approaches.
Here is a tutorial from the website RealPython that walks you through using, as well as motivates, venv: [Python Virtual Environments: A Primer](https://realpython.com/python-virtual-environments-a-primer/)
#### The `pipenv` package
For Python, our template ([HEGSRR/HEGSRR-Template](https://github.com/HEGSRR/HEGSRR-Template)) may have used the package `pipenv` for a virtual environment.
This is indicated by a `Pipfile` and a `Pipfile.lock` file, found under the `procedure/environment/` folder.
To recover the computational environment, first install `pipenv`:
```python
!pip install --user pipenv
```
Then, navigate to the `procedure/environment/` folder.
If a `Pipfile` and a `Pipfile.lock` file are present, run:
```python
!cd ../environment
!pipenv sync
```
If a `Pipfile.lock` file is not present, run:
```python
!cd ../environment
!pipenv install
!pipenv sync
```
#### The `pigar` package
If no `Pipfile` or `Pipfile.lock` file is found under `procedure/environment/`,
then the project in question may have used `pigar` ([damnever/pigar](https://github.com/damnever/pigar)) to generate `requirements.txt`,
a list of the packages needed.
If you are using a disposable environment, then simply run the following code (in Jupyter) to install the required packages:
```python
!pip install -r /path/to/requirements.txt
```
For non-disposable environments, we recommend installing `pipenv` for virtual environments:
```python
!pip install --user pipenv
```
Then, utilize `pipenv` to import the requirements file:
```python
!pipenv install -r path/to/requirements.txt
```
## Other compendiums
Here are some general tips for how to recover from a research compendium one might find on the Internet.
### R
If you encounter a list of packages and their respective versions, you will have to install these versions manually.
You can do this with the `remotes::install_version()` function documented [here](https://search.r-project.org/CRAN/refmans/remotes/html/install_version.html).
However, older version of packages on CRAN are usually only available in "source" form, and not an immediately usable "binary" form.
This means that you will need to have an R development environment installed for the package to be compiled from source to binary.
You can check if you do by running `devtools::has_devel()`.
On Windows, the "development environment" is [RTools](https://cran.r-project.org/bin/windows/Rtools/);
on MacOS [the process](https://mac.r-project.org/tools/) is slightly more complicated.
One further complication is that packages often require system-level dependencies.
These dependencies are additional software that are not always available, or consistent, across different machines.
Thus, installing "source" packages is often prone to failure.
#### Recovering with `groundhog`
Fortunately, the `groundhog` package ([CredibilityLab/groundhog](https://github.com/CredibilityLab/groundhog)) can be used to attempt recovery of a computational environment, especially when the original versions are unknown.
Given a list of packages, the next step is to guess the "groundhog day" for these packages to be up-to-date.
A good guess is when the research in question was conducted, or the date of publication.
Once a date (for which the packages function correctly) is identified, that date can be recorded, and a reproducible compedium is produced.
#### The `renv` package
Analogous to `venv` in Python, `renv` records the versions of R and its packages for a certain project, and isolates these packages in a separate virtual environment.
Per [the documentation](https://rstudio.github.io/renv/articles/renv.html#collaboration), a project with `renv` will contain a `renv` directory, as well as a `renv.lock` file.
By opening the project, `renv` will automatically activate itself and prompt you to run `renv::restore()`, which restores the virtual environment.
Note that `renv` usually has to install from "source" when recovering, so this will often fail.
See `renv`'s [Caveats page](https://rstudio.github.io/renv/articles/renv.html#caveats) for details.
### Python
Projects using virtual environments (`venv`, `virtualenv`, `conda`, Poetry, etc.) will usually at least mention the method used.
One can also look for hints in the folder structure, for example a `venv` folder.
Once the particluar solution is located, the documentation for restoring that environment can then be readily found online.
#### Manually create `requirements.txt`
If you encounter a list of package names in Python, it is possible to convert that list into a reproducible `requirements.txt` file.
Simply find a package version that works, and record it into the requirements file.
[Here is an example](https://pip.pypa.io/en/stable/reference/requirements-file-format/) requirements file.
Using the package `scipy` as an example:
- `scipy`: install [the latest version](https://pypi.org/project/scipy/) from the Python Package Index (PyPI).
- `scipy == 1.9.1`: install version 1.9.1 from PyPI.
- `scipy == 1.9.*`: install the latest version that starts with 1.9.1. This installs 1.9.3, unless `scipy` decides to release version 1.9.4.
To ensure repeatability, it is best practice to specify an exact version of packages; this is called "version pinning".
### Docker
You may encounter a compedium in the form of a Docker container or Dockerfile.
In this case, you will need to run Docker.
Docker is software that can package software into "containers".
These containers then run inside a virtual machine, within a "host" machine that runs the Docker program.
Docker containers can be distributed directly, usually through a "container registry" that you can "pull" from, and occasionally through a large file.
They can also come in the form of a Dockerfile, which is the "recipe" for producing a Docker container.
The Dockerfile will need to be accompanied with files (data, software, etc.) that Docker uses as "ingredients" to produce the Docker container.
Installing Docker (Desktop) on [Windows](https://docs.docker.com/desktop/install/windows-install/) or [MacOS](https://docs.docker.com/desktop/install/mac-install/) is possible, but complications and caveats exist.
The preferred way to use Docker (Server) is through a server running [a supported Linux distribution](https://docs.docker.com/engine/install/#server).
A Dockerfile contains instructions on how to create ("build") a Docker container; run [docker build](https://docs.docker.com/engine/reference/commandline/build/) to turn a project-with-Dockerfile into a container.
Alternatively, the Docker container may have already been built; in this case, [docker pull](https://docs.docker.com/engine/reference/commandline/pull/) the container onto the host machine.
Once we have a container, we can use [docker run](https://docs.docker.com/engine/reference/commandline/run/) to execute the container.
Note that Docker images are essentially virtual machines inside your system, so it is not usually possible to directly access the virtual machine.
Often, the communication happens through a browser.
For example, the Docker image may run RStudio Server (or JupyterLab), in which case you can access the service with a browser.
#### Cloud services
Alternatively, you can utilize an online service to run Docker.
Many platforms offer "virtual machines" that run Linux distributions, so you can install and run Docker inside these machines.
Another commonly-seen approach is for the platform to run Docker for you.
When using cloud services, especially "virtual machines", it is important to note that there is a risk of someone hacking into your cloud machine, possibly from anywhere in the world.
Therefore, when using a cloud service, it is desirable to follow common best practices (e.g. use a strong password, use SSH keys, set up a firewall), and take extra precautions to guard against unauthorized access to your machine.
## Acknowledgement
Special thanks to Yifei Luo for contributions to this chapter.