Skip to content

o2r-project/ctv-computational-environments

Repository files navigation

Do not edit this README by hand. See CONTRIBUTING.md.

CRAN Task View: Computational Environments and Reproducibility

Maintainer: Daniel Nüst
Contact: daniel.nuest at uni-muenster.de
Version: 2019-01-11

This Task View contains information about controlling and documenting computational environments in R. The base version of R does not provide features to manage different version of R or collections of packages easily, so a number of approaches and packages exist to simplify computational environments for the sake of development, testing, bug-fixing, and reproducibility. The ReproducibleResearch Task View provides further discussion of packages around scientific reproducibility.

If you have any comments or suggestions for additions or improvements for this Task View, go to GitHub and submit an issue , or make some changes and submit a pull request . If you can’t contribute on GitHub, send Daniel an email . If you have an issue with one of the packages discussed below, please contact the maintainer of that package.

Contributors: [@nuest] (https://github.com/nuest/), [@jdblischak] (https://github.com/jdblischak/)

Virtual Machines and Containers

Virtual machines (VMs) are a straightforward way to encapsulate your runtime environment around the actual data and code.

In computing, a virtual machine (VM) is an emulation of a computer system. Virtual machines are based on computer architectures and provide functionality of a physical computer. https://en.wikipedia.org/wiki/Virtual_machine

Their advantage for many users is the provided visual user interface. VirtualBox is a Free and Open Source (FOSS) virtualization product you can install on most operating systems (OS), then known as the “host” OS, and supports a number of “guest” operating systems , many of which can install and run R just like on a non-virtual OS. VMs must be booted like regular OS and have a virtual disk, which you can archive or share with collaborators (though having to handle the large file size). VMs can share directories, network, and other devices with their host.

Containers are an effective way to apply virtual environments at the system level. Compared to virtual machines their most important advantage for controlled computational environments are performance and transparency. Containers share the hosts core libraries and can “boot” within milliseconds and have negligible computational overhead. They can be created with the help of scripts or “recipes”, which are simple text files. These recipes can be included in code repositories and easily shared online.

The most widespread container solution is Docker . It is available for recent and common operating systems. Dockerfiles are the recipes that can be built to Docker images , which can be run and become Docker containers . Using Docker requires some proficiency with a command line interface (CLI). The Docker Hub is an image repository with a large number of pre-built images for different use cases.

Rocker

The Rocker project provides a number of Docker images for R , including the official r-base image. All Rocker images are available on the Docker Hub. Rocker images are a stable and widespread tool for running R in local and cloud environments and have established useful best practices around containers with R.

Bioconductor provides a collection of images based on rocker/rstudio.

Since images can extend existing ones, using a suitable Rocker image as a base for your own computations is a very good approach to control your computational environment. The simplest way is to run the rocker/rstudio container and work with the RStudio IDE in your web browser. Alternatively, you can develop your analysis on your computer and “package” it in a container only when preparing for a software release of scientific publication.

Other R distributions and operating systems

Docker images for other than the “regular” R distribution and the Debian -based Rocker images are available on Docker Hub, though none at the level of maturity and features of Rocker.

  • MRO images are available as an independent contribution (i.e. not by MRO team) on Docker Hub, nuest/mro , and as CentOS-based Dockerfiles on GitHub, jlisic/R-docker-centos .
  • Renjin images are available as an independent contribution on Docker Hub as nuest/renjin
  • pqR images are available as an independent contribution on Docker Hub as nuest/pqr

Tools for working with containers

Docker

  • harbor (not on CRAN) provides all Docker commands with R functions. It may be used to control Docker containers that run either locally or remotely.
  • docker is an alternative to the plain R harbor and provides Docker CLI commands using the Docker SDK for Python via the package reticulate and consequently runs on various operating systems including Windows. The package is best suited for apt Docker users, i.e. if you know the Docker commands and life cycle. Source code is on GitHub .
  • dockermachine (not on CRAN) provides a convenient R interface to the docker-machine command, so you can provision easily local or remote/cloud instances of containers.
  • analogsea is a general purpose client for the Digital Ocean v2 API. In addition, the package includes functions to install various R tools including base R, RStudio server, and more. There’s an improving interface to interact with docker on your remote droplets via this package. (GitHub )
  • rize (not on CRAN) dockerises Shiny applications.
  • containerit (not on CRAN) automatically creates Dockerfiles for arbitrary R sessions, script files, or workspace directories.
  • dockertest (not con CRAN) is a proof of concept for using the isolated environments of Docker containers to run tests.
  • liftr partially automates rendering R Markdown documents with Docker by adding YAML-metadata (example ), see http://liftr.me/ .
  • googleComputeEngineR (website ) provides an R interface to the Google Cloud Compute Engine API, for example for creating an RStudio VM, also using Docker to configure the environment.
  • batchtools (repository , JOSS paper ) provides a parallel implementation of Map for HPC for different schedulers , including Docker Swarm .

Deployment

Another alternative to share a well-defined computational environment is setting up R on a server.

Interactive development environments

Apps and APIS

Package management

Structure, templates and workflows

A good project structure is essential to be sure about the actually used computational environment, which includes locally defines functions and data and not just used packages or the R version.

Tracking and provenance

A computational environment evolves as an analysis is developed. These packages help observing these changes, in addition to always recommended code versioning systems .

  • freezer (not on CRAN) helps data analysis by capturing analyses executions including used code, results, and metadata.
  • recordr (not on CRAN) provides an automated way to capture data provenance of “runs” for R scripts and console commands.

CRAN packages:

Related links: