Let Renv and Git Hooks Manage R Project Environment for You

How we set up automatic tracing, versioning, and synchronization of project environment

Anastasia Lebedeva
DataSentics
5 min readJun 20, 2020

--

Reproducibility and traceability belong to the crucial requirements of a reliable system. No continuous delivery nor efficient collaboration is possible without these two features. In fact, any kind of delivery is hardly imaginable without a precise specification of the environment.

Fortunately, various tools and methods exist, such as package managers, environment managers, and virtualization tools, addressing the issue. Those methods are widely applied among programmers but seem much less appreciated among data scientists and statisticians. Partially because scientific projects often do not encounter software delivery phases. And even when they do, the responsibility mostly falls well outside scientists’ role. Nonetheless, to produce reliable easy-to-reproduce results scientists have to deal with their computational environments.

A properly configured, versioned environment might save scientists plenty of time and nerves, even on the research stage. An isolated traced environment allows conducting multiple analyses in parallel safely. This way, the environment for each project stays preserved for the entire project duration. It also brings a project much closer to productionalization, regardless of whether it means a software release, a paper publication, or anything in between.

Photo by Steve Harvey on Unsplash

But let's move from theory to practice. Within one of our projects, we challenged the issue: we designed and implemented a workflow enabling automatic initiation of an isolated environment for each project. Furthermore, we set up automation that traces, versions, and synchronizes the environment among collaborators automatically.

In fact, we implemented the described workflows for both Python and for R — the two most popular languages used for data analysis. While allowing very similar capabilities, the workflows support different development environments and utilize different environment managers. In this blog series, we cover both solutions. This post, the first one in the series, introduces the workflow for R language.

Package managers

There are multiple options when it comes to package managers for R. Below we discuss some of the most popular as for March 2020.

Conda

Since we were aimed at creating solutions for both R and Python, we first considered Conda, applicable to many programming languages, including Python, R, Ruby, Lua, Scala, Java, JavaScript, C/ C++, FORTRAN, and more [source]. Unfortunately, there is no canonical way of using Conda within R Studio Server — the IDE we were constrained to.

Checkpoint

We then considered Checkpoint. It accesses a CRAN snapshot, stored at Microsoft R Archived Network server (MRAN), using a date. It exposes only one function, used as follows:

Checkpoint then overwrites the project’s options so that packages are downloaded from a specific snapshot at MRAN, and also creates a per-snapshot library. That suggests that you can not manipulate your library package by package and also implies that there is no easy migration from (and back to) more standard approaches.

Packrat and Renv

Then we turned to Packrat, that implements a more conventional way of library management than Checkpoint.

The package documentation states: “Use Packrat to make your R projects more isolated, portable and reproducible.” We shortly discovered that there is another package — Renv, developed by the same authors, which is claimed to be “a robust, stable replacement for the Packrat package, with fewer surprises and better default behaviors” [source]. Also, from the documentation, we discovered that Packrat has been soft-deprecated and is superceded by Renv.

Automatic library initiation and management

After considering all the pros and cons, we picked Renv as the package manager. Next, we wanted to automate Renv usage. It is true that the package provides an intuitive interface and is well-documented. Our goal was, however, to automate environment management as much as possible, so that scientists could concentrate on their piece of work. All in all, we sought a way to guarantee that

  1. The package manager will be applied to all projects developed within the platform
  2. The lockfile will be systematically updated and versioned
  3. Users working on the same project will have identical environments
  4. Scientists will spend minimum time and effort on environment management

Having the requirements in mind, we came up with the following solution.

Initialization

We implemented a shell script that clones a Git repo and initiates an isolated library for a project, using renv::init(). We ask users to execute the script instead of thegit clone command. This way Renv is applied for each project. The shell script also configures Git hooks for the project, which automate library management, as it is described in the following section.

After initialization, a project structure looks as follows:

Tracking, versioning, synchronization

To achieve automatic environment management, we set up the following Git hooks:

  • pre-commit hook, which executes renv::snapshot() function, captures the state of the environment. The hook then attaches a lockfile to the commit.
  • post-merge and post-checkout hooks, which execute renv::restore() function, restore the environment as it is described in the current version of the lockfile.

We placed the scripts on the server and saved their location into an environment variable. Hooks actually added to each project then sourced the global scripts. This way we ensured that hooks can be updated, versioned, and moved easily and safely.

Workflow illustrating functionality of Git hooks

Discussion

In the post, we introduced a solution capable of the automatic R project environment set up and management. It makes the environment of projects fully isolated and traceable. It also synchronizes the environment automatically, ensuring that users working on the same project have identical environments. It requires no additional effort from scientists — thanks to the automation — but also due to the fact that there is no change in the process of package installation, removal, etc.

The described approach is relatively easy to implement, and you can simplify it even more by cutting off some of the proposed automation. For instance, you may configure the Git hooks globally, if appropriate for your system, and ask users to initiate Renv library manually, instead of providing a shell script. The rest of the environment management would be done automatically by the Git hooks.

The proposed solution is also rather generic, which we confirm by the second post of the series, describing implementation of the idea for Python.

Thank you for reading. If you have any further questions or suggestions, feel free to leave a response. Also, if you have a topic in mind that you would like DataSentics to cover in future posts, let us know.

--

--