R from Research to Production

Daniel Giterman
NI Tech Blog
Published in
6 min readJun 9, 2020

R is often regarded as a research-only language that is mainly used in academia, where it provides an easy way to slice-and-dice structured data, calculate statistics and visualize it on plots, while writing only a few lines of code or a script. But in addition to being the traditional statistical swiss army knife, the R ecosystem today also includes a wide variety of frameworks and tools that makes it totally production-grade and allows the unique transition between quick iterative research and robust production data pipelines.

This post will focus on how we at Natural Intelligence built a full production-grade development process for R code, streamlining the transition between research, development and production phases.

Code Organization

I shall begin with the architecture of our code. First of all, we don’t write scripts, we develop R packages. We design our code similarly as in other programming languages and try to stick with best practices (like type validation, code modularization, vectorization rather than loops etc.). Our packages are segmented into two groups, the first is generic infrastructure and utilities which are used across the whole project, while the second is for specialized and tailored solutions.

It is important to understand that R has a relatively large variety of options for coding, some packages even introduce their own type of syntax (e.g. data.table, tidyverse), therefore one has to possess a high level of discipline to maintain a coherent R project (this is also true for Python).

In spite of that, an R package is having a very strict structure that consists of:

  1. R code
  2. Markdown-based documentations (using roxygen2)
  3. Tests (using testthat)
  4. Logger with various handlers (using logging)
  5. Package-specific configurations, implemented as R6 objects
  6. Compiled external dependencies (C++/Java/Stan etc.)
  7. List of dependency packages (imported, suggested, depended)
  8. List of exported objects and methods (NAMESPACE)

A package, once developed, is built and compiled into bytecode (R >=3.5.0), producing more efficient execution later on.

While R is originally a functional programming language, we frequently use the R6 class system to implement a classical object-oriented approach where it’s applicable.

In our projects, we additionally keep a dedicated directory with R scripts that we call main. Each script is essentially a gateway to a specific procedure/solution used in our Airflow operators. It basically fulfills argument parsing and ultimately calls functions or classes from our developed packages.

R&D with RStudio Server

RStudio Server provides a browser based interface to a version of R running on a remote Linux server, bringing the power and productivity of the RStudio IDE to server-based deployments of R.

We created a custom Docker image of our projects with RStudio Server to enable an easier research and development funnel, ultimately providing better code sharing and eliminating the necessity to setup the whole R development environment on local machines.

Another advantage is that any developed code and installed package on this environment is sure to run exactly as intended in production, because it is based on the production R Docker image (same OS, architecture and version).

But probably the most prominent advantage is the significant boost it provides to our research activities. We simply estimate the amount of CPU (or GPU) and RAM parameters for our research and choose an appropriate on-demand server instance on Amazon EC2. We prepared a mini-deployment process on instance initialization, so the service is always updated with our newest Docker image. The server type may depend on the size of data we want to work with, and the amount of parallelized iterative processes we consider to run.

To support efficient research, RStudio provides a powerful feature that enables running code in separate jobs, in other words, separate R processes, while optionally sharing variables with the main environment we are working on, so we can continue analyzing results on the go. This is very useful, for example when testing algorithms sensitivity to various parameters, different distributions, sampling and algorithms stability over multiple runs.

Dependency Management

Package dependency management is accomplished with packrat, it is a tool for R to make projects more isolated, portable, and reproducible. This tool basically creates a private library inside the project and similarly to venv in Python, manages all dependency packages with their specific versions for the project.

Moreover, it snapshots the state of a private library, which saves to the project directory whatever information it needs to be able to recreate that same private library on another machine.

To accommodate our special needs, and due to the fact that most dependency packages don’t need to be reinstalled on every deployment, we had to develop a custom script we called packratReserve. This script manages a local repository of already built packages (according to packrat snapshot) and copies them into a dedicated private library in an R project.

This way, when we resolve dependencies with packrat, only changes will get attention and the build runtime will substantially decrease.

In addition, we introduced a few more capabilities to this script:

  1. Retrieval of external (mostly niche) R packages that are found in other repositories (not part of CRAN), like R-Forge etc.
  2. Retrieval of jar artifacts from Maven (when we have to use some Java methods)
  3. Retrieval of packages from various online sources via system command-lines (helpful when installation of packages is complicated)

CI/CD Funnel

Finally, our integration and deployment funnel in R is composed of the following steps:

CI/CD Funnel

CI — Docker Packaging and Tests

Step 1: Pushing developed R source code to Git repository.

Step 2: TeamCity triggers R build configuration on a VCS change event in Git, in this step an intermediate Docker image is built, where all required R packages and dependencies are installed (incl. JVM, gcc etc.).

Additionally, packratReserve script is launched to install/update the private library inside the image, and eventually, update the local installed packages repository on our TeamCity server by copying back the changed libraries from inside the Docker image.

This local repository is essentially a local cache of installed packages. Hence, we avoid downloading and installing all of the required packages on every build, which eventually saves a lot of time.

Step 3: Building a final R Docker image from the intermediate Docker image — basically removing redundant stuff and cleaning it, also adding external dependencies if needed. Thereafter, running tests for each package from inside a Docker container.

CD — Airflow Deployment

Step 4: Only after all relevant tests have succeeded, the R Docker image is pushed to DockerHub and then deployed on all our relevant servers (e.g. Swarm Managers) in all environments.

Step 5: R tasks are being scheduled from Airflow to run in scale on Docker Swarm/K8s clusters.

CI — RStudio Server Packaging

As a side process, TeamCity also triggers a build configuration for RStudio Server Docker image on a successful R Docker image build and ultimately updates this image on DockerHub. This image is a bit different from the production R Docker image as it adds the source code of all packages (for development purposes), Git access (so it will be possible to commit code from RStudio Server) and of course the RStudio Server service.

Summary

In this post we proposed a methodology for utilizing R and RStudio Server to provide a convenient funnel of data analysis, research and development in relevant data projects. We also introduced a robust way to deploy R in production. Finally, after more than 2 years working according to this methodology in a multitude of data science projects, we can clearly say that it saves us a lot of time (especially in research and testing), makes the R ecosystem more accessible, and ultimately solves any technical gaps we could have encountered during a project.

--

--