R and Python got you in a Bind?

Configuring mybinder.org to interactively share data science analyses in R and Python from the same GitHub repository.

Kevin Donkers

Published in

Met Office Informatics Lab

9 min readAug 22, 2022

Motivation

A team in the Informatics Lab is working on applying machine learning to the thorny question of how to predict weather in a region of the world where seasonal weather forecasts have low skill. The region in question is North East China, where 5% of global maize is grown. Weather in the region is heavily influenced by monsoons and weather year-to-year can be highly variable, impacting the success of maize growth and harvests.

The project depends upon separate activities written in both Python and R, reflecting its diverse requirements and interdisciplinary nature. The purpose of this blogpost is to share our experiences of making analysis codes openly available and reproducible by the public, a key requirement underpinning open science. It will discuss how to configure a Binder to handle both R and Python code, using literate programming to encapsulate each analysis with a narrative justifying its purpose in the overall research.

Prior art

Many will be familiar with Jupyter Notebooks, and by extension Binder. For the uninitiated, Jupyter Notebooks are documents containing text, images and executable code which explains some sort of analysis with a narrative and working code. It is a front end to a live backend which executes the code cells in the selected language (Python, R, Julia, etc.).

Binder is an automatic service for launching a live, executable Jupyter Notebook (and other frontends) on the cloud, with the infrastructure gracefully handled by Docker. The generous people at Binder provide free, limited compute resources for the purposes of sharing research, demonstrators, educational material and even documentation. This cloud computing is provided for free by supporting organisations such as Google Open Source and the Alan Turing Institute. (The instance used in this example and blogpost is run on the “default” Google Cloud infrastructure). The whole project is ran by volunteers, and financially supported by donations and grants. The underlying technology can even be deployed on your own (organisation’s) infrastructure. Usage guidelines can be found here.

Source: https://binderhub.readthedocs.io/en/latest/overview.html#a-diagram-of-the-binderhub-architecture

From a user perspective, it works by simply pointing the service at an online git repository (e.g. hosted on GitHub) and Binder handles the rest. It has been around for a few years and has been continually developed to now include other user interfaces such as Jupyter Lab (an IDE like implementation of Jupyter Notebook), RStudio and RShiny.

mybinder.org webpage. Turn a Git repo into a collection of interactive notebooks. Grey options box with text boxes for repo details. Orange launch button. — mybinder.org homepage: Specify a Git repository and other details to launch as a live session

RStudio will be familiar to those who write R code as the go-to IDE for developing R code. Traditionally used as a desktop application, RStudio now supports being run in a browser with RStudio Server and is available as a service through RStudio Cloud.

Source: https://upload.wikimedia.org/wikipedia/commons/thumb/e/ed/RStudio_IDE_screenshot.png/1024px-RStudio_IDE_screenshot.png

For this project I followed many guides on how to deploy a Binder for both R and Python code, since (maybe unsurprisingly) there is more than one way to configure the infrastructure of a Binder. I started with Zero-to-Binder, the canonical guide hosted by The Alan Turing Institute as part of The Turing Way. While this is a very useful guide, and setup for Binder with Python “Just Works™️”, I found that its recommendations around setting up a Binder for R were not as functional as implied. Following the method for R outlined in the main guide leads to a working Binder for (just) R. However, it didn’t support a Python Binder as well and I was not satisfied with how the R package installation was handled. So I added renv to the postBuild script to handle package specification more explicitly (see Working Configuration below for more details) and searched for some guidance on Binderising R and Python together.

As an aside, the guide recommends using the holepunch package, an R package which automatically configures a Binder instance for your R code. Unfortunately, it has not been maintained for over two years and I found it produced a completely blank Dockerfile that did not produce a working Binder instance, so I abandoned that line of action. The guide also recommends using Rocker, a repository of R-specific Docker images on DockerHub. While this works, this level of manual configuration is unnecessary with the now excellent automagic of repo2docker, so I would not recommend using this method for setting up a Binder either (although very useful for setting up a remote RStudio/RShiny instance!).

I ended up taking inspiration from the binder-examples/r_with_python and binder-examples/r repositories, which led to the final configuration described below.

Working configuration

The configuration that I found to work well for RStudio, Jupyter Lab and RShiny is demonstrated in this GitHub repository:
https://github.com/informatics-lab/binder_rstudio_jupyterlab_example

The infrastructural specifications are found in the .binder/ folder and consist of:

.binder/
├─ environment.yml   # Define *additional* conda packages to install
├─ install.R         # Define R packages to install (not used)
├─ postBuild         # Where renv::restore() is executed
├─ renv.lock         # Manifest of R packages to install using renv
├─ runtime.txt       # R version (from the MRAN archive)

environment.yml is used by conda to install additional packages for Python, including from PyPI using pip. Binder automatically includes a bunch of base Python packages such at jupyter, python, IPython, etc, so you do not need to define these yourself. It is possible to install R packages here too but not recommended as the conda channels for R are not well maintained.
install.R is the “standard” way of installing R packages for a Binder using install.packages() or devtools::install_version(), but I chose to use renv instead since it is more reproducible and allows you to explicitly specify package versions and sources in a single manifest file, renv.lock.
postBuild is run after the Docker image has been built, and this is where renv::restore() is executed. There is discussion about integrating renv into the Binder build process but that hasn’t been implemented so for now this is the easiest solution. It adds a little overhead to the Binder spin-up time, since the renv environment is not cached with the Docker image, but it is fast enough in my experience.
renv.lock defines the list of R packages to install using renv::restore().
runtime.txt is used to define the version on R (from MRAN) to install. It uses the format r-<version>-<year>-<month>-<day> as the only line of the file (I tried adding comments but it very broke the system)

This relatively minimal configuration setup allows you to deploy a Binder with Juptyer Lab, Jupyter Notebook, RStudio and RShiny as the interfaces, all without having to define or edit a Dockerfile. This is all down to the magic of repo2docker and binderhub — kudos to the JupyterHub team!*

*The ease of setting up such a system might raise the question of how secure Binder is. The Binder team have take precautions, as discussed in their docs.

Examples

There are four demonstration analyses included in the binder_rstudio_jupyterlab_example repository. All four plot example data for maize yields in three Chinese provinces, acquired from the National Bureau of Statistics of China, and ERA5 reanalysis weather data for those same provinces.

Jupyter(Lab) with Python

The orange JUPYTERLAB badge in the link above will open a Jupyter Lab session with the Jupyter Notebook analysis-4.ipynb open. This notebook plots data in the data/ directory as scatter plots with linear regression lines fitted using Python. You can execute the cells in the notebook which will be run using the live Python kernel in the background. You can add code, edit the notebook, create new notebooks and files, and add your own data to analyse (using the file browser on the left hand pane).

Jupyter(Notebook) with R

The orange JUPYTERNOTEBOOK badge in the link above will open the Jupyter Notebook analysis-2.ipynb in a standalone Jupyter Notebook session. You can run the cells like a regular Jupyter Notebook, which plots data in the data/ directory as scatter plots with linear regression lines fitted, using an R kernel. You can add code and edit the notebook but you cannot easily add data to analyse. The JupyterLab example above has a better user interface for adding and navigating files. This usecase is not specific to an R-based Jupyter Notebook, simply a demonstration of another interface.

RStudio

The blue RSTUDIO badge in the link above will launch an RStudio session, from which you can navigate the file browser to open the R script analysis-1.R and RMarkdown notebook analysis-3.Rmd. These can be run as if you were using RStudio Desktop. You can edit these scripts, create new files, and upload your own data for analysis.

RMarkdown notebook in RShiny

The blue RMARKDOWN badge in the link above will open the RMarkdown notebook analysis-3.Rmd in an interactive RShiny instance. It simply plots data in the data/ directory as scatter plots with linear regression lines fitted, and embeds a couple of RShiny widgets. Cells cannot be executed or edited like a Jupyter notebook, but the widgets can be interacted with like in an RShiny dashboard.

This configuration took a lot of digging to uncover an implementation for. The functionality is well established within in R, but its implementation within the Binder project was unclear. These two pull requests (#799 & #891) in the repo2docker GitHub repo eventually led to a solution. Please note that in order for this setup to work the .Rmd file must be located in the base directory of the repo, subfolders are not supported for this functionality. The name of the file does not matter.

RShiny dashboards are also supported in MyBinder using a server.R and ui.R configuration. This setup is not demonstrated in this repo.

Future work

As a result of doing this project there are a couple of things that could improve the system:

Saving the final iteration of the Binder Docker image out to a container registry like DockerHub, where fully built images are saved and then downloaded and run for quick spin-up times (since the building stage has already been done). There does not seem to be an easy way to do this with repo2docker or binderhub, but would be useful for an organisation like the Met Office to provide pre-compiled Docker images for the finalised software environments used to recreate analyses for specific papers/projects.
Include renv in the R Binder stack. This is discussed in the following GitHub issue on the repo2binder repository.
Support for RMarkdown notebooks could be better. At present the implementation seen in the example above requires that the .Rmd file be located in the base directory of the repository. Support for locating the files in a subdirectory would allow more flexibility for repository configurations.
Using flexdashboard to arrange RShiny dashboards. It looks like a cool technology that builds on the ability of RMarkdown to be used with RShiny.

Conclusion

Having been aware of and used Binder for over four years now, it’s exciting to see it go from strength to strength. It’s an invaluable resource for creating demonstrations and reduces the barrier to reproducible research. While the configurations demonstrated in this blogpost took a bit of digging, trial and error, it was a surprisingly painless (and fun) week where very little broke completely or just didn’t work. Kudos to the community and Jupyter devs for working so in the open that there are plenty of breadcrumbs to follow for an answer.

Given the continued ease of setting up a Binder instance, I encourage all scientists, project maintainers and research software engineers to have a go with Binder — it might be just the (free) resource you’ve been looking for.