R and Python got you in a Bind?
Configuring mybinder.org to interactively share data science analyses in R and Python from the same GitHub repository.
Motivation
A team in the Informatics Lab is working on applying machine learning to the thorny question of how to predict weather in a region of the world where seasonal weather forecasts have low skill. The region in question is North East China, where 5% of global maize is grown. Weather in the region is heavily influenced by monsoons and weather year-to-year can be highly variable, impacting the success of maize growth and harvests.
The project depends upon separate activities written in both Python and R, reflecting its diverse requirements and interdisciplinary nature. The purpose of this blogpost is to share our experiences of making analysis codes openly available and reproducible by the public, a key requirement underpinning open science. It will discuss how to configure a Binder to handle both R and Python code, using literate programming to encapsulate each analysis with a narrative justifying its purpose in the overall research.
Prior art
Many will be familiar with Jupyter Notebooks, and by extension Binder. For the uninitiated, Jupyter Notebooks are documents containing text, images and executable code which explains some sort of analysis with a narrative and working code. It is a front end to a live backend which executes the code cells in the selected language (Python, R, Julia, etc.).
Binder is an automatic service for launching a live, executable Jupyter Notebook (and other frontends) on the cloud, with the infrastructure gracefully handled by Docker. The generous people at Binder provide free, limited compute resources for the purposes of sharing research, demonstrators, educational material and even documentation. This cloud computing is provided for free by supporting organisations such as Google Open Source and the Alan Turing Institute. (The instance used in this example and blogpost is run on the “default” Google Cloud infrastructure). The whole project is ran by volunteers, and financially supported by donations and grants. The underlying technology can even be deployed on your own (organisation’s) infrastructure. Usage guidelines can be found here.
From a user perspective, it works by simply pointing the service at an online git repository (e.g. hosted on GitHub) and Binder handles the rest. It has been around for a few years and has been continually developed to now include other user interfaces such as Jupyter Lab (an IDE like implementation of Jupyter Notebook), RStudio and RShiny.
RStudio will be familiar to those who write R code as the go-to IDE for developing R code. Traditionally used as a desktop application, RStudio now supports being run in a browser with RStudio Server and is available as a service through RStudio Cloud.
For this project I followed many guides on how to deploy a Binder for both R and Python code, since (maybe unsurprisingly) there is more than one way to configure the infrastructure of a Binder. I started with Zero-to-Binder, the canonical guide hosted by The Alan Turing Institute as part of The Turing Way. While this is a very useful guide, and setup for Binder with Python “Just Works™️”, I found that its recommendations around setting up a Binder for R were not as functional as implied. Following the method for R outlined in the main guide leads to a working Binder for (just) R. However, it didn’t support a Python Binder as well and I was not satisfied with how the R package installation was handled. So I added renv
to the postBuild
script to handle package specification more explicitly (see Working Configuration below for more details) and searched for some guidance on Binderising R and Python together.
As an aside, the guide recommends using the holepunch
package, an R package which automatically configures a Binder instance for your R code. Unfortunately, it has not been maintained for over two years and I found it produced a completely blank Dockerfile that did not produce a working Binder instance, so I abandoned that line of action. The guide also recommends using Rocker, a repository of R-specific Docker images on DockerHub. While this works, this level of manual configuration is unnecessary with the now excellent automagic of repo2docker
, so I would not recommend using this method for setting up a Binder either (although very useful for setting up a remote RStudio/RShiny instance!).
I ended up taking inspiration from the binder-examples/r_with_python
and binder-examples/r
repositories, which led to the final configuration described below.
Working configuration
The configuration that I found to work well for RStudio, Jupyter Lab and RShiny is demonstrated in this GitHub repository:
https://github.com/informatics-lab/binder_rstudio_jupyterlab_example
The infrastructural specifications are found in the .binder/
folder and consist of:
.binder/
├─ environment.yml # Define *additional* conda packages to install
├─ install.R # Define R packages to install (not used)
├─ postBuild # Where renv::restore() is executed
├─ renv.lock # Manifest of R packages to install using renv
├─ runtime.txt # R version (from the MRAN archive)
environment.yml
is used byconda
to install additional packages for Python, including from PyPI usingpip
. Binder automatically includes a bunch of base Python packages such atjupyter
,python
,IPython
, etc, so you do not need to define these yourself. It is possible to install R packages here too but not recommended as theconda
channels for R are not well maintained.install.R
is the “standard” way of installing R packages for a Binder usinginstall.packages()
ordevtools::install_version()
, but I chose to userenv
instead since it is more reproducible and allows you to explicitly specify package versions and sources in a single manifest file,renv.lock
.postBuild
is run after the Docker image has been built, and this is whererenv::restore()
is executed. There is discussion about integratingrenv
into the Binder build process but that hasn’t been implemented so for now this is the easiest solution. It adds a little overhead to the Binder spin-up time, since therenv
environment is not cached with the Docker image, but it is fast enough in my experience.renv.lock
defines the list of R packages to install usingrenv::restore()
.runtime.txt
is used to define the version on R (from MRAN) to install. It uses the formatr-<version>-<year>-<month>-<day>
as the only line of the file (I tried adding comments but it very broke the system)
This relatively minimal configuration setup allows you to deploy a Binder with Juptyer Lab, Jupyter Notebook, RStudio and RShiny as the interfaces, all without having to define or edit a Dockerfile. This is all down to the magic of repo2docker
and binderhub
— kudos to the JupyterHub team!*
*The ease of setting up such a system might raise the question of how secure Binder is. The Binder team have take precautions, as discussed in their docs.
Examples
There are four demonstration analyses included in the binder_rstudio_jupyterlab_example
repository. All four plot example data for maize yields in three Chinese provinces, acquired from the National Bureau of Statistics of China, and ERA5 reanalysis weather data for those same provinces.
Jupyter(Lab) with Python
The orange JUPYTERLAB
badge in the link above will open a Jupyter Lab session with the Jupyter Notebook analysis-4.ipynb
open. This notebook plots data in the data/
directory as scatter plots with linear regression lines fitted using Python. You can execute the cells in the notebook which will be run using the live Python kernel in the background. You can add code, edit the notebook, create new notebooks and files, and add your own data to analyse (using the file browser on the left hand pane).
Jupyter(Notebook) with R
The orange JUPYTERNOTEBOOK
badge in the link above will open the Jupyter Notebook analysis-2.ipynb
in a standalone Jupyter Notebook session. You can run the cells like a regular Jupyter Notebook, which plots data in the data/
directory as scatter plots with linear regression lines fitted, using an R kernel. You can add code and edit the notebook but you cannot easily add data to analyse. The JupyterLab example above has a better user interface for adding and navigating files. This usecase is not specific to an R-based Jupyter Notebook, simply a demonstration of another interface.
RStudio
The blue RSTUDIO
badge in the link above will launch an RStudio session, from which you can navigate the file browser to open the R script analysis-1.R
and RMarkdown notebook analysis-3.Rmd
. These can be run as if you were using RStudio Desktop. You can edit these scripts, create new files, and upload your own data for analysis.
RMarkdown notebook in RShiny
The blue RMARKDOWN
badge in the link above will open the RMarkdown notebook analysis-3.Rmd
in an interactive RShiny instance. It simply plots data in the data/
directory as scatter plots with linear regression lines fitted, and embeds a couple of RShiny widgets. Cells cannot be executed or edited like a Jupyter notebook, but the widgets can be interacted with like in an RShiny dashboard.
This configuration took a lot of digging to uncover an implementation for. The functionality is well established within in R, but its implementation within the Binder project was unclear. These two pull requests (#799 & #891) in the repo2docker GitHub repo eventually led to a solution. Please note that in order for this setup to work the .Rmd
file must be located in the base directory of the repo, subfolders are not supported for this functionality. The name of the file does not matter.
RShiny dashboards are also supported in MyBinder using a server.R
and ui.R
configuration. This setup is not demonstrated in this repo.
Future work
As a result of doing this project there are a couple of things that could improve the system:
- Saving the final iteration of the Binder Docker image out to a container registry like DockerHub, where fully built images are saved and then downloaded and run for quick spin-up times (since the building stage has already been done). There does not seem to be an easy way to do this with
repo2docker
orbinderhub
, but would be useful for an organisation like the Met Office to provide pre-compiled Docker images for the finalised software environments used to recreate analyses for specific papers/projects. - Include
renv
in the R Binder stack. This is discussed in the following GitHub issue on therepo2binder
repository. - Support for RMarkdown notebooks could be better. At present the implementation seen in the example above requires that the
.Rmd
file be located in the base directory of the repository. Support for locating the files in a subdirectory would allow more flexibility for repository configurations. - Using
flexdashboard
to arrange RShiny dashboards. It looks like a cool technology that builds on the ability of RMarkdown to be used with RShiny.
Conclusion
Having been aware of and used Binder for over four years now, it’s exciting to see it go from strength to strength. It’s an invaluable resource for creating demonstrations and reduces the barrier to reproducible research. While the configurations demonstrated in this blogpost took a bit of digging, trial and error, it was a surprisingly painless (and fun) week where very little broke completely or just didn’t work. Kudos to the community and Jupyter devs for working so in the open that there are plenty of breadcrumbs to follow for an answer.
Given the continued ease of setting up a Binder instance, I encourage all scientists, project maintainers and research software engineers to have a go with Binder — it might be just the (free) resource you’ve been looking for.