A Polyglot’s Dream: Reticulated Python & R

Published in

Data at NewDay

5 min readJan 15, 2019

Written by: James Hickey (Data Scientist @ NewDay)

A common inconvenience…

When starting a new data science project, the initial focus is on scoping the work and generating initial hypotheses to explore and test. Usually one then moves onto the technicalities of executing a project and often faces a decision between what language/libraries to use. This decision frequently involves two of data science’s heavyweights… R and Python. The decision is often made on several grounds, including: the desired visualizations, preferred IDEs, data ingestion/machine learning libraries etc. Once work on the project begins, rarely is there conversion from one language to another and the assets available in the unselected stack are not available. This is primarily due to the inconvenience of integrating/converting between the languages on the fly.

Here at NewDay, our data scientists are polyglots with varying strengths across many languages including R and Python. Our aspirations are simple: do things the best-way always (no shortcuts!) and to break the 80–20 curse of data science. The problem described above proves an unexpected blocker to both of these goals in the following way: if a data science project is primarily done in one language, the strengths of the alternative languages and experience of some of our team are not fully utilised. Secondly, if our initial hypotheses prove incorrect and we require functionality that is not available in our initial language choice — this can slow progress significantly. More dangerously, certain hypotheses may not be tested as effectively as they could be if one chooses a language with an inferior set of libraries for that subset of tasks. This constraint of choosing a primary data science language then results in additional re-workings/conversions; it disrupts standards for presentations/upwards communications and reduces both the efficiency and quality of our deliveries.

Enter reticulate, an R package that provides a comprehensive set of tools for weaving Python directly into R. This article covers some of the basic uses of reticulate and aims to highlight some of the reasons we are using it more and more! We begin by providing an overview of reticulate and the basics of how it works before discussing two great benefits of using it: integrating Python with R Markdown and R Shiny!

What is reticulate?

Concisely, it is an R package that allows one to interoperate between R and Python with ease. It works by embedding a Python session within an R session thus providing a seamless interface between the two. The package is installed from CRAN in the usual manner:

Following the installation, the Python interpreter of your choice is set using either: use_python, use_virtualenv, use_condaenv or by setting the RETICULATE_PYTHON environment variable, see https://rstudio.github.io/reticulate/articles/versions.html for more details.

Once set-up, the library allows for the translation between R and Python objects and the calling of Python scripts/modules from R in numerous settings including R Markdown. One notable benefit is that one may use Python within RStudio in the same way one would use R, leveraging the console for a combined Python + R REPL. This is particularly useful during the EDA phase of a project where one can leverage the analysis packages of both languages. The objects created in the Python session are accessible as attributes of the py object from the R session while R session objects are accessible as attributes of the r object in the Python session.

**Figure 1: A simple example showing how to create an object in the Python REPL and plot using ggplot2 in R.**

Reticulate facilitates integration beyond REPL, allowing R to source Python scripts/modules using a series of helper functions including import, source_python, py_run_file etc. Additionally, Python objects can be created directly in the R session and can be converted to R objects manually using the py_to_r function when desired.

R Markdown with Python

One of the key strengths of R is its integration with Markdown for the presentation of results. It is simple to convert an R script to a presentation using RMarkdown and thus reduce the amount of time not spent on optimising your results. Reticulate provides a Python engine for Markdown with no set-up required if you’re using knitr >= 1.18. If you’re using an older version of knitr it is necessary to enable the reticulate Python engine via the following code-chunk:

Once set-up, Python chunks behave in a very similar manner to the more standard R chunks with R/Python objects available to both engines through the attributes of the r and py objects respectively.

**Figure 2: The output from R Markdown using both Python and R to perform kmeans clustering on the Iris dataset.**

A “Prython” Server

Finally, whether it be a full operationalized machine learning pipeline or simply insights as a service — at the end of every project we offer dashboards so the business end-users can monitor the performance of the data science driven system/business process and leverage the outputs of the project. Due to ease of deployment and build, a dockerized R Shiny server with apps became the go-to for nearly all our dashboard requirements. The one snag was frequently these apps required interaction with in-house python packages and modules resulting in additional python scripts being written for execution in the app. This process is streamlined using reticulate within our docker container. In essence, we now employ a “prython” server, this is a docker image consisting of: R Shiny server, base R with reticulate and the tidyverse stack along with mini-conda.

To utilize our in-house python packages, we simply mount the appropriate volumes and then either source the desired modules using source_python or import the packages directly. This approach reduces the need to produce additional scripts for execution in an R Shiny App and allows us to focus on disseminating data science in the most effective manner to the wider business at NewDay.

Some Gotchya’s

Unfortunately, as much as we love hopping between R and Python there are still some fundamental differences that even reticulate cannot smooth over and must be borne in mind.

· Python arrays are indexed from 0 while R array indices start at 1 — although slightly annoying, it isn’t the biggest thing in the world to remember, it just means you have to keep track of which arrays belong to which language during analysis.

· R arrays are column-oriented by default while Python arrays can be row (the default for numpy) or column oriented. This can cause some confusion, particularly as the display of numpy/R arrays is very different and can lead one to assume that two arrays are identical when in fact they’re not. For example, if an array is created in numpy it will be row-oriented by default, when copied to the R session it will appear transposed due to R’s column-major format.