RStudio and Databricks Better Together

All you need to know for running RStudio and Databricks side-by-side

Patrick Pichler
Creative Data
7 min readFeb 23, 2022

--

Photo by Wil Stewart on Unsplash

Introduction

R and Python are still the most popular programming languages for conducting advanced analytics, even though several newcomers such as Julia are on the rise. Amongst data scientists, especially R together with the RStudio IDE has been by far the most popular tool-set for statistical computing and graphics over the last few years. It provides a very easy way to get started and is therefore taught on most universities which results in a vast talent pool of RStudio experts.

Although the combination by itself is extremely powerful, it’s limited by the resources of a single machine which makes it often impractical to use for big data, machine learning and in general running advanced analytics at scale. To address this scalability issue, Spark provides a convenient way to distribute R computations on a cluster of multiple machines working together. In this article, we will see how this works together with Databricks, the leading fully-managed version of the open-source Apache Spark. This combination not only allows extensive scalability, but also provides a persistent storage including off-memory data structures. RStudio and Databricks have also partnered up in 2018 for bringing together the best of both worlds, Sparks computing power with the feature-rich and familiar RStudio IDE.

Deployment

One outcome of this partnership was the ability to install RStudio directly on a Databricks cluster. This includes the R runtime including required packages installed on every node in the cluster while RStudio Server only runs on the driver node where also the web UI is provided.

Yet, despite this local installation possibility, it’s still often preferable or common practice to deploy RStudio in a separate environment as this avoids resource contention as well as also allows to connect to any other remote storage or compute resources if required¹. Following this path, working remotely with the Databricks cluster, you can either use a client (Python) library called Databricks Connect or via a JDBC/ODBC connection using the Databricks Spark SQL Driver. Though, only with the first option you really get the best out of this combination, the latter one is similar to how you would interact with any other external database requiring you to always transfer data before working with it which isn’t always necessary using Databricks Connect as we will see further down.

Of course, installing RStudio locally on the Databricks cluster also comes with its advantages and is a totally viable option as well. Firstly, it doesn’t require you to provide another infrastructure and secondly, no network latency occurs by transferring data between R and Databricks as the JVM processes run on the same machine. So, if your Databricks cluster is constantly running and you haven’t lot going on besides the interactive workloads, then this option might even be more convenient.

RStudio Databricks Connect (Edited by Author)

Sparklyr Vs. SparkR

The second question you will face by evaluating RStudio together with Databricks is whether to use Sparklyr or SparkR to interact with the computing cluster. While they indeed might slightly differ in available functionality, it’s mostly rather the syntax and usage structure making the difference as in most cases they are both capable of doing the job.

They both can be used within RStudio and basically share the same concept by providing data through a Spark DataFrame stored on the Databricks side. Likewise, they also benefit both from Apache Arrow processing optimizations for translating Spark DataFrames into local R DataFrames which allows you to take full advantage of your “local” R environment. Furthermore, also both would allow you to run any other R functionality not natively available directly on Databricks at scale. However, those specific functions should be used with caution since they introduce additional overhead in terms of complexity and performance³. For instance, in case of spark_apply() using Sparklyr, each involved Spark worker node is launching a R process which needs to transfer data back and forth between Spark and R for processing the submitted R function. At the time of this writing, you further need to install the arrow R package manually on the Databricks cluster to make usage of it with spark_apply() otherwise you will receive an error message.

With regards to package management and installation in general, Sparklyr itself is available through CRAN as any other R library while SparkR is a part of the Spark installation itself. In contrast to Sparklyr, SparkR is basically a tool for “natively” running R on Spark and as such it’s similar to PySpark, the Python API for Spark. So choosing between these two interfaces greatly depends on the use-case together with the available expertise. However, since this article assumes to have lots of RStudio experts available, Sparklyr will most likely be more familiar to them. It gives you you the capability to interact with Spark using well-known R interfaces, such as dplyr, broom and DBI and at the same time you gain access to Spark’s distributed machine learning libraries, streaming API and much more, also the community is very active and steadily growing.

Awsome Sparklyr

Databricks Connect and Sparklyr

The installation and usage of Databricks Connect and Sparklyr is well documented. Nevertheless, it’s often necessary to have some knowledge of what’s happening behind the scenes (at least on a high-level) to ease troubleshooting and fine-tine configuration settings.

A connection between RStudio using Sparkly and Databricks is established through a locally available Spark instance which comes as a part of the Databricks Connect installation. Using Sparklyr in this local connection mode, Spark starts a single process with default configuration that runs everything inside the driver application and is triggered via spark-submit in the background. This way, the local Spark instance is able to talk to the Databricks cluster by sending Remote Procedure Calls (RPCs) via a secure TCP tunnel over HTTPS using the Netty framework. On Linux, for instance, you should be able to find this running Spark application via the bash command ps -ef | grep spark.

Spark Local Mode & RPC (Edited by Author)

All the translated commands sent via RPCs through this tunnel run inside Databricks, so no data transfer happens unless it’s explicitly called. It returns you only a variable object pointing to the location where the Spark session loaded the data(frame) which allows you to run any Sparklyr function on it.

Only when you trigger the collect() function, Spark actually returns the data/result through the local Spark instance to an in-memory R DataFrame. In case of larger objects, collecting results could take some time and requires quite a lot of resources on the local driver node. Therefore, it usually requires you to tweak the default Spark configuration settings (see code below) to your needs before establishing the connection with Databricks otherwise OOM errors are almost guaranteed. Altogether, this means that the most efficient way of working would be to run heavy computations involving huge amounts of data on the Databricks side and bringing only results into your R environment for further visualization and documentation.

Conclusion

The working experience by basically using RStudio as an IDE for Databricks is very smooth. Thanks to Apache Arrow, also the data transfer including the time spent on serializing/de-serializing could be dropped significantly.

Using them in a combined manner could be especially interesting when it comes to making data warehouse processes work together with advanced analytics processes. A topic which has gained increasing importance over the last couple of years and I believe using RStudio together with Databricks is a very good choice to tackle it. Not only to close the gap from a technology perspective but also to bring closer together the different kind of personas within organizations Data Scientists, Engineers and Analysts.

If you want to learn more about how to use Apache Spark with R, I can recommend to read “Mastering Spark with R” by Javier Luraschi, Kevin Kuo, Edgar Ruiz.

Resources

[1] Sparklyr. 2022. Using sparklyr with Databricks. [ONLINE] Available at: https://spark.rstudio.com/deployment/databricks-cluster.html. [Accessed 16 February 2022].

[2] Databricks. 2022. Migration Guide: Hadoop to Databricks. [ONLINE] Available at: https://databricks.com/wp-content/uploads/2021/08/Migrating-Hadoop-to-Databricks-Ebook-FINAL.pdf. [Accessed 19 February 2022].

[3] therinspark. 2022. Chapter 11 Distributed R. [ONLINE] Available at: https://therinspark.com/distributed.html. [Accessed 22 February 2022].

--

--

Patrick Pichler
Creative Data

Promoting sustainable data and AI strategies through open data architectures.