Setting up Isolated Virtual Environments in SparkR

Shubham Raizada
Dec 5, 2020 · 4 min read
Image for post
Image for post

Motivation

With the increasing adoption of Spark for scaling ML pipelines, being able to install and deploy our own R libraries becomes especially important if we want to use UDFs.
In my previous post, I talked about scaling our ML pipelines in R with the use of SparkR UDFs.
Today I am going to discuss setting up a virtual environment for our SparkR run, ensuring that the run time dependencies and the libraries are installed on the cluster.

Constraints

For any Spark cluster, we can either install R and the required libraries on all the nodes in the cluster, in a one size fits all fashion or create virtual environments as required.
In my case, we have a Cloudbreak cluster with [non-sudo] access only to the edge node for submitting Spark jobs. All the other cluster nodes are not accessible.
Due to these constraints, I cannot install R and any of the dependencies on either the edge node or the cluster.

Generating the environment

Since we were currently running our ML algorithms in R, we had a docker image with R and all the ML libraries installed on it. I created a new image with Spark (v2.3.0 same as Cloudbreak cluster) installed on top of it.

Successful execution of the SparkR implementation of the ML algorithms [with smaller dataset] on this container ensures I can use this R installation directory for the setup of the virtual environment on the CloudBreak cluster.
Since we cannot install R directly on the Cloudbreak cluster due to permission constraints, I intended to ship the R installation directory from the container to the edge node.

install_spark.sh: Shell script for installing Spark.

yum -y install wget

Dockerfile: Dockerfile for creating a new image with Spark installed based on the existing image with R and ML libraries installed.

FROM <image_with_R_and_ML_libs_installed>:latest
COPY install_spark.sh ./
RUN bash install_spark.sh
ENV SPARK_EXAMPLES_JAR=”/usr/local/spark/examples/jars/spark-examples_2.11–2.3.0.jar”

Bootstrapping the environment

SparkR running in Local Mode

I created a folder sparkr_packages in the edge node home directory and copied here the R installation directory and the packages from the container.

We also need to set some required environment variables.

export PATH=$HOME/sparkr_packages/R/bin:$PATH
export R_LIBS=$HOME/sparkr_packages/R/library
export RHOME=$HOME/sparkr_packages/R
export R_HOME=$HOME/sparkr_packages/R

The R installation requires certain compile-time dependencies which are not needed after the installation. Since we have successfully installed R on the container and validated, we will not require these dependencies on the edge node.

We would still need the run time dependencies which are required during the RScript execution. Without these libs present, starting up the R console will fail with an error like this

$HOME/sparkr_packages/R/bin/exec/R: error while loading shared libraries: libtre.so.5: cannot open shared object file: No such file or directory

In my case, I needed libtre.so.5 and libpcre2–8.so.0 on the edge node.

These libs are also present in the container at /usr/lib64/. Just like the R installation directory I also copied them to the edge node at sparkr_packages.
We need to set the LD_LIBRARY_PATH to point to this location for R runtime to access these libs. We can also add these libs to R/libs to make them available for R runtime.

export LD_LIBRARY_PATH=$HOME/sparkr_packages:$LD_LIBRARY_PATH

We can now start the SparkR console in local mode and run the UDF to validate the installation on the edge node.

SparkR running in Cluster-Mode

For using UDFs with SparkR running in cluster mode, the R installation directory and the run time dependencies must be present on all the executors. We also need to set the corresponding environment variables on each of the executors.

We can use spark-submit run time param archive to send the zipped sparkr_packages directory to all the executors.

-- archive : Takes a comma separated list of archives to be extracted into the working directory of each executor.

We can set the environment variables such as R_HOME, LD_LIBRARY_PATH and PATH for each executor by using the config spark.executorEnv.<property_name>= <property_value> during spark-submit.

Finally, start the SparkR session

sparkR —-master yarn —-conf spark.executorEnv.RHOME=./environment/sparkr_packages/R —-conf spark.executorEnv.R_HOME_DIR=./environment/sparkr_packages/R —-conf spark.executorEnv.PATH=./environment/sparkr_packages/R/bin:$PATH —-conf spark.executorEnv.LD_LIBRARY_PATH=./environment/sparkr_packages:$LD_LIBRARY_PATH —-num-executors 10 —-executor-cores 3 —-executor-memory 10g —-archives sparkr_packages.zip#environment

Conclusion

Setting up the virtual environment like this is a bit cumbersome as we have to manually maintain the R executables and modules.
Still, this approach served us quite well and allowed us to set up a virtual environment without access to the cluster’s nodes.

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data…

Sign up for Analytics Vidhya News Bytes

By Analytics Vidhya

Latest news from Analytics Vidhya on our Hackathons and some of our best articles! Take a look.

By signing up, you will create a Medium account if you don’t already have one. Review our Privacy Policy for more information about our privacy practices.

Check your inbox
Medium sent you an email at to complete your subscription.

Shubham Raizada

Written by

Software Engineer III @WalmartLabs Bangalore

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

Shubham Raizada

Written by

Software Engineer III @WalmartLabs Bangalore

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store