Installing Spark w/ SparkR on Cloudera/CDH

2 min readMar 18, 2016

Below are instructions for installing and playing around with Spark 1.6.x on a Cloudera CDH cluster so that you can play with SparkR. This is not currently a supported component. I would not recommend installing this on nodes that have supported Spark installed on them currently — instead you should install on a gateway node with HDFS and YARN gateway roles installed on it.

1). Make sure R is installed on the gateway node — yum install R

2). Download Spark 1.6.x on the gateway node (http://spark.apache.org/downloads.html — it’s the “prebuild with user supplied Hadoop” choice)

wget http://www.apache.org/dyn/closer.lua/spark/spark-1.6.1/spark-1.6.1-bin-without-hadoop.tgz

3). Extract it : tar -zxvf spark-1.6.1-bin-without-hadoop.tgz

4). Make sure you have the JAVA_HOME env variable set: echo $JAVA_HOME. Mine is set to “/usr/java/jdk1.7.0_67-cloudera”

5). Set SPARK_DIST_CLASSPATH to your hadoop conf dir (should exist on gateway node with HDFS and YARN gateway roles installed):

For me, it’s set like this: export SPARK_DIST_CLASSPATH=$(hadoop --config /etc/hadoop/conf/ classpath)

6). Launch the SparkR shell to get started: ./spark-1.6.1-bin-without-hadoop/bin/sparkR

If you want to launch it using YARN, you’ll have to set the following:

export SPARK_HOME=”/home/cloudera/spark-1.6.1-bin-without-hadoop” ←- this will be the location of the Spark you extracted above

export HADOOP_CONF_DIR=”/etc/hadoop/conf”

You would then launch sparkR like this: ./spark-1.6.1-bin-without-hadoop/bin/sparkR — master yarn

SparkR shell running in YARN as shown in Cloudera Manager

Keep in mind that while this works on a single node cluster (IE quickstart VM), runing on YARN might not work unless you follow all of the above steps for every node. Right now, I wouldn’t necessarily recommend doing this on nodes with CDH-supported versions of Spark running on them.

Installing Spark w/ SparkR on Cloudera/CDH

Written by Brandon Kvarda