Installing Spark w/ SparkR on Cloudera/CDH
Below are instructions for installing and playing around with Spark 1.6.x on a Cloudera CDH cluster so that you can play with SparkR. This is not currently a supported component. I would not recommend installing this on nodes that have supported Spark installed on them currently — instead you should install on a gateway node with HDFS and YARN gateway roles installed on it.
1). Make sure R is installed on the gateway node — yum install R
2). Download Spark 1.6.x on the gateway node (http://spark.apache.org/downloads.html — it’s the “prebuild with user supplied Hadoop” choice)
wget http://www.apache.org/dyn/closer.lua/spark/spark-1.6.1/spark-1.6.1-bin-without-hadoop.tgz
3). Extract it : tar -zxvf spark-1.6.1-bin-without-hadoop.tgz
4). Make sure you have the JAVA_HOME env variable set: echo $JAVA_HOME. Mine is set to “/usr/java/jdk1.7.0_67-cloudera”
5). Set SPARK_DIST_CLASSPATH to your hadoop conf dir (should exist on gateway node with HDFS and YARN gateway roles installed):
For me, it’s set like this: export SPARK_DIST_CLASSPATH=$(hadoop --config /etc/hadoop/conf/ classpath)
6). Launch the SparkR shell to get started: ./spark-1.6.1-bin-without-hadoop/bin/sparkR
If you want to launch it using YARN, you’ll have to set the following:
export SPARK_HOME=”/home/cloudera/spark-1.6.1-bin-without-hadoop” ←- this will be the location of the Spark you extracted above
export HADOOP_CONF_DIR=”/etc/hadoop/conf”
You would then launch sparkR like this: ./spark-1.6.1-bin-without-hadoop/bin/sparkR — master yarn
Keep in mind that while this works on a single node cluster (IE quickstart VM), runing on YARN might not work unless you follow all of the above steps for every node. Right now, I wouldn’t necessarily recommend doing this on nodes with CDH-supported versions of Spark running on them.