Spark on cluster (Standalone/Yarn)

Shehan Fernando
5 min readJun 8, 2020

--

Photo by Jakub Skafiriak on Unsplash

Here I’m going to run a standalone spark cluster and then use my existing yarn cluster to run spark application.

feel free to check my previous posts to configure the Hadoop cluster.

SPARK standalone cluster

In a standalone cluster, we can run spark either client or cluster deploy modes.

Basically, It decides where the driver program runs on. In the client-deploy mode, the application driver runs on your same client process, but in cluster-deploy mode, the driver starts on one of the worker nodes in the cluster.

Google it, there are plenty of resources that explain how this works.

I’m using my Hadoop master node VM for spark master node and all Hadoop data nodes VM for spark work nodes.

Install spark on our master node

Download Spark distribution
wget https://downloads.apache.org/spark/spark-3.0.0-preview2/spark-3.0.0-preview2-bin-hadoop3.2.tgz
Extract to any location
I extract to /usr/local/spark/ folder.

Set enviroment veriables

nano .bash_profileThen add these entries
export SPARK_HOME=/usr/local/spark/
export HADOOP_CONF_DIR=/home/shehan/hadoop/etc/hadoop
export SPARK_HOME=/usr/local/spark/
export LD_LIBRARY_PATH=/home/shehan/hadoop/lib/native:$LD_LIBRARY_PATH
My .bash_profile

Configure Spark

Create a copy of spark-defaults.conf.template
cp /usr/local/spark/conf/spark-defaults.conf.template /usr/local/spark/conf/spark-defaults.conf
Open spark-defaults.conf & add properties.
nano /usr/local/spark/conf/spark-defaults.conf
My spark-defaults.conf

Note

Make sure you add these entries and this required when Spark runs on the Yarn cluster.spark.dynamicAllocation.enabled true
spark.shuffle.service.enabled true
Now, copy shuffle library (this library file is on %SPARK_HOME%/yarn)spark-3.0.0-preview2-yarn-shuffle.jar Copy this file into /hadoop/share/hadoop/yarn/ folder.
cp /usr/local/spark/yarn/spark-3.0.0-preview2-yarn-shuffle.jar hadoop/share/hadoop/yarn

Configure Spark

Open spark-env.sh & add properties
export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.252.b09-2.el7_8.x86_64/jre/
export HADOOP_CONF_DIR=/home/shehan/hadoop/etc/hadoop
export SPARK_MASTER_HOST=master.hadoop.smf
Create folder in HDFS for spark-event logs
hadoop fs -mkdir /spark-events

Verify

run spark-shell -version to check spark is installed correctly and to see the version.

Run Spark Master.

/usr/local/spark/sbin/start-master.sh

http://master.hadoop.smf:8080/

Add workers.

Create a copy of “slaves.template” as “slaves” then add node names
cp /usr/local/spark/conf/slaves.template/usr/local/spark/conf/slaves
My entries
node01.hadoop.smf
node02.hadoop.smf
node03.hadoop.smf

Finally, copy the spark folder to all the worker nodes. (copy to the same location as master, then no need to change any configuration), then make sure .bash_profile also updates accordingly.

Run Spark Slaves.

/usr/local/spark/sbin/start-slaves.sh

Everything seems okay, and all the workers are registered. Now its time to run some samples and see cluster is working expectedly.

Example ( deploy-mode=cluster)

spark-submit --master spark://master.hadoop.smf:7077 --deploy-mode cluster --class org.apache.spark.examples.SparkPi /usr/local/spark/examples/jars/spark-examples_2.12-3.0.0-preview2.jar
Since we run as “cluster” deploy mode application exits immediately and the driver program runs on one of our worker nodes.

We can see results on driver logs.

Example ( deploy-mode=client)

spark-submit --master spark://master.hadoop.smf:7077 --deploy-mode client --class org.apache.spark.examples.SparkPi /usr/local/spark/examples/jars/spark-examples_2.12-3.0.0-preview2.jar
Instead of exiting immediately, the drive is running on the same process, and once the application finished we can see the results in the command line.

When running the application we can see progress via the SparkUI http://<nodename>:4040/jobs/. However, once the job is finished it will disappear. So, we can run the Spark history server to see status/progress even after the job is finished. And History server is run on port 18080 (defined in spark-defaults.conf)

/usr/local/spark/sbin/start-history-server.sh

http://master.hadoop.smf:18080/

SPARK On YARN

Spark supports two modes for running on the YARN cluster, “yarn-cluster” mode and “yarn-client” mode. It’s the same as the Spark cluster. In the “yarn-cluster” mode driver will run on one of the data nodes and it is called “Application master”. “ Application master” will be just another java process, and its purpose is to start to drive the given application.

Open yarn-site.xml & add/update properties<property>
<name>yarn.nodemanager.aux-services</name>
<value>spark_shuffle,mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.spark_shuffle.class</name>
<value>org.apache.spark.network.yarn.YarnShuffleService</value>
</property>

We don’t need to spark master and its worker nodes to run here.

Stop spark cluster
/usr/local/spark/sbin/stop-slaves.sh
/usr/local/spark/sbin/stop-master.sh
start yarn daemons
start-yarn.sh

Example

cluster mode
spark-submit --master yarn --deploy-mode cluster --class org.apache.spark.examples.SparkPi /usr/local/spark/examples/jars/spark-examples_2.12-3.0.0-preview 2.jar
Client mode
spark-submit --master yarn --deploy-mode client --class org.apache.spark.examples.SparkPi /usr/local/spark/examples/jars/spark-examples_2.12-3.0.0-preview 2.jar
Spark job on YARN. Application is submitted and we can see the tracking URL so we can see the results once an application is finished.

--

--