Multi Node Spark Setup on Hadoop with YARN

Harshil Raval
WhatfixEngineeringBlog
5 min readJul 26, 2021

Date : 20th July 2021

Photo by Scott Gummerson on Unsplash

Introduction :

Spark is a fast and general processing engine compatible with Hadoop data. It can run in Hadoop clusters through YARN or Spark’s standalone mode, and it can process data in HDFS, HBase, Cassandra, Hive, and any Hadoop InputFormat. It is designed to perform both batch processing (similar to MapReduce) and new workloads like streaming, interactive queries, and machine learning. [Ref : https://spark.apache.org/faq.html ]

In this article, we will discuss how to set up a spark cluster on top of an existing hadoop cluster. We will use Hadoop’s YARN as Resource Manager for spark.

Prerequisite :

  • 3 Node Hadoop cluster (we are using HADOOP VERSION 3.2.1)
  • Identify Name node and data nodes: (We will use following for this article)
Namenode : 192.168.0.1
Datanode1 : 192.168.0.2
Datanode2 : 192.168.0.3
  • ssh machines in 3 terminals to check the access :
$ ssh user@ 192.168.0.1
$ ssh user@ 192.168.0.2
$ ssh user@ 192.168.0.3
  • Check if Hadoop daemon processes are running :
$ jps
Namenode :
19042 Jps
17669 NameNode
17910 SecondaryNameNode
18199 ResourceManager
18623 JobHistoryServer
Datanode1 :
18518 DataNode
24702 Jps
18670 NodeManager
Datanode2 :
17521 DataNode
17673 NodeManager
15628 Jps
  • Check list of yarn nodes (Execute only on Namenode):
$ yarn node -list
  • Make sure java is installed on all of them (java is prerequisite for hadoop):
$ java -version

Spark Set up Step :

[NOTE : execute only on namenode (master node) unless specified to execute on datanode(worker node) ]

  1. Download spark (Spark 3.1.1 for Hadoop 3.2.1) : https://spark.apache.org/downloads.html
$ cd /opt/hadoop
$ wget https://mirrors.estointernet.in/apache/spark/spark-3.1.1/spark-3.1.1-bin-hadoop3.2.tgz
$ tar -xvf spark-3.1.1-bin-hadoop3.2.tgz
$ mv spark-3.1.1-bin-hadoop3.2 spark

[we are extracting spark in /opt/hadoop dir]

2. Set up class paths :

put following in /home/<user>/.bash_profile file (or in the file which contains env set up for hadoop)

export SPARK_HOME=/opt/hadoop/spark
export PATH=$PATH:$SPARK_HOME/bin
export HADOOP_CONF_DIR=/opt/hadoop/etc/hadoop
export LD_LIBRARY_PATH=/opt/hadoop/lib/native:$LD_LIBRARY_PATH

Explanation of above statements :

a> Set SPARK_HOME
$ export SPARK_HOME=/opt/hadoop/spark
b> Add the Spark binaries directory to your PATH
$ export PATH=$PATH:$SPARK_HOME/bin
c> Integrate Spark with YARN
$ export HADOOP_CONF_DIR=/opt/hadoop/etc/hadoop
$ export LD_LIBRARY_PATH=/opt/hadoop/lib/native:$LD_LIBRARY_PATH

3. Restart your session by logging out and logging in again or execute following command :

$ source .bash_profile

4. Edit $SPARK_HOME/conf/spark-defaults.conf and set spark.master to yarn, so spark can interact with YARN cluster:

$ mv $SPARK_HOME/conf/spark-defaults.conf.template \ $SPARK_HOME/conf/spark-defaults.conf$ vi spark-defaults.conf
(insert following line at the end)
spark.master yarn

5. Configure Memory Allocation

Allocation of Spark containers to run in YARN containers may fail if memory allocation is not configured properly. For nodes with less than 4G RAM, the default configuration is not adequate and may trigger swapping and poor performance, or even the failure of application initialisation due to lack of memory.

NOTE : Be sure to understand how Hadoop YARN manages memory allocation before editing Spark memory settings so that your changes are compatible with your YARN cluster’s limits.

Give Your YARN Containers Maximum Allowed Memory

If the memory requested is above the maximum allowed, YARN will reject creation of the container, and your Spark application won’t start.

Get the value of yarn.scheduler.maximum-allocation-mb [4096] in $HADOOP_CONF_DIR/yarn-site.xml. This is the maximum allowed value, in MB, for a single container.

Make sure that values for Spark memory allocation are below the maximum[4096].

In cluster mode, the Spark Driver runs inside YARN Application Master. The amount of memory requested by Spark at initialization is configured either in spark-defaults.conf, or through the command line.

Set the default amount of memory allocated to Spark Driver in cluster mode via spark.driver.memory (this value defaults to 1G). To set it to 2 GB, edit the file:

$ vi $SPARK_HOME/conf/spark-defaults.conf
(insert following line at the end)
spark.driver.memory 2g

6. Configure Spark Executors’ Memory Allocation

The Spark Executors’ memory allocation is calculated based on two parameters inside $SPARK_HOME/conf/spark-defaults.conf:

spark.executor.memory: sets the base memory used in calculation

spark.yarn.executor.memoryOverhead: is added to the base memory. It defaults to 10% of base memory, with a minimum of 384MB

[Note :Make sure that the Executor requested memory, including overhead memory, is below the YARN container maximum size[4096 MB], otherwise the Spark application won’t initialize.

Example: for spark.executor.memory of 1Gb , the required memory is 1024+384=1408MB. For 2 GB, the required memory will be 2048+384=2432 MB]

To set executor memory to 2GB, edit $SPARK_HOME/conf/spark-defaults.conf:

$ vi $SPARK_HOME/conf/spark-defaults.conf
(add the following line at the end)
spark.executor.memory 2g

7. Test the set up with example app :

$ spark-submit --class org.apache.spark.examples.SparkPi \
--master yarn \
--deploy-mode cluster \
--queue thequeue \
/opt/hadoop/spark/examples/jars/spark-examples*.jar \
10

References :

Known Issues :

1>

2021–04–10 07:52:41,108 INFO yarn.Client: Verifying our application has not requested more than the maximum memory capability of the cluster (4096 MB per container)

Exception in thread “main” java.lang.IllegalArgumentException: Required AM memory (4096+409 MB) is above the max threshold (4096 MB) of this cluster! Please check the values of ‘yarn.scheduler.maximum-allocation-mb’ and/or ‘yarn.nodemanager.resource.memory-mb’.

Solution :

In this case,

yarn.scheduler.maximum-allocation-mb = 4096

and

spark.executor.memory 4g

This will result in 4096+409 MB > yarn.scheduler.maximum-allocation-mb

Either set lower spark.executor.memory :

spark.executor.memory 3g

Or increase yarn.scheduler.maximum-allocation-mb

yarn.scheduler.maximum-allocation-mb = 5120

--

--