Multi Node Spark Setup on Hadoop with YARN

Harshil Raval

Published in

WhatfixEngineeringBlog

5 min readJul 26, 2021

Date : 20th July 2021

Introduction :

Spark is a fast and general processing engine compatible with Hadoop data. It can run in Hadoop clusters through YARN or Spark’s standalone mode, and it can process data in HDFS, HBase, Cassandra, Hive, and any Hadoop InputFormat. It is designed to perform both batch processing (similar to MapReduce) and new workloads like streaming, interactive queries, and machine learning. [Ref : https://spark.apache.org/faq.html ]

In this article, we will discuss how to set up a spark cluster on top of an existing hadoop cluster. We will use Hadoop’s YARN as Resource Manager for spark.

Prerequisite :

3 Node Hadoop cluster (we are using HADOOP VERSION 3.2.1)
Identify Name node and data nodes: (We will use following for this article)

Namenode : 192.168.0.1
Datanode1 : 192.168.0.2
Datanode2 : 192.168.0.3

ssh machines in 3 terminals to check the access :

$ ssh user@ 192.168.0.1
$ ssh user@ 192.168.0.2
$ ssh user@ 192.168.0.3

Check if Hadoop daemon processes are running :

$ jps
Namenode :
    19042 Jps
    17669 NameNode
    17910 SecondaryNameNode
    18199 ResourceManager
    18623 JobHistoryServer
Datanode1 :
    18518 DataNode
    24702 Jps
    18670 NodeManager
Datanode2 :
    17521 DataNode
    17673 NodeManager
    15628 Jps

Check list of yarn nodes (Execute only on Namenode):

$ yarn node -list

Make sure java is installed on all of them (java is prerequisite for hadoop):

$ java -version

Spark Set up Step :

[NOTE : execute only on namenode (master node) — unless specified to execute on datanode(worker node) ]

Download spark (Spark 3.1.1 for Hadoop 3.2.1) : https://spark.apache.org/downloads.html

$ cd /opt/hadoop
$ wget https://mirrors.estointernet.in/apache/spark/spark-3.1.1/spark-3.1.1-bin-hadoop3.2.tgz
$ tar -xvf spark-3.1.1-bin-hadoop3.2.tgz
$ mv spark-3.1.1-bin-hadoop3.2 spark

[we are extracting spark in /opt/hadoop dir]

2. Set up class paths :

put following in /home/<user>/.bash_profile file (or in the file which contains env set up for hadoop)

export SPARK_HOME=/opt/hadoop/spark
export PATH=$PATH:$SPARK_HOME/bin
export HADOOP_CONF_DIR=/opt/hadoop/etc/hadoop
export LD_LIBRARY_PATH=/opt/hadoop/lib/native:$LD_LIBRARY_PATH

Explanation of above statements :

a> Set SPARK_HOME
$ export SPARK_HOME=/opt/hadoop/spark
b> Add the Spark binaries directory to your PATH
$ export PATH=$PATH:$SPARK_HOME/bin
c> Integrate Spark with YARN
$ export HADOOP_CONF_DIR=/opt/hadoop/etc/hadoop
$ export LD_LIBRARY_PATH=/opt/hadoop/lib/native:$LD_LIBRARY_PATH

3. Restart your session by logging out and logging in again or execute following command :

$ source .bash_profile

4. Edit $SPARK_HOME/conf/spark-defaults.conf and set spark.master to yarn, so spark can interact with YARN cluster:

$ mv $SPARK_HOME/conf/spark-defaults.conf.template \ $SPARK_HOME/conf/spark-defaults.conf$ vi spark-defaults.conf
(insert following line at the end)
spark.master    yarn

5. Configure Memory Allocation

Allocation of Spark containers to run in YARN containers may fail if memory allocation is not configured properly. For nodes with less than 4G RAM, the default configuration is not adequate and may trigger swapping and poor performance, or even the failure of application initialisation due to lack of memory.

NOTE : Be sure to understand how Hadoop YARN manages memory allocation before editing Spark memory settings so that your changes are compatible with your YARN cluster’s limits.

Give Your YARN Containers Maximum Allowed Memory

If the memory requested is above the maximum allowed, YARN will reject creation of the container, and your Spark application won’t start.

Get the value of yarn.scheduler.maximum-allocation-mb [4096] in $HADOOP_CONF_DIR/yarn-site.xml. This is the maximum allowed value, in MB, for a single container.

Make sure that values for Spark memory allocation are below the maximum[4096].

In cluster mode, the Spark Driver runs inside YARN Application Master. The amount of memory requested by Spark at initialization is configured either in spark-defaults.conf, or through the command line.

Set the default amount of memory allocated to Spark Driver in cluster mode via spark.driver.memory (this value defaults to 1G). To set it to 2 GB, edit the file:

$ vi $SPARK_HOME/conf/spark-defaults.conf
(insert following line at the end)
spark.driver.memory    2g

6. Configure Spark Executors’ Memory Allocation

The Spark Executors’ memory allocation is calculated based on two parameters inside $SPARK_HOME/conf/spark-defaults.conf:

spark.executor.memory: sets the base memory used in calculation

spark.yarn.executor.memoryOverhead: is added to the base memory. It defaults to 10% of base memory, with a minimum of 384MB

[Note :Make sure that the Executor requested memory, including overhead memory, is below the YARN container maximum size[4096 MB], otherwise the Spark application won’t initialize.

Example: for spark.executor.memory of 1Gb , the required memory is 1024+384=1408MB. For 2 GB, the required memory will be 2048+384=2432 MB]

To set executor memory to 2GB, edit $SPARK_HOME/conf/spark-defaults.conf:

$ vi $SPARK_HOME/conf/spark-defaults.conf
(add the following line at the end)
spark.executor.memory    2g

7. Test the set up with example app :

$ spark-submit --class org.apache.spark.examples.SparkPi \
    --master yarn \
    --deploy-mode cluster \
    --queue thequeue \
    /opt/hadoop/spark/examples/jars/spark-examples*.jar \
    10

References :

FAQ | Apache Spark

How does Spark relate to Apache Hadoop? Spark is a fast and general processing engine compatible with Hadoop data. It…

spark.apache.org

Apache Hadoop

This is the first stable release of Apache Hadoop 3.3.x line. It contains 697 bug fixes, improvements and enhancements…

hadoop.apache.org

Apache Hadoop YARN

The fundamental idea of YARN is to split up the functionalities of resource management and job scheduling/monitoring…

hadoop.apache.org

Known Issues :

2021–04–10 07:52:41,108 INFO yarn.Client: Verifying our application has not requested more than the maximum memory capability of the cluster (4096 MB per container)

Exception in thread “main” java.lang.IllegalArgumentException: Required AM memory (4096+409 MB) is above the max threshold (4096 MB) of this cluster! Please check the values of ‘yarn.scheduler.maximum-allocation-mb’ and/or ‘yarn.nodemanager.resource.memory-mb’.

Solution :

In this case,

yarn.scheduler.maximum-allocation-mb = 4096

and

spark.executor.memory 4g

This will result in 4096+409 MB > yarn.scheduler.maximum-allocation-mb

Either set lower spark.executor.memory :

spark.executor.memory 3g

Or increase yarn.scheduler.maximum-allocation-mb

yarn.scheduler.maximum-allocation-mb = 5120