Easily Setup Multi-Node Hadoop Cluster in YARN Mode on CentOS/Ubuntu

5 min readMar 14, 2017

First, let’s understand what is Hadoop YARN and how it’s different from Hadoop 1.x. YARN is one the key features in next gen Hadoop 2 version. YARN is now characterized as a large-scale, distributed operating system for big data applications.

YARN(Yet Another Resource Negotiator) is a cluster management technology where resource management and processing is done using different processing models. When compared to Hadoop 1.X where scaling was limited to 4000 nodes per cluster, in 2.x we can scale up to 10,000 nodes per cluster. Hadoop 2.x allows to work in MapReduce as well as other distributed computing models like Spark, MP and Hbase coprocessor. It works on the concept of containers. Here we can have multiple namenodes and hence there is no single point of failure.

The architecture consists of 2 major components i.e. HDFS and YARN. As the name suggests HDFS is used for Storage and YARN is used for Processing. In the above architecture, the NameNode and ResourceManager acts as Master and DataNode and NodeManager acts as slave.

Let’s begin with creating a multinode hadoop yarn cluster. Here I will show, how you can create your own 3 Nodes(1 Master and 2 Slaves nodes cluster)

1. Install CentOS:

I will be using 3 CentOS 6.5 minimal machines on my virtual box and lets name them as nn1(consisting of namenode and resourcemanager), dn1(datanode1 and nodemanager1), dn2(datanode2 and nodemanager2). You can also use ubuntu here instead of CentOS

You can make your network adapter as Bridge or Host-Only depending upon your requirement. Here are my 3 machines

nn1–192.168.1.200 HDFS — Namenode and YARN — Resourcemanager
dn1–192.168.1.201 HDFS — Datanode and YARN — Nodemanager
dn2–192.168.1.202 HDFS — Datanode and YARN — Nodemanager

2. Edit /etc/hosts :

Now lets ssh into nn1, dn1, dn2 and change the /etc/hosts file, so that we can use hostname instead of IP everytime we wish to use or ping any of these machines

3. Download and Install Java:

[root@nn1 /]yum install wget
[root@nn1 /]wget https://mirror.its.sfu.ca/mirror/CentOS-Third-Party/NSG/common/x86_64/jdk-7u45-linux-x64.rpm
[root@nn1 /]# rpm -ivh /opt/jdk-7u45-linux-x64.rpm

4. Install Hadoop 2.7.3

[root@nn1 ~]# wget https://archive.apache.org/dist/hadoop/core/hadoop-2.7.3/hadoop-2.7.3.tar.gz
[root@nn1 ~]# tar -xzvf /root/hadoop-2.7.3.tar.gz -C /home/hadoop/

5. Set Environment variables in ~/.bash_profile(centos) or .bashrc(ubuntu)

export JAVA_HOME=/usr/java/jdk1.7.0_45
PATH=$PATH:$JAVA_HOME/bin:/home/hadoop/hadoop-2.7.3/bin:$PATH
export PATH
# Set Hadoop-related environment variables
export HADOOP_HOME=/home/hadoop/hadoop-2.7.3
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
# Add Hadoop bin/ directory to PATH
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin

source it to reflect changes

6. Modify core-site.xml

<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://192.168.1.200:9000</value>
</property>
</configuration>

7. Modify hdfs-site.xml

<configuration>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/home/hadoop/hadoop-2.7.3/hadoop_store/hdfs/namenode2</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/home/hadoop/hadoop-2.7.3/hadoop_store/hdfs/datanode2
</value>
</property>
</configuration>

8. Modify mapred-site.xml

cp mapred-site-template.xml mapred-site.xml

<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>

9. Modify yarn-site.xml

<configuration>
<! — Site specific YARN configuration properties — >
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
<name>yarn.resourcemanager.resource-tracker.address</name>
<value>192.168.1.200:8025</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>192.168.1.200:8030</value>
</property>
<property>
<name>yarn.resourcemanager.address</name>
<value>192.168.1.200:8050</value>
</property>
</configuration>