Easily Setup Multi-Node Hadoop Cluster in YARN Mode on CentOS/Ubuntu
First, let’s understand what is Hadoop YARN and how it’s different from Hadoop 1.x. YARN is one the key features in next gen Hadoop 2 version. YARN is now characterized as a large-scale, distributed operating system for big data applications.
YARN(Yet Another Resource Negotiator) is a cluster management technology where resource management and processing is done using different processing models. When compared to Hadoop 1.X where scaling was limited to 4000 nodes per cluster, in 2.x we can scale up to 10,000 nodes per cluster. Hadoop 2.x allows to work in MapReduce as well as other distributed computing models like Spark, MP and Hbase coprocessor. It works on the concept of containers. Here we can have multiple namenodes and hence there is no single point of failure.
The architecture consists of 2 major components i.e. HDFS and YARN. As the name suggests HDFS is used for Storage and YARN is used for Processing. In the above architecture, the NameNode and ResourceManager acts as Master and DataNode and NodeManager acts as slave.
Let’s begin with creating a multinode hadoop yarn cluster. Here I will show, how you can create your own 3 Nodes(1 Master and 2 Slaves nodes cluster)
1. Install CentOS:
I will be using 3 CentOS 6.5 minimal machines on my virtual box and lets name them as nn1(consisting of namenode and resourcemanager), dn1(datanode1 and nodemanager1), dn2(datanode2 and nodemanager2). You can also use ubuntu here instead of CentOS
You can make your network adapter as Bridge or Host-Only depending upon your requirement. Here are my 3 machines
nn1–192.168.1.200 HDFS — Namenode and YARN — Resourcemanager
dn1–192.168.1.201 HDFS — Datanode and YARN — Nodemanager
dn2–192.168.1.202 HDFS — Datanode and YARN — Nodemanager
2. Edit /etc/hosts :
Now lets ssh into nn1, dn1, dn2 and change the /etc/hosts file, so that we can use hostname instead of IP everytime we wish to use or ping any of these machines
3. Download and Install Java:
[root@nn1 /]yum install wget
[root@nn1 /]wget https://mirror.its.sfu.ca/mirror/CentOS-Third-Party/NSG/common/x86_64/jdk-7u45-linux-x64.rpm
[root@nn1 /]# rpm -ivh /opt/jdk-7u45-linux-x64.rpm
4. Install Hadoop 2.7.3
[root@nn1 ~]# wget https://archive.apache.org/dist/hadoop/core/hadoop-2.7.3/hadoop-2.7.3.tar.gz
[root@nn1 ~]# tar -xzvf /root/hadoop-2.7.3.tar.gz -C /home/hadoop/
5. Set Environment variables in ~/.bash_profile(centos) or .bashrc(ubuntu)
export JAVA_HOME=/usr/java/jdk1.7.0_45
PATH=$PATH:$JAVA_HOME/bin:/home/hadoop/hadoop-2.7.3/bin:$PATH
export PATH
# Set Hadoop-related environment variables
export HADOOP_HOME=/home/hadoop/hadoop-2.7.3
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
# Add Hadoop bin/ directory to PATH
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
source it to reflect changes
6. Modify core-site.xml
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://192.168.1.200:9000</value>
</property>
</configuration>
7. Modify hdfs-site.xml
<configuration>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/home/hadoop/hadoop-2.7.3/hadoop_store/hdfs/namenode2</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/home/hadoop/hadoop-2.7.3/hadoop_store/hdfs/datanode2
</value>
</property>
</configuration>
8. Modify mapred-site.xml
cp mapred-site-template.xml mapred-site.xml
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
9. Modify yarn-site.xml
<configuration>
<! — Site specific YARN configuration properties — >
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
<name>yarn.resourcemanager.resource-tracker.address</name>
<value>192.168.1.200:8025</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>192.168.1.200:8030</value>
</property>
<property>
<name>yarn.resourcemanager.address</name>
<value>192.168.1.200:8050</value>
</property>
</configuration>
10. Create Namenode directory in nn1
mkdir -p /home/hadoop/hadoop-2.7.3/hadoop_store/hdfs/namenode2
11. Modify slaves files and add both datanodes ip’s
12. Copy all modified files to both the datanodes dn1 and dn2
13. Make datanode directory on both dn1 and dn2 and give permissions
14. Disable Firewall
service iptables stop
service ip6tables stop
15. Format Namenode on nn1
16. Start Namenode, Datanode(start-dfs.sh) and Resourcemanager, Nodemanager(start-yarn.sh)
17. jps on nn1
18. jps on dn1 and dn2
19. Check if we get 2 live data-nodes
hdfs dfsadmin -report
20. Check ResourceManager on browser
If you have any issues installing vm, network related issues or any commands, please comment below.