A quick guide to Multi-Node Hadoop Cluster setup for Beginners

Snigdha Sen
7 min readOct 8, 2020

--

Data, data and Data. Across every sectors people are dealing with huge and colossal amount of data which is also termed as Big data. Hadoop is a very well known and widespread distributed framework for big data processing . But when it comes to Hadoop installation, most of us feel that it is quite cumbersome job. This article will provide you some easy and quick steps for a multi node Hadoop cluster setup.

Multi-Node Cluster in Hadoop 3.x (3.1.3)

A Multi Node Cluster in Hadoop contains two or more data nodes in a distributed Hadoop environment. This is used in organisations to store and analyse their massive amount of data. So knowing how to setup a multi-node Hadoop cluster is an important task.

Prerequisites

We will need the following software and hardware as prerequisite to perform the activities.

Ubuntu 18.04.3 LTS (Long Term Support)

  • Hadoop-3.1.3
  • JAVA 8
  • SSH
  • At least 2 laptop/desktop connected by LAN/Wi-Fi

Installations Steps

Step 01: Installation of Ubuntu/OS in the machines

This step is very self-explanatory, as a first step we need to install Ubuntu or any other flavor of Linux you have chosen in both the nodes (Laptop/Desktop — will be referred as nodes from hereon). You can also install a lighter version of Ubuntu — Lubuntu (Light weight Ubuntu)( if you are using old hardware where you are having difficulty installing Ubuntu.

In my case I was using an old laptop of mine as the slave node and I had to install Lubuntu and it worked without any issues.

Please create an admin user in both the nodes preferably with the same username

Step 02: Configuring host names

Once OS is installed as a next step, we should set the hostname for both the nodes. In my case I named the nodes as —

  • masternode
  • slave
sudo vi /etc/hostname 

Reboot of the node is required after the hostname is updated.

* This step is optional if you have already put the hostnames during OS installation

Step 03: Configuring IP address in the hosts file of the nodes

Next, we need to add the IPs of masternode and slave node in the /etc/hosts file in both the nodes.

Command:

sudo vi /etc/hosts

In my case I added:-

192.168.1.4 masternode 192.168.1.23 slave 

Comment out all other entries you have in the hosts file in both the nodes.

Command to see the IP of the node:

ip addr show

Step 04: Restart the sshd service in both the nodes

Command:

service sshd restart

Step 05: Create the SSH Key in the master node and publish it in the slave node.

For this activity follow the below steps: —

  • Command to generate SSH key in masternode: ssh-keygen
  • It will ask for folder location where it will copy the keys, I entered /home/username/.ssh/id_rsa
  • It will ask for pass phrase, keep it empty for simplicity.
  • Next copy the newly generated public key to auth file in your users home/.ssh directory. Command: cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys
  • Next execute — ssh localhost to check if the key is working.
  • Next, we need to publish the key to the slave node. Command: ssh-copy-id -i $HOME/.ssh/id_rsa.pub <username>@slave
  • First time it will prompt you to enter the password and publish the key.
  • Execute ssh <username>@slave again to check if you are able to loging without password. This is very important. Without public key working,the slave node cannot be added to the cluster later.

Step 06: Download and install Java

Download and install Open JDK 8 and set the JAVA_HOME path in your .bashrc file of the user under which you are installing hadoop.

Step 07: Download the Hadoop 3.1.3 package in all nodes.

Login to each node and download and untar the Hadoop package.

wget http://apache.cs.utah.edu/hadoop/common/current/hadoop-3.1.3.tar.gz tar -xzf hadoop-3.1.3.tar.gz

Step 08: Add the Hadoop and Java paths in the bash file (.bashrc) on all nodes.

Command: sudo vi .bashrc

Environment Variables to Set in .bashrc

Step 09: Set NameNode Location

Update your ~/hadoop/etc/hadoop/core-site.xml file to set the NameNode location to node-master on port 9000:

Step 10: Set path for HDFS

Edit ~/hadoop/etc/hadoop/hdfs-site.conf to add the following for the masternode:

<configuration> <property> <name>dfs.namenode.name.dir</name> <value>/home/hadoop/data/nameNode</value> </property> <property> <name>dfs.datanode.data.dir</name> <value>/usr/local/hadoop/data/dataNode</value> </property> <property><name>dfs.replication</name> <value>1</value> </property> </configuration>

For the data node please put the following config:

<configuration> <property> <name>dfs.replication</name> <value>1</value> </property> <property> <name>dfs.permissions</name> <value>false</value> </property> <property> <name>dfs.datanode.data.dir</name> <value>/usr/local/hadoop/data/datanode</value> </property> </configuration>

Please note the difference between the configuration properties of masternode and slave.

Step 11: Set YARN as Job Scheduler

Step 12: Configure YARN

Edit ~/hadoop/etc/hadoop/yarn-site.xml, which contains the configuration options for YARN. In the value field for the yarn.resourcemanager.hostname, replace 192.168.1.4 with the IP address of node-master that you have:

<property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property></configuration>

Step 13: Configure Workers

The file worker is used by startup scripts to start required daemons on all nodes. Edit ~/hadoop/etc/hadoop/workers of the masternode to include hostnames of both of the nodes.

Step 14: Update the JAVA_HOME in hadoop-env.sh

Edit ~/hadoop/etc/hadoop/hadoop-env.sh and update the value for the JAVA_HOME of your installation for both the nodes.

Step 15: Format HDFS namenode

Execute the following command in your masternode.

Command: hdfs namenode -format

Step 16: Start and Stop HDFS

Ok. So now you are almost there. Only thing left is starting the daemons. To start all the daemons and bring up your hadoop cluster use the below command:

Command: start-all.sh

Once the command prompt is back, to check the daemons running use the following command: Command: jps

This is what you will see in the masternode:

This is what you will see in the slave node:

If you are not seeing the above daemons running, then something has gone wrong in your configuration. So, you need to check the previous steps again.

So, your cluster is up and running. To check the namenode information from the browser use the following URL after modifying the IP with that of your masternode: http://192.168.1.4:9870/dfshealth.html#tab-overview

To check the Nodes of the cluster you can use the following URL after updating the IP for your masternode: http://192.168.1.4:8088/cluster/nodes

Step 17: Put and Get Data to HDFS

To start with you have to create the user directory in your HDFS cluster. This user directory should be in the same username under which you have installed and running the cluster. Use the following command:

Command: hdfs dfs -mkdir /user/username

Once user directory is created you can use any of your hdfs dfs commands and start using your HDFS cluster.

So you are able to setup Hadoop cluster finally without much trouble I hope.

Issues I faced

Well as with any projects small or big, it wasn’t a smooth sailing as the documentation above looks like. I initially started with an aim to create a multimode cluster using windows machines but somehow the ssh server and client did not workout with the username convention that windows follows. So I used the next best option available to me — creating a cluster using Ubuntu on my old laptops. During setting up the configuration I faced couple of major hurdles:

  • With the jdk versions: I had installed Open JDK 11 and it looks like there is some incompatibility with the Open JDK 11 with that of Hadoop 3.1.3. The nodemanager daemon was not coming up due to this. After googling around I settled for Open JDK 8.
  • Username used for Hadoop installation in master and slave node: Initially I created two different username (like master & slave) in masternode and slave due to which the ssh command triggered by the Hadoop start-all scripts was failing as it was trying to use master@slave while I did not have ‘master’ username created on the slave node. This was resolved once I switched to the same username in both nodes.

This brings us to the end of this blog. Do let me know in the comment section whether this helped you in setting up your own multinode hadoop cluster.

Thanks for reading.:)

--

--

Snigdha Sen

I am currently pursuing PhD in Machine Learning and Big Data analytics from IIIT, Allahabad. I am a Data Science enthusiast.