Apache Hadoop-3.1.3 Multinode Cluster Installation Guide

Sankara Subramanian V
8 min readAug 28, 2020

--

First of all, I thank everyone for supporting my first article on LinkedIn (https://www.linkedin.com/pulse/graduates-here-hear-10-lessons-from-last-years-my-life-venkatraman/).

This article is for Data Engineering/Data Science aspirants who want to work with Big Data.

Image courtesy: https://data-flair.training/blogs/hadoop-ecosystem-components/

I’m going to explain how to setup Hadoop multimode cluster in a distributed environment. Also, the required changes in the configuration files are explained in detail. The Hadoop environment is provisioned in ubuntu 18.04 with 1 master and 2 workers configuration. The entire setup is made in OpenStack private cloud platform. The following installation is common for other cloud providers such as AWS, Azure and GCP as well.

The following steps will explain the step-by-step process of Hadoop Multinode setup.

Master Machine General Configuration

1) Start with 1 master node setup. Launch an instance in OpenStack and associate floating IP to access the cloud instance from your local machine.

Login to the machine using pem file that was created while launching the instance using the command:

ssh -i “hadoop_demo_master.pem” ubuntu@master-floating-ip

2) Next, update the machine using the command:

sudo apt update

3) Java is the major component of Hadoop, so install Java using the below commands. As Java 8 is more stable and suitable for Hadoop, it is preferred over Java 11.

sudo apt install openjdk-8-jre-headlesssudo apt install openjdk-8-jdk-headless

In case of facing the below error,

E: Could not get lock /var/lib/dpkg/lock-frontend -open (11: Resource temporarily unavailable)

E: Unable to acquire the dpkg frontend lock (/var/lib/dpkg/lock-frontend), is another process using it?

Use the following commands and rerun the java installation commands.

sudo rm /var/lib/apt/lists/locksudo rm /var/cache/apt/archives/locksudo rm /var/lib/dpkg/lock*sudo dpkg — configure -asudo apt install openjdk-8-jre-headlesssudo apt install openjdk-8-jdk-headlessjava -version

4) The next step is to create a new group and a user with root (sudo) permissions to install Hadoop components.

The group name is hadoop and the username is hadoop_user.

sudo groupadd hadoopsudo useradd -ghadoop hadoop_user -m -s /bin/bashsudo passwd hadoop_usersudo usermod -aG sudo hadoop_user

5) Provide password authentication to the master machine by editing the file and changing the option from ‘No’ to ‘Yes’ and save the file and restart sshd.

sudo vim /etc/ssh/sshd_configPasswordAuthentication yessudo service sshd restart

Note: yes, should be in small letters.

6) Finally, log in to the master instance as a new user using the command:

su hadoop_user

Worker Machine General Configuration

Initially, start with 1 worker setup, and once the multimode cluster is set up with 1 master and 1 worker node, then replicate the worker machine image using the create snapshot option.

Repeat steps 1 to 6 from the master machine general configuration.

Combined Machine General Configuration

This step applies to both master and worker machines.

1) Add the master machine’s private-ip and worker private-ip in /etc/hosts file in both master and worker, using the below commands and save the file.

sudo vim /etc/hostsmaster-private-ip hadoop-masterworker-1-private-ip hadoop-worker-1

Note: Don’t change the rest of the file property in /etc/hosts related to ipv6 configuration.

2) The next step is to setup ssh in every node to establish communication between the nodes without the password.

Execute the below commands in both master and worker nodes.

ssh-keygen -t rsacd /home/hadoop_user/.sshls -lssh-copy-id -i ~/.ssh/id_rsa.pub hadoop_user@hadoop-masterssh-copy-id -i ~/.ssh/id_rsa.pub hadoop_user@hadoop-worker-1chmod 0600 ~/.ssh/authorized_keys

Finally, to check whether the machines can communicate without password authentication, use the below command from the master machine.

ssh hadoop_user@hadoop-worker-1exit

3) Hadoop does not currently run over IPv6 networks, so we need to disable IPv6 using the command below.

sudo vim /etc/sysctl.conf

Add the following lines to the end of the file.

net.ipv6.conf.all.disable_ipv6=1net.ipv6.conf.default.disable_ipv6=1net.ipv6.conf.lo.disable_ipv6=1sudo sysctl -p

Hadoop Installation in Master and Worker Machine

From hadoop_user, create a directory in /opt/ folder using below commands and provide required permissions.

sudo mkdir /opt/hadoopsudo chown -R hadoop_user:hadoop /opt/cd /opt/hadoop
curl -O http://ftp.heanet.ie/mirrors/www.apache.org/dist/hadoop/common/hadoop-3.1.3/hadoop-3.1.3.tar.gz

If curl is not found on your system, you can install it with the command below and try to install hadoop again.

sudo apt install curl

Unzip the hadoop folders to extract the files from it and remove the unzip file.

sudo tar xvf hadoop-3.1.3.tar.gzrm hadoop-3.1.3.tar.gz

Create a symbolic link. This helps us to have multiple versions of Hadoop on the same machines.

sudo ln -s hadoop-3.1.3 hadoopsudo chown -R hadoop_user:hadoop hadoop*

In the following hadoop configuration, the master machine has the only namenode and datanodes are configured in the worker machines.

Configuring Hadoop in Master Node

1. After successful installation of hadoop in the master machine, the next step is configuring namenode, datanode and replication factor settings.

cd /opt/hadoop/hadoop/etc/hadoop

2. Edit the hdfs-site.xml file and add the following lines to it. Make sure you type the following lines. If you copy and paste it, there can be blank spaces, and which causes issues while initializing hadoop. Add the below lines to the file and save the file.

vim hdfs-site.xml
Configuration of hdfs-site.xml in Master Node

3. Edit core-site.xml file located at /opt/hadoop/hadoop/etc/hadoop and let hadoop distribution know where the name node is located.

vim core-site.xml
Configuration of core-site.xml in Master Node

4. Then update yarn-site.xml to manage the resource from the master.

vim yarn-site.xml
Configuration of yarn-site.xml in Master Node

5. Next, update mapred-site.xml to make sure MapReduce jobs take appropriate configuration files.

vim mapred-site.xml
Configuration of mapred-site.xml in Master Node

6. Update the hadoop-env.sh and add the below lines at the end of the file. To check the location of Java path, use the below command.

sudo update-alternatives --config java
Configuration of hadoop-env.sh in Master Node

7. Finally, remove localhost and add the list of worker IP’s to the workers file.

vim workers

Configuring Hadoop in Worker Node

1. After successful installation of hadoop in the master machine, the next step is configuring namenode, datanode and replication factor settings in the worker node.

cd /opt/hadoop/hadoop/etc/hadoop

2. Edit the hdfs-site.xml file and add the following lines to it. Make sure you type the following lines. If you copy and paste it, there can be blank spaces, and which causes issues while initializing hadoop. Add the below lines to the file and save the file.

vim hdfs-site.xml
Configuration of hdfs-site.xml in Worker Node

3. Edit core-site.xml file located at /opt/hadoop/hadoop/etc/hadoop and let hadoop distribution know where the name node is located.

vim core-site.xml
Configuration of core-site.xml in Worker Node

4. Then update yarn-site.xml to manage the resource from the master.

vim yarn-site.xml
Configuration of yarn-site.xml in Worker Node

5. Next, update mapred-site.xml to make sure MapReduce jobs take appropriate configuration files.

vim mapred-site.xml
Configuration of mapred-site.xml in Worker Node

6. Finally, update the hadoop-env.sh, and add the below lines at the end of the file.

Configuration of hadoop-env.sh in Worker Node

Hadoop Common Configuration

The following steps need to be implemented in master and worker nodes. Environment settings of the hadoop components installed in the machines are centralized.

sudo vim /etc/profile
Configuration of /etc/profile file

The file should have the above export directories. Then link /etc/profile with /root/.bashrc to make sure the following global settings apply to all users.

sudo ln -sf /etc/profile /root/.bashrcsource /etc/profile

To check the hadoop version, exit from the user and log in again and use the below commands:

exitsu hadoop_user$HADOOP_HOME/bin/hadoop version

If you face any issues, run the below command and check the version.

source /etc/profile

Once the configuration is completed in both the machines, format the master machine using the following command.

$HADOOP_HOME/bin/hdfs namenode -format

Start Hadoop Components

Use the below commands to turn on the hadoop on both master and worker nodes. Use the jps command to check the java services running on the master and worker nodes.

$HADOOP_HOME/sbin/start-dfs.sh
  1. The following services should be running in the master machine after staring dfs.
Master Node after starting dfs service

2. Similarly, the following services should be running in the datanode.

Worker Node after starting dfs service
$HADOOP_HOME/sbin/start-yarn.sh

3. The following services should be running in the master machine after staring yarn.

Master Node after starting yarn service

4. Similarly, the following services should be running in the datanode.

Worker Node after starting yarn service

If the above services are up and running, you have completed the hadoop multinode cluster setup with 1 master and 1 worker node.

You can stop the services using the below commands and can restart the cluster whenever you want to use it. The data will be available in the worker nodes.

$HADOOP_HOME/sbin/stop-dfs.sh$HADOOP_HOME/sbin/stop-yarn.sh

Adding Second DataNode in the Hadoop Cluster

As mentioned earlier, we need to take a snapshot of the worker instance and reuse the same ssh key. As the hadoop configuration was installed in the 1st worker node, no need to install the whole configuration again. In the below figure, use the Create SnapShot option and replicate the machine settings and launch the snapshot as an instance.

Snapshot of Worker Node-1
Launch Worker Node-2 using Worker Node-1 Snapshot

Add the new worker machine’s private-IP in /etc/hosts file within the master and 2 worker nodes using the below command.

sudo vim /etc/hostsworker-2-private-ip hadoop-worker-2

Then add the worker-2-private-ip in /opt/hadoop/hadoop/etc/hadoop/workers. The file should contain the following information.

vim /opt/hadoop/hadoop/etc/hadoop/workersworker-1-private-ipworker-2-private-ip

Now start the dfs and yarn services and check for the hadoop components in the new worker node.

Worker Node-2 after starting dfs and yarn services

If the above services are running in the worker-node 2. You have successfully configured the multinode cluster with 1 master and 2 worker nodes.

References

1) https://www.tutorialspoint.com/hadoop/hadoop_multi_node_cluster.htm

2) https://towardsdatascience.com/installing-hadoop-3-1-0-multi-node-cluster-on-ubuntu-16-04-step-by-step-8d1954b31505

3) https://askubuntu.com/questions/1109982/e-could-not-get-lock-var-lib-dpkg-lock-frontend-open-11-resource-temporari

--

--