Hadoop Multi-Node Cluster Setup Using Ansible

This blog aims to explain the process of creating the multi-node cluster setup of Hadoop using Ansible which is very rarely available. Hadoop version 1.2.1 is used in this blog, you can choose your own version based on your choice.

Harshit Dawar
The Startup
4 min readDec 29, 2020

--

Image by Author

In this rapid world, where automation is a crucial aspect for every business, therefore it is very important to provide the correct education for everyone. But, the sad reality is that the way which is used to deliver education is absurd. Most of the sources for education are very abstract which does not provide the content in-depth.

The biggest example of abstraction in education can be taken from the tool used in this blog, i.e., Hadoop, most of the sources will provide you the configuration script which will work after a lot of effort because initially, it is challenging to understand that script. But, very rarely you will find the exact meaning of those scripts.

To make yourself outstand the crowd in this world, it is very important to know the correct concepts, their applications & their implementation.

Now, that being said, let’s start the blog.

If you are not aware of Ansible & Hadoop, I would request you to go through the below-mentioned links of the blogs which will explain both the tools in brief which will help you in a better understanding of the concept.

Ansible Blog link:

Hadoop Blog Link:

This blog will use the complete automation strategy to deploy the Hadoop multi-node cluster using Ansible.

Note: This blog covers the complete code for the RedHat based Linux distribution, or other distributions I will update as soon as I create the code for them. (But, if you are slightly familiar with the commands, then you can customize the scripts based on your requirement). Also, you should have Ansible Installed in your machine to run this script. If you want to know more about the installation, checkout this link.

The resources that are used for the setup are mentioned below(All the files except “JDK” are present in my GitHub repo whose link is present at the end of this blog):

  1. Oracle JDK version 1.8

When you visit the above link, you will see the multiple JDK listed there, to continue with this blog, download the JDK which is highlighted.

Image by Author!

2. Hadoop 1.2.1

3. hdfs-site.xml file for DataNode & MasterNode.

4. core-site.xml file for Data Node & Master Node.

5. Network connectivity between all the Data & Master/Name Nodes.

Variables used in the Ansible Script!

A separate Variable file is present in the GitHub repository which will be used by the ansible-playbook to implement the cluster.

All of the variables can be left untouched except one, the variable that is important to set are:

  1. “jdk”: This variable points to the JDK path in the managed node (where the ansible-playbook will make changes). To assign a value of it, first of all, in all the data nodes & name Node, download the JDK which is mentioned above in the exact same path in each machine, then assign that path to this variable.

Inventory Groups!

In the Ansible Inventory, make sure to assign the group name as “MasterNode” for all the machines on which NameNode has to be configured.

The group name for all the DataNodes should be “WorkerNodes”.

Configuration Files for the Cluster!

The most important configuration files which are required for this cluster setup are:

1. hdfs-site.xml (For NameNode)!

HDFS-Site.XML for NameNode!

2. hdfs-site.xml (For DataNode)!

HDFS-Site.XML for DataNode!

3. Core-site.xml (For both DataNode & NameNode, as it is the same for both the nodes)!

core-Site.XML for DataNode & NameNode!

Ansible-Playbook to configure everything!

When the below-given ansible-playbook is executed, it always asks “Whether to format the NameNode Directory or not”, you should enter “Yes” for the first time only, because it will format all the metadata present in the namenode. For the first time it is required for the proper working of namenode, but after some time when there is data present in the Hadoop Cluster, then if you format the NameNode directory, then all the linking to the data will be lost.

Therefore be cautious with the prompt of this script!

Complete Ansible-Playbook:

Ansible-Playbook to configure the Hadoop-Cluster!

I hope my article explains each and everything related to the topic with all the deep concepts and explanations. Thank you so much for investing your time in reading my blog & boosting your knowledge. If you like my work, then I request you to give an applaud to this blog & follow me on Medium & GitHub!

--

--

Harshit Dawar
The Startup

AIOPS Engineer, have a demonstrated history of delivering large and complex projects. 14x Globally Certified. Rare & authentic content publisher.