Hadoop Multi-Node Cluster Setup Using Ansible

This blog aims to explain the process of creating the multi-node cluster setup of Hadoop using Ansible which is very rarely available. Hadoop version 1.2.1 is used in this blog, you can choose your own version based on your choice.

Harshit Dawar
Dec 29, 2020 · 4 min read
Image for post
Image for post
Image by Author

In this rapid world, where automation is a crucial aspect for every business, therefore it is very important to provide the correct education for everyone. But, the sad reality is that the way which is used to deliver education is absurd. Most of the sources for education are very abstract which does not provide the content in-depth.

The biggest example of abstraction in education can be taken from the tool used in this blog, i.e., Hadoop, most of the sources will provide you the configuration script which will work after a lot of effort because initially, it is challenging to understand that script. But, very rarely you will find the exact meaning of those scripts.

To make yourself outstand the crowd in this world, it is very important to know the correct concepts, their applications & their implementation.

Now, that being said, let’s start the blog.

If you are not aware of Ansible & Hadoop, I would request you to go through the below-mentioned links of the blogs which will explain both the tools in brief which will help you in a better understanding of the concept.

Ansible Blog link:

Hadoop Blog Link:

This blog will use the complete automation strategy to deploy the Hadoop multi-node cluster using Ansible.

Note: This blog covers the complete code for the RedHat based Linux distribution, or other distributions I will update as soon as I create the code for them. (But, if you are slightly familiar with the commands, then you can customize the scripts based on your requirement). Also, you should have Ansible Installed in your machine to run this script. If you want to know more about the installation, checkout this link.

The resources that are used for the setup are mentioned below(All the files except “JDK” are present in my GitHub repo whose link is present at the end of this blog):

  1. Oracle JDK version 1.8

When you visit the above link, you will see the multiple JDK listed there, to continue with this blog, download the JDK which is highlighted.

Image for post
Image for post
Image by Author!

2. Hadoop 1.2.1

3. hdfs-site.xml file for DataNode & MasterNode.

4. core-site.xml file for Data Node & Master Node.

5. Network connectivity between all the Data & Master/Name Nodes.

Variables used in the Ansible Script!

A separate Variable file is present in the GitHub repository which will be used by the ansible-playbook to implement the cluster.

All of the variables can be left untouched except one, the variable that is important to set are:

  1. “jdk”: This variable points to the JDK path in the managed node (where the ansible-playbook will make changes). To assign a value of it, first of all, in all the data nodes & name Node, download the JDK which is mentioned above in the exact same path in each machine, then assign that path to this variable.

Inventory Groups!

In the Ansible Inventory, make sure to assign the group name as “MasterNode” for all the machines on which NameNode has to be configured.

The group name for all the DataNodes should be “WorkerNodes”.

Configuration Files for the Cluster!

The most important configuration files which are required for this cluster setup are:

1. hdfs-site.xml (For NameNode)!

HDFS-Site.XML for NameNode!

2. hdfs-site.xml (For DataNode)!

HDFS-Site.XML for DataNode!

3. Core-site.xml (For both DataNode & NameNode, as it is the same for both the nodes)!

core-Site.XML for DataNode & NameNode!

Ansible-Playbook to configure everything!

When the below-given ansible-playbook is executed, it always asks “Whether to format the NameNode Directory or not”, you should enter “Yes” for the first time only, because it will format all the metadata present in the namenode. For the first time it is required for the proper working of namenode, but after some time when there is data present in the Hadoop Cluster, then if you format the NameNode directory, then all the linking to the data will be lost.

Therefore be cautious with the prompt of this script!

Complete Ansible-Playbook:

Ansible-Playbook to configure the Hadoop-Cluster!

I hope my article explains each and everything related to the topic with all the deep concepts and explanations. Thank you so much for investing your time in reading my blog & boosting your knowledge. If you like my work, then I request you to give an applaud to this blog & follow me on Medium & GitHub!

The Startup

Medium's largest active publication, followed by +752K people. Follow to join our community.

Harshit Dawar

Written by

Big Data Enthusiast, have a demonstrated history of delivering large and complex projects. Interested in working in the field of AI and Data Science.

The Startup

Medium's largest active publication, followed by +752K people. Follow to join our community.

Harshit Dawar

Written by

Big Data Enthusiast, have a demonstrated history of delivering large and complex projects. Interested in working in the field of AI and Data Science.

The Startup

Medium's largest active publication, followed by +752K people. Follow to join our community.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store