Created by Raktim

Deploying Multi-Node Hadoop Cluster on AWS Using Ansible Automation

In today's world, all Big Companies have millions of users & their billions of Data. But ever think of — How these companies are storing these Huge amount of data daily. Let's see what tools they are using & how we can also setup these tools on our own…

Raktim Midya
Geek Culture
Published in
13 min readApr 30, 2021

Big Data :

Source : Google

We all know there are millions of people who are using Internet daily & most of them are using Social Media platforms like Facebook, Instagram, YouTube etc. If we take a second & think we don't pay anything to these Social Media platforms, then how these companies are doing their business. Answer is simple…

“If you're not paying for the product, then you are the product”

These companies have billions of data of their users & using that data, they are doing their business. Now the big challenge is — How they are able to store this much Huge amount of Data. Plus they also need to effectively store these Data, so that in future they can read them very quickly. Here comes the play of Distributed Storage Cluster.

Few days back I wrote one blog, where I discussed how Distributed Storage Cluster works to solve the challenge of Big Data. For reference I have attached the link below…

Hadoop :

Hadoop Logo
  • On my above mentioned Blog, I mentioned that to implement Distributed Storage Cluster, there are lots of technological tools available. One of the most popular tool is Hadoop in this context.

Hadoop is a program that allows us to store Big Data in a distributed environment, so that, we can process it in parallel. There are basically two components in Hadoop — HDFS & YARN.

HDFS (Hadoop Distributed File System) helps us to store data of various formats across the cluster. YARN is for resource management in Hadoop. It allows parallel processing over the data which is stored across HDFS.

In this Blog, I don't want to go deep dive in to Hadoop, but I want to show — How to create Ansible Roles to setup Hadoop Distributed Storage Cluster on AWS EC2 Instances. But if you're curious to know more about Hadoop you can refer to this link —

Let's see the problem Statement :

  1. Create Ansible Role to launch 4 AWS EC2 Instances.
  2. Dynamically fetch the IPs & create the Inventory to run the further Ansible Roles on those Instances.
  3. Create Role to configure Hadoop Name Node (Master), Data Node (Worker) & Client Node.
  4. Finally configure 1st & 2nd Instance as Name Node & Client Node, also configure other 2 systems as Data Node.

Pre-requisite :

Obviously by looking at the problem statement, it might look very small & simple, but there are lot of components we need to take care of to achieve this setup. But don't worry because as always I will discuss each & every small points so that at the end of the blog, you feel comfortable with this practical.

There are few pre-requisite to understand this practical :

  • Although I will show how to write the code to complete this practical, but You definitely need basic knowledge of AWS EC2 instance & Ansible Role. If you don't have these knowledge then you can refer to these below mentioned blogs…
  • I’m using my local Windows machine & there I’m using Oracle VM box to run Linux OS. Inside my VM I had Ansible version 2.10.4 installed. Also to make things easier I connected my RHEL8 VM with VS Code of my Windows System via SSH.
  • Lastly we need one AWS account & there using IAM we need to create AWS Access Key & Secret Key, so that Ansible can login to our AWS account to launch the Instances.

Let’s start running the commands & writing the codes…

Creating four Ansible Roles :

I will upload my Ansible Role & Codes on GitHub — the link is at the end of this blog. In this Blog I will discuss each bit of those code, so to learn them keep on reading.

  • Create one workspace, let’s say “hadoop-ws”. Go inside this workspace & create one folder called “roles”. Now go inside this folder & run these below mentioned three commands…
ansible-galaxy init ec2
ansible-galaxy init hadoop_master
ansible-galaxy init hadoop_slave
ansible-galaxy init hadoop_client
  • Remember one thing that it will create four Ansible Role inside “hadoop-ws/roles/” folder & you can give whatever name you want to your Roles, but I suggest to give some logical name.

Setting up Ansible Configuration File :

Now inside “hadoop-ws” workspace we gonna create one local configuration file. In future whatever ansible command we gonna use, that will be run inside this workspace — “hadoop-ws”, so that ansible read this local configuration file & work accordingly.

So, inside “hadoop-ws” folder create one file called “ansible.cfg” & put the below mentioned content in it…

  • Here we can see some common keywords like “host_key_checking”, “command_warnings” etc. As I had already asked that Basic Ansible knowledge is required to understand this practical, so I believe you know these keywords.
  • Let me tell some new keywords like “private_key_file” which signifies to the aws key pair. When Ansible gonna login to AWS instances to setup K8s via SSH, then it needs the private key file. Also the default remote user of EC2 Instance is “ec2-user”.

Creating AWS Key-pair & putting it in the Workspace :

Go to AWS => EC2 => Key-pair & there create one key pair — let’s say “hadoop_instance.pem”. Then download the key in your VM workspace — “hadoop-ws”. Finally run…

chmod 400 hadoop_instance.pem

Creating Ansible Vault to store the AWS Credentials :

Lastly on your workspace run…

ansible-vault create cred.yml
  • It will ask to provide one vault password & then it will open the VI editor on Linux, create two variables in this file & put your AWS access key & secret key as values. For example…
access_key: ABCDEFGHIJK
secret_key: abcdefghijk12345
  • Save the file. Now you are finally ready to write Ansible Roles.

Just for reference run these below mentioned commands & observe the outputs in the below screenshot…

Screenshot of terminal

Writing Code for ec2 Role :

Self made on Canva

Task YML file :

Go inside the folder “hadoop-ws/roles/ec2/tasks/” & start editing the “main.yml” file. In this file write the below mentioned code…

  • This file helps Ansible to go to AWS & to launch 4 Instances & 1 Security Group there. Let's try to understand one by one, what's happening in this file.

Remember one thing that I will run this file on my localhost & Ansible will contact to AWS API to send the prompts. Now Ansible is written on top of Python Language & Ansible need Boto & Boto3 Python library to contact to AWS API. That's why using pip module, we are first installing those two library on our localhost. Here I used “python_pkgs” variable & the values are stored in variable file, which I will show afterwards.

  • Next we are using “ec2_group” module to create AWS security group. Here to make things simple I just allowed all ports in both inbound & outbound rules. But in real scenario we always configure best possible securities. Here I'm using two variables called “sg_name”, “region_name”. The values of these variables we will put in variables file.
  • Next we are using “ec2” module to launch the ec2 instances. Here I'm using loop to call this module multiple times with different tags. Again we can see I'm using some variables, & those variables & their values are mentioned in variable file.
  • Very important thing to note here is that I'm using “register” to store the Metadata in one variable called “ec2”, so that in future I can retrieve the important information from this Metadata. One thing here I want to mention that if you know basics of AWS, then you definitely know why we are using those options under ec2 module. Lastly we are using “wait” to true because I want, ansible to go to the next step once the Instance launched successfully.
  • Next I'm using “add_host” module which has the capability to create dynamic hostgroup. Remember one thing that I haven't created any “inventory” file in my system because I want this “add_host” module to create that information. We can't see what “add_host” is doing but you can think it's creating variables that contain the IP addresses that we will use as our inventory. If you want to see what “add_host” is creating then you can use “debug” module & can print the “hostgroup”.
  • Finally we are doing JSON parsing on that Metadata to find the public ip of each instances & then we are adding those ip in our dynamic host groups. Here finally we are creating 3 host groups — called “namenode”, “datanode” & “clientnode”. Namenode contain one Instance IP, datanode contain 2 Instance IP & finally clientnode contain 1 Instance IP.
  • Lastly I'm using “wait_for” module to make sure SSH service has started & is available for the connection.

Vars YML file :

Open the “hadoop-ws/roles/ec2/vars/main.yml” file & store all the variables that we mentioned on the “task/main.yml” file along with their respective values. For reference I attached the file…

Writing Code for hadoop_master Role :

Self made using Canva

Task YML file :

Similarly like previous time open the “hadoop-ws/roles/hadoop_master/tasks/main.yml” file & put the below mentioned code there…

  • This entire file has the capability to setup the Hadoop Namenode(Master Node). Let's understand the modules written in this file one by one…
  • 1st I’m using “get_url” module to download the hadoop & java software & then using “command” module it's installing those downloaded softwares. The variable “pkgs_name” is stored in “vars/main.yml” file, which I will show afterwards.

Next we need to configure our Hadoop Namenode & for that we need to write some data inside “/etc/hadoop/core-site.xml” file & “/etc/hadoop/hdfs-site.xml” file.

  • Now inside “core-site.xml” file we need to mention some keywords & need to tell that this current system is our Namenode. Using copy module I'm copying the file “hadoop-ws/roles/hadoop_master/files/core-site.xml” to the Namenode inside the file “/etc/hadoop/core-site.xml”. Because hadoop read the configuration from the folder “/etc/hadoop/”. For reference I mentioned the file below…
  • Next I used template module to copy the “hdfs-site.xml.j2” file in the Namenode. Now template module is doing three things in one step — first it’s picking the “hdfs-site.xml.j2” file from the “hadoop-ws/roles/hadoop_master/templates” folder & inside this file I have mentioned one variable. Secondly template module is updating that variable with the value. Lastly it’s copying the file to the namenode & rename it to “hdfs-site.xml”. For reference I mentioned the file below…
  • Next our code is creating the directory in namenode where it's gonna store the metadata of the cluster. For that I used “file” module.
  • Finally in last 3 steps we are just formatting the namenode & then starting the namenode service. And to make the service permanent we are just putting the service start command in “/etc/rc.d/rc.local” file.

Vars YML file :

Open the “hadoop-ws/roles/hadoop_master/vars/main.yml” file & store the variables. For reference I attached the file…

Writing Code for hadoop_slave Role :

Task YML file :

Similarly like previous time open the “hadoop-ws/roles/hadoop_slave/tasks/main.yml” file & put the below mentioned code there…

  • Here 1st three steps are exactly same like previous namenode task file. Let's understand after that…

We are using “template” module twice to copy both “core-site.xml.j2” & “hdfs-site.xml.j2” files to the datanode. I have already discussed what template module do in previous code. But here there are slight changes on these two template files. Let’s understand them…

  • Here in this “core-site.xml” file we need to tell where is our namenode, so that datanode can connect to it. That’s why we need to tell datanode the IP address & the port no — on which namenode service is working. Here I’m using “hostvars” keyword which has the capability to go inside the hostgroup & fetch all the metadata collected by Gathering Facts.
  • Now there are multiple host groups & hosts, but here we need to fetch information from “namenode” host group. That's why we are using “groups” keyword. Next we are using “[groups[‘namenode’][0]]” keywords to filter the first host from the hostgroup “namenode”. Now “hostvars” knows from which host, it needs to fetch the metadata.
  • Next it's time to filter the metadata & search for the IP. For that we are using “ansible_all_ipv4_addresses” facts. So this ansible fact contain all the private IP of the host namenode. Next I used “[0]” to only pick the 1st private IP from the list. This is the method of dynamic configuration using Ansible Facts.
  • Next “hdfs-site.xml.j2” file is using the same concept like previous “hdfs-site.xml” file from “hadoop_master” role, only difference is here I'm telling my current node is datanode.
  • Then in the main code I used “file” module to create the folder that we gonna use to store data in datanode. In datanode we don't need to format it. We just need to start the datanode services. Then I just make the service permanent like previous time.

Vars YML file :

Open the “hadoop-ws/roles/hadoop_slave/vars/main.yml” file & store the variables. For reference I attached the file…

Writing Code for hadoop_client Role :

Task YML file :

Similarly like previous time open the “hadoop-ws/roles/hadoop_client/tasks/main.yml” file & put the below mentioned code there…

  • In client system again we are downloading & installing hadoop & java software like previously we did in master & slave system.
  • Next in client system we only need to setup “core-site.xml” file so that client knows where the namenode is located. Again I'm using template module & applying the same concept I used in slave role. For reference I attached the file below…

Vars YML file :

Open the “hadoop-ws/roles/hadoop_client/vars/main.yml” file & store the variables. For reference I attached the file…

Finally Create the Setup file :

  • Now it’s finally the time to create the “setup.yml” file which we gonna run to create this entire infrastructure on AWS. Remember one thing that we need to create this file inside the folder “hadoop-ws”. For reference I attached the file below…
  • Here as you can see, we are running the first “ec2” role on our localhost because it’s gonna contact to AWS API from our localhost. Also using “vars_files” I included the “cred.yml” file in this task so that “ec2” role can access it.
  • On next three steps we are running the “hadoop_master”, “hadoop_slave” & “hadoop_client” role on “namenode”, “datanode” & “clientnode” dynamic hostgroup respectively.

GitHub Repository for Reference :

That’s all for the coding part. Now it’s time to run the playbook. For that run the below mentioned command in “hadoop-ws” folder.

ansible-playbook setup.yml --ask-vault-pass
  • Next it will prompt you to pass the password of your Ansible Vault (cred.yml file), provide the password & then you will see the power of automation…
Ansible Playbook Workflow

Checking your cluster :

Just login to client node & run the below mentioned hadoop command to check if your cluster is working fine or not. For reference I have shared the output of that command below…

hadoop dfsadmin -report
Client node terminal screenshot

Final Words :

  • There are endless future possibilities of learning Ansible, AWS & Hadoop. This is just a simple demo of Ansible Role, but if you want you can create more bigger infrastructure by adding more modules. Each and every kind of configuration we can achieve using Ansible.

Note : This is the 1st part of this practical. On next blog we gonna extend this practical and will add Hadoop Map Reduce Cluster (Compute Cluster) setup using Ansible Role. Although I have already published this entire practical in a video, so if you want you can check out the below mentioned video.

  • I tried to make it as simple as possible. Hope You learned Something from here. I keep on writing Blogs on Machine Learning, DevOps Automation, Cloud Computing, Big Data Analysis etc. So, if you want to read future blogs of mine, follow me on Medium. You can also ping me on LinkedIn, checkout my LinkedIn profile below…

Thanks Everyone for reading. That’s all… Signing Off… 😊



Raktim Midya
Geek Culture

Technical Content Writer || Exploring modern tools & technologies under the domains — AI, CC, DevOps, Big Data, Full Stack etc.