Hadoop MapReduce Multi-node Cluster over AWS using Ansible Automation

Documentation and analysis is done back and forth using Job Tracker and Task Tracker nodes that are part of MR Cluster and giving the benefit to run the analysis program giving Distributed Computing Resources….

Published in

Geek Culture

15 min readMay 27, 2021

Hello Readers, this blog is extended version of my previous blog in which I have discussed about HDFS Cluster setup (Distributed Storage File System that are provided by Hadoop) we have done this setup on AWS using Configuration management tool — “Ansible”.

Big-Data - Hadoop Multi-Node Cluster over AWS using Ansible

Apache Framework — To Do Big Data Computing and Storage through Distributed Approach. Big Size files are Stripped out…

akanksha77.medium.com

🤔 What new we gonna discuss here?

Hadoop Distributed Computing Cluster
Working of Job Tracker Node
Working of Task Tracker Node
How Hadoop Provides internal Sorting Program?
How to setup Job Tracker?
How to setup Task Tracker?

Hadoop being a Big Data Storage and analysis tool never to go with multi node setup. Practically when we implement Hadoop DFS (Distributed File System) or Hadoop DCC (Distributed Computing Cluster) data file stripe then distributes on the Slave Nodes and further the metadata or say the Format File System keeps track of Stored piece of data that later helps in Fast Retrieval of Raw Data.

Hadoop Distributed Computing Cluster :

MapReduce framework is to just define the data processing task. It was focused on what logic that the raw data has to be focused on. Distributed Computing Cluster will be running the data processing task across the multiple machines, managing memory, managing processing, etc. The user defines the mapper and reducer tasks, using the MapReduce API to Job Tracker Node. It can help us to work with Java and other defined languages. Map defines id program is packed into jobs which are carried out by the cluster in the Hadoop.

As we know that, HDFS is a file system that is used to manage the storage of the data across machines in a cluster. Perhaps MapReduce is a framework to process the data across the multiple Servers. So It takes data from the HDFS Cluster and populate the mapper and reducer programs in the Task Tracker Nodes which works with the raw data set and give an required output. The Output is again a data so it stores in Hadoop Distributed Storage Cluster.

Hadoop is used for Big Data Processing and Business Analytical Operations, to have business reports we continue with Business Integration and Analysis by solving use cases by some Java or any other language Programs that distribute the data and program after sorting it and then performing Reducer functions final results. This is called “MapReduce Multi-Node Hadoop Cluster”.

Working of Job Tracker Node:

Job Tracker’s function is resource management, tracking resource availability and tracking the progress of fault tolerance. Job Tracker communicates with the Name Node to determine the location of data. Finds the task tracker nodes to execute the task on given nodes. It tracks the execution of Map Reduce from local to the Slave Node. It is the point of communication for the HDFS cluster to Task Tracker nodes for data distribution along with processing units.

**Fig 2. Flow Diagram of Hadoop Data Flow**

Working of Task Tracker Node

Every Task Tracker is configured with a set of slots, these indicate the number of tasks that it can accept. When the Job Tracker tries to find somewhere to schedule a task within the Map Reduce operations, it first looks for an empty slot on the same server that hosts the Data Node containing the data, and if not, it looks for an empty slot on a machine in the same rack.

How Hadoop Provides internal Sorting Program?

While searching the facts about the actual work of Mapper and Reducer Programs conclusion was made that when we have the mapper program which imports data from HDFS Cluster and regulation of Sorting that makes the Reducer Program to run faster and accurate among the big dataset that we have some algorithm which works in both the parts of filtering and managing programs along with the reducer programs.

Reducer Mainly works on R-way merge Sorting algorithm, while behind the scene Mapper of Hadoop internally provide a in-built sorting algorithm named Quick Sort that works on the Task Tracker Nodes and before pushing the data to reducer program filters and manages the data according to the use-case of the User\Organization.

Reducer- Merge sort is used in reduce side. Merge sort is the default feature of MapReduce. One cannot change the MapReduce sorting method, the reason is that data comes from the different nodes to a single point, so the best algorithm that can be used here is the merge sort.

How to setup Job Tracker?

The Job Tracker talks to the Name Node to determine the location of the data. The Job Tracker locates Task Tracker nodes with available slots at or near the data. The Job Tracker submits the work to the chosen Task Tracker nodes. The Task Tracker nodes are monitored. If they do not submit heartbeat signals often enough, they are deemed to have failed and the work is scheduled on a different Task Tracker. A Task Tracker will notify the Job Tracker when a task fails. The Job Tracker decides what to do then: it may resubmit the job elsewhere, it may mark that specific record as something to avoid, and it may may even blacklist the Task Tracker as unreliable. When the work is completed, the Job Tracker updates its status. Client applications can poll the Job Tracker for information.

How to setup Task Tracker?

The Task Tracker spawns a separate JVM processes to do the actual work; this is to ensure that process failure does not take down the task tracker. The Task Tracker monitors these spawned processes, capturing the output and exit codes. When the process finishes, successfully or not, the tracker notifies the Job Tracker. The Task Trackers also send out heartbeat messages to the Job Tracker, usually every few minutes, to reassure the Job Tracker that it is still alive. These message also inform the Job Tracker of the number of available slots, so the Job Tracker can stay up to date with where in the cluster work can be delegated. Here mapper program works in parallel whereas reducer program works as combiner.

Let me Direct you to the practical part now,

We will create two clusters here — MR Cluster (Map Reduce Cluster) and HDFS Cluster (Hadoop Distributed File System) Cluster which we will use for Client Data Operation and Storage respectively. For performing any operation on data we require a program and most important data on one place but as discussed earlier Task Tracker (TT) Nodes are assigned with the processing part by Job Tracker (JT) Node and they evaluate the task in two processed i.e. With Mapper Program and With Reduced Program to get a required Output. The output is further store in the Data Nodes (DN) as both JT Node and NN (Name Node) are dependent on each other for this whole Process on Big data. Hence Called Hadoop Distributed Computing and Hadoop Distributed Storage File System Features.

In this Practical we are including 9 Live instances where we will configure 3 Nodes as Data Node, 3 Nodes as Task Tracker Node and Remining 3 would be One Name Node, One Job Tracker Node and Lastly a Hadoop Client Node where we would mention the address of both the clusters. These 9 instances will be launched over Public Cloud (AWS) using EC2 Service, dynamically by the help of Ansible Roles. We will further configure the Cluster part using Dynamic Inventory concepts and those Host groups will divide the instances in 5 different categories i.e.

**Fig 3. Practical Setup of our HDFS and MR Cluster**

Name Node: Name Node is also known as the Master. Name Node only stores the metadata of HDFS — the directory tree of all files in the file system, and tracks the files across the cluster.
Job Tracker Node: The Job Tracker is the service within Hadoop that farms out MapReduce tasks to specific nodes in the cluster, ideally the nodes that have the data, or at least are in the same rack. Client applications submit jobs to the Job tracker.
Hadoop Client Node: Client communicates with Name Node and Job Tracker for storing data and Processing it with a set of Program for desired output by reaching to Job Tracker Node.
Task Tracker Node: A Task Tracker is a node in the cluster that accepts tasks — Map, Reduce and Shuffle operations — from a Job Tracker. Every Task Tracker is configured with a set of slots, these indicate the number of tasks that it can accept.
Data Node: Data Node is also known as the Slave. Name Node and Data Node are in constant communication. When a Data Node starts up it announce itself to the Name Node along with the list of blocks it is responsible for.

My Local Machine Configurations:
> RHEL 8 VM on top of VBox Manager with 2 CPU, 4GB RAM.
> Ansible version 2.10.4 installed.
> Proper Network connectivity using Bridge Adapter.

Step 1: Create Ansible Configuration file:

Ansible being Agentless Automation Tool need the inventory file at the Controller node which i have mentioned to be our local System. Inventory file could be created either globally inside controller node in the path (/etc/ansible/ansible.cfg) or could be created at the workspace where we are going to run our playbooks/roles.

Create Workspace for this Project:

# mkdir hadoop-ws
# cd hadoop-ws
# mkdir roles

Configuration file:

Fig 4. ansible.cfg File

For explanations of the above file Visit!

Step 2: Create two files

First:

Ansible Vault file named cred.yml which contain IAM access and Secret Key for Authentication to your AWS Account. In your directory 'my-ws' create using ansible-vault create cred.yml (give password)

file format:
access_key: GUJGWDUYGUEWVVFEWGVFUYV 
secret_key: huadub7635897^%&hdfqt57gvhg

Second

Create file named hadoop_instance.pem which is key-pair that we use to create the ec2-instance remotely over our AWS account.

Steps:

Go to AWS Management Console
EC2 dashboard
Key-pairs
Create new Key-Pair
Give name as hadoop_instance
Select ‘.PEM’ file format
Download the key to your local system
Transfer the key to Ansible Controller node in same directory where your role is hadoop-ws

Step 3: Creating Ansible Roles

Next we will create Six main Roles i.e.
◼ AWS-EC2 instance Creation and dynamic IP retrieval Role,
◼ Hadoop-Master Configuration and Starting Name Node Service Role,
◼ Hadoop-Slave Configuration and Starting Data Node Service Role,
◼ Hadoop-Client Configuration and Registration to Cluster,
◼ Hadoop-JobTracker Configuration and Starting Job Tracker Service Role,
◼ Hadoop-TaskTracker Configuration and Starting Task Tracker Service Role.

# mkdir role
# cd role
# ansible-galaxy init ec2
# ansible-galaxy init hadoop_master
# ansible-galaxy init hadoop_slave
# ansible-galaxy init hadoop_client
# ansible-galaxy init hadoop_jobtracker
# ansible-galaxy init hadoop_tasktracker

Now we have all the tasks, template, vars file other Ansible artifacts based on a known file structure that are pre-embedded in roles just we need to write the declaration/description (in YAML Language) of what all things we need by including Modules and jinja attribute annotations.

To know more about roles visit!

Step 4: Writing EC2 role:

EC2 Module of Ansible provide property to launch and provision instances over AWS Cloud. We have preferred t2.micro as instance_type and Amazon Linux 2 image as AMI. Also we have security group allowing all traffic from anywhere, instead of that you may go with the HDFS Protocol and some SSH and few ports inbound outbound rules like ports 9001, 50070, 9002, and 10020 …few more.

# cd role/ec2/tasks
# vim main.yml

Fig 5. Task File for EC2 Role

Included some variables and files in templates and vars folder like instance_tags, Python_pkg, sg_name, region_name, subnet_name, ami_id, key_pair, instance_flavour. These variables can directly be called in ec2 roles/ec2/tasks/main.yml file as per Ansible Artifact.

# cd role/ec2/vars
# vim main.yml

Fig 6. Variable File for EC2 Role

Step 5: Writing Hadoop_master role:

Hadoop master has the main metadata of the cluster like Cluster ID and ports written on the core-site.xml file in XML format, Critical thing that we need to perform over this master node is formatting of our mater shared folder which need to be done only once. Further we have to give the Port from where master will communicate to it’s slaves and client i.e. Port 9001.

Consider following code that need to be written inside main.yml of the hadoop_master/tasks folder.

# cd role/hadoop_master/tasks
# vim main.yml

Fig 7. Tasks for Master Ansible Role

The jinja2 format file that we include inside template folder this file is of hdfs-site.xml that our master configures at the time of master configuration:

# cd role/hadoop_master/templates
# vim hdfs-site.xml.j2

Fig 8. hdfs-site.xml.j2 file

We have variable file that contain the value which would be directly called / replaced at the time of role execution. These variables include — pkgs_name and hadoop_folder.

# cd role/hadoop_master/vars
# vim main.yml

Fig 9. Variable file for Hadoop-master Role

There are two files in Hadoop that we need to configure one is core-site.xml and another is hdfs-site.xml to configure both these files through ansible we have two strategies one is just to copy them using copy module or copy them doing some changes using template module. Usually we use template for the data processing at the time of coping and these template files are written in jinja2 file format.

What you need to do is put core-site.xml file in files folder of our role. So just go to hadoop_master/files and put the following file.

# cd role/hadoop_master/templates
# vim main.yml

Fig 10. core-site.xml.j2 file

Step 6: Writing Hadoop_slave role:

Now we have to configure Hadoop-slave, there can be as many Hadoop slave as big we want to gain the Distributed Storage power and Computing Resources. Following is the task/main.yml file where all the steps are declared in YAML format. Starting from installing dependencies then at the end we have final task in slave to run the data node services.

# cd role/hadoop_slave/tasks
# vim main.yml

Fig 11. Hadoop-Slave Task File

Following is the var/main.yml file from our hadoop_slave role where we have two vars, pkgs_name and hadoop_folder that we need to call for the respective values.

# cd role/hadoop_slave/vars
# vim main.yml

Fig 12. Hadoop-Slave Variable File

As we have to use template module there in the task/main.yml file so as to copy both the files after doing some processing or say changes into the file and then the files to the slave node and the files are core-site.xml and hdfs-site.xml those are mentioned there inside templates folder in role in jinja2 file format:

# cd role/hadoop_slave/templates
# vim core-site.xml.j2

Fig 13. core-site.xml.j2 File

# cd role/hadoop_slave/templates
# vim hdfs-site.xml.j2

Fig 14. hdfs-site.xml.j2 File

Step 7: Writing Hadoop_jobtracker role:

Job Tracker Node configuration is done by this role which follows a task flow in which downloading the JVM and further installation of Hadoop and Java. This Role helps to set Master for the MR Cluster which loads the Mapper and Reducer Program and Distribute them to their salves i.e. Task Tracker Nodes for Distributed Computation. For this we write the following tasks in main.yml file inside Path /roles/hadoop_jobtracker/tasks ansible role directory:

# cd role/hadoop_jobtracker/tasks
# vim main.yml

Fig 15. Task File for Job Tracker Role

Many tasks including installation of Hadoop Software, Configuration of mapred-site.xml file which will connect these nodes to their MR Cluster Master (JT) and next Starting of Task Tracker Node Service and Making the service Daemon Permanent by the help of Ansible modules like: command, template, lineinfile, shell, and get_url.

Next these variables are mentioned in vars directory inside hadoop_jobtracker role: The variable named pkgs_name holds the value of packages we need for the hadoop installation.

# cd role/hadoop_jobtracker/vars
# vim main.yml

Fig 16. Variable File for Job Tracker Role

Two main Configuration file that we locate inside the templates folder of role for further configuration of job Tracker Node. File named core-site.xml.j2 will manage the connectivity of HDFS cluster with the Job Tracker and the other file named mapred-site.xml.j2 is for MR Cluster setup.

# cd role/hadoop_jobtracker/templates
# vim core-site.xml.j2

Fig 17. Core-Site.xml.j2 File

# cd role/hadoop_jobtracker/templates
# vim mapred-site.xml.j2

Fig 18. Mapred-Site.xml.j2 File

Step 8: Writing Hadoop_tasktracker role:

Task Tracker nodes are more in numbers. These are configured to communicate with Job Tracker node and get the input of Data and Programs for Mapper and Reducer. The main idea of Task Tracker node is parallel computing on a large amount of data. So we have written the following play in the hadoop_tasktracker role for Node Configuration:

# cd role/hadoop_tasktracker/tasks
# vim main.yml

Fig 19. Task File for Task Tracker Role

Here, we have many tasks like installation of Hadoop Software, Configuration of mapred-site.xml file which will connect these nodes to their MR Cluster Master (JT) and next Starting of Task Tracker Node Service and Making the service Daemon Permanent by the help of Ansible modules like: command, template, lineinfile, shell, and get_url.

Following is the variable file that is there in vars folder inside this hadoop_tasktracker role. Inside main.yml file. It contain only one variable named pkgs_name which contain the required package name for hadoop installation in our task tracker node.

# cd role/hadoop_tasktracker/vars
# vim main.yml

Fig 20. Variable File for Task Tracker Role

Next comes the templates folder under our roles file which contain a main address file for master connectivity of our Task Tracker node. The file mapred-site.xml.j2 will process and get copied in the desired location by the help of templates module inside the defined location.

# cd role/hadoop_tasktracker/templates
# vim mapred-site.xml.j2

Fig 21. Mapred-Site.xml.j2 File

Step 9: Writing Hadoop_client role:

Client have the target to reach to the cluster using master IP and then put/ read/ write / edit / delete any other operations to store the files of big sizes in the desirable block size and the further replication that will increase the availability of their files.

Also for the MR (Map Reduce) Operation Code it Hadoop is Providing a Separate Computing Cluster where those Program files would be distributed and by taking the data blocks from the HDFS Cluster it gives the appropriate output to the client which further stores in our HDFS Cluster. Following is the roles/hadoop_client/task/main.yml file of our role where the descriptive code of client configuration is given:

# cd role/hadoop_client/tasks
# vim main.yml

Fig 22.Tasks for Hadoop Client Role

The variables are to be mentioned in the vars/main.yml folder which includes- pkgs_name that contain the required package names for installation of Hadoop software.

# cd role/hadoop_client/vars
# vim main.yml

Fig 23. Variable File for Hadoop Client Role

As only we need to mention the master node name while configuring client node, so only file we need to change are core-site.xml and for Job Tracker Nodes we configure mapred-site.xml.j2 in jinja 2 format which we will put in template folder of our role:

# cd role/hadoop_client/templates
# vim core-site.xml.j2

Fig 24. Core-Site.xml.j2 File

# cd role/hadoop_client/templates
# vim mapred-site.xml.j2

Fig 25. Mapred-Site.xml.j2 File

Here we are retrieving mater IP through the facts hostvars within our dynamic inventory and that we get from our master provision output with the port on which mater would listen/respond i.e. 9001, 9002 and Others.

Finally, we are done with the creation of all the roles now we just need to run these, for which we have to create a playbook and just keep on running these roles by including them one after the other in logical way. So we know that firstly the nodes need to be launched over AWS then Master configuration followed by slave configuration, Job Tracker and Task Tracker Configurations and then finally client registration to both the HDFS and MR clusters.

Step 10: Create Setup Playbook to run all the roles and Create the Cluster over AWS:

Following is the playbook file — setup.yml that need to be there inside our working directory which is hadoop-ws . Here we are writing all the roles by calling their names inside include_role module and related variable files as vars_files parameter.

Fig 26. setup.yml File

Hence, we have called all the five roles that we have created here: ec2 role, hadoop_master role, hadoop_slave role, hadoop_jobtracker role, hadoop_tasktracker role and hadoop_client role. With one Variable file named cred.yml which is a vault file containing the credentials to login into AWS public cloud platform.

Step 11: Running the Playbook Setup.yml

To run the playbook using ansible-playbook command and giving vault password to authenticate and login into AWS Account. (Refer the vault file creation in above steps).

# ansible-playbook setup.yml -ask-vault-pass

Hence we have achieved our target of “Hadoop MapReduce Multi-node Cluster over AWS using Ansible Automation”.

You can find this project over my GitHub, just fork and lets make the project more system independent and reliable to customers.

akankshaS77/Hadoop-HDFS-MR-Multi-Node-Cluster-AWS-Ansible

Create Ansible Role to launch 9 AWS EC2 Instances. Dynamically fetch the IPs & create the Inventory to run the further…

github.com

To contribute the Project, and for further query or opinion you may connect me over LinkedIN:

Akanksha Singh - Success Head @ ARTH - The School of Technologies - LinuxWorld Informatics Pvt Ltd…

View Akanksha Singh's profile on LinkedIn, the world's largest professional community. Akanksha has 3 jobs listed on…

www.linkedin.com