Big Data Series | Article 2
Hadoop Components and Architecture — Hadoop v1.0
We already know from the previous introductory article on big data, that it is efficiently handled by a cluster of machines. But, wouldn’t you be interested to know how these machines interact with each other? In this article, we will look into the these details by learning about Hadoop definition, basics and framework in v1.0.
Hadoop is an open source framework that is used to efficiently process and store huge volumes of data. It talks about the details of the data storage and processing across different machines in the cluster.
1. Hadoop Master/Slave Architecture:
master/slave architecture with an example of a study class. Typically, a study class has a group of a professor and a group of students and their duties are as follow:
Professor:
* Entry point of contact for anyone outside the class.
* Distributes work to all the students.
* Receives frequent updates about the progress of tasks assigned to each student periodically.
* Keeps track of the whereabouts of students
Student:
* Gets the guidance from professor
* Accepts the share of work and perform the task
* Continuously updates the ProfessorIn the Hadoop world, Professor can be related to Name/Master Node and Student can be related to Data/Slave Node.
2. Hadoop v1.0 Components:
As seen in P2, Hadoop v1.0 had 2 major components:
HDFS: The layer at which the data is stored on each machine of a cluster.
MapReduce: The layer at which the data across all the machines is processed and resources are managed across all the machines in the cluster.
Let’s combine both the pictures (P1,P2) above and try to understand further.
As shown above (in P3), all the nodes (including Master/Name and Slave/Data) are machines that has HDFS Layer and MapReduce Layer.
2.1) Hadoop Distributed File System (HDFS) Layer:
Data Node: Contains actual data in blocks of files (more on this later).
Name Node: Preserves the information of the data stored in Data Nodes i.e. metadata of the data present on data nodes.
2.2) MapReduce (MR) Layer:
Task Tracker: Primarily responsible for the completion and execution of tasks on individual data node the task tracker exists on. Each task tracker is responsible to report the status of the task to the job tracker on master node.
Job Tracker: Exists on Name Node and coordinates with the task trackers on each data node. It is responsible for scheduling and monitoring of jobs and tasks.
3. HDFS Architecture:
So far we know that the data is stored on Data Nodes. But, how exactly does this happen?
* Block Size: A file is split into multiple blocks based on the block size (default size is 64MB for Hadoop v1.0 and 128MB for Hadoop v2.0) which are pushed to different Data Nodes. The address of all these blocks is stored on Name Node.
Q1: How many unique data blocks would there be for a file of size of 5GB
Ans: (5 GB* 1024 MB) / 64 MB= 80 blocks
***Assumption: The block size is 64 MB.* Replication Factor: All the blocks of files are replicated based on the replication Factor. Default replication factor is set to 3. So, any data block is present 3 times on the cluster unless changed.
Q2: How many total blocks (including replicas) of the same file above would be present on the cluster.
Ans: 80 blocks *3 = 240 blocks
*** Assumption: The replication factor is 3.
As seen in P4, when a user tries to store a file on HDFS, the request is sent to Name Node and the file is split and replicated based on block size and replication factor together to be stored on various Data Nodes. Metadata/address of these data blocks is saved in Name Node.
4. Failure Management:
Why do we need a failure management?
It is obvious that we are dealing with a bunch of machines and anything could break at anytime. So, Hadoop v1.0 framework has built-in failure management as well.
4.1) Data Node Failure Management: Data Nodes are in continuous contact with Name Node and sends “heartbeats” every 3 seconds. If Name Node does not receive heartbeats continuously for more than 10 seconds, it considers the data node is down and removes the data block replica from the effected data node which in turn brings down the replication factor of the data block. Name Node then creates new copy of the data block lost due to the failed data node to maintain the replication factor. Address of this new data block is now updated on Name Node.
4.2) Name Node Failure Management: Assume there is a huge volumes of data across different data nodes, but, don’t know the metadata. Would it be of any use? Of course not! This is exactly what happens if the Name Node goes down. Unfortunately, Hadoop v1.0 does not have a built-in solution for this (Except that the Hadoop admins do their magic to get the data from backups) which results in the cluster down for a few hours. However, this is efficiently handled in Hadoop v2.0, to be discussed in the next article.
5. Drawbacks of Hadoop v1.0:
5.1) Job Tracker: Job Tracker is responsible to coordinate with the task managers across all the data nodes. What happens if the cluster size is real big? Yes, Job Tracker will be a bottle neck in this case. As per industry, if the cluster size increases beyond 4000, Job Tracker will be a real bottle neck and the work slows down even though the data nodes and their task managers have the capacity as they are waiting on instructions from Job Tracker.
5.2) Name Node Failure Management: As discussed earlier, there is no automatic mechanism built in Hadoop framework to take care of name node failure.
These drawbacks are gracefully tackled in Hadoop v2.0. Stay tuned to the series of the articles!
Note: All the images embedded here are retrieved from internet and the author has no intention of copyright infringement.