Introduction to Hadoop Architecture

Published in

Geek Culture

5 min readMay 26, 2021

What is Hadoop ?

Apache Hadoop is an open-source software library that is used to manage data processing and storage in big data applications. Hadoop facilitates analyzing large amounts of data parallelly and more quickly. Apache Hadoop was introduced to the public in 2012 by The Apache Software Foundation(ASF).

Why Apache Hadoop?

Before the digital era, the amount of data gathered at a slow pace and could be analyzed and stored with a single storage format. At the same time, the format of the data collected for similar purposes had the same format. However, with the evolution of the Internet and digital platforms like social media, the data comes in various formats (structured, semi-structured, and unstructured) and its velocity also massively increased. A new name was given to this data which is Big data. Then, the need for multiple processors and storage units arose to handle the big data. Therefore, as a solution Hadoop was introduced.

Hadoop components

Hadoop consists of three main components.

HDFS: the storage unit in Hadoop
Map Reduce: the processing method
Yarn: the resource negotiator

1. HDFS

HDFS follows master/slave architecture. It consists of a single namenode and many datanodes. In the HDFS architecture, a file is divided into one or more blocks and stored in separate datanodes. Datanodes are responsible for operations such as block creation, deletion and replication according to namenode instructions. Apart from that, they are responsible to perform read-write operations on file systems.

Namenode acts as the master server and the central controller for HDFS. It holds the file system metadata and maintains the file system namespace. Namenode oversees the condition of the datanode and coordinates access to data.

Data replication in HDFS

In HDFS, all blocks consist of the same size except for the last block. An application can specify the number of replicas for a file and the default is 3. The replication factor can be specified at the time of creation and can be changed later on too. Out of the three copies, two copies are stored in different nodes in the same local area network. The third copy will be stored in a different local area network. The namenode is responsible for all the decisions related to block replication. It periodically receives a signal which dictates the health of each node, whether it is functioning properly or not.

2. Map Reduce

Traditionally, data were processed on a single computer. However, with big data context, it has become a tedious and time consuming task. Therefore Hadoop has its own processing method called Map Reduce. Map Reduce has two tasks, namely Map and Reduce. Mapper is responsible for splitting and mapping the data while the reducer for shuffling and reducing the data.

Splitting phase: Input data is split into smaller chunks.

Mapping phase: Chunks are converted into <key, value> pairs with their occurrence frequency. Here if a word occurs multiple times, it will not state as an accumulated value, but as a single value.

Shuffling and sorting phase: The process by which the system performs the sort and sends the output to the reducer. Here the sorting is done based on key not value. Values passed to the reducer can be in any order.

Reducing phase: Once the shuffling and sorting is done, the reducer combines the result and outputs it which is stored in HDFS.

3. Yarn

Yarn stands for “Yet Another Resource Negotiator”. Yarn is responsible for resource management and job scheduling in HDFS. Yarn architecture has the following components.

Client: Submits map-reduce jobs
Resource manager: Responsible for resource assignment and management across the application. When a map-reduce task is received, it forwards the task to the respective node manager, and does the resource allocation accordingly. Resource manager consists of two main components namely scheduler and application manager.

Scheduler: Performs scheduling based on the allocated application and available resources.
Application manager: Responsible for negotiation of resources with the resource manager, monitoring the application progress and application status tracking.

3. Node manager: Takes care of an individual node on Hadoop cluster and its application workflow. Responsible for creating the container process and also killing as per the instructions given by the resource manager. In addition, it is also responsible for monitoring resource usage and performing log management.

4. Application master: Responsible for negotiating resources with the resource manager. This tracks the status and monitors the progress of a single application. Application master requests the container from the node manager by sending the relevant details to run the application and once it runs a time-to-time report about the health of the container to the resource manager.

5. Container: A collection of physical records such as RAM, CPU cores and disk on a single node.

Practical applications of Hadoop

Financial companies use Hadoop for their analytics purposes such as to assess risk, build investment models and for trading algorithms.
Retailers use Hadoop to analyze different types of data such as structured, unstructured and semi-structured to understand their customers and serve them in a better way.
Hadoop is heavily used in the telecommunications industry where it is used in analytics to execute predictive performance in their infrastructure.