Hadoop Architecture

3 min readSep 12, 2023

HDFS (Hadoop Distributed File System) :

HDFS (Hadoop Distributed File System) is a storage system that divides large files into smaller blocks and distributes them across a cluster of computers for efficient and fault-tolerant data storage and retrieval, key to Hadoop’s big data processing.

Demystifying Hadoop Architecture:

Hadoop, a powerful framework for handling vast datasets, is composed of three essential core components: HDFS (Hadoop Distributed File System), YARN (Yet Another Resource Negotiator), and MapReduce. In this blog, we’ll delve into these components and uncover how they work together to manage and process big data.

The Three Pillars of Hadoop

1. HDFS (Hadoop Distributed File System)

HDFS is like a clever library with a Librarian-in-Chief (Master) and many librarians (Slaves) that efficiently organize, store, and fetch data, ensuring your big data is always accessible. HDFS serves as the backbone of Hadoop, akin to a well-organized library. It consists of two main roles:

NameNode (Master): Think of the NameNode as the head librarian, meticulously keeping track of where each “book” (data block) is located. This ensures efficient data organization and swift access.
DataNodes (Slaves): DataNodes are like the library shelves, where the actual “books” (data blocks) are stored. They safeguard your data, making sure it’s readily available whenever needed.

2. YARN (Yet Another Resource Negotiator)

YARN functions as the event coordinator within Hadoop, ensuring resource allocation efficiency:

Picture YARN as a resource manager, efficiently assigning resources to various “librarians” (processing frameworks) as needed.

3. MapReduce

MapReduce is a team of “librarians” tasked with reading and summarizing “books” (data) efficiently:

Mapper: Each mapper reads and processes data concurrently, akin to librarians simultaneously exploring different chapters of books.
Reducer: Reducers collaborate to consolidate information gathered by the mappers, much like librarians gathering to summarize data cohesively.

Understanding Data Storage in HDFS

In the realm of HDFS, data is organized into blocks. For example, consider storing a 500 MB file in a 4-node cluster:

In Hadoop 1.x, the default block size is 64 MB.
In Hadoop 2.x, the default block size is 128 MB.
Consequently, your 500 MB file would be divided into four blocks: three 128 MB blocks and one 116 MB block.

The Role of Replication and NameNode

Hadoop emphasizes data reliability through replication:

DataNodes, also known as Slave Nodes, store multiple copies of data blocks to ensure data safety.
The NameNode maintains metadata, including the mapping of data blocks to specific DataNodes

The Significance of the Replication Factor

Hadoop introduces the concept of a replication factor, typically set at 3 by default. This factor determines the number of copies maintained for each data block, bolstering fault tolerance. It can be customized based on project requirements.

**HDFS ARCHITECTURE OF 4 NODES WITH REPLICATION FACTOR OF 3**

Client Node

Client Node represents a person sitting at their system, desiring access to data. Here’s how it works:

When this person requests data, the request is sent to the NameNode, acting as the chief librarian.
The NameNode, internally checking the metadata table (catalog), guides you on where to find the data. It’s like receiving directions from the librarian on where to locate specific books on the shelves.

Wrapping Up

Hadoop’s architecture, encompassing HDFS, YARN, and MapReduce, empowers organizations to effectively navigate, store, and process vast datasets. Understanding these core components serves as the cornerstone for harnessing Hadoop’s capabilities in the realm of data analytics.

For a more in-depth exploration of Hadoop, refer to the official Hadoop documentation.

#Hadoop #BigData #DataManagement #DataAnalytics #HDFS #MapReduce #YARN #DataScience #TechExplained #DataProcessing #TechInnovation #DataStorage #DataArchitecture #DigitalTransformation #HadoopEcosystem