Hadoop : Modules & Process Flow

2 min readMar 25, 2024

Hadoop is an Apache open source framework written in java that allows distributed processing of large datasets across clusters of computers using simple programming models.

Hadoop is designed to scale up from single server to thousands of machines, each offering local computation and storage.

Why Hadoop is required ?

As the data is growing on a large scale and everyone wants to analyze and extract meaningful information for business use case, there are certain challenges that comes with big data :

Capturing Data
Curation
Storage
Searching
Sharing
Transfer
Analysis
Presentation

To address the above challenges, we use the Hadoop Modules

Hadoop Modules

HDFS — Hadoop Distributed File System, in this files will be broken into blocks and stored in nodes over the distributed architecture.
Yarn — Yet another Resource Negotiator is used for job scheduling and manage the cluster.
Map Reduce — This is a framework which helps programs to do the parallel computation on data using key value pair (more details in separate article).
Hadoop Common — These are the java libraries which are used to start Hadoop and used by other Hadoop modules

Hadoop Process Flow

Hadoop runs code across a cluster of computers. This process includes the following core tasks that Hadoop performs :

Data is initially divided into directories and files. Files are divided into uniform sized blocks of 128 MB or 64 MB
These files are then distributed across various cluster nodes for further processing
HDFS, being on top of the local file system, supervises the processing
Blocks are replicated to multiple nodes for handling hardware failure
Checking that the code was executed successfully
Performing the sort that takes place between the map and reduce stages
Sending the sorted data to a certain node
Writing the debugging logs for each job

Hadoop map reduce in detail : https://medium.com/@abhisheksukhwal9/hadoop-map-reduce-in-detail-ac1c8429b03b

Hadoop : Modules & Process Flow

Written by Abhishek Sukhwal