Hadoop : Modules & Process Flow

Abhishek Sukhwal
2 min readMar 25, 2024

--

Hadoop Modules

Hadoop is an Apache open source framework written in java that allows distributed processing of large datasets across clusters of computers using simple programming models.

Hadoop is designed to scale up from single server to thousands of machines, each offering local computation and storage.

Why Hadoop is required ?

As the data is growing on a large scale and everyone wants to analyze and extract meaningful information for business use case, there are certain challenges that comes with big data :

  1. Capturing Data
  2. Curation
  3. Storage
  4. Searching
  5. Sharing
  6. Transfer
  7. Analysis
  8. Presentation

To address the above challenges, we use the Hadoop Modules

Hadoop Modules

  1. HDFS — Hadoop Distributed File System, in this files will be broken into blocks and stored in nodes over the distributed architecture.
  2. Yarn — Yet another Resource Negotiator is used for job scheduling and manage the cluster.
  3. Map Reduce — This is a framework which helps programs to do the parallel computation on data using key value pair (more details in separate article).
  4. Hadoop Common — These are the java libraries which are used to start Hadoop and used by other Hadoop modules

Hadoop Process Flow

Hadoop runs code across a cluster of computers. This process includes the following core tasks that Hadoop performs :

  1. Data is initially divided into directories and files. Files are divided into uniform sized blocks of 128 MB or 64 MB
  2. These files are then distributed across various cluster nodes for further processing
  3. HDFS, being on top of the local file system, supervises the processing
  4. Blocks are replicated to multiple nodes for handling hardware failure
  5. Checking that the code was executed successfully
  6. Performing the sort that takes place between the map and reduce stages
  7. Sending the sorted data to a certain node
  8. Writing the debugging logs for each job

Hadoop map reduce in detail : https://medium.com/@abhisheksukhwal9/hadoop-map-reduce-in-detail-ac1c8429b03b

--

--