Hadoop

3 min readMay 20, 2024

What is Apache Hadoop?

According to IBM Apache Hadoop is an open-source software framework developed by Douglas Cutting, then at Yahoo, that provides the highly reliable distributed processing of large data sets using simple programming models.

Before Apache Hadoop

Social media platforms, websites and IOT devices all produces large amounts of data daily, that makes it hard to deal with these massive amounts of unstructured data to store, transform, or perform analytical tasks.

The problem of storing data appears when devopling search engine’s algorithms which works by downloading large number of webpages in order to rank them.

back in 2002 google was working on Apache Nutch, develpers were facing difficulties dealing with big data.

From 2002 to 2006
1. Google introduced GFS which is a solution for big data storage and access.
2. Google published a paper about MapReduce which is a cluster based data processor
3. Doug Cutting introduced Nutch Distributed File System (NDFS)

In 2006 Doug Cutting developed Hadoop with HDFS(Hadoop Distributed File System) based on Nutch project.

Hadoop Components

Common: contains all Java libraries on which all hadoop modules depend
Hadoop Distributed File System (HDFS): a distributed file system that offers high data access speed and store large amount of data a cross multiple nodes.
YARN (Yet Another Resource Negotiator): yarn is the clusters manger in hadoop it acts as job scheduler
MapReduce: it’s a parallel data processing tool that used to process data parallelly across multiple clusters

Hadoop Architecture

hadoop follows master-slave architecture which means there are a Master Node (also called NameNode) and Slave Nodes(also called DataNodes)

each Node far from its type is splitted into to layers MapReduce Layer and HDFS Layer.

Master Node: a single node that controls the file system namespace
it contains Jop Tracker & Task Tracker in MapReduce Layer and Name Node & Data Node in HDFS Layer.

Slave Node: multiple nodes that store blocks of data
it contains Task Tracker in MapReduce Layer and Data Node in HDFS Layer.

Secondary Node: a stand-in node that used in case a working node goes down.

Job Tracker: it’s a part of Master Nodes and it’s responsible for managing resources, schedule jobs, and keep metadata about jobs.

Task Tracker: it’s a part of data nodes and it’s responsible for Executing tasks, inform Job Tracker about data node health(Heartbeat test), and manage resource for each task in the node.

After using YARN there is no need for Task Tracker and Job Tracker

HDFS (Hadoop Distributed File System)

HDFS is inspired from Google File System (GFS) it’s a distributed file system in which data is stored across multiple machines and replicated to ensure high availability.

Block

Block is the smallest chunk of data in the data node (mostly 100–200MB) the block size is big to minimize seeking time.

HDFS is suitable in dealing with very large files and data stream.

MapReduce

It’s a processing model used in hadoop

Map: The input data is split into independent chunks and processed by the map tasks in a completely parallel manner.
Reduce: it first sorts the data coming from the Maps and then perform reduce tasks. Typically, the output of the reduce task is the final output of the job.

Example finding the word with highest number of occurance:

Map: Reads the text and returns a key-value pair for each word Ex:("word", 1).
Reduce: Sums the counts for each word, and return the word with max count.

YARN (Yet Another Resource Negotiator)

Yarn is a resource manager and job scheduler that solves the problems of Job Tracker and Task Tracker by introducing two layers:

ResourceManager: Manages the resources across the cluster and allocates resources for jobs.
NodeManager: Manages the resources on a single node and reports to the ResourceManager.
Application Master(Map Reduce): Manages running map reduce jobs.

References

https://www.javatpoint.com/hadoop-tutorial