HADOOP : Basic Architecture (Hadoop 2.x)

Anupam Majhi
Sep 5, 2018 · 3 min read
HADOOP : Basic Architecture (Hadoop 2.x)
HADOOP : Basic Architecture (Hadoop 2.x)

Apache Hadoop has earned quite a traction in recent past. It is a framework for processing of large datasets in a distributed computing environment.

All this started back in 2003 when Google released it’s paper on Google File System. Later the Hadoop project started taking that as a blueprint and it has reached here today.

Hadoop 1.x had quite a different structure and had certain limitations, which has been addressed in Hadoop 2.x versions with a different architecture with different components to enable more reliable, distribute, parallel processing of Big Data.



Hadoop comprises of 2 main core components:

HDFS (Hadoop Disributed File System) — Storage Unit
YARN (Yet Another Resource Negotiator) — Processing Unit

HDFS

HDFS comprises of 3 components.



Name Node

Name Node is the ‘master’ component of HDFS system. It stores the metadata of HDFS, which is essentially the directory tree of the files in HDFS and a track of all the files across the cluster.

Name node does not store the actual data. It knows the list of blocks and locations for any given file.

When name node is down, the cluster itself is considered down as the cluster becomes inaccessible.

It stores fsimage(snapshot of filesystem at start-up) and editlogs(sequence of changes made to filesystem after start-up).



Secondary Name Node

Secondary Name node is NOT a backup name node as the name might suggest. It is more of an assistive node.

It is responsible for taking checkpoints of the file system metadata present in namenode. It does this by checkpointing fsimage. It gets the edit logs from name node in regular intervals and creates the checkpoint. Then it copies this fsimage back to name node.

Name node uses this fsimage in next start-up so that it reduces start-up time.



Data Node

This is where the actual data is stored. Name node keeps track of each block that is stored in these data nodes. It is also known as the ‘slave’.

When it starts up, it announces it’s presence to name node and also the data blocks it is responsible for. This is the workhorse of Hadoop HDFS.

YARN

YARN comprises of 2 components:



Resource Manager

Resource Manager is responsible for taking inventory of available resources and run critical services, most critical of which is the Scheduler.

The scheduler allocates resources. It negotiates the available resources in the cluster and manages the distributed processing. It works along with the Node Manager to attain this.



Node Manager

Node manager acts as the ‘slave’ to the Resource Manager.

It keeps track of the tasks and jobs that are being deployed to the data nodes. It helps Resource manager in keeping track of available space, processing power. Memory, bandwidth etc. so that tasks can be distributed to the data nodes.

Read the full article

Anupam Majhi

Written by

Data Science | Machine Learning | AI | Senior System Developer @ First American India

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade