HADOOP : Basic Architecture (Hadoop 2.x)

Apache Hadoop has earned quite a traction in recent past. It is a framework for processing of large datasets in a distributed computing environment.
All this started back in 2003 when Google released it’s paper on Google File System. Later the Hadoop project started taking that as a blueprint and it has reached here today.
Hadoop 1.x had quite a different structure and had certain limitations, which has been addressed in Hadoop 2.x versions with a different architecture with different components to enable more reliable, distribute, parallel processing of Big Data.
Hadoop comprises of 2 main core components:
HDFS (Hadoop Disributed File System) — Storage Unit
YARN (Yet Another Resource Negotiator) — Processing Unit

HDFS
HDFS comprises of 3 components.
Name Node
Name Node is the ‘master’ component of HDFS system. It stores the metadata of HDFS, which is essentially the directory tree of the files in HDFS and a track of all the files across the cluster.
Name node does not store the actual data. It knows the list of blocks and locations for any given file.
When name node is down, the cluster itself is considered down as the cluster becomes inaccessible.
It stores fsimage(snapshot of filesystem at start-up) and editlogs(sequence of changes made to filesystem after start-up).
Secondary Name Node
Secondary Name node is NOT a backup name node as the name might suggest. It is more of an assistive node.
It is responsible for taking checkpoints of the file system metadata present in namenode. It does this by checkpointing fsimage. It gets the edit logs from name node in regular intervals and creates the checkpoint. Then it copies this fsimage back to name node.
Name node uses this fsimage in next start-up so that it reduces start-up time.
Data Node
This is where the actual data is stored. Name node keeps track of each block that is stored in these data nodes. It is also known as the ‘slave’.
When it starts up, it announces it’s presence to name node and also the data blocks it is responsible for. This is the workhorse of Hadoop HDFS.
YARN
YARN comprises of 2 components:
Resource Manager
Resource Manager is responsible for taking inventory of available resources and run critical services, most critical of which is the Scheduler.
The scheduler allocates resources. It negotiates the available resources in the cluster and manages the distributed processing. It works along with the Node Manager to attain this.
Node Manager
Node manager acts as the ‘slave’ to the Resource Manager.
It keeps track of the tasks and jobs that are being deployed to the data nodes. It helps Resource manager in keeping track of available space, processing power. Memory, bandwidth etc. so that tasks can be distributed to the data nodes.
