MLearning.ai
Published in

MLearning.ai

Introduction to Hadoop Part 2

(Core stack component of the Hadoop echo system)

Hey guys,

In the previous article (introduction to Hadoop Part 1), I have set the tone on introduction to Hadoop and its echo system. Hadoop echo system is divided in the three categories namely core stack, data manipulation stack and coordination stack. In this article, I will dig deep on core stack component of the Hadoop echo system. I am sure it will enrich your knowledge on core stack components.

So, lets dive into deep to core stack components of Hadoop echo system.

Following are the core stack components of Hadoop echo system:

1. HDFS (Hadoop Distributed File System)

2. YARN (Yet Another Resource Negotiator)

3. MapReduce

Let’s explore each core component in detail.

1. HDFS (Hadoop Distributed File System):

The HDFS is the core component of Hadoop echo system. It is also called heart of the HDFC echo system. HDFS allows us to store data in distributed environment on the commodity computers but from the outside it looks to the user that the data stored on single machine. HDFS manage multiple copies of data on different cluster to make system highly fault tolerant. In case of any cluster computer fails HDSF can fetch data from second copy from another cluster in the environment. By default, HDFS maintains replicas at three places. However, HDFS administrator can degree of replica set as per their requirement. HDFS is built on master / slave architecture. It means that each cluster contains single master node also called name node and multiple secondary nodes also called data nodes / slave nodes in the cluster. The name node handles the hoop file system namespace, client access to files and metadata. The data nodes store real blocks of files. For more information, you can refer official web page of Apache Hadoop.

2. YARN (Yet Another Resource Negotiator):

YARN stands for Yet Another Resource Negotiator. YARN was proposed in Hadoop 2.0 to overcome the bottleneck of Job Tracker which was constituent in Hadoop 1.0 and the responsibility of Job tracker was as a resource manager and application manager. YARN is known as large-scale distributed operating system which mediates all cluster resources. It is also based on master / slave architecture. YARN mainly separates resource management layer from the processing layer. YARN is empowered to decide who gets to run the tasks. It manages when and what nodes are available for allocation of workload. YARN manages the meta data and keep account of jobs that are running on various node and controls usage of memory and CPU. YARN comprises of following main modules.

· Client: It submits map-reduce job request.

· Resource Manager: It is the master daemon in the YARN architecture. It has a complete view of total CPU and memory (RAM) utilization in the cluster. It assigns resources to the applications in a system. It is responsible for resource allocation and management among all the applications. Whenever resource manager receives any request, it sends the request to the corresponding node manager and performs the resource allocation for the request. Resource manager has two main components:

  • Scheduler: It performs scheduling based on the available resources to the allocated application. It does not accomplish any other tasks like task monitoring or tracking of the task. In case of failure, it does not guarantee to restart of the process.
  • Application manager: It is responsible for accepting the application and negotiating the first container from the resource manager. It also restarts the Application Master container if a task fails.

· Node Manager: It take care of individual node on Hadoop cluster and manages application and workflow and that particular node. Its primary job is to keep-up with the Resource Manager. It registers with the Resource Manager and sends heartbeats with the health status of the node. It monitors resource usage, performs log management and also kills a container based on directions from the resource manager. It is also responsible for creating the container process and start it on the request of Application master.

· Application Master: An application is a single job submitted to a framework. The application master is accountable for negotiating resources with the resource manager, monitoring progress of application, and keeping track of the status. The application master calls the container via node manager by posting a Container Launch Context (CLC). it manages the health report card and communicate to the resource manager on periodically basis.

· Container: It is a group of physical resources i.e. CPU, RAM (memory), cores and disk space on a single node and containers invoked by the Container Launch Context (CLC). CLC is a record which comprises of information i.e. security tokens, environment variables, dependencies etc.

For more details about YARN, you can refer official web page of Apache YARN:https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html

YARN Features:

YARN earned recognition because of the following characteristics:

· Scalability: YARN architecture permits Hadoop to extend and manage hundred to thousands of nodes and clusters.

· Compatibility: YARN (Hadoop 2.0) supports backward compatibility. It supports the existing map-reduce applications without any interruptions and makes it compatible to Hadoop 1.0.

· Cluster Utilization: YARN supports dynamic deployment of cluster in Hadoop, which facilitates optimized cluster usage.

3. MapReduce:

It is a programming paradigm that allows users to process data across Hadoop cluster. MapReduce consists of Mappers and Reducers. Mappers and Reducers are the scripts that we can write. while writing a MapReduce program. MapReduce is built on the two functions.

· Map(): It’s main job to organize data or group the data. It also performs sorting and filtering of data. Map produces a key-value pair based output (result) that is pass to Reduce() for further process.

· Reduce(): It takes the result (output) as input which was generated by Map() and combines those tuples for produce output tuples.

For more detailed information about Hadoop MapReduce, you may visit Apache Hadoop official web page: https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html

Conclusion:

In this article, I have dig deep on core stack component of the Hadoop echo system. The core stacks components are HDF, YARN and MapReduce. I am sure it will enrich your knowledge on core stack components. In the next week article, I will discuss in detail on data manipulation stack of the Hadoop echo system.

On winding up notes, feel free to share your comments. Your likes and comments will help me to present contents in better way. See you next week.

--

--

--

Data Scientists must think like an artist when finding a solution when creating a piece of code. ⚪️ Artists enjoy working on interesting problems, even if there is no obvious answer ⚪️ linktr.ee/mlearning 🔵 Follow to join our 18K+ Unique DAILY Readers 🟠

Recommended from Medium

Kubernetes: explained for Non-IT (Part I)

MongoDB Schema Design — Part #1

Troubleshoot Building a Log Analytics Solution using Amazon Web Services technologies

New Beginnings — New blog

A google image file showing four badly drawn freezes

Docker API Abuse

MetaMetaverse Weekly Digest Mar/ Apr 29th- 5th

Weekly digest

.Net 7 (Preview 4) — Minimal API — Multiple Result Type — Route Groups

Building Custom Vagrant box

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Virendra Kumar Shrivastava

Virendra Kumar Shrivastava

Professor (Big Data Analytics)||Adani Institute of Digital Technology Management (AIDTM) || Adani Group || Gandhinagar, Gujarat, India

More from Medium

Introduction to Hadoop Part 4

Creating a Randomly Sampled Working Data in Spark SQL and Python from Original Dataset

A slice of a pie

Real Time Data Processing Using Spark Streaming

Introduction to Big Data and Hadoop