A quick Intro on Hadoop

Berselin C R
6 min readApr 25, 2022

--

Hello everyone,In this blog we will know about Hadoop architecture.

Lets get started:)

Before we start with Hadoop,lets get to know about Big Data.

Big Data:

→ Collection of data in huge volume, yet growing exponentially with time is Big Data.

→It is large in size and complexity, thus we can’t use traditional data management tools to store and process it.

example: entering and tracking a company’s daily transaction records in a spreadsheet.

Some of the use cases of Big data are:

  1. Companies like Facebook ingest 500+ terabytes of unstructured data almost every day.
  2. Big data is used by businesses to gain valuable insights into their data and improve their marketing campaigns.

What ETL?

ETL stands for ‘Extract, Transform, and Load’. ETL is the process of moving your data from a source to a data warehouse. This step is one of the most crucial steps in your data analysis process.

Example: University portal

Which chooses client-server relation.

Client-server refers to a relationship between cooperating programmes in an application, which consists of clients initiating service requests and servers providing that function or service. ETL is employed.

What is OLTP?

OLTP is an operational system that supports transaction-oriented applications in a 3-tier architecture. It administers the day to day transaction of an organization. OLTP is basically focused on query processing, maintaining data integrity in multi-access environments as well as effectiveness that is measured by the total number of transactions per second. The full form of OLTP is Online Transaction Processing.

Examples of using OLTP include:

  • Online banking
  • Adding items to cart in web shops
  • Booking a ticket
  • Sending a text message
  • Order entry

On the other hand we have our facebook in which millions of people can access them simultaneously…why?

Because facebook is cluster oriented.

Data is divided into groups in such a way that objects in each group have more in common than objects in other groups. OLTP is employed.

Thus to process these data we use Hadoop.

What is Hadoop?

→ Hadoop is an open source framework that is used for storing, processing and analyzing Big Data.

→Hadoop allows you to store Big Data in a distributed environment, thus we can process them parallelly.

Evolution of Hadoop:

Hadoop was founded in 2002 by Doug Cutting and Mike Cafarella, who were both working on the Apache Nutch project at the time. The Apache Nutch project aimed to create a search engine system capable of indexing 1 billion pages. They were able to obtain it in 2006 after many years of hard labor. It must been updated as time passes.

ARCHITECTURE OF HADOOP:

1.HDFS:

The first is HDFS (Hadoop Distributed File System) for storage, which lets you to store data in a variety of formats across a cluster.

A Hadoop cluster’s nodes each have their own disc space, memory, bandwidth, and processing. Individual data blocks are created from the incoming data, which are then saved in the HDFS distributed storage layer. Every disc drive and slave node in the cluster is assumed to be unreliable by HDFS. HDFS stores three copies of each data set throughout the cluster as a precaution. The metadata for each data block and all of its replicas is kept on the HDFS master node (NameNode).

Thus the main purpose of hdfs is storing of the dats in nodes of distributed architecture by splitting them into blocks. NameNodes manage the many DataNodes, maintain data block metadata, and control client access while DataNodes process and store data blocks.

2.YARN:

YARN stands for Yet Another Resource Negotiator.

HDFS is a storage framework, and YARN is a generic job scheduling framework.

In a nutshell, YARN has a master (Resource Manager) and workers (Node manager),

The resource manager creates containers on workers for MapReduce jobs, spark jobs, and so on.

3.MAPREDUCE:

MapReduce is a java-based processing technique and programme model for distributed computing.

The MapReduce algorithm consists of two key tasks: Map and Reduce.

Map converts one set of data into another, where individual elements are broken down into tuples (key/value pairs).

Reduce task, which takes the output of a map as an input and merges those data tuples into a smaller set of tuples. The reduce task is always performed after the map job, as the name MapReduce implies.

Thus we will be able to process huge amount of data parallelly.

4.COMMON UTILITIES:

Common utilities are nothing more than our java library and java files, or the java scripts that we require for all of the other components in a Hadoop cluster.

These utilities are used to run the cluster by HDFS, YARN, and MapReduce. Hadoop Common validates that hardware failure in a Hadoop cluster is common and must be solved automatically in software by the Hadoop Framework.

Daemons in hadoop:

Daemons means process. Hadoop Daemons are a set of processes that run on Hadoop.

  • NameNode
  • DataNode
  • Secondary Name Node
  • Resource Manager
  • Node Manager

🔱 NameNode:

→ It never stores the data that is present in the file.

→It stores DataNode information such as Block Ids and Number of Blocks.

🔱DataNode:

→DataNode is a programme that runs on the slave system and serves the client’s read/write request.

→The NameNode always instructs the DataNode on where to store the Data.

🔱Secondary NameNode:

→Secondary NameNode is used for taking the hourly backup of the data.

→If the Hadoop cluster gets crashed then we can use the backup from the secondary NameNode.

🔱Resource Manager:

→Resource Manager is also known as the Global Master Daemon that works on the Master System.

→Manages the resource allocation.

It consists of two things:

  1. Application manager:for accepting request from client and also for memory resource on the Slaves in hadoop.
  2. Scheduler: manages the resource needs for individual applications.

🔱Node Manager:

→The Node Manager manages the memory resources within the Node and Memory Disk using the Slaves System.

→A single NodeManager Daemon runs on each Slave Node in a Hadoop cluster. This monitoring data is also sent to the Resource Manager.

Job Tracker:

Managers Mapreduce jobs, Distributes individual task to machines running the task tracker.

Task Tracker:

Responsible for instantiating and monitoring individual Map and Reduce tasks.

That is guys…..I guess now you have an idea about Hadoop.

Let me catch you guys in my next blog……

Bye…..^_^

BERSELIN C R

Resources:

https://www.geeksforgeeks.org/hadoop-architecture/

--

--