Hadoop Architecture ? Big Data ?

7 min readApr 23, 2022

→ Before going to Hadoop, firstly what’s Big data ? Is there any connection between Hadoop and Big data ? Let’s Find out.

Big Data

→ Big data is a collection of data that is huge in volume. you can analyze and assess production, customer feedback and returns, and other factors to reduce outages and anticipate future demands.

→ Big data can also be used to improve decision-making in line with current market demand.

→ Big data is a concept and not the software.

Types Of Big Data

Structured

Any data that can be stored, accessed and processed in the form of fixed format is termed as a ‘structured’ data

2. Unstructured

Any data with unknown form or the structure is classified as unstructured data.

3. Semi-structured

Semi-structured data can contain both the forms of data.

Characteristics Of Big Data

Big data can be described by the following characteristics:

Volume:

→ The size and amounts of big data that companies manage and analyze

Variety:

→ The value of big data usually comes from insight discovery and pattern recognition that lead to more effective operations, stronger customer relationships and other clear and quantifiable business benefits

Value:

→ The diversity and range of different data types, including unstructured data, semi-structured data and raw data

Velocity:

→ The speed at which companies receive, store and manage data

Variability:

→ The “truth” or accuracy of data and information assets, which often determines executive-level confidence

Note:

As we’ll know Facebook has 2.912 billion user’s and University Portal has ~7.5 lakhs Students.

But Why University portal goes down, but Facebook application doesn’t ?

Facebook is cluster oriented.

→Data are divided into groups in a way that objects in each group share more similarity than with other objects in other groups.

→ OLTP is used.

University portal are Client- Server relation.

→ Client-server denotes a relationship between cooperating programs in an application, composed of clients initiating requests for services and servers providing that function or service.

→ ETL is used.

Top Companies that uses Big Data

Below are the companies that uses different software for running 1 application:

:-) Google
:-) Amazon
:-) Facebook
:-) Twitter
:-) IRCTC

Big Data also used in:

How Big Data helps in Decision Making?

1. Real-time Data to Improve Customer Engagement and Retention
2. Enhance Operational Efficiency
3. Increased Capacity Without Extra Investment

→ Big data and business analytics can help improve decision making is by identifying patterns.

→ Identifying problems and providing data to back up the solution is beneficial as you can track whether the solution is solving the problem, improving the situation or has an insignificant effect.

OLTP

OLTP, or online transactional processing, enables the real-time execution of large numbers of database transactions by large numbers of people, typically over the internet.

→ A database transaction is a change, insertion, deletion, or query of data in a database.

→ OLTP (online transactional processing) enables the rapid, accurate data processing behind ATMs and online banking, cash registers and ecommerce, and scores of other services we interact with each day.

OLTP Architecture and System Design

Examples of OLTP systems

ATM machines (this is the classic, most often-cited example) and online banking applications
Credit card payment processing (both online and in-store)
Order entry (retail and back-office)
Online bookings (ticketing, reservation systems, etc.)
Record keeping (including health records, inventory control, production scheduling, claims processing, customer service ticketing, and many other applications)

What is ETL ?

ETL (Extract, Transform, Load) is the process of extracting data from disparate sources, transforming it into a clean and analysis-ready format, and loading it into a data warehouse for analysis.

Data Warehouse: To make all data into common dataset we’re using Data Warehouse.

Hope you got some idea on big data…
Now let’s move on to an interesting topic Hadoop…

What is Hadoop ?

Hadoop is an open source framework from Apache and is used to store process and analyze data which are very huge in volume.

→ Hadoop is written in Java and is not OLAP (online analytical processing).

→ It is used for batch/offline processing.

→ It is being used by Facebook, Yahoo, Google, Twitter, LinkedIn and many more. Moreover it can be scaled up just by adding nodes in the cluster.

Hadoop History

Why Hadoop is important ?

Big Data challenges and Hadoop as one of the solution of Big Data with its modules

Modules of Hadoop

HDFS:

→ Hadoop Distributed File System. Google published its paper GFS and on the basis of that HDFS was developed. It states that the files will be broken into blocks and stored in nodes over the distributed architecture. More Additional Information On Hadoop Admin Certification.

2. Yarn:

→ Yet another Resource Negotiator is used for job scheduling and manage the cluster.

3. Map Reduce:

→ This is a framework which helps Java programs to do the parallel computation on data using key value pair. The Map task takes input data and converts it into a data set which can be computed in Key value pair. The output of Map task is consumed by reduce task and then the out of reducer gives the desired result.

4. Hadoop Common:

→ These Java libraries are used to start Hadoop and are used by other Hadoop modules.

Hadoop Architecture

The Hadoop architecture is a package of the file system, MapReduce engine and the HDFS (Hadoop Distributed File System).

→ A Hadoop cluster consists of a single master and multiple slave nodes.

→ The master node includes Job Tracker, Task Tracker, NameNode, and DataNode whereas the slave node includes DataNode and TaskTracker.

Hadoop Distributed File System

The Hadoop Distributed File System (HDFS) is a distributed file system for Hadoop.

→ It contains a master/slave architecture.

→ This architecture consist of a single NameNode performs the role of master, and multiple DataNodes performs the role of a slave.

→ Both NameNode and DataNode are capable enough to run on commodity machines. The Java language is used to develop HDFS.

NameNode

It is a single master server exist in the HDFS cluster.
As it is a single node, it may become the reason of single point failure.
It manages the file system namespace by executing an operation like the opening, renaming and closing the files.
It simplifies the architecture of the system.

DataNode

The HDFS cluster contains multiple DataNodes.
Each DataNode contains multiple data blocks.
These data blocks are used to store data.
It is the responsibility of DataNode to read and write requests from the file system’s clients.
It performs block creation, deletion, and replication upon instruction from the NameNode.

Job Tracker

The role of Job Tracker is to accept the MapReduce jobs from client and process the data by using NameNode.
In response, NameNode provides metadata to Job Tracker.

Task Tracker

It works as a slave node for Job Tracker.
It receives task and code from Job Tracker and applies that code on the file. This process can also be called as a Mapper.

MapReduce Layer

→ The MapReduce comes into existence when the client application submits the MapReduce job to Job Tracker. In response, the Job Tracker sends the request to the appropriate Task Trackers.

→ Sometimes, the TaskTracker fails or time out. In such a case, that part of the job is rescheduled.

Hadoop Ecosystem

End of the Blog.

Hope you got some idea on what is Hadoop and big data.

Catch you in the next blog with more interesting tech concept.

Have a great day !!!

Always with love …🎈

Ramya …❤