Have You Learned about Hadoop Architecture?

Vaishali S
9 min readApr 25, 2022

--

Before Knowing about Hadoop architecture Let’s learn about Big Data…

What is Big Data?

Big data is a combination of structured, semistructured and unstructured data collected by organizations that can be mined for information and used in machine learning projects, predictive modeling and other advanced analytics applications.

Systems that process and store big data have become a common component of data management architectures in organizations, combined with tools that support big data analytics uses.

Big data characteristics:

3V’s of Big Data

  • the large volume of data in many environments;
  • the wide variety of data types frequently stored in big data systems; and
  • the velocity at which much of the data is generated, collected and processed.

These characteristics were first identified in 2001 by Doug Laney, then an analyst at consulting firm Meta Group Inc.; Gartner further popularized them after it acquired Meta Group in 2005. It have been added to different descriptions of big data, including veracity, value and variability.

Although big data doesn’t equate to any specific volume of data,big data deployments often involve terabytes, petabytes and even exabytes of data created and collected over time.

Why is big data important?

Companies use big data in their systems to improve operations, provide better customer service, create personalized marketing campaigns and take other actions that, ultimately, can increase revenue and profits. Businesses that use it effectively hold a potential competitive advantage over those that don’t because they’re able to make faster and more informed business decisions.

For example, big data provides valuable insights into customers that companies can use to refine their marketing, advertising and promotions in order to increase customer engagement and conversion rates.

Here are some more examples of how big data is used by organizations:

  • In the energy industry, big data helps oil and gas companies identify potential drilling locations and monitor pipeline operations; likewise, utilities use it to track electrical grids.
  • Financial services firms use big data systems for risk management and real-time analysis of market data.
  • Manufacturers and transportation companies rely on big data to manage their supply chains and optimize delivery routes.
  • Other government uses include emergency response, crime prevention and smart city initiatives.

Types of Big Data

Structured Data:

Structured data is that resides in a fixed field within a record. It is bound by a certain schema, so all the data has the same set of properties. Structured data is also called relational data.

Examples of structured data include numbers, dates, strings, etc.

Semi-Structured Data:

Semi-structured data is not bound by any rigid schema for data storage and handling. The data is not in the relational format and is not neatly organized into rows and columns like that in a spreadsheet.

Unstructured Data:

Unstructured data is the kind of data that doesn’t adhere to any definite schema or set of rules. Its arrangement is unplanned and haphazard. Photos, videos, text documents, and log files can be generally considered unstructured data.

We’ll all know that Facebook,Whatsapp,Twitter has billion user’s. But Why University portal goes down?

Facebook,Whatsapp,Twitter are cluster oriented.It’s a process in which a group of objects which are similar to each other are grouped together. OLTP is used.

While University portal are Client- Server relation. A client-server relationship describes how a server can provide resources or services to one or more clients. ETL is used.

Let’s look more about OLTP and ETL…

On Line Transaction Processing (OLTP)

OLTP is an operational system that supports transaction-oriented applications in a 3-tier architecture. It administers the day to day transaction of an organization. OLTP is basically focused on query processing, maintaining data integrity in multi-access environments as well as effectiveness that is measured by the total number of transactions per second. The full form of OLTP is Online Transaction Processing.

Characteristics

  • High availability: OLTP systems have high availability requirements. It is well integrated with high availability offerings that SQL server provides.
  • Data storage: The data in OLTP systems is stored at transaction level.
  • Normalized database design: It has a normalized database design.
  • Useful for small transactions: OLTP can work properly on small amounts of data and is useful for small transactions.
  • Less response time:: OLTP systems have less response time which is very useful.

Architecture of OLTP

Here is the architecture of OLTP

Example of OLTP Transaction

An example of the OLTP system is the ATM center. Assume that a couple has a joint account with a bank. One day both simultaneously reach different ATM centers at precisely the same time and want to withdraw the total amount present in their bank account.

However, the person that completes the authentication process first will be able to get money. In this case, the OLTP system makes sure that the withdrawn amount will be never more than the amount present in the bank. The key to note here is that OLTP systems are optimized for transactional superiority instead of data analysis.

Other examples of OLTP system are:

  • Online banking
  • Online airline ticket booking
  • Sending a text message
  • Order entry
  • Add a book to shopping cart

What is ETL?

In the world of data warehousing, if you need to bring data from multiple different data sources into one, centralized database, you must first:

  • EXTRACT data from its original source
  • TRANSFORM data by deduplicating it, combining it, and ensuring quality, to then
  • LOAD data into the target database

ETL tools enable data integration strategies by allowing companies to gather data from multiple data sources and consolidate it into a single, centralized location. ETL tools also make it possible for different types of data to work together.

I hope that you have got an idea about Big Data Now Let’s look into Hadoop …

What is Hadoop?

Hadoop is an open source framework from Apache and is used to store process and analyze data which are very huge in volume. Hadoop is written in Java and is not OLAP (online analytical processing). It is used for batch/offline processing.It is being used by Facebook, Yahoo, Google, Twitter, LinkedIn and many more. Moreover it can be scaled up just by adding nodes in the cluster.

History of Hadoop

The Hadoop was started by Doug Cutting and Mike Cafarella in 2002. Its origin was the Google File System paper, published by Google.

Hadoop Architecture in Detail — HDFS, MapReduce & YARN

Hadoop has a master-slave topology. In this topology, we have one master node and multiple slave nodes. Master node’s function is to assign a task to various slave nodes and manage resources. The slave nodes do the actual computing. Slave nodes store the real data whereas on master we have metadata.

Hadoop Application Architecture in Detail

Hadoop Architecture comprises three major layers. They are:-

  • HDFS (Hadoop Distributed File System)
  • Yarn
  • MapReduce

1. HDFS

HDFS stands for Hadoop Distributed File System. It provides for data storage of Hadoop. HDFS splits the data unit into smaller units called blocks and stores them in a distributed manner. It has got two daemons running. One for master node — NameNode and other for slave nodes — DataNode.

a) NameNode and DataNode

HDFS has a Master-slave architecture. The daemon called NameNode runs on the master server. It is responsible for Namespace management and regulates file access by the client.

DataNode daemon runs on slave nodes. It is responsible for storing actual business data. Internally, a file gets split into a number of data blocks and stored on a group of slave machines.

Namenode manages modifications to file system namespace. These are actions like the opening, closing and renaming files or directories. NameNode also keeps track of mapping of blocks to DataNodes.

This DataNodes serves read/write request from the file system’s client. DataNode also creates, deletes and replicates blocks on demand from NameNode.

b) Block in HDFS

Block is nothing but the smallest unit of storage on a computer system. It is the smallest contiguous storage allocated to a file. In Hadoop, we have a default block size of 128MB or 256 MB.

c) Replication Management

To provide fault tolerance HDFS uses a replication technique. In that, it makes copies of the blocks and stores in on different DataNodes. Replication factor decides how many copies of the blocks get stored. It is 3 by default but we can configure to any value.

2. MapReduce

MapReduce is the data processing layer of Hadoop. It is a software framework that allows you to write applications for processing a large amount of data. MapReduce runs these applications in parallel on a cluster of low-end machines. It does so in a reliable and fault-tolerant manner.

MapReduce job comprises a number of map tasks and reduces tasks. Each task works on a part of data. The input file for the MapReduce job exists on HDFS. The inputformat decides how to split the input file into input splits. Input split is nothing but a byte-oriented view of the chunk of the input file. This input split gets loaded by the map task. The map task runs on the node where the relevant data is present. The data need not move over the network and get processed locally.

3. YARN

Yet Another Resource Manager takes programming to the next level beyond Java , and makes it interactive to let another application Hbase, Spark etc. to work on it.Different Yarn applications can co-exist on the same cluster so MapReduce, Hbase, Spark all can run at the same time bringing great benefits for manageability and cluster utilization.

Components Of YARN

  • Client: For submitting MapReduce jobs.
  • Resource Manager: To manage the use of resources across the cluster
  • Node Manager:For launching and monitoring the computer containers on machines in the cluster.
  • Map Reduce Application Master: Checks tasks running the MapReduce job. The application master and the MapReduce tasks run in containers that are scheduled by the resource manager, and managed by the node managers.

Let’s Catch you all in next blog… Any Questions? Please pin me in the comment section, will get back to you :)

Image Source:

--

--