Hadoop

6 min readSep 13, 2023

Welcome, everyone! This is my first article on Hadoop. I am planning to write a series of blogs about Hadoop, so please follow me to see those articles. Let’s embark on our Hadoop journey from scratch.

Before move Hadoop let’s see what is the Big Data ?

Big data is a term used to describe data sets that are so large and complex that they are difficult or impossible to process using traditional data processing methods.

Three V’s of Big Data

Volume → describes the amount of data generated by organizations or individual .
Velocity → Describe the frequency at which data is generated , captured and shared .
Variety → Big data can come in a variety of formats, including structured, semi-structured, and unstructured data.

Big data can be used to solve a variety of problems, including ,

Predictive analytics → Big data can be used to predict future events, such as customer churn or fraud.
Fraud detection → Big data can be used to detect fraudulent activity, such as credit card fraud or insurance fraud.
Recommendation engines → Big data can be used to recommend products or services to customers, such as what products to show on an e-commerce website or what movies to recommend on a streaming service.
Natural language processing → Big data can be used to understand natural language, such as text or speech. This can be used for tasks such as machine translation or sentiment analysis.
Healthcare → Big data can be used to improve healthcare by, for example, identifying patients at risk of disease or developing new treatments.

Big data is a rapidly growing field, and new applications for big data are being developed all the time. As the amount of data continues to grow, big data is becoming increasingly important for businesses and organizations of all sizes.

Why we Analyze Big Data ?

Effective analysis of big data provides a lot of business advantages as organization will learn which areas of focus on and which areas are less important .

When we analyze big data why we can’t we use traditional systems to analyze Big Data ?

1 ) Data Size → when data size become huge it is hard to analyze it using traditional systems

2 ) Unstructured Data → RDMS cant categorize unstructured data

3 ) Growth Rate → RDBMS are capable for steady data retention rather than rapid growth

Great ! ,Now you have knowledge about big data , why we need Big Data and Why traditional methods not suitable for analyze Big Data . Let’s discuss about Hadoop .

What is Hadoop ?

Hadoop is open source java base software framework and parallel data processing engine .it enable big data analysis task breakdown in to smaller task that can be perform parallel using an algorithm likely the MapReduce algorithm . it store and process large datasets ranging in gigabytes to petabytes of data . instead of using one large computer to store and process the data Hadoop allows clustering multiple computers to analyze massive datasets in parallel more quickly .

Hadoop consists of four main modules

1 ) Hadoop Distributed File System ( HDFS ) → A distributed file system that can run on standard or low-end hardware ( 8Gb machines having dual cores processor ) . HDFS provides better data throughput than traditional systems , in addition to high fault tolerance and native support of large data set .
2 ) Yet Another Resource Negotiator ( YARN ) → Manages and monitors cluster nodes and resource usage . it schedule jobs and tasks .
3 ) MapReduce → Framework that helps programs do the parallel computations on data . the map task is responsible for breaking down the input data into smaller chunks and processing each chunk independently. The reduce task is then responsible for combining the results of the map tasks and producing the final output.
4 ) Hadoop Common → Provides common Java libraries that can be used across all modules .

Why Hadoop and Its Use Cases ?

features of Hadoop

Hadoop can run application on cluster of commodity hardware . Hadoop have ability to handle limitless data because of it distributed computing model .

Flexibility of Hadoop

Hadoop can process structured and unstructured data . Hadoop can run multiple operating systems like windows , linux etc. . Hadoop is a scalable platform; new node can easily add and faulty node can be remove very easily .

Use cases

Facebook → Hadoop help Facebook to store all the profiles with related data like post , comment , video , image and so on .

Different Ecosystems of Hadoop

HDFS → HDFS spread the data within cluster nodes .
MapReduce → this is programming module for Hadoop
Hadoop stream → utility to enable MapReduce code in any language C , python , C++, etc
Hive and Hue → Hive convert SQL to MapReduce job .
Pig → High level programming environment to do MapReduce coding . language of pig called Pig Latin .
Sqoop → Sqoop provide by directional data transfer between Hadoop and RDBMS .
Oozie → Manage Hadoop workflow .
HBase → Scalable key value store . it is not a relational database .
Flume → Realtime loader for streaming data into Hadoop . it store data HBase .
Mahout → use for predictive analytics and other advance analytics .
Fuse → Make HDFS looks like a regular file system .
Zookeeper → Manage synchronization for the cluster .

This is the time to discuss about structured , unstructured and semi structured data .

Structured data

The term structured data refers to any data that conforms to a certain format or schema .
Ex → A popular example of structured data is a spreadsheet . in a spreadsheet there are usually clearly labeled rows and columns and the information within those rows and columns follows a certain formats .

Unstructured data

Information that lacks a predefined format .
EX → Documents , images , audio files and social media posts .

Semi structured data

Semi Structured data fits somewhere in between structured and unstructured data . Semi structured data does not reside in the table but it have some organization .
EX → HTML codes

Relationship between Hadoop and Big Data .

Types of big data projects

1 ) On-premises data center: A company owns and operates its own data center, which is located on its premises. This gives the company full control over the infrastructure, but it also requires a significant upfront investment and ongoing maintenance costs.
2 ) Cloud-based data center: A company leases data center resources from a third-party provider. This is a more cost-effective option than owning an on-premises data center, but it gives the company less control over the infrastructure.

Summarization about on premises and cloud based data centers

What is cluster environment ?

A cluster environment is a group of multiple server instances , spanning across more than one node . all running node have identical configuration .all instances in a cluster work together to provide high availability , reliability and scalability . in a cluster each computer is referred as a node .each node perform different tasks .

What is Hadoop Cluster ?

A Hadoop cluster is a collection of computers , known as nodes , that are networked together to perform these kinds of parallel computations on big data sets . Unlike other computer clusters , Hadoop clusters are designed specially to store and analyze mass amounts of structured and unstructured data in a distributed computing environment .

This is the end of my first article of Hadoop , see you in the next article . Next article will be regarding Hadoop installation and configuration .

Thank you !!