The Hadoop Ecosystem

Heena Rijhwani
Analytics Vidhya
Published in
5 min readFeb 17, 2021

Hadoop is a java-based big data analytics tool used to fill the voids and pitfalls in the traditional approach when there is voluminous data. It is an open source framework for storing data and running applications on clusters of commodity hardware. It offers massive storage and enormous power for processing the data. It is based on the assumption that hardware failure is possible and must be handled by the framework. Hadoop splits large files of data into blocks or fragments, distributes these across nodes in a cluster and transfers code to data to allow parallel processing. The data is locally available and lesser inter-process communication time allows for faster processing. Another important feature of Hadoop is the redundancy of data due to which node failure can be easily managed. Hadoop Ecosystem is a suite which encompasses a number of services such as storing, querying, analyzing, maintaining, and more, inside it which are collectively used to solve big data problems. It includes components like HDFS, Oozie, Zookeeper, Spark, Sqoop, and more. Here we will study some of these components in detail.

1. Hadoop Common- It includes the libraries needed by Hadoop modules.

2. HDFS (Hadoop Distributed File System)- It is a distributed file system modeled after the Google File System (GFS) paper. It is the storage component of Hadoop and allows data replication and fault tolerance. It consists of Name Node (Master) which breaks down received files into smaller chunks (64/128 Mb) and stores copies of the data in Data nodes (Slave).

3. Map Reduce- It is a distributed framework modeled after the GFS paper. It is the computing component in Hadoop and allows to parallelize work over a large amount of raw data. It decomposes the work into map and reduce tasks where Map performs sorting and filtering of data and Reduce aggregates the mapped data. Each node is called Task Tracker and executes these Map and Reduce tasks. Map tasks will convert the input elements into key-value pairs, which are then grouped by key, and processed by reduce functions to give the final output of zero or more key-value pairs.

How is a MapReduce based program executed?

The user program creates a master controller process with the help of fork command. It also forks a number of worker processes. The master creates map and reduce tasks, assigns them to the workers, and keeps a track on these tasks. Every map task creates an intermediate file in its local node for each of the reduce tasks, which is passed on to the reduce tasks as input. All the reduce tasks produce a single output for the MapReduce program.

4.Hadoop YARN (Yet Another resource Negotiator) — It is a resource management platform for managing computing resources in clusters and using them to schedule user’s applications. It consists of Resource Manager, Node Manager and Application Manager.

5. Hive- It is a Hadoop ecosystem component used for querying and analyzing large datasets of the Hadoop system. It is used for performing data summarization, data querying, and data analysis. It uses HQL (Hive Query Language) and queries are translated to MapReduce jobs through HiveQL. Let us have a look at how it works.

Hive Web UI or Hive Command Line sends a query to the Driver to execute. The Driver will check the syntax and requirement of the query with the help of a Compiler. Then a metadata request is sent to Metastore in order to get data, which is then sent as a response back to the compiler. Compiler will send the plan to the driver and it is then passed on to the execution engine, followed by the JobTracker and Task Tracker. The execution engine will receive the result from data nodes and sends them to the driver. Finally, the driver will send them to the interfaces. Hive Data Models consist of databases, tables, partitions and buckets.

6. HBase- It is a NoSQL database responsible for managing distributed datasets and is designed to store structured data in tables. It is a scalable, distributed database and provides real-time access to the read or write data in HDFS. HBase has mainly two components- HBase Master and RegionServer.

You might wonder how it is different from a relational database. Here is how:

It is column oriented unlike RDBMS which is row oriented. It is schema-less whereas RDBMS uses schema. Moreover, it has de-normalized data and is suitable for Online Analytical Processing (OLAP) in contrast to RDBMS which is suitable for Online Transaction Processing (OLTP).

The HBase Data Model consists of tables, rows identified by rowkeys, column families comprising of more than one columns, cells, and different versions of the data.

7.Pig- It was developed by Yahoo and works on pig Latin language, a Query based language similar to SQL. It is a platform for structuring data flow, processing and analyzing huge data sets. Pig executes the commands and in the background, all the activities of MapReduce are taken care of. After the processing, Pig stores the result in HDFS. It offers extensibility, data handling of all types, optimization, and more.

8. Zookeeper- Apache Zookeeper is a centralized service for maintaining configuration information, naming, and providing distributed synchronization. It can also provide group services. A large set of machine clusters can be managed through Zookeeper.

9.Apache Mahout- Mahout is open source framework for creating scalable machine learning algorithms and data mining libraries. Once data is stored in Hadoop HDFS, Mahout provides the data science tools to automatically find meaningful patterns in those big data sets. It performs algorithms like clustering, frequent pattern mining, classification, and more.

10.Apache Flume- Flume efficiently collects, aggregates and moves a large amount of data from its origin and sends it back to HDFS. It is a fault tolerant and reliable mechanism. It allows data flow from the source into the Hadoop environment. Using Flume, one can get data from multiple servers immediately into Hadoop.

11. Sqoop- It is a tool for transferring data between Hadoop and RDBMS. It allows importing data from RDBMS to Hadoop and exporting data from Hadoop to RDBMS.

12. Apache Spark- It is designed for fast computation and extends MapReduce to execute complex computations and stream processing. It has high speed and support for multiple languages including Java and Python. It supports streaming data, machine learning, SQL, and Map and Reduce. It can be installed on top of HDFS or run without any pre-installation.

Its architecture includes Spark SQL, Spark streaming for streaming analytics, MLlib or Machine Learning library, and GraphX which is a distributed graph processing framework.

Hadoop Ecosystem

--

--

Heena Rijhwani
Analytics Vidhya

Final Year Information Technology engineer with a focus in Data Science, Machine Learning, Deep Learning and Natural Language Processing.