Apache Hadoop & HDFS

Çağatay Tüylü

Published in

Çağatay Tüylü

5 min readOct 5, 2021

The development of Big Data has created new areas and solutions previously unknown in the world of technology. In the recent past, it has been seen that servers need to process, sort and store enormous amounts of data in real-time. As a solution to this need, Apache Hadoop was born, which can easily handle large data sets.
Apache Hadoop is an open-source platform that enables highly reliable, scalable, and distributed processing of large datasets using simple programming models. Based on commercial computer clusters, Hadoop provides a cost-effective solution for storing and processing large volumes of structured, semi-structured, and unstructured data without any formatting requirements. This makes Hadoop ideal for supporting big data analytics initiatives.
Include emerging data formats (such as audio, video, social media data, and click data) alongside semi-structured and unstructured data not typically used in a data warehouse. More comprehensive data enables more accurate analytical decisions to be made, supporting new technologies such as artificial intelligence and the Internet of Things.
Hadoop helps provide real-time, self-service access for data scientists, line-of-business owners, and developers. Hadoop drives the future of data science as an interdisciplinary field combining machine learning, statistics, advanced analytics, and programming.

What is the impact of Hadoop?

Hadoop was a major development in the big data space. In fact, it is credited with being the foundation for the modern cloud data lake. Hadoop democratized computing power and made it possible for companies to analyze and query big data sets in a scalable manner using free, open-source software and inexpensive, off-the-shelf hardware. This was a significant development because it offered a viable alternative to the proprietary data warehouse (DW) solutions and closed data formats that had ruled the day until then. With the introduction of Hadoop, organizations quickly had access to the ability to store and process huge amounts of data, increased computing power, fault tolerance, flexibility in data management, lower costs compared to DW’s, and greater scalability — just keep on adding more nodes. Ultimately, Hadoop paved the way for future developments in big data analytics, like the introduction of Apache Spark™.

Hadoop consists of four main modules.

HDFS (Hadoop Distributed File System)

It is Hadoop’s file system. It is a distributed file system used to process big data on clusters of ordinary servers. It provides better data throughput compared to traditional file systems. It combines ordinary server disks and creates a large virtual disk. This makes it possible to store and process very large files.

YARN (Yet Another Resource Negotiator)

The main purpose of YARN is to split the resource management and job scheduling/monitoring functions into separate daemons. The idea is to have a global Resource Manager and a per-app Application Manager.
Resource Management and NodeManager make up the data computing framework. The ResourceManager is the ultimate authority that arbitrates resources among all applications in the system. NodeManager is a per-machine framework agent responsible for containers, monitoring resource usages (CPU, memory, disk, network), and reporting them to ResourceManager/Scheduler.
Per-application ApplicationMaster is, in fact, a framework-specific library and is tasked with negotiating resources from the ResourceManager and working with NodeManager(s) to execute and monitor tasks.

MapReduce

Hadoop MapReduce module helps programs to process data simultaneously. The threads are dispersed over the cluster and perform simultaneous processing. MapReduce’s Map task transforms input data into key-value pairs. The Reduce task takes the input, aggregates the information, and produces the result. This is where the great power of Hadoop comes into play. Network traffic is not busy as the read of the processed files is always over the local disk of the respective node. Moreover, the performance increases proportionally as the number of nodes in the cluster increases.

Hadoop Common

It is the common module that helps in big data processing and supports all Hadoop modules.

What are the challenges with Hadoop architectures?

Complexity — Hadoop is a low-level, Java-based framework that can be overly complex and difficult for end-users to work with. Hadoop architectures can also require significant expertise and resources to set up, maintain, and upgrade.
Performance — Hadoop uses frequent reads and writes to disk to perform computations, which is time-consuming and inefficient compared to frameworks that aim to store and process data in memory as much as possible, like Apache Spark™.
Long-term viability — In 2019, the world saw a massive unraveling within the Hadoop sphere. Google, whose seminal 2004 paper on MapReduce underpinned the creation of Apache Hadoop, stopped using MapReduce altogether, as tweeted by Google SVP of Technical Infrastructure, Urs Hölzle. There were also some very high-profile mergers and acquisitions in the world of Hadoop. Furthermore, in 2020, a leading Hadoop provider shifted its product set away from being Hadoop-centric, as Hadoop is now thought of as “more of a philosophy than a technology.” Lastly, 2021 has been a year of interesting changes. In April 2021, the Apache Software Foundation announced the retirement of ten projects from the Hadoop ecosystem. Then in June 2021, Cloudera agrees to private. The impact of this decision on Hadoop users is still to be seen. This growing collection of concerns paired with the accelerated need to digitize has encouraged many companies to re-evaluate their relationship with Hadoop.

What are the benefits of Hadoop?

Scalability — Unlike traditional systems that limit data storage, Hadoop is scalable as it operates in a distributed environment. This allowed data architects to build early data lakes on Hadoop. Learn more about the history and evolution of data lakes.
Resilience — The Hadoop Distributed File System (HDFS) is fundamentally resilient. Data stored on any node of a Hadoop cluster is also replicated on other nodes of the cluster to prepare for the possibility of hardware or software failures. This intentionally redundant design ensures fault tolerance. If one node goes down, there is always a backup of the data available in the cluster.
Flexibility — unlike traditional relational database management systems, when working with Hadoop, you can store data in any format, including semi-structured or unstructured formats. Hadoop enables businesses to easily access new data sources and tap into different types of data.

What are some examples of popular Hadoop-related software?

Other popular packages that are not strictly a part of the core Hadoop modules but that are frequently used in conjunction with them include:

Apache Hive is data warehouse software that runs on Hadoop and enables users to work with data in HDFS using a SQL-like query language called HiveQL.
Apache Zookeeper is a centralized service for enabling highly reliable distributed processing.
Apache Sqoop™ is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases.