What is Apache Hadoop?

3 min readOct 4, 2022

Hadoop is a collection of open source software. which makes it possible to store, distribute and process data and run programs on a set of servers. This framework, written in Java and licensed under the Apache license, is designed to perform distributed processing on thousands of machines with high fault tolerance.

History of Hadoop

As the World Wide Web expanded in the late 1900s and early 2000s, search engines and indexes were created to find relevant information among textual content. In order to increase the volume of information exchange, Google was looking for a way to increase the speed and efficiency of its servers. who invented a unique distribution system called GFS, which stands for Google File System. Following this success, a person named Doug Cutting in the distribution community of Apache company thought of expanding this technology on a wider level and named it after his child’s toy elephant.

The main purpose of Doug Cutting and his team in producing this tool was to use it in the Apache search engine called Nutch. And it intended to return web search results faster by distributing data and calculations to different computers so that several tasks can be done at the same time.

During this project, another search engine named Google was under construction and after that many big companies like Facebook, Google, Yahoo, etc. also used it.

In 2006, the Nutch project was spun off, leaving the web crawling part as Nutch and the distributed computing and processing part as Hadoop. In 2008, Yahoo released Hadoop as an open source project. Today, the Hadoop technology framework and ecosystem is managed and maintained by the non-profit Apache Software Foundation (ASF).

Introduction to Hadoop

Imagine that you, as a visitor, are going to load a page such as the homepage or the About Us page on the website. In this situation, the browser will send a request to your server. This is exactly where Apache Hadoop comes into play.

This tool sends all requested files including text, image or other files to the browser. This is the mediation role played by this tool. The server and the user will communicate with each other through HTTP protocols. And the responsibility of establishing a safe and smooth connection between these two machines is the responsibility of the Apache Hadoop tool.

Now, if we want to look at the issue a little more specifically, we must say that the core of Hadoop consists of a storage section (Hadoop Distributed File System or HDFS) and a processing section (Map/Reduce). Hadoop breaks files into large blocks and distributes them among the nodes of a cluster. To process the data, the Map/Reduce section sends the code package to do the processing in parallel.

This approach takes advantage of data locality (nodes work on a part of the data that is available to them). In this way, data is processed faster and more efficiently than when using a supercomputer-based architecture that uses a parallel system and interconnects data computation through a high-speed network.

As mentioned, the Hadoop framework is written in Java, but C and Shell Script are also used in parts of it. Finally, users can use any programming language to implement the “map” and “reduce” sections when working with Hadoop.

Hadoop framework

The main framework of Hadoop consists of the following modules:

Hadoop Commons section

This section contains libraries and utilities required by other Hadoop modules.

Hadoop Distributed File System (HDFS)

A distributed file system that stores data on clustered machines and provides high bandwidth.

Hadoop YARN

A resource management platform that is responsible for managing computing resources in clusters.

Hadoop reduction mapping (Map/Reduce)

It is a programming model for large-scale data processing. In fact, Hadoop provides a distributed file system that can store data on thousands of servers, and distribute tasks (Map/Reduce tasks) to these machines and perform them alongside the data.