The Art of Apache Hadoop

Amjad El Baba
4 min readJun 29, 2022

--

towardsdatascience.com

Big data is a sensitive topic. Should it be?

As we all know, data is endless and it can come from every single resource we are using in our daily life. Also, data is useless if we didn't infer from it, but to infer from it we need first to store it in a convenient & easy to reach way so we can then analyze, clean & infer from it by turning information given from the data into knowledge.

If you are wondering about distributed computing, Hadoop is that kind of framework, so I will ask you to bear with me throughout this article so I can clarify in easy words the logic implemented by Hadoop.

This might not make much sense to you so let’s clear it up!

What is Hadoop?

It’s a framework for distributed computing, where it consists of multiple computers or nodes separated from each other but they work on the same system. So the work and computation will be distributed among different places.

Why Hadoop?

Time is priceless, and we need tools that make our analysis and work much faster and efficient for that tremendous amount of data which we call Big Data, for that reason we need multiple processors and storage units rather than storing and processing on a single storage which will lead to a longer process.

Ingredients of Hadoop

music.amazon.com

Hadoop consists of two essential parts:

  • HDFS aka Hadoop Distributed Files System
  • MapReduce

HDFS

Which is Hadoop Distributed Files System. First, we need a place to store that data in a suitable way. Here where HDFS come into play, because it handles all the complexities such as splitting data into blocks of storage (each block = 128 MB) so we are not limited to a certain hard disk capacity on a machine, duplicating each block to some of the other nodes & tracking each node containing which blocks (duplication = 3 copies of a block as a default number).

So, by blocks duplication, we can have a backup of our data incase of a certain failure occurred.

Node is considered as a computer used for processing and storing data.

MapReduce

After storing data in a proper way, the computational work starts, and MapReduce handles all of it’s complexities.

MapReduce is a programming model implementing the process of generating, computing & manipulating big data sets with a certain algorithms.

A great example of how MapReduce work is:

todaysoftmag.com

MapReduce phases:

  • Map phase, in this phase the data input which is chunked to blocks is splitted among the mappers (splitting part) then it’s mapped to key-value pairs (mapping).

Please note that the input split is not a physical chunk, it’s just a class programmed in a certain language that points the start and end locations of a record within blocks. Since a record of data may be greater than 128 MB so it will be splitted among two or more blocks depending on it’s size.

  • Shuffle phase, here the key-value pairs are grouped based on the key.
  • Reduce phase, finally this step comes when we want to figure out the results so the final values are sent into the reducers where each reducer will produce a result set example: (Bear, 2) and by default for each job there is one reducer and this can be changed by the user. In other words, in our example above when the aggregation occurred so we can know how frequent each word presented.

What we learned in this article:

  • What & why Hadoop is used.
  • Hadoop components.
  • How each component of Hadoop works with a practical example.

Some useful resources:

Thanks for your time and let’s boost our knowledge!

--

--

Amjad El Baba

Data enthusiast, developer, hard worker and a detail-oriented person who enjoys team work and leadership. Passionate about AI, Big Data and Data Science.