HADOOP — MAP REDUCE

Shehryar Mallick
3 min readSep 16, 2022

--

INTRO TO MAP REDUCE:

MapReduce is one of the core components of Hadoop framework. As discussed in the previous articles that Hadoop is an open source framework to process big data, this directs our attention towards the question of what is the component that carries out this task? It’s none other than MapReduce.

MapReduce is software frame work that processes large dataset by dividing them into small portions and performing the computation logic on all those small chucks in parallel. Once the specified processing is carried out on the individual chucks they are rejoined to give the final output.

Minimally, applications specify the input/output locations and supply map and reduce functions via implementations of appropriate interfaces and/or abstract-classes. These, and other job parameters, comprise the job configuration. A detailed explanation would be given as we dive into this article.

BENEFITS OF MAPREDUCE:

1) Scalability. Businesses can process petabytes of data stored in the Hadoop Distributed File System (HDFS).

2) Flexibility. Hadoop enables easier access to multiple sources of data and multiple types of data.

3) Speed. With parallel processing and minimal data movement, Hadoop offers fast processing of massive amounts of data.

4) Simple. Developers can write code in a choice of languages, including Java, C++ and Python

ARCHITECTURE OF MAPREDUCE:

We will look at the following example to establish the concept of MapReduce wordcount process.

1. INPUT:

In the above example an input file was given that had two sentences.

2. SPLIT:

In the split step the sentences are separated and sent to map stage.

3. MAP:

3.1. MAPPER:

A separate mapper is spawned for each input split. The number of maps is usually driven by the total size of the inputs, that is, the total number of blocks of the input files.

The mapper takes the input and generates the intermediate results. What it does is after reading the input sentence splits it up at the spaces and associates a value with the word. So for instance if the input sentence was “hello there” mapper would split the sentence as <hello, 1> and <there, 1>.

3.2. LOCAL COMBINER (OPTIONAL):

An interesting thing in the map stage is the local combiner which performs the same task as the reducer. However it is optional and can be enabled by the user if he wishes. If enabled it would reduce the output of it’s respective map and send it as input to the reducing stage. This helps to cut down the amount of data transferred from the Mapper to the Reducer.

4. REDUCE:

In the reducing stage there are three core aspects, shuffle, sort and reducers. The shuffle and sort work simultaneously.

4.1. SHUFFLE:

Input to the Reducer is the sorted output of the mappers. In this phase the framework fetches the relevant partition of the output of all the mappers

4.2. SORT:

In this step which runs simultaneously with shuffle, the same keys are grouped together and sent as the input to the reducers.

4.3. REDUCERS:

This is the final stage, the reducers count all the instances of the input from the shuffle and sort stage and provide the final output. The output given by reducers is not sorted.

Partitioner

Partitioner partitions the key space. Partitioner controls the partitioning of the keys of the intermediate map-outputs.

REFERENCE:

--

--

Shehryar Mallick

I am a Computer Systems Engineer who has a keen interest in variety of subjects which include Data Science, Machine Learning, Programming and Data engineering.