Hadoop : Map Reduce In Detail

5 min readMar 31, 2024

MapReduce is a programming model or pattern within the Hadoop framework that is used to access big data stored in HDFS (Hadoop File System).

MapReduce facilitates concurrent processing by splitting petabytes of data into smaller chunks, and processing them in parallel on Hadoop commodity servers. Finally, it aggregates all the data from multiple servers to return a consolidated output back to the application.

MapReduce Architecture

The data goes through following phases of MapReduce in big data :

Input Splits — An input to a map reduce in big data job is divided into fixed size pieces called input splits, each split is a chunk of the input that is consumed by a single map
Mapping — In this phase data in each split is passed to a mapping function to produce output values
Shuffling — This phase consumes the output of mapping phase. Its task is to consolidate the relevant records from mapping phase output.
Reducing — In this phase, output values from shuffling are aggregated. It combines values from shuffling and returns a single output value. This phase summarizes the complete dataset

Detail Explanation

Hadoop divides input to a MapReduce job into fixed size pieces called input splits. It creates one map task for each split, which runs user- defined map function for each record in the split.

Note : Having many splits means the time taken to process each split is small compared to the time to process the whole input. On the other hand, if splits are too small, the overhead of managing the splits and map task creation begins to dominate the total job execution time.

2. Map side

Each map task has a circular memory buffer that is writes the output to. When the contents of the buffer reach a certain threshold size, a background thread will start to spill the contents to disk.

Map outputs will continue to be written to the buffer while spill takes place, but if the buffer fills up during this time, the map will block until the spill is complete.

Before it writes to disk, the thread first divides the data into partitions corresponding to the reducers that they will ultimately be sent to. Within each partition, the background thread performs an in memory sort by key and if there is a combiner function (it helps to cut down the amount of data shuffled between the mappers and reducers), it is run on the output of the sort.

Each time memory buffer reaches the spill threshold, a new spill file is created, so after the map task has written its last output record, there could be several spill files. Before the task is finished, the spill files are merged into a single partitioned and sorted output file.

3. Reduce Side

Copy phase — The map tasks may finish at different times, so the reduce task starts copying their outputs as soon as each completes. Map outputs are copied to the reducer task JVM’s memory if they are small enough, otherwise they are copied to disk.

Note : As map tasks complete successfully, they notify the application master using heartbeat mechanism. A thread in reducer periodically asks the master for map output hosts until it has retrieved them all.

Note : Hosts do not delete map outputs from disk as soon as reducer has retrieved them, as reducer may subsequently fail. Instead, they wait until they are told to delete by application master, which is after the job is completed.

Sort phase — When all map outputs have been copied, the reduce task moves into sort phase (properly should be called as merge phase, as sorting was carried out on map side), which merges the map outputs, maintaining their sort ordering. This is done in rounds. If there are 50 map outputs and the merge factor was 10, there would be 5 rounds.

Reduce phase — During this phase, the reduce function is invoked for each key in the sorted output. The output of this phase is written directly to the output filesystem typically HDFS

MapReduce In Action

# Word count mapper

import sys

for line in sys.stdin:    # Input is read from STDIN and the output of this file is written into STDOUT
    line = line.strip()    # remove leading and trailing whitespace
    words = line.split()   # split the line into words
    
    for word in words:   
        print '%s\t%s' % (word, 1)   #Print all words (key) individually with the value 1

# Word count reducer

from operator import itemgetter
import sys

current_word = None
current_count = 0
word = None


for line in sys.stdin:    # input comes from STDIN
    line = line.strip()    # remove leading and trailing whitespace

    word, count = line.split('\t', 1)    # parse the input we got from mapper.py by a tab (space)

    try:    
        count = int(count)     # convert count from string to int
    except ValueError:
        continue        #If the count is not a number then discard the line by doing nothing


    if current_word == word:      #comparing the current word with the previous word (since they are ordered by key (word))
        current_count += count
    else:
        if current_word:
            # write result to STDOUT
            print '%s\t%s' % (current_word, current_count)
        current_count = count
        current_word = word

if current_word == word:    # do not forget to output the last word if needed!
    print '%s\t%s' % (current_word, current_count)

# Running steps

---Creating Files

touch words.txt

---Making Directory in hdfs

hdfs dfs -mkdir -p /wordcount

---Copying test file from local directory to hdfs

hdfs dfs -copyFromLocal /usr/local/words.txt /wordcount

---Check for file listing on hdfs:

hdfs dfs -ls /wordcount

---Running the mapreduce job

/usr/local/hadoop/bin/hadoop  jar /usr/local/hadoop/share/hadoop/tools/lib/hadoop-streaming-2.6.0.jar -file /usr/local/mapper.py    -mapper mapper.py -file /usr/local/reducer.py   -reducer reducer.py -input /wordcount/words.txt -output /wordcount/output

---Print the output

hdfs dfs -cat /wordcount/output/part-00000

Advantages Of MapReduce

Following are the advantages of map reduce :

Parallel Processing — In MapReduce, we are dividing the job among multiple nodes and each node with a part of the job simultaneously. MapReduce is based on Divide and Conquer paradigm.
Data Locality — Instead of moving data to the processing unit, we are moving the processing unit to the data in the map reduce framework. This minimizes network congestion and increases the overall throughput of the system.

Hadoop modules & process flow link : https://medium.com/@abhisheksukhwal9/hadoop-modules-process-flow-73f6a971d043

Hadoop : Map Reduce In Detail

MapReduce Architecture

Detail Explanation

MapReduce In Action

Advantages Of MapReduce

Written by Abhishek Sukhwal