Introduction to MapReduce

James Han
Analytics Vidhya
Published in
2 min readMay 31, 2021

What is MapReduce?

MapReduce is a programming framework for distributed parallel processing of large jobs. It was first introduced by Google in 2004, and popularized by Hadoop. The primary motivation of MapReduce was that computationally intensive jobs distributed across large clusters of machines needed a simple way to manage them in a parallelized, scalable, and fault-tolerant way.

The main idea of MapReduce is that a complex job can be distributed and parallel process by splitting the job into multiple tasks through the use of map and reduce stages. Each stage would utilize a number of workers, and each worker would execute a user-defined map function or reduce function.

What are Map and Reduce Functions?

A map function takes a chunk of the job’s input and executes some user-defined logic to output a list of key-value pairs.

A reduce function takes a group of key-value pairs that share the same key and executes some user-defined logic to output a single value for that key.

Example

Here is a simple MapReduce example where the map function calculates the number of occurrences of each letter in the input string, and the reduce function calculates the total number of occurrences of a particular letter.

Each gray box indicates the input or output data of the map and reduce tasks. In MapReduce, this intermediate data is usually stored in a distributed file system such as HDFS.

Notice the splitting of the initial input and the shuffling and sorting of the output of the map function into the input of the reduce function. These steps are done automatically by the MapReduce framework.

Fault Tolerance

One of the biggest advantages of using MapReduce is its fault tolerance. Every single map and reduce task are executed independently, and if one task fails, it is automatically retried for a few times without causing the entire job to fail. That way, if transient issues such as hardware failures occur or if tasks are intentionally stopped to free up resources for more important jobs, MapReduce tasks can easily retry using the same input data.

In order to ensure that the retry does not have any unintended consequences, MapReduce functions must produce outputs without altering the inputs, and they cannot have any side effects. If implemented correctly, a MapReduce job that has failed tasks but eventually succeeded after retries should produce the exact same output as that of a job that executed successfully without any failed tasks.

--

--