What is MapReduce?

PB
SystemDesign.us Blog
3 min readNov 20, 2022

Visit systemdesign.us for System Design Interview Questions tagged by companies and their Solutions. Follow us on YouTube, LinkedIn, Twitter, Medium.

MapReduce is a programming model for processing large data sets with a parallel, distributed algorithm on a cluster.

The MapReduce algorithm is composed of two parts: the map function and the reduce function.

The map function takes an input key/value pair and produces an intermediate key/value pair. The reduce function takes an intermediate key/value pair and produces an output key/value pair.

The MapReduce programming model can be used to solve various problems such as counting the number of words in a document, finding the average age of users in a social network, or computing the shortest path in a graph.

In general, the MapReduce algorithm has three main steps:

1) Map: The map function takes an input key/value pair and produces an intermediate key/value pair.

2) Reduce: The reduce function takes an intermediate key/value pair and produces an output key/value pair.

3) Combine: The combine function takes an output key/value pair and combines it with another output key/value pair to produce a single output key/value pair.

The MapReduce algorithm can be implemented in any programming language, but it is most commonly used in Java.

MapReduce is a powerful tool for processing large data sets in parallel and distributed environments. However, it has some limitations. For example, MapReduce cannot be used to find the shortest path in a graph or to solve problems that require random access to data. In addition, MapReduce is not well suited for problems that require real-time processing or interactive applications.

MapReduce Architecture

https://research.google.com/archive/mapreduce-osdi04-slides/index-auto-0008.html

The MapReduce architecture consists of a master node and a set of worker nodes. The master node is responsible for managing the job, while the worker nodes are responsible for executing the tasks.

The MapReduce architecture is designed to be scalable and fault-tolerant. When a job is submitted, the master node distributes the tasks to the workers. If a worker fails, the master node can reschedule the task to another worker.

MapReduce Example

Let’s consider a simple example: counting the number of words in a document. We will use the MapReduce algorithm to count the number of words in a large document in parallel.

First, we need to write two functions: the map function and the reduce function.

The map function takes a line of text as input and outputs a list of (word, 1) pairs. For example, if the input is “Hello world”, the output would be (“Hello”, 1), (“world”, 1).

The reduce function takes a list of (word, count) pairs as input and outputs a list of (word, totalCount) pairs. For example, if the input is ((“Hello”, 1), (“world”, 1), (“Hello”, 1), (“world”, 1)), the output would be ((“Hello”, 2), (“world”, 2)).

Next, we need to create a MapReduce job. A MapReduce job consists of two phases: the map phase and the reduce phase.

In the map phase, the Map function is applied to each input key/value pair to produce a list of intermediate key/value pairs. In our example, the map function would be applied to each line of text in the document.

In the reduce phase, the Reduce function is applied to each intermediate key/value pair to produce a list of output key/value pairs. In our example, the reduce function would be applied to the list of (word, 1) pairs produced by the map function.

Finally, we need to submit the MapReduce job to a cluster. The cluster will execute the job and return the results.

MapReduce Summary

MapReduce is a powerful tool for processing large data sets in parallel and distributed environments.

MapReduce has three main steps: map, reduce, and combine.

The MapReduce architecture consists of a master node and a set of worker nodes. The master node is responsible for managing the job, while the worker nodes are responsible for executing the tasks.

MapReduce is designed to be scalable and fault-tolerant. When a job is submitted, the master node distributes the tasks to the workers. If a worker fails, the master node can reschedule the task to another worker.

Visit systemdesign.us for System Design Interview Questions tagged by companies and their Solutions. Follow us on YouTube, LinkedIn, Twitter, Medium.

--

--