MapReduce White Paper — Part 2

Saumya Bhatt
6 min readApr 25, 2024

--

In this article, we shall be going through the MapReduce white paper. A legendary framework which paved way to technologies like Spark, Hadoop and many other distributed systems that form the backbone of the current tech scene! So, let’s get into it!

Preview

In the earlier article (if you have missed it, you can read it here), we understood what MapReduce is and how and why it is important.

In this section, we shall uncover the black box and take a look at how MapReduce does what it does best, distributed computing!

Managing Nodes: The master-worker setup

So, if you had remembered, we used the below image for MapReduce to signify that there is not just one single machine but multiple nodes that makes distributed computing possible. Well, let’s modify the image a bit:

There is one master node, and the rest are slave nodes. Note that all nodes are communicating with each other. And generally, they are doing so over the same network, hence the “chatter” is pretty quick!

Generally, when working in a distributed setup, you need someone to orchestrate the tasks. Someone that takes care of who should do what task, assign or even re-assign tasks, keep track of what is done and all. (Kind of like Zookeeper in Kafka).

In MapReduce too there is a master node that does the following:

  1. Assigns tasks to the worker node.
  2. Re-assigns tasks to a healthy node if any of the worker node is unhealthy.
  3. Keeps track of which all tasks are done, and which are left.

Getting things ready

As we know, MapReduce has 2 phases, Map and Reduce! These are basically atomic functions (and more often then not are deterministic in nature) that are defined by the user. As soon as we pass it to the master, the master node sends a copy to all the worker nodes:

The user passes the map and reduce functions to the master node. Note that these functions should be atomic in nature!
Each node keep the copy of both the functions. This makes it possible for any node to perform either a Map task or a Reduce Task whenever needed.

We do this so that any given time, a node can perform a map task or a reduce task. Say a node that earlier was performing a map task now is asked by the master to perform a reduce task because one of the other nodes that was supposed to perform it died!

This way, even if nodes die, the work doesn’t stop. This sounding like fault tolerance to you guys?

But what are these bloody tasks!?

Let’s cut to the chase and explain what map and reduce tasks are! Up till now, we have understood that:

  1. There is a master node which assigns tasks.
  2. Each node has the knowledge of how to perform a map and a reduce operation.

Based on this, let’s deep dive into it:

Splitting of Input Data

When the data (say a file of ‘x’ GB) arrives, it is split into M pieces (smaller files, each of size ‘y’ MB, where y<x).

The constant M is specified by the user.

Assigning of tasks

Now that our data is broken down into M chunks, the master node, assigns M nodes to perform the Map operation and sends it the chunked data, and R nodes to do the reduce operation. Note that the reduce function will happen only after all the mapping is done! The master node keeps track of when the mapping is done.

Note that M can be more than the number of nodes! There can 100 nodes and still the value of M can go into 1000s! This is because M here is basically a partition. Hence on a single node, we can send multiple chunks of data or can have one node perform both a map and a reduce task! This here is shown for simplicity of understanding.

What happens inside a Map Task?

Once the assignment is done, let’s get inside a node which does the map task and see what happens!

The node already has the map function with it. It gets the chunked data, performs the map function on it and stores the data on its disk.

The intermediate key-value data is first buffered in memory and periodically written to the disk of the worker node that created it!

But wait! Remember that we said that one node can receive multiple chunked data as often the value of M is much more than the number of nodes! Hence the true picture would look like:

Also remembered that M and R both are defined by the user. M is something that tells in how many chunks we want to split the data into. Larger the value of M, smaller the chunks, faster the computation (at the cost of possible under-utilizing your resources). R defines the into how many chunks we want the output to be! It can even be set to 1 if we want a singular output but for cases where it doesn’t matter, we can have the output into multiple “chunks” (basically files)

Once again, M and R are defined by the user depending on their use case.

So now step back for a minute and image MapReduce to be working on a single node. Our input is of size M and output of size R. Hence, the intermediate key-value data that our Map tasks computed should also model that!

Hence, we will modify that diagram a little bit!

We have defined M as 3 and R as 2. As you can see, the output of the map task now is store in 2 partitions on the disk of that worker!

Our worker node now has successfully performed the map phase on the data and stored the intermediate data onto its partition. It will now send an update to the master!

Note that the master has information of which tasks were done by which nodes and where the data lies.

The worker nodes that performed the map task store the intermediate data on their disk in partitions specified by R and relays this information to the master. The overall image would now look like:

The master is the one who has assigned the tasks. So even if one of the nodes dies, it can assign the map task to another node. Only when all the map tasks are done shall it moves forward.

The Reduce Phase begins!

Now that the map phase is done with, the master has the information of exactly on which node and on which partition of that node the intermediate data resides and passes this information to the workers tasked with reduce task.

The master “tells” the workers with reduce task as to from where they can get their data from

The reduce task worker makes a remote procedural call to the map task worker to get all the intermediate data. The data that it gets is from all the partitions and since the reduce function will be done on the data having the same keys, the worker first sorts the data.

Sorting is needed as the reduce task worker is getting data from multiple map workers. Since the reduce function will be done on data having same keys, sorting has to be done. Either internally, or we can assign an external sort (maybe another MapReduce!)

Depending on the infrastructure, the reduce task nodes will perform the reduce function on the intermediate values and write the data to an output file. When Google introduced this, the output was an R file based on GFS.

The User is informed about the job completion. The output files are stored in a file storage system.

Conclusion

In this section, we learned how the MapReduce internally works:

  1. There is a master which coordinates work amongst the nodes.
  2. The data is first split into M tasks. The master assigns these tasks to the worker nodes.
  3. The worker nodes perform map operations on these chunks and stores the intermediate data onto its disk and informs the master.
  4. Once all map tasks are done and the master has the information as to where the intermediate data is stored, it relays this information to the workers who are supposed to perform the reduce task.
  5. The workers make an RPC call to get this intermediate data, sort it according to partition and performs the reduce function on it.
  6. The output is then stored in a file system and the master is informed about the same.
  7. Once all the reduce tasks are done, the master wakes up the user code and returns the result.

In the next section, we shall understand how this design is fault tolerant, how it can be more refined and ultimately what are its short comings. Till then, stay tuned folks!

--

--