Map Reduce — how Google operate on large files (in GB’s or TB’s)

HDFS file system and combination of MapReduce is used to simplify data processing on large clusters. Taught in MIT, Cambridge distributed systems course 6.824, Lecture 2.

6 min readMay 18, 2022

Why we need MapReduce?

It’s just a combination of map and reduce on large scale. By large scale I mean, all together from multiple machines. The reason behind choosing multiple machines is, suppose you have a 100GB file and you want to process in next one hour, not even possible with single machine with multiple cores and lots of CPU power !!!!!

The only solution to this problem can be solved by MapReduce. To achieve this we use HDFS file system which is highlighted in the next sub heading.

What is HDFS ?

Hadoop File System was developed using distributed file system design. It is run on commodity hardware. Unlike other distributed systems, HDFS is highly faulted tolerant and designed using low-cost hardware.

HDFS holds a very large amount of data and provides easier access. To store such huge data, the files are stored across multiple machines. These files are stored redundantly to rescue the system from possible data losses in case of failure. HDFS also makes applications available for parallel processing.

What is MapReduce ?

As per the name suggest it consists of two task, map and reduce. To understand this concept let’s take an example to count the number of words in a file. Now you pass the file in chunks to the map function (some key/value as input) where it binds the data with the key. In context, the words can be key and value could be count of that particular key which is word in that particular chunk of file.

Now, we have an intermediate output key/value of map which is stored in the local storage of the machine where map executed. This buffered data is now ready to be processed by the reduce tasks which intake these key/value map’s output as it’s input and generate the desired output. This output in our case could be the keys remain same(words), but value are added if the same word is found.

Reduce tasks are generally on different machines reason behind this is discussed further, so speed up the process we can read the intermediate state by using RPC (Remote Procedural Calls). Link is attached in the reference section if you want to read more about RPC calls.

Implementation

This could differ as per the requirements you have. And the right choice depends on your environment. For example, one implementation may be suitable for a small shared-memory machine, another for a large NUMA multi-processor, and yet another for an even larger collection of networked machines.

How Google targeted primarily for computing, implementation & execution.

The Map invocations are distributed across multiple machines by automatically partitioning the input data into a set of M splits. The input splits can be processed in parallel by different machines. Reduce invocations are distributed by partitioning the intermediate key space into R pieces using a partitioning function (e.g., hash(key) mod R). The number of partitions (R) and the partitioning function are specified by the user.

Figure shows the overall flow of a MapReduce operation in our implementation. When the user program calls the MapReduce function, the following sequence of actions occurs (the numbered labels in Figure 1 corre- spond to the numbers in the list below):

The MapReduce library in the user program first splits the input files into M pieces of typically 16 megabytes to 64 megabytes (MB) per piece. It then starts up many copies of the program on a cluster of machines.
One of the copies of the program is special — the master. The rest are workers that are assigned work by the master. There are M map tasks and R reduce tasks to assign. The master picks idle workers and assigns each one a map task or a reduce task.
A worker who is assigned a map task reads the contents of the corresponding input split. It parses key/value pairs out of the input data and passes each pair to the user-defined Map function. The intermediate key/value pairs produced by the Map function are buffered in memory.
Periodically, the buffered pairs are written to local disk, partitioned into R regions by the partitioning function. The locations of these buffered pairs on the local disk are passed back to the master, who is responsible for forwarding these locations to the reduce workers.
When a reduce worker is notified by the master about these locations, it uses RPC calls to read the buffered data from the local disks of the map workers. When a reduce worker has read all intermediate data, it sorts it by the intermediate keys so that all occurrences of the same key are grouped together. The sorting is needed because typically many different keys map to the same reduce task. If the amount of intermediate data is too large to fit in memory, an external sort is used.
The reduce worker iterates over the sorted intermediate data and for each unique intermediate key en- countered, it passes the key and the corresponding set of intermediate values to the user’s Reduce func- tion. The output of the Reduce function is appended to a final output file for this reduce partition.
When all map tasks and reduce tasks have been completed, the master wakes up the user program. At this point, the MapReduce call in the user program returns back to the user code.

The master keeps several data structures. For each map task and reduce task, it stores the state (idle, in-progress, or completed), and the identity of the worker machine (for non-idle tasks).

Fault tolerant

Since the MapReduce library is designed to help process very large amounts of data using hundreds or thousands of machines, the library must tolerate machine failures gracefully.

There are two scenarios:

Everything is managed by Master, what if the master goes down.
Map and Reduce tasks are performed on multiple machines, what if any worker fails.

We do a have single point of failure that is the master. If the master goes down all the local disk addresses and mapping will be lost. To overcome this, the master can periodically note its checkpoints. In case the master dies a new copy can be started from the last check-pointed state.

In case of worker failure, the master keeps on pinging each worker periodically. If no response from the worker is received then the worker is marked as failed. Moreover, we need to handle both map worker machines and reduce worker machines. Handling reduces worker machine is a bit easier, if reduce tasks gets completed and then machine failed we do not need to re-execute anything.

In the case of the map worker machine, the data is stored in a local machine, therefore if the map worker is completed and the machine crashed, we need to re-execute that map task. And in this case, if some reduced worker was reading from the intermediate output of the map, it stops and updates the master. Now, any other reduced worker machine can pick up the re-executed map tasks later again when the map worker finishes its task.

One of the reasons for choosing atomic periodic checkpoints was if we calculate the output for a deterministic value, then in this case the output would remain the same and not affect the output.

If you want to read more… follow up with the research paper. This was just a gist.

If you find this article helpful, please do clap, share and follow me for more such advance design articles.

Reference

Remote procedure call - Wikipedia

In distributed computing, a remote procedure call ( RPC) is when a computer program causes a procedure ( subroutine) to…

en.wikipedia.org