MapReduce White Paper — Part 1

Saumya Bhatt
5 min readApr 25, 2024

--

In this article, we shall be going through the MapReduce white paper. A legendary framework which paved way to technologies like Spark, Hadoop and many other distributed systems that form the backbone of the current tech scene! So, let's get into it!

What even is MapReduce?

Well, the official definition based on the white paper that Google published is as follows:

MapReduce is a programming model and an associated implementation for processing and generating large data sets.

What this means in simple terms is that it helps us perform computations. Fast. Now fast is the key word here. Imagine it to be a black box. You define what kind of data you will be passing to it and what to do to that data. This black box then accepts inputs that you defined, performs the computations that you defined and gives an output.

User defines some operation that MapReduce will do on the data that it receives.
The process is asynchronous hence represented with dotted lines.

Why is it required?

Well, coz even though as easy as it sounds, computation is an expensive process. Even with the advance CPUs and chips, for huge datasets that goes into petabytes, it often takes days or even more if one tries to perform any kind of logic onto that data!

The only real solution to this problem would be to have more compute power. But there’s a limit to our hardware. Hence wouldn’t it be better if instead of being restricted to the compute power of just 1 machine, we use 2 machines! Why 2, we can use 4! Wait, why not 8? 16 is also possible! Hence what about 32! And on and on…

So, you can see, we can horizontally scale (add more machines) instead of just vertically scaling (getting more powerful machine) to perform our computations! This is what MapReduce does! It distributes the computation across different machines. This can make the computation speed depend on the number of machines you have which theoretically can be any number! So, our new picture would look like as follows:

All the machines in MapReduce model are called nodes.

But why call it MapReduce?

Well jimbo, that coz is what most calculations involve! Or at least the ones which MapReduce involve! So, when your data first comes in, you map it some intermediate value and then you reduce it to a definite value!

The overall flow of MapReduce

Think explaining this with an example would be better. Imagine you have a log of users mapped to which websites they visited and at what time in a day as below:

This can be in any format, CSV, JSON, etc. It just needs to be a Key-Value pair

Now, if I want to rank the websites based on the number of visits it had, how would I do that?

Well, first we “map” the website to the user count as follows:

we map it as key-value of website (key) to user count (value)

Now that we have mapped to our intermediate data, we “reduce” it (sum it basically):

And voila! We now have the google.com and facebook.com was visited the most times followed by youtube.com! We have effectively mapped and reduce the data that we had into some form that is useable to us!

Okay, but where can we use it?

Even though it sounds simple, MapReduce surprisingly powered quite a few of the things!

Distributed Grep:

We all know that we can use grep function to find a particular keyword in a document. But what if the document is terabytes and petabytes large! It might take us quite some time and resource of find what we are looking for! Why don’t we split the documents into chunks, in each chunk find the word that we are looking for (map phase) and then combine all the chunks together! (reduce phase)

Count of URL access frequency:

This one we saw above in our example. The funny thing is it forms the backbone of the PageRank algorithm that is used to rank websites! The very thing that started Google!

Reverse Web-Link Graph:

Say there is a webpage that references 10 other webpages. We have a one-to-many relationship here. What if I want to backtrack the webpage and see which webpage was the source of my current webpage! Can you see the need to invert our one-to-many data?

Distributed Sort:

No need to explain why sorting is important! But when the data that we are dealing with is so huge, it is better to break it up into chunks (sounds like merge sort?)

Inverted Index:

Say you have a book. Now you want to make a program that takes in a word and scans through the entire book and returns on which pages this word is referenced. Think we can use MapReduce here! MapReduce will take in each page of the book and emit a <word, documentId> which would reference that “this” word was used in “this” document. In the reduce phase, we would just collect all the similar words and create a list of docs as <word, list(documentId)> and voila! We have our word lookup ready!

Conclusion

Based on the examples above, we can conclude that we can use MapReduce in cases where:

The dataset is huge!

When we want to leverage the power of distributed computing

In this section, we understood what MapReduce is where all it can be used for. Now that we have the black-box definition of it out of the way, in the coming articles we shall understand how MapReduce works under the hood! So stay tuned folks!

--

--