Learning Data Science: Day 14 - MapReduce
We are going slightly away from machine learning. Today, we are going to talk about working in parallel. As a starter, we are going to talk about MapReduce.
Introduction to MapReduce
Originally developed by Google, MapReduce is a programming model for huge data sets. When we are talking about huge data sets, we don’t mean in gigabytes. What we mean is terabytes of data or even petabytes. MapReduce runs in parallel and distributed.
The good thing about MapReduce is that it is also a cluster framework. Usually, we have to deal with a cluster with different computer nodes. We have to make sure no instances crash, no instances running too slow, etc. But, MapReduce already takes care of them. What a good time to live!
If we are using Python, the easiest way to run it on Hadoop (I know we haven’t talked about Hadoop, we will talk about it later) is using mrjob. mrjob take its flexibility to another level by giving the capability to test the code on your computer without installing Hadoop.
Going back to MapReduce. As the name suggests, MapReduce consists of two main functions that are Map and Reduce. The basic idea of MapReduce is that we pair key and value. Then those key value pairs are brought to mappers.
The mapper’s function can be anything. Basically, the multiple instances of the mapper created to make sure the data can be distributed to different mappers. This will allow mappers working parallelly.
After the data processed on mappers, it comes to the process of shuffle and sort. The nice thing about the framework, the shuffle and sort process already handled by the framework. Internally, it will do a shuffle and sort, then aggregate all the values based on keys. For an example let’s say we have 2 mappers.
First mapper output (key, values): A, 1
Second mapper output (key, values): A, 5
And during the shuffle and sort, it’s all sorted according to the key. So, the result will be something like below.
List sorted by (key = A): 1, 5
One thing to remember, after mappers give output values the framework will write to disk. So, MapReduce is not efficient with the I/O.
The next process after that will be the reduce process. The reduce is the function that runs after the shuffle and sort have finished. The function itself would depend on what you want to do. Let’s use the previous example and let’s say the reduce function is summing up all the values for each key. So, the result will be something like below.
Reduce function (A): 1 + 5
Reduce function (A): 6
Now we are going to make a hello world that is not an ordinary hello world. This hello world script will be a word count script.
In mapper function, the script will read every line of the file and loop through each line and split it into words. We will assign the lowercase version of each word for key, and put a ‘1’ for the value.
Then in reducer function, the script will sum the total values for a specific key.
Try to run the script on your own PC. Basically, the input can be anything. If you are too lazy to make the input file you can take one from here.
After you define the input file, we can run the script by running
python hello-world.py < input.txt > output.txt. The output.txt is the file name for the output file which counts the words of the input file. The output will be something like the snippet below.
You may be wondered why there exist the first line even though we have split the words by using whitespace as the delimiter. It is actually the newline, not the white space.
You might wonder, how the script will perform if we only applying Map and Reduce of the word count script. Let’s say we have multiple occurrences of words in a mapper.
Words: I love what I love
If we only use Map and Reduce, we are getting these kinds of output from the mapper.
Key, value: (i, 1), (love, 1), (what, 1), (i, 1), (love, 1)
You may see that the key ‘i’ and ‘love’ is redundant. What if we have 1 billion lines of words, it can be a huge overhead. It can be optimized by using Combiner.
We are not going to talk about partition at the moment. Basically, combiner function exists to help the reducer function. Combiner placed between mapper phase with shuffle and sort. Since it’s trying to help reducer, the function usually similar to what the reducer do.
So, we came back to the last example where we have this list of words.
Words: I love what I love
Instead of giving ‘1’ value to each word multiple times, it would be better if the mapper already combines the values together for the exact same keys. That’s the function of the combiner. So, instead of the previous output by applying combiner we can get output like below.
Key, value: (i, 2), (love, 2), (what, 1)
Now we can try it on the previous example script. So, what we have to do is to actually implement the combiner method that summing the values for each key. So, this is what we come up with.
Even though it help to reduce traffic on reducer, the problem with combiner is that the combiner isn’t guaranteed to run at all or multiple times. So, you need to make sure that the code should be able to run without problem even though the combiner didn’t run at all or multiple times.
Today, we have talked about a (simple) introduction to MapReduce. In the future, we may discuss Spark or even Hadoop. Hopefully, this story may help you in learning data science. Have a nice weekend!