Running A MapReduce job on a Psuedo Distributed Mode

In my previous article (https://medium.com/@anmol.ganju81/configuring-hadoop-on-linux-rhel-7-cent-os-fedora-23-machine-3dc8caf57ec9) I explained how to configure pseudo distributed mode onto a rhel7 or centos7 machine, So whoever reading this article have a basic idea of configuring hadoop in linux machine(there are different types of flavours like suse ubuntu, I have explained configuring hadoop on rhel7, centos7, fedora23). There is also a way to configure hadoop in windows environment it can be win10 win8 or win7 it doesn't matter but we can configure it in windows also, will talk about this later. My focus of today's task is to briefly explain how to run a mapreduce job on pseudo distributed mode. This procedure does not involve any kinds of steps (you may have witnessed in my previous blog), this is a very straight forward tutorial. First let me explain what is map reduce who runs the mapreduce, how mapreduce functions, in which node mapreduce jobs run and several other questions, So lets get started.

What Is Mapreduce?

See mapreduce is a framework by the help of which we process huge amounts of data. To run a mapreduce task we have to write a mapreduce program first, now this program can be wirtten in severel languages however I know only 3 of them its Java, C++, and Python. In this tutorial ill be explaining the running of the mapreduce job using java only.

How mapreduce functions and node responsible for mapreduce?

Mapreduce basically works on the basis of key/value pairs and it follows an architecture to work accordingly. The MapReduce framework contains two important tasks, namely Map and Reduce. Map takes a set of data and converts it into another set of data, where individual elements are broken down into key/value pairs. Secondly, reduce task, which takes the output from a map as an input and combines those key/value pairs into a smaller set of Key/Value pairs. As the sequence of the name MapReduce implies, the reduce task is always performed after the map job. Under the MapReduce model, the data processing primitives are called mappers and reducers. There is also combiner which runs in between of map job and reduce job but its running is not always necessary, it runs only when the amount of data going through mapper is lot more then it should be. Now the second thing is the code, we have to write code for mapper reducer and the driver class, the driver class is the main class of the mapreduce task which executes first and then it executes the map reduce jobs that are written inside the driver class, if the combiner needs to run then we have to write only one line of code for that. This is the basic criteria that one should follow to write a map-reduce job, I will explain how to write the code in some other article for now lets just focus our attention in understanding the concepts and not going in depth.

The node which is responsible for the mapreduce task is the job tracker and the task tracker, I hope you know what job tracker and task tracker is..? job tracker is the service within the hadoop which ditributes mapreduce tasks to different data nodes(Data node’s Stores the actual data in HDFS) and performs mapreduce jobs over it, and task tracker keeps tracks of mapreduce jobs that are going on inside the data node and it keeps on sending the heart beat signals to the jobtracker until it finishes.

How to run mapreduce task on a psuedo distributed mode?

To achieve this we will be using a rhel7/centos7/fedora23 machine. Now we are going to copy a text file(.txt) inside the hdfs then we will be applying the jar file onto that directory that contains our text file. Now this jar file contains mapper, reducer and driver code that are required to run the job so lets do this am going to show the images too which I have taken during the procedure.

Creating a file.txt:-

You have to create a file(or you can just copy any file into the hdfs direcotry i.e /user/(your hadoop username/(your input directory), by the help of -put command) when you are logged in as the hadoop user remember the file will get created inside the local machine not the hdfs.

$ vi file.txt
This is the file.txt which I have created inside the hadoop user directory the file will be created inside /opt/hadoop the directory which is specified at the rime of user creation (Read previous Blog).

1. Create a directory inside the hdfs:-

Execute below command.

$hdfs dfs -mkdir /user/dead/anmol 
(dead is my hadoop username and anmol is the directory which I am going to create)

2. Now copy the file from local directory to the hdfs:-

This can be done by the help of the put command, execute the below command.

$hdfs dfs -put file.txt /user/dead/anmol 
(what to put and where to put).

3. Now Run the command to start the mapreduce:-

We are specifing the name of the jar then the name of the folder on which we have to apply the jar and then output is the directory which stores a part-r-00000 file inside it and it get automatically created. Run the command:-

$ hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.2.jar wordcount anmol output 
(we specified the path of the jar first then the name of the jar after that we specified the input directory and then the output directory which automatically gets created)
At the top is the command which I have performed that starts the map reduce system
This images shows the mapreduce opertaions has been performed

And when we navigate to the output directory we can see the part-r-00000 file.

By doing ls we can see the files inside output directory

And when we use cat command we can see the following output:-

The output shows us the number of words that have repeated.

So this was how to perform mapreduce operation on a linux machine now for the more explanation that how this whole program is working you can read along. Generally I am not explaining the code here, am just explaining how the whole thing is working, general architecture of a mapreduce.

MapReduce functions with a more elaborative explanation.

I have made an image that shows the basic explanation of the above mapreduce wordcount program, I will try to explain as simple as I can jus follow me along.

An image to explain the mapreduce job which we have performed earlier

In every mapreduce job there are basically 3 phases apart from them 1 phase is optional i.e the combiner phase, A combiner does not have a predefined interface and it must implement the Reducer interface’s reduce() method. Combiner phase only performs when the input to the mapper is very large, i.e if the data that is needed to reduce to get some valuable information is very large only then combiner is said to be run.

Now let me explain the wordcount example:-

Record Reader:- Record reader is the built in interface which reads the file present in HDFS and breaks it into the form of keyvalue pair as the output, this phase start before the mapper phase. So our input file that was file.txt so the input will be like:-

Saturn is a planet
Earth is a planet
Pluto is not a planet anymore

And the output will be given by record reader will be in key/value pair format:-

<0, Saturn is a planet>
<19, Earth is a planet>
<36, Pluto is not a planet anymore>

Then comes the mapper phase, Now based on the code which is written in the mapper phase The Map phase takes input from the Record Reader, processes it, and produces the output as another set of key-value pairs. As you can see in the image that I have added above that stats the map output.

Record Reader Input:-

<0, Saturn is a planet>
<19, Earth is a planet>
<36, Pluto is not a planet anymore>

Mapper Output:-

<Saturn,1> <is,1> <a,1> <a,1> <planet,1> <planet,1>
<planet,1> <is,1> <is,1> <not,1> <a,1>
<anymore,1> <Pluto,1> <Earth,1>

After this the combiner phase comes into play, The Combiner phase reads each key-value pair, combines the common words as key and values as collection. Usually, the code and operation for a Combiner is similar to that of a Reducer, these are basically the three files that are present inside the driver class, and they initialize when the mapreduce jobs initialize:-

job.setMapperClass(.class); (mapper code)
job.setCombinerClass(.class); (reducer code)
job.setReducerClass(.class); (reducer code)

Mapper code Input:-

<Saturn,1> <is,1> <a,1> <a,1> <planet,1> <planet,1>
<planet,1> <is,1> <is,1> <not,1> <a,1>
<anymore,1> <Pluto,1> <Earth,1>

Combiner phase output:-

<Saturn,1> <is,1,1,1> <a,1,1,1> <planet,1,1,1> <not,1> <anymore,1>
<Pluto,1> <Earth,1>

After this the reducer phase comes in, The Reducer phase takes each key-value collection pair from the Combiner phase, processes it, and passes the output as key-value pairs. Note that the Combiner functionality is same as the Reducer.

Combiner Input:-

<Saturn,1> <is,1,1,1> <a,1,1,1> <planet,1,1,1> <not,1> <anymore,1>
<Pluto,1> <Earth,1>

Reducer Output:-

<Saturn,1> <is,3> <a,3> <planet,3> <not,1> <anymore,1>
<Pluto,1> <Earth,1>

After all the jobs are performed the last phase is of record writer which writes the output in the part-r-00000 text file hence the output will be same as final result which I have added in the image as final result:-

Planet         3
is 3
a 3
anymore 1
not 1
Saturn 1
Earth 1
Pluto 1

So, this was the architecture of what goes behind when a mapreduce job is run inside the hadoop cluster, feel free to comment if you like this tutorial or correct me if I am wrong any where in this whole tutorial ill be highly obliged.