Image: http://www.sdtimes.com/images/hadoop.jpg

Ryu should say: HADOOP…ER

Nuno Barbosa
bitmaker

--

For some time I wanted to build and write how a (very) basic Hadoop cluster should be built but it was you, my dear reader/follower/friend/no-better-thing-to-do-person, that made me embark in this awesome odyssey. We will fight giants, kill huge dogs with three heads, defeat pseudo-gods and sing Somewhere Over the Rainbow while doing it… Well, almost… But still, I’ll do my best to explain step-by-step how I created the clusters, help you somehow and we can both learn together.

I’ll first explain a bit of the technologies involved and state the requirements to do it. After this, we start building a single slave cluster, which will be our base system, and then improve the cluster by adding more nodes. We also provide a helpful test code for you to check if the cluster is running smoothly or not.

I hope you enjoy! Let’s do it!

Tech List

This adventure requires us to understand some core concepts. Here is the small list of techie goodies we use:

  • Hadoop: The heart of our joint adventure. Hadoop is an open-source software framework for storing data and running applications on clusters. Widely used to process huge amount of data in a multitude of formats;
  • HDFS: If Hadoop was a brain, this was where its memories would be stored. HDFS stands for Hadoop Distributed File System and it’s optimized to handle distributed data in Hadoop;
  • YARN: Yet another resource negotiator, literally. It’s used to split up the functionalities of resource management and job scheduling/monitoring into separate daemons;
  • MapReduce: It’s a programming model used to process and generate large quantities of data in a parallel and distributed fashion.

Of course we could even explain further the technologies listed but then we would lose focus on our fun task: Building the Hadoop Single/Multi Node cluster.

Requirements

We will use virtual machines to create the cluster since we don’t have physical resources to build an actual cluster. I’ll try to explain as much as I can along the way about each development toy we use. Here is the list of what you must have in your computer:

  • Virtual Box (VBox): the VMs manager. You can get it at here ;
  • Vagrant: an awesome tool to quickly deploy VMs using Virtual Box (or actually any other manager). Grab it!

Of course you’ll need even more tools but those will be installed in the VMs (AKA Guest machines) and not on your computer (AKA Host machine).

Please install these tools… I’ll wait… I’m not going anywhere! Continue when you’re ready.

We start by being Single…

Yes… We all start by being single, as will this short adventure. So, assuming that everything is setup, we start by deploy one VM using Vagrant and VBox. Vagrant runs based on the configuration declared in the configuration file, usually called Vagrantfile.

Vagrants also read
Go to the directory you which to work, create the Vagrantfile with the following content:

Let’s try to explain a bit what it’s happening.The main configuration entry is declared from line 1 to 23. Line 2 it’s a comment: when a line starts with #, vagrant interprets it as comment. From line 3 to 22 we start configuration a machine named “master”. Every line between this block will deploy and configure the first machine.

Line 4 tells Vagrant the box we want to use (it will download it from the Hashicorp repository) and we will use Ubuntu 14.04 Trusty Thar 64 bits release. Line 6 sets the machine IP over the Vagrant private network; Line 7 forwards a port from the host to the guest machine (this port will be used to monitor Hadoop); and Line 8 enables the public network bridge.

The provider entries for this VM (lines 10–13) configure the physical resources to be used by the guest machine. In this case, 1024 MB from the host will be reserved to this VM (line 11) as well as a CPU core (line 12).

The last set of instructions (lines 15–21) will run the guest’s shell and regular bash commands.

You can now boot up your machine by running vagrant up on the working directory. Be aware that you have to select the interface you want Vagrant to bridge. In this case, select the one you’re using to access the internet.

You have now my permission to take a cup of coffee because this normally takes a bit. Why you may ask… Because if you never used the selected box, Vagrant will download it from the repository, install it and configure it.

After, finishing this step, let’s go to following step…

Getting your hands dirty
We have our VM running and begging to be used. This can be done by calling ssh and the awesome Vagrant has a helper: vagrant ssh. And now you are logged in the VM and we can start configuring the cluster.

Step 1. Install Oracle Java 8
There isn’t any special reason why I use Oracle Java instead of OpenJDK or another flavor… So, you can use another version if you prefer. To install follow these steps:

sudo add-apt-repository -y ppa:webupd8team/java
sudo apt-get update && sudo apt-get -y upgrade
echo oracle-java8-installer shared/accepted-oracle-license-v1–1 select true | sudo /usr/bin/debconf-set-selections
sudo apt-get -y install oracle-java8-set-default

The only line that requires some explanation is the third line: it forces the acceptance of Oracle Java 8 terms and does not block the process. You can confirm your installation by running: java -version.

Step 2. Add Hadoop User “hadooper” and Group “hadoop”
Hadoop will run on a specific group and under a specific user. You could install using the default user but this keeps the house tidy:

sudo addgroup hadoop
sudo adduser — ingroup hadoop hadooper
sudo adduser hadooper sudo

Step 3. Setting up SSH
Let’s setup SSH and the RSA keys. Hadoop uses SSH to communicate between nodes (even the master node with itself).

sudo apt-get -y install ssh
su hadooper
cd ~
ssh-keygen -t rsa -P “”
cat ./.ssh/id_rsa.pub >> ./.ssh/authorized_keys

Step 4. Grabbing Hadoop
The latest Hadoop version available at date is 2.7.2. We used one of the available mirrors to download the package (automatically assigned when I went to the download page):

wget http://mirrors.fe.up.pt/pub/apache/hadoop/common/hadoop-2.7.2/hadoop-2.7.2.tar.gz
tar -xzvf hadoop-2.7.2.tar.gz
sudo mv hadoop-2.7.2 /opt/hadoop

We install the package at /opt directory but you are free to place anywhere else.

Step 5. We need a nice Environment
Now, we need our hadooper to be able to run the commands related to Hadoop anywhere. We are going to edit hadooper’s ~/.bashrc and add the following lines to the end of it:

From now on, we will use these variables when needed, even during the writing of this story. Don’t forget to load the new configuration: source ~/.bashrc .

Step 6. Hadoken… Errr… Hadooper
This step configures the Hadoop deployment. We will edit three different files that give support to Hadoop environment and execution. Let’s start by editing $HADOOP_HOME/etc/hadoop/hadoop-env.sh and change the JAVA_HOME entry to:

Now we copy and edit the MapReduce site template. To copy:

cp $HADOOP_HOME/etc/hadoop/mapred-site.xml.template $HADOOP_HOME/etc/hadoop/mapred-site.xml

And then we dig in the file $HADOOP_HOME/etc/hadoop/core-site.xml and $HADOOP_HOME/etc/hadoop/mapred-site.xml to set configuration tag as follows:

Step 7. We need storage
We are now going to setup Hadoop file storage. HDFS requires that we select a directory to place the file system. I’ve selected /var/hadoop/hdfs. We need to declare two sub-folders: namenode where Haddop keeps the directory tree of all files in the file system, and tracks where across the cluster the file data is kept (does not store the data); datanode where the actual data is stored.

The creation is done by:

sudo mkdir -p /var/hadoop/hdfs/namenode
sudo mkdir -p /var/hadoop/hdfs/datanode
sudo chown -R hadooper /var/hadoop

And the configuration by editing $HADOOP_HOME/etc/hadoop/hdfs-site.xml and changing the configuration tag to:

If you check the first property dfs.replication is set to 1. This means the data will be placed on only one node, in this case, the master/single node. The following properties define the namenode and the datanode directories.

Finally, we format the file storage by running:

hdfs namenode -format

Step 8. Start the Single Node Cluster
Watch you’re hard work results unning by typing the following commands (type yes if SSH connection asks for it):

start-dfs.sh
start-yarn.sh
jps

Where the first line boots the HDFS, the second starts the resource negotiator and the last line just gives you feedback on the services running.

Congrats young Padawan! Single Node Cluster running you have now! (Ohhhh… Good ol’ Yoda). If you want to test it right now, check Annex A, where I placed an example to be compiled and executed in your cluster (single or multi, works on both). Also you can check the cluster status by opening a browser in your host machine and going to:

http://localhost:50070/

We can now jump to next big thing: Going Multi Node!

…But we grow and go Multi

By now you should have a working Hadoop cluster with a single node. To transform it in a multi-node, you must have more machines. We will use two more virtual machines. More than three VMs is overkill unless you have an awesome computer. Adding the VMs declaration to the Vagrant file will be our first step. After that, we will reconfigure our first VM, master, to behave as a the lead node. The following move is to set up each node. This is probably the most boring task because you’ll need to perform multiple steps in each new node that you already made in master.

Get set! Ready! Go!!!!!

Step 9. Add more machines
We will add two VMs to Vagrant file now. Edit the Vagrantfile in your host machine and add the add slave1 and slave2 machines:

The two machine are declared in:

  • Lines 24–43: slave1 machine with IP 192.168.2.101
  • Lines 45–64: slave2 machine with IP 192.168.2.102

Every time you want to execute a Vagrant command directed to one machine, specify the machine name after the command itself, e.g., you want to boot only the master machine, you execute: vagrant up master. Another example is the ssh access: vagrant ssh slave1 to access the slave1 machine. If you do not define the name, you’ll run the command over all machines if the command is compliant with it, i.e., vagrant up will boot all machines but vagrant ssh will fail.

In the next steps I’ll place “Master” in the title or “Slaves” so that you don’t lose track on where we stand at that point.

Step 10. Master: Stop HDFS and YARN
Simple. Run the following commands:

stop-dfs.sh
stop-yarn.sh

Step 11. Master: Declare companions
We need to teach master (pun intended) which machines are its companions, AKA slaves. Edit /etc/hosts file (requires sudo) and add the following lines after the localhost initial line (if there is that line):

127.0.0.1 localhost # After this line
192.168.2.100 master
192.168.2.101 slave1
192.168.2.102 slave2

Master knows the IP addresses of the two slaves. This will enable master to execute remote tasks in the slaves.

Step 12. Master: Adapt configuration
Edit the following according with the gist:

  • $HADOOP_HOME/etc/hadoop/hdfs-site.xml
  • $HADOOP_HOME/etc/hadoop/mapred-site.xml
  • $HADOOP_HOME/etc/hadoop/yarn-site.xml
  1. hdfs-site.xml — We now have a replication value of 2 and removed the datanode from master. We also need to clean the /var/hadoop folder to fit to the configuration:
    sudo rm -r /var/hadoop
    sudo mkdir -p /var/hadoop/hdfs/namenode
    sudo chown -R hadooper /var/hadoop
  2. mapred-site.xml — Besides the obvious update to the job tracker address, we also define that YARN is now responsible to distribute MapReduce tasks across the entire cluster.
  3. yarn-site.xml — We declare that we want to use the resource tracker, scheduler and resource manager at the specified addresses. You can assign other ports but be careful to not assign unavailable ports.

Step 13. Master: Identify slaves (nodes)
We will now tell Hadoop which are its slaves. Edit file $HADOOP_HOME/etc/hadoop/slaves and add the following two lines:

slave1
slave2

And also inform him that he is the master by adding to $HADOOP_HOME/etc/hadoop/masters the line:

master

Step 14. Slaves: Redo steps
In each slave, execute the following step you also did in master and keep the same password for the hadooper user:

  • Step 1;
  • Step 2;
  • Step 3;

Step 15. Slaves: Reuse Master’s pair of SSH keys
This is really tricky but helps a bit in the future. From each slave, execute:

scp hadooper@master:~/.ssh/* ~/.ssh/

You need to be hadooper user in the slave. This will make easier to access every node in the near future.

Step 16. Slaves: Get Hadoop directly from master
This will save us a lot of time. In each slave, execute:

scp -r hadooper@master:/opt/hadoop/ ~/hadoop
scp -r hadooper@master:~/.bashrc ~
source ~/.bashrc
sudo mv ~/hadoop /opt/hadoop

Step 17. Slaves: Configure HDFS
Since each slave is a datanode we must adapt a file system to that position by running:

sudo mkdir -p /var/hadoop/hdfs/datanode
sudo chown -R hadooper /var/hadoop/

And edit each $HADOOP_HOME/etc/hadoop/hdfs-site.xml file in each slave:

Step 18. Slaves: Settings the hosts in each slave
Our odyssey has almost ended. For each slave N, we must edit the /etc/hosts files place the lines:

127.0.0.1 localhost # After this line
192.168.2.100 master
192.168.2.10<N> slave<N>

Where <N> = {1, 2} (since we are only using two machines). Now the slaves are ready.

Step 19. Master: Almost there…
This is the final step in this amazing adventure you came across. In master, execute as hadooper:

hdfs namenode -format
start-dfs.sh
start-yarn.sh
jps

If you have problems connecting to either one of the slaves, try to ssh connect directly to troubleshoot the issue. Sometimes it may complain that one of the slaves is not safe and you’ll only need to add the key to the authorized_keys file, which connecting to the that troublesome node will do for you… Also, I had to reboot all the machines once after the installation because Hadoop master couldn’t connect to slave2, possibly because SSH wasn’t running correctly or the keys weren’t loaded.

And voilá: you have your cluster… If everything went OK! :D To check the status, point the browser in your host machine to http://localhost:50070/.

You can test again the cluster using the example in Annex A. The code should be executed in master.

One thing that we left out is running the cluster automatically at startup. In Ubuntu you can use init.d scripts. More info on this here.

Conclusion

We went on an incredible adventure to build a powerful Hadoop cluster (or not)… We were introduced to set of technologies and the requirements to go over the story. After, we started by building a single slave cluster by following a set of steps and then went on building a cluster with three machines, one master and two slaves, where the master handles the work distribution and the slaves handle the processing and data storing. We used virtual machines just for test case but, in the real world, we should use physical machines or machines/services with advanced virtualization capabilities.

Hadoop has a lot of companion tools and tools that improve Hadoop itself. The next step would be to integrate Hadoop with a big data database and use tools to handle the interaction between Hadoop and the big data solution selected. Also, exploring other technologies that were built over Hadoop, like Spark, is a good way path to follow.

Drop me a line if you found any error and got stuck anywhere in the story and I’ll try to help you as best as I can.

Hope you enjoyed!

More References

https://medium.com/@markobonaci/the-history-of-hadoop-68984a11704#.b17jz23m0

https://medium.com/@nikantvohra/hadoop-82e96891022c#.hfbpyg7tu

http://hortonworks.com/hadoop-tutorial/using-commandline-manage-files-hdfs/

https://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html

https://www.vagrantup.com/docs/

Annex A

In any programming language or new programming framework there is a typical introductory program, which we commonly call Hello World!. The same happens in Hadoop. The most basic example used in Hadoop with MapReduce is a word counting program. You could write it in Python, R, Scala or Java but, since we at Bitmaker Software use mostly the latest, the example we give is in Java.

Also, the credit to this code is not ours. It was taken from the official Apache Hadoop documentation. You can create the file WordCount.java in the hadooper’s home directory (cd ~) and paste the following code:

The following lines are to be run in the command line and will compile your code using Hadoop libraries:

hadoop com.sun.tools.javac.Main WordCount.java
jar cf wc.jar WordCount*.class

As we explain during the tutorial, Hadoop uses HDFS as its file storage system. Every file you want to process must be included in a HDFS folder. You must first create the directory and then copy the files to be processed. This can be achieved by using Hadoop’s fs commands. To begin with the directory creation:

hadoop fs -mkdir -p /user/hadooper/wordcount/input

If it fails, be sure that HDFS and YARN are running by executing jps command. If you see something like:

4674 Jps

Instead of:

5185 Jps
5075 SecondaryNameNode
4838 NameNode
(… or with more information)

You’re probably not running the file storage and/or the resource negotiator. Just run the next couple of lines in command line and try again:

start-dfs.sh
start-yarn.sh

If for some reason the namenode entered in safe mode, you can disable it by running:

hdfs dfsadmin -safemode leave

We also need at least a test file (but you can use multiple…). I grabbed a copy from the King James Bible in UTF-8 txt file and placed on the folder we created to do the word counting:

wget https://www.gutenberg.org/ebooks/10.txt.utf-8 -O kjb.txt
hadoop fs -put kjb.txt /user/hadooper/wordcount/input

Finally we run:

hadoop jar wc.jar WordCount /user/hadooper/wordcount/input /user/hadooper/wordcount/output

And the results can be shown by:

hadoop fs -cat /user/hadooper/wordcount/output/part-r-00000

--

--