Install a Hadoop cluster on AWS EC2

Published in

nibbleai

5 min readApr 10, 2019

A small cluster, by Zbynek Burival on Unsplash

This article is the 2nd part of a serie of several posts where I describe how to build a 3-node Hadoop cluster on AWS.
• Part 1: Setup EC2 instances with AWS CloudFormation

Following our previous article about setting up three EC2 instances, we can now install our Hadoop 3.x.y cluster.

We will suppose that the appropriate SecurityGroups are well associated to the instances to enable Hadoop processes (HDFS and Yarn), and to allow SSH access. If this is not the case, please refer to our previous post.

1. Overview

Our cluster will consist of 3 nodes: one Master and two Slaves.

NOTE Starting with Hadoop 3, slaves are called workers. Keep in mind that both names refer to the same Hadoop role.

1.1 — Master node roles

The role of the Master node is to manage the distributed file system, and to schedule resources allocation. To this end, the Master node runs two Hadoop daemons:

A NameNode to manage the HDFS and keep track of the location of the data blocks
A ResourceManager to orchestrate the distribution of tasks between slaves via YARN

1.2 — Slave nodes roles

The slaves nodes are the ones that actually store the data from the HDFS (sending information to the NameNode so it can keep track of the global state), and execute the actual processing tasks. The daemons running on the slaves nodes are:

A DataNode to take care of the HDFS operations and store the data blocks on the local filesystem
A NodeManager to manage the tasks execution

2. Prerequisites

2.1 — Install Java 8 on your cluster

You need Java 8 installed on every machine of the cluster. At the time of writing, Hadoop does not support Java 11.

sudo apt install openjdk-8-jdk-headless -y

NOTE We installed the Java Development Kit (JDK) instead of the Java Runtime Environment (JRE) as it can be useful to have some of the extras provided by the JDK. Strictly speaking, the JRE is enough to make Hadoop running.

2.2 — Map each node’s IP with a hostname

To simplify the reference of nodes, we will give them hostnames: hadoop-master, hadoop-worker-1, and hadoop-worker-2.

In your EC2 web console (or via a terminal with the ip addr command), get the private IP addresses for each machine.

Then, on each machine, open /etc/hosts with your favorite editor, and edit its content to match the following (you need to have sudo rights to override this file):

# /etc/hosts
172.0.0.1    hadoop-master
172.0.0.2    hadoop-worker-1
172.0.0.3    hadoop-worker-2

Do not forget to replace the values with your own IPs 😉

IMPORTANT NOTE You need to take your EC2’s private IPs because Hadoop will later try to bind ports of this IP to specific services. Since it cannot have control over the public network interface, this will eventually lead to an error java.net.BindException: Address already in use.

2.3 — Create a dedicated user for Hadoop

For clarity, we recommend naming this user hadoop. Again, you need to create this user on every node of your cluster.

In Debian/Ubuntu:

sudo adduser hadoop

Then log in to this new user:

su - hadoop

Starting now, every command has to be entered as thehadoop user, unless specified otherwise.

2.4 — Authorize SSH connections between nodes

When starting the distributed file system or the resources manager, the Master node needs to connect to the Workers. Also, under regular operations, the workers have to constantly update the Master on their state.

We need to authorize those SSH connections by 1) creating a SSH key-pair for Master and 2) adding the public key to the list of authorized keys on all our machines.

2.4.1 — Create a SSH key-pair for the Master node

The key-pair must be free of any kind of passphrase so it can be used automatically by the Hadoop daemons.

On the Master instance only, enter the following, leaving the prompt for Passphraseblank:

ssh-keygen -t rsa -b 2048

Now, print out the content of the public key, and copy it somewhere:

cat ~/.ssh/id_rsa.pub

2.4.1 — Add the Master’s public key to the authorized hosts of Workers

For all nodes — Master included — add the content of the public key you just copied to the `authorized_keys` file.

WARNING Be sure to be logged in as the hadoop user.

echo “ssh-rsa AAAA2EA…== hadoop@master” >> ~/.ssh/authorized_keys

From now on, your Master node will be able to execute tasks on Workers.

3. Hadoop installation

We are now ready to install Hadoop 3 on our cluster.

From now on, it’s a pretty straight-forward process. What we need to do is:

1. Download the Hadoop tarball
2. Untar the ball in the home directory of the hadoop user
3. Update the $PATH to include Hadoop binaries and scripts
4. Setup some environment variables

As a good engineer, you must probably be embracing laziness — or let’s call it efficiency. This is why we made you a Bash script ( curl must be installed on your machine)

Thank you for trusting us, but do not forget to read before executing!

Now, to complete the installation on every nodes, you have two options:

Run the same scripts on the Worker nodes

Synchronize your Master’s Hadoop directory into Workers filesystems. By doing so, do not forget to update the .bashrc file with the new environment variables. This is probably the fastest option.

If you choose the latter, the following command is all you need:

# To run from Master node
for worker in hadoop-worker-1 hadoop-worker-2; do
    rsync -avz ~/hadoop-3.1.1 worker:/home/hadoop
    scp ~/.bashrc worker:/home/hadoop
done

And that’s it! We now have Hadoop installed on three different machines.

To check if everything is set up as expected, you can run

$ hadoop version

on every machine (as hadoop user of course…). This should respond with some information about your Hadoop installation.

We now have to update the configuration of each nodes so they can respectively act as the role we assigned to them: one Master, two Workers. This will be the covered in the next article.

Any feedback on what you just read ? Or other topics you would like to learn about? Leave us a comment or drop us a message contact [at] nibble [dot] ai.