How To Set Up a Hadoop 3.2.1 Multi-Node Cluster on Ubuntu 18.04 (2 Nodes)

João Torres
8 min readFeb 3, 2020

--

To start: What is Hadoop?

Apache Hadoop is a collection of open-source software utilities that allows the distribution of larges amounts of data sets across clusters of computers using simple programing models.

It provides a software framework for distributed storage and the processing of big data using the MapReduce. It was originally design for computer clusters (“is a set of tightly connected computers that work together, so they can be viewed as a single system”) commodity hardware, it has also found use on cluster of higher-end hardware.

The base Apache Hadoop framework is composed of the following modules:

  • Hadoop Common contains libraries and utilities needed by other Hadoop modules;
  • Hadoop Distributed File System (HDFS)a distributed file-system that stores data on commodity machines, providing very high aggregate bandwidth across the cluster;
  • Hadoop YARN(introduced in 2012) a platform responsible for managing computing resources in clusters and using them for scheduling users’ applications;
  • Hadoop MapReducean implementation of the MapReduce programming model for large-scale data processing.

Pre-requirements

  • Ubuntu 18.04 installed on a virtual machine.

What are we going to install in order to create the Hadoop Multi-Node Cluster?

  • Java 8;
  • SSH;
  • PDSH;

1st Step: Configuring our Network

Go to the Network Settings of your Virtual Machine and Enable Adapter 2. Then, instead of NAT, chose Virtual Host-Only Adapter and where it says “Promiscuous Mode” select the option “Allow All”.

2nd Step:

Install SSH using the following command:

sudo apt install ssh

It will ask you for the password. When it asks for confirmation, just give it.

3rd Step:

Install PDSH using the following command:

sudo apt install pdsh

Just as before, give confirmation when needed.

4th Step:

Open the .bashrc file with the following command:

nano .bashrc

At the end of the file just write the following line:

export PDSH_RCMD_TYPE=ssh

5th Step:

Now let’s configure SSH. Let’s create a new key using the following command:

ssh-keygen -t rsa -P ""

Just press Enter everytime that is needed.

6th Step:

Now we need to copy the public key to the authorized_keys file with the following command:

cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

7th Step:

Now we can verify the SSH configuration by connecting to the localhost:

ssh localhost

Just type “yes” and press Enter when needed.

8th Step:

This is the step where we install Java 8. We use this command:

sudo apt install openjdk-8-jdk

Just as previously, give confirmation when needed.

9th Step:

This step isn’t really a step, it’s just to check if Java is now correctly installed:

java -version

10th Step:

Download Hadoop using the following command:

sudo wget -P ~ https://mirrors.sonic.net/apache/hadoop/common/hadoop-3.2.1/hadoop-3.2.1.tar.gz

11th Step:

We need to unzip the hadoop-3.2.1.tar.gz file with the following command:

tar xzf hadoop-3.2.1.tar.gz

12th Step:

Change the hadoop-3.2.1 folder name to hadoop (this maked it easier to use). Use this command:

mv hadoop-3.2.1 hadoop

13th Step:

Open the hadoop-env.sh file in the nano editor to edit JAVA_HOME:

nano ~/hadoop/etc/hadoop/hadoop-env.sh

Paste this line to JAVA_HOME:

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/

(I forgot to take a screenshot for this step, but it’s really easy to find. Once you find it just remove the # commentary tag and do what I said, copy it).

14th Step:

Change the hadoop folder directory to /usr/local/hadoop. This is the command:

sudo mv hadoop /usr/local/hadoop

Provide the password when needed.

15th Step:

Open the environment file on nano with this command:

sudo nano /etc/environment

Then, add the following configurations:

PATH="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/usr/local/hadoop/bin:/usr/local/hadoop/sbin"JAVA_HOME="/usr/lib/jvm/java-8-openjdk-amd64/jre"

16th Step:

Now we will add a user called hadoopuser, and we will set up it’s configurations:

sudo adduser hadoopuser

Provide the password and you can leave the rest blank, just press Enter.

Now type these commands:

sudo usermod -aG hadoopuser hadoopuser
sudo chown hadoopuser:root -R /usr/local/hadoop/
sudo chmod g+rwx -R /usr/local/hadoop/
sudo adduser hadoopuser sudo

17th Step:

Now we need to verify the machine ip address:

ip addr

Now, as you can see, my IP is 192.168.205.7, just remember this will be different for you, you need to act accordingly when the IP addresses are used later.

My network will be as follows:

master: 192.168.205.7

slave1: 192.168.205.8

slave2: 192.168.205.9

In your case, just keep adding 1 to the last number of the IP you get on your machine, just as I did for mine.

18th Step:

Open the hosts file and insert your Network configurations:

sudo nano /etc/hosts

19th Step:

Now is the time to create the Slaves.

Shut Down your Master Virtual Machine and clone it twice, naming one Slave1 and the Other Slave2.

Make sure the “Generate new MAC addresses for all network adapters” option is chosen.

Also, make a Full Clone.

Clone for Slave1, do the same for Slave2.

20th Step:

On the master VM, open the hostname file on nano:

sudo nano /etc/hostname

Insert the name of your master virtual machine. (note, it’s the same name you entered previously on the hosts file)

Now do the same on the slaves:

Also, you should reboot all of them so this configuration taked effect:

sudo reboot

21st Step:

Configure the SSH on hadoop-master, with the hadoopuser. This is the command:

su - hadoopuser

22nd Step

Create an SSH key:

ssh-keygen -t rsa

23rd Step:

Now we need to copy the SSH key to all the users. Use this command:

ssh-copy-id hadoopuser@hadoop-master
ssh-copy-id hadoopuser@hadoop-slave1
ssh-copy-id hadoopuser@hadoop-slave2

24th Step:

On hadoop-master, open core-site.xml file on nano:

sudo nano /usr/local/hadoop/etc/hadoop/core-site.xml

Then add the following configurations:

<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://hadoop-master:9000</value>
</property>
</configuration>

25th Step:

Still on hadoop-master, open the hdfs-site.xml file.

sudo nano /usr/local/hadoop/etc/hadoop/hdfs-site.xml

Add the following configurations:

<configuration>
<property>
<name>dfs.namenode.name.dir</name><value>/usr/local/hadoop/data/nameNode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name><value>/usr/local/hadoop/data/dataNode</value>
</property>
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
</configuration>

26th Step:

We’re still on hadoop-master, let’s open the workers file:

sudo nano /usr/local/hadoop/etc/hadoop/workers

Add these two lines: (the slave names, remember the hosts file?)

hadoop-slave1
hadoop-slave2

27th Step:

We need to copy the Hadoop Master configurations to the slaves, to do that we use these commands:

scp /usr/local/hadoop/etc/hadoop/* hadoop-slave1:/usr/local/hadoop/etc/hadoop/
scp /usr/local/hadoop/etc/hadoop/* hadoop-slave2:/usr/local/hadoop/etc/hadoop/
Copying information to Slave1
Copying Information to Slave2

28th Step:

Now we need to format the HDFS file system. Run these commands:

source /etc/environment
hdfs namenode -format

29th Step:

Start HDFS with this command:

start-dfs.sh

To check if this worked, run the follwing command. This will tell you what resources have been initialized:

jps

Now we need to do the same in the slaves:

30th Step:

Let’s see if this worked:

Open your browser and type hadoop-master:9870.

This is what mine shows, hopefully yours is showing the same thing!

As you can see, both nodes are operational!

31st Step:

Let’s configure yarn, just execute the following commands:

export HADOOP_HOME="/usr/local/hadoop"
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export HADOOP_HDFS_HOME=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_YARN_HOME=$HADOOP_HOME

32nd Step:

In both slaves, open yarn-site.xml on nano:

sudo nano /usr/local/hadoop/etc/hadoop/yarn-site.xml

You have to add the following configurations on both slaves:

<property>
<name>yarn.resourcemanager.hostname</name>
<value>hadoop-master</value>
</property>

33rd Step:

On the master, let’s start yarn. Use this command:

start-yarn.sh

34th Step:

Open your browser. Now you will type http://hadoop-master:8088/cluster

As you can see, the cluster shows 2 active nodes!

35th Step:

Just kidding, there are no more steps. Hopefully you managed to do it all correctly, and if so, congratulations on building a Hadoop Multi-Node Cluster!

If not, try again and again until you get it right!

Cheers.

--

--