How To Set Up a Hadoop 3.2.1 Multi-Node Cluster on Ubuntu 18.04 (2 Nodes)
To start: What is Hadoop?
Apache Hadoop is a collection of open-source software utilities that allows the distribution of larges amounts of data sets across clusters of computers using simple programing models.
It provides a software framework for distributed storage and the processing of big data using the MapReduce. It was originally design for computer clusters (“is a set of tightly connected computers that work together, so they can be viewed as a single system”) commodity hardware, it has also found use on cluster of higher-end hardware.
The base Apache Hadoop framework is composed of the following modules:
- Hadoop Common — contains libraries and utilities needed by other Hadoop modules;
- Hadoop Distributed File System (HDFS) — a distributed file-system that stores data on commodity machines, providing very high aggregate bandwidth across the cluster;
- Hadoop YARN — (introduced in 2012) a platform responsible for managing computing resources in clusters and using them for scheduling users’ applications;
- Hadoop MapReduce — an implementation of the MapReduce programming model for large-scale data processing.
- Ubuntu 18.04 installed on a virtual machine.
What are we going to install in order to create the Hadoop Multi-Node Cluster?
- Java 8;
1st Step: Configuring our Network
Go to the Network Settings of your Virtual Machine and Enable Adapter 2. Then, instead of NAT, chose Virtual Host-Only Adapter and where it says “Promiscuous Mode” select the option “Allow All”.
Install SSH using the following command:
sudo apt install ssh
It will ask you for the password. When it asks for confirmation, just give it.
Install PDSH using the following command:
sudo apt install pdsh
Just as before, give confirmation when needed.
Open the .bashrc file with the following command:
At the end of the file just write the following line:
Now let’s configure SSH. Let’s create a new key using the following command:
ssh-keygen -t rsa -P ""
Just press Enter everytime that is needed.
Now we need to copy the public key to the authorized_keys file with the following command:
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
Now we can verify the SSH configuration by connecting to the localhost:
Just type “yes” and press Enter when needed.
This is the step where we install Java 8. We use this command:
sudo apt install openjdk-8-jdk
Just as previously, give confirmation when needed.
This step isn’t really a step, it’s just to check if Java is now correctly installed:
Download Hadoop using the following command:
We need to unzip the hadoop-3.2.1.tar.gz file with the following command:
tar xzf hadoop-3.2.1.tar.gz
Change the hadoop-3.2.1 folder name to hadoop (this maked it easier to use). Use this command:
mv hadoop-3.2.1 hadoop
Open the hadoop-env.sh file in the nano editor to edit JAVA_HOME:
Paste this line to JAVA_HOME:
(I forgot to take a screenshot for this step, but it’s really easy to find. Once you find it just remove the # commentary tag and do what I said, copy it).
Change the hadoop folder directory to /usr/local/hadoop. This is the command:
sudo mv hadoop /usr/local/hadoop
Provide the password when needed.
Open the environment file on nano with this command:
sudo nano /etc/environment
Then, add the following configurations:
Now we will add a user called hadoopuser, and we will set up it’s configurations:
sudo adduser hadoopuser
Provide the password and you can leave the rest blank, just press Enter.
Now type these commands:
sudo usermod -aG hadoopuser hadoopuser
sudo chown hadoopuser:root -R /usr/local/hadoop/
sudo chmod g+rwx -R /usr/local/hadoop/
sudo adduser hadoopuser sudo
Now we need to verify the machine ip address:
Now, as you can see, my IP is 192.168.205.7, just remember this will be different for you, you need to act accordingly when the IP addresses are used later.
My network will be as follows:
In your case, just keep adding 1 to the last number of the IP you get on your machine, just as I did for mine.
Open the hosts file and insert your Network configurations:
sudo nano /etc/hosts
Now is the time to create the Slaves.
Shut Down your Master Virtual Machine and clone it twice, naming one Slave1 and the Other Slave2.
Make sure the “Generate new MAC addresses for all network adapters” option is chosen.
Also, make a Full Clone.
On the master VM, open the hostname file on nano:
sudo nano /etc/hostname
Insert the name of your master virtual machine. (note, it’s the same name you entered previously on the hosts file)
Now do the same on the slaves:
Also, you should reboot all of them so this configuration taked effect:
Configure the SSH on hadoop-master, with the hadoopuser. This is the command:
su - hadoopuser
Create an SSH key:
ssh-keygen -t rsa
Now we need to copy the SSH key to all the users. Use this command:
On hadoop-master, open core-site.xml file on nano:
sudo nano /usr/local/hadoop/etc/hadoop/core-site.xml
Then add the following configurations:
Still on hadoop-master, open the hdfs-site.xml file.
sudo nano /usr/local/hadoop/etc/hadoop/hdfs-site.xml
Add the following configurations:
We’re still on hadoop-master, let’s open the workers file:
sudo nano /usr/local/hadoop/etc/hadoop/workers
Add these two lines: (the slave names, remember the hosts file?)
We need to copy the Hadoop Master configurations to the slaves, to do that we use these commands:
scp /usr/local/hadoop/etc/hadoop/* hadoop-slave1:/usr/local/hadoop/etc/hadoop/
scp /usr/local/hadoop/etc/hadoop/* hadoop-slave2:/usr/local/hadoop/etc/hadoop/
Now we need to format the HDFS file system. Run these commands:
hdfs namenode -format
Start HDFS with this command:
To check if this worked, run the follwing command. This will tell you what resources have been initialized:
Now we need to do the same in the slaves:
Let’s see if this worked:
Open your browser and type hadoop-master:9870.
This is what mine shows, hopefully yours is showing the same thing!
As you can see, both nodes are operational!
Let’s configure yarn, just execute the following commands:
In both slaves, open yarn-site.xml on nano:
sudo nano /usr/local/hadoop/etc/hadoop/yarn-site.xml
You have to add the following configurations on both slaves:
On the master, let’s start yarn. Use this command:
Open your browser. Now you will type http://hadoop-master:8088/cluster
As you can see, the cluster shows 2 active nodes!
Just kidding, there are no more steps. Hopefully you managed to do it all correctly, and if so, congratulations on building a Hadoop Multi-Node Cluster!
If not, try again and again until you get it right!