Building a Distributed Hadoop Cluster with HBase on Amazon EC2’s from Scratch

Published in

Analytics Vidhya

8 min readApr 2, 2020

If you want to build a Distributed Hadoop Cluster on AWS EC2 with HBase, then the best option is to use AWS EMR. But if you are like me and want to build your cluster from scratch then, you are in the right place. There can be many reasons to build your clusters, for me, it is to understand as to how the connections happen between master and slaves and dig deeper into systems. Or if you want to just tweak come config or code which is not available in EMR and run the production load.

Hadoop Prerequisites

First thing first, let us launch 4 EC2 instances (1 master and 3 slaves). For simplification, I am not using the bastion host, enabled public IP and have restricted the security group to my IP only. As soon as the instances are ready, label them as master, slave1, slave2 and slave3. I have used ubuntu 18.04 AMI and following is my ~/.ssh/config file

#OnLocalHost master
  User ubuntu
  IdentityFile ~/.aws/key.pem
  ProxyCommand ssh -q -W %h:%p ubuntu@ec2-x-x-x-x.us-west-2.compute.amazonaws.com -i ~/.aws/key.pem
  StrictHostKeyChecking no
  UserKnownHostsFile=/dev/null
  HostName 10.0.x.xHost slave1
  User ubuntu
  IdentityFile ~/.aws/bds.pem
  ProxyCommand ssh -q -W %h:%p ubuntu@x-x-x-x.us-west-2.compute.amazonaws.com -i ~/.aws/key.pem
  StrictHostKeyChecking no
  UserKnownHostsFile=/dev/null
  HostName 10.0.x.xHost slave2
  User ubuntu
  IdentityFile ~/.aws/bds.pem
  ProxyCommand ssh -q -W %h:%p ubuntu@x-x-x-x.us-west-2.compute.amazonaws.com -i ~/.aws/key.pem
  StrictHostKeyChecking no
  UserKnownHostsFile=/dev/null
  HostName 10.0.x.xHost slave3
  User ubuntu
  IdentityFile ~/.aws/key.pem
  ProxyCommand ssh -q -W %h:%p ubuntu@x-x-x-x.compute.amazonaws.com -i ~/.aws/bds.pem
  StrictHostKeyChecking no
  UserKnownHostsFile=/dev/null
  HostName 10.0.x.x

Remember to replace the ubuntu@x-x-x-x.us-west-2.compute.amazonaws.com with an actual public DNS of ec2 instances and also change the HostName to private IPv4.

Now with ~/.ssh/config configured, I can simply ssh into the box using ssh master or ssh slave1 etc.

Let us update before we proceed with the installation of Hadoop.

#OnAllsudo apt update && sudo apt dist-upgrade -y

Next, set the static hostname for the master.

#OnMastersudo hostnamectl set-hostname --static master

Similarly for the slaves too.

#OnSlavessudo hostnamectl set-hostname --static slave1
sudo hostnamectl set-hostname --static slave2
sudo hostnamectl set-hostname --static slave3

Now open the file sudo vim /etc/cloud/cloud.cfg and configure the below property.

#OnAllpreserve_hostname=true

Now update the /etc/hosts file with the private IP's of master and slave instances. Here is the list of my IP’s. Remember to remove the first line 127.0.0.1 localhost from the /etc/hosts file.

#OnAllsudo vim /etc/hosts10.0.6.80     master
10.0.6.174    slave1
10.0.6.252    slave2
10.0.6.35     slave3

Next, let’s install OpenJDK 8 on the instances

#OnAllsudo apt install openjdk-8-jdk openjdk-8-jre -y

Now will reboot the instance

#OnAllsudo reboot

Now let us enable and set a password for the instances. This is being done to make it easier to transfer the files between master and slave nodes. We can always disable it later.

Now, make the below changes on all the nodes

#OnAllsudo vim /etc/ssh/sshd_config# Set the below value in the file
PasswordAuthentication yes

Then restart the sshd and set the password for the ubuntu user.

#OnAllsudo service ssh restart
sudo passwd ubuntu
# Enter the password and remember it for future use

Now Execute next instructions only on the master node. We will generate public/private ssh key and copy it to all the slaves from the master.

#OnMasterssh-keygen -b 4096ssh-copy-id -i $HOME/.ssh/id_rsa.pub ubuntu@master
ssh-copy-id -i $HOME/.ssh/id_rsa.pub ubuntu@slave1
ssh-copy-id -i $HOME/.ssh/id_rsa.pub ubuntu@slave2
ssh-copy-id -i $HOME/.ssh/id_rsa.pub ubuntu@slave3

A Note if the above commandssh-copy-id is freezing, Then you need to take a look at the security group to make sure that you can ssh from master to slave. One way to do it is, to have a common security group (SG) attached to all the instances and add that SG to itself.

Let’s install and setup Hadoop!

On Master Node,

Run the below commands to download, untar and copy Hadoop to /usr/local/hadoop . Let us also give permission to the same directory.

#OnMasterwget https://archive.apache.org/dist/hadoop/core/hadoop-2.7.3/hadoop-2.7.3.tar.gz
sudo mkdir /usr/local/hadoop
tar -xzf hadoop-2.7.3.tar.gz
sudo mv hadoop-2.7.3/* /usr/local/hadoop/
ls -ltr /usr/local/hadoop/
sudo chown -R ubuntu:ubuntu /usr/local/hadoop

Now let us add some paths to ~/.bashrc file, to make it easier to navigate going forward. Setting PATH’s will always make it easier. Append the below config to your ~/.bashrc file.

#OnMaster# Java
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export JRE_HOME=/usr/lib/jvm/java-8-openjdk-amd64/jre# Hadoop
export HADOOP_INSTALL=/usr/local/hadoop
export PATH=$PATH:$HADOOP_INSTALL/bin
export PATH=$PATH:$HADOOP_INSTALL/sbin
export HADOOP_MAPRED_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_HOME=$HADOOP_INSTALL
export HADOOP_HDFS_HOME=$HADOOP_INSTALL
export YARN_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_INSTALL/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_INSTALL/lib"

Once saved, run the command source ~/.bashrc to update it in the current session.

Now let us make the change to the Hadoop environment to enable it for distributed mode. There is a long list of changes coming ahead.

First, let’s go into cd /usr/local/hadoop/etc/hadoop directory.

Set the JAVA_HOME in hadoop-env.sh.

#OnMasterexport JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64

2. Add the YARN config in yarn-site.xml .

#OnMaster<property>
    <name>yarn.acl.enable</name>
    <value>0</value>
</property>
<property>
    <name>yarn.resourcemanager.hostname</name>
    <value>master</value>
</property>
<property>
    <name>yarn.nodemanager.aux-services</name>
    <value>mapreduce_shuffle</value>
</property>
<property>
    <name>yarn.nodemanager.resource.memory-mb</name>
    <value>2826</value>
</property>
<property>
    <name>yarn.scheduler.maximum-allocation-mb</name>
    <value>2726</value>
</property>
<property>
    <name>yarn.scheduler.minimum-allocation-mb</name>
    <value>128</value>
</property>
<property>
    <name>yarn.nodemanager.vmem-check-enabled</name>
    <value>false</value>
</property>
<property>
    <name>yarn.nodemanager.resource.cpu-vcores</name>
    <value>2</value>
</property>

Make sure that the above properties are added between <configuration> and </configuration> .

3. Add the below config to core-site.xml .

#OnMaster<property>
    <name>fs.default.name</name>
    <value>hdfs://master:9000</value>
</property>

4. Add HDFS config to hdfs-site.xml .

#OnMaster<property>
    <name>dfs.replication</name>
    <value>3</value>
    <description>
        Default block replication.
        The actual number of replications can be specified when the file is created.
        The default is used if replication is not specified in create time.
    </description>
</property>
<property>
    <name>dfs.namenode.name.dir</name>
    <value>file:/usr/local/hadoop_store/hdfs/namenode</value>
</property>
<property>
    <name>dfs.datanode.data.dir</name>
    <value>file:/usr/local/hadoop_store/hdfs/datanode</value>
</property>

5. Now, let us add Map-Reduce config to mapred-site.xml . But first, we need to copy the template. This can be done by

#OnMastercp mapred-site.xml.template mapred-site.xml

Then open the file mapred-site.xml and add the below config

#OnMaster<property>
    <name>mapreduce.framework.name</name>
    <value>yarn</value>
</property>
<property>
        <name>yarn.app.mapreduce.am.resource.mb</name>
        <value>1024</value>
</property>
<property>
        <name>mapreduce.map.memory.mb</name>
        <value>1024</value>
</property>
<property>
        <name>mapreduce.reduce.memory.mb</name>
        <value>1024</value>
</property>

If you are interested in understanding what each property does, then please refer the official documentation for Hadoop.

6. Let us create a directory for HDFS store.

#OnMastersudo mkdir -p /usr/local/hadoop_store/hdfs/namenode
sudo mkdir -p /usr/local/hadoop_store/hdfs/datanode
sudo chown -R ubuntu:ubuntu /usr/local/hadoop_store/

7. Now we have made all the configuration we needed on the master, let us make the changes on the slave node. But first, let us copy the hadoop-2.7.3.tar.gz to slave nodes.

#OnMastercd ~
scp hadoop-2.7.3.tar.gz ubuntu@slave1:/home/ubuntu/
scp hadoop-2.7.3.tar.gz ubuntu@slave2:/home/ubuntu/
scp hadoop-2.7.3.tar.gz ubuntu@slave3:/home/ubuntu/

Once we have copied the files to slave nodes, it is time to make similar changes there as well.

On Slaves,

#OnSlavessudo mkdir /usr/local/hadoop
tar -xzf hadoop-2.7.3.tar.gz
sudo mv hadoop-2.7.3/* /usr/local/hadoop/
ls -ltr /usr/local/hadoop/
sudo chown -R ubuntu:ubuntu /usr/local/hadoop

Now let us add some paths to ~/.bashrc file in slaves as well.

#OnSlaves# Java
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export JRE_HOME=/usr/lib/jvm/java-8-openjdk-amd64/jre# Hadoop
export HADOOP_INSTALL=/usr/local/hadoop
export PATH=$PATH:$HADOOP_INSTALL/bin
export PATH=$PATH:$HADOOP_INSTALL/sbin
export HADOOP_MAPRED_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_HOME=$HADOOP_INSTALL
export HADOOP_HDFS_HOME=$HADOOP_INSTALL
export YARN_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_INSTALL/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_INSTALL/lib"

Once saved, run the command source ~/.bashrc to update it in the current session.

Let us create a directory for HDFS store in the slave as well.

#OnSlavessudo mkdir -p /usr/local/hadoop_store/hdfs/namenode
sudo mkdir -p /usr/local/hadoop_store/hdfs/datanode
sudo chown -R ubuntu:ubuntu /usr/local/hadoop_store/

Now that we have configured slaves, let us move back to the master node.

On Master,

Let us copy all the changes we did to Hadoop Configuration to Slaves as well.

#OnMasterscp /usr/local/hadoop/etc/hadoop/* ubuntu@slave1:/usr/local/hadoop/etc/hadoop/scp /usr/local/hadoop/etc/hadoop/* ubuntu@slave2:/usr/local/hadoop/etc/hadoop/scp /usr/local/hadoop/etc/hadoop/* ubuntu@slave3:/usr/local/hadoop/etc/hadoop/

Let us add the slave names to the /usr/local/hadoop/etc/hadoop/slaves file.

#OnMasterslave1
slave2
slave3

Now we are done with Hadoop Setup. Let us start the Hadoop and Test!

On Master again,

Let us format namenode .

#OnMasterhdfs namenode -format

Then we will start the DFS

#OnMasterstart-dfs.sh

To check if everything is configured properly, let us check jps

#OnMasterjps

On the Master node, the output should be

#OnMaster4448 SecondaryNameNode
4572 Jps
4175 NameNode

And on the slave nodes,

#OnSlaves2632 DataNode
2713 Jps

Of course, the PID’s will be different.

Now let us start YARN as well.

#OnMasterstart-yarn.sh

Now when we run jps , we should see ResourceManager getting added to the list in master and NodeManager in slaves.

On Master

#OnMaster4448 SecondaryNameNode
4631 ResourceManager
4895 Jps
4175 NameNode

On Slaves

#OnSlaves2832 NodeManager
2632 DataNode
2943 Jps

And also you see if everything is running fine is the UI as well. First, find out the public IP for the master node. DFS UI is on the port 50070 and YARN UI is on 8088.

DFS: http://<master_ip>:50070

YARN: http://<master_ip>:8088/cluster

Now we can run a small Map Reduce Job of counting the words in a text. Execute the below commands in the Master (of course) to run the job.

#OnMastercd ~
mkdir sample_data
cd sample_data/
wget -O alice.txt https://www.gutenberg.org/files/11/11-0.txt
cd ~hadoop fs -copyFromLocal sample_data/alice.txt hdfs://master:9000/
hdfs dfs -ls hdfs://master:9000/
yarn jar /usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.3.jar wordcount "hdfs://master:9000/alice.txt" hdfs://master:9000/output/

While the job is running you can also visit the YARN UI to track the Job.

HBase

Let us download the HBase and then untar and copy the files to /usr/local/hbase directory. The official link to download the HBase can be found here. From there you have to find the mirror and download it. Please feel free the replace the link based on your location. Also, I have tested with HBase version 1.4.13 .

#OnMasterwget https://downloads.apache.org/hbase/1.4.13/hbase-1.4.13-bin.tar.gz
tar -zxvf hbase-1.4.13-bin.tar.gz
sudo mv hbase-1.4.13 /usr/local/hbase
sudo chown -R ubuntu:ubuntu /usr/local/hbase

2. Add the below commands to ~/.bashrc file

#OnMasterexport HBASE_HOME=/usr/local/hbase
export PATH=$PATH:$HBASE_HOME/bin

Then, as usual source the file. source ~/.bashrc .

3. Update the HBase Config files

#OnMastercd /usr/local/hbase/conf/

Set the JAVA_HOME in hbase-env.sh file.

#OnMasterexport JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64

Then, proceed to add the properties to hbase-site.xml file.

#OnMaster<property>
      <name>hbase.rootdir</name>
      <value>hdfs://master:9000/hbase</value>
   </property><property>
      <name>hbase.zookeeper.property.dataDir</name>
      <value>hdfs://master:9000/zookeeper</value>
   </property><property>
     <name>hbase.cluster.distributed</name>
     <value>true</value>
   </property>
<property>
    <name>hbase.zookeeper.quorum</name>
    <value>slave1,slave2,slave3</value>
  </property>

Then add the slave names to regionservers

#OnMasterslave1
slave2
slave3

Now On the Slave Nodes, make the create /usr/local/hbase directory and give permissions so that we can copy the files from the master to slave nodes. Make the following changes on the slaves.

#OnSlavessudo mkdir -p /usr/local/hbase
sudo chown -R ubuntu:ubuntu /usr/local/hbase

Now, from the master again, we need to copy the HBase file to slaves. Run the below commands to do that.

#OnMasterscp -rp /usr/local/hbase/* ubuntu@slave1:/usr/local/hbase/
scp -rp /usr/local/hbase/* ubuntu@slave2:/usr/local/hbase/
scp -rp /usr/local/hbase/* ubuntu@slave3:/usr/local/hbase/

Now it is time to start the HBase. Make sure that the Hadoop is running, before you start the HBase.

#OnMasterstart-hbase.sh

When you run jps command,

On Master you see

#OnMaster7616 NameNode
7891 SecondaryNameNode
8851 Jps
8581 HMaster
8056 ResourceManager

And on the slaves

#OnSlave4741 DataNode
5270 HQuorumPeer
5614 Jps
5438 HRegionServer
4927 NodeManager

The HBase also has a UI to see the RegionServers information among other things.

HBase UI: http://<master_ip>:16010/master-status

Hadoop and HBase are up and running in the distributed mode with a replication factor of 3 on the AWS EC2 instances!