Make your PC into Hadoop cluster (Windows hyper-v)

Shehan Fernando
5 min readJun 8, 2020

--

Photo by Markus Spiske on Unsplash

Hortonworks Sandbox is awesome. It helps to learn a lot about Hadoop and related technologies. But what if you can create your own, and without spending money on EMR/Databricks as well.

Then, these are my goals, and let’s try to build one.

  • Build a Hadoop cluster with Yarn (one Namenode & 3 Datanodes).
  • Build a Standalone Spark cluster and run Spark on yarn.
  • Install Zeppelin and play with a spark.

VM Choice

I pick Hyper-v because I’m on Windows 10 virtualization enabled environment. Being a type1 hypervisor, Hyper-v gives some performance advantage over other installed VMs such as VirtualBox.

OS Choice

I would suggest CentOs 7 minimal for this because CentOs and RHEL are mostly used in Hadoop clusters. “ CentOs 7 minimal” gives the advantage to run each node with minimum memory & CPU. So my choice is CentOs7 (“CentOS-7-x86_64-Minimal-2003.iso”)

First, install CentOS in Hyper-v (google it 😇) and make sure to pick an external network for these VMs.

Install required Softwares

sudo yum install -y nano
sudo yum install -y wget
sudo yum install -y hyperv-daemons

Additional configuration to make VM run smoothly

Disable Linux disk I/O optimization.
sudo nano /etc/selinux/config
change SELINUX=enforcing to SELINUX=disabled
Disable Firewalld (i know, but this is not a production cluster).
systemctl stop firewalld
systemctl disable firewalld
systemctl mask —now firewalld
Disable CentOs I/O Scheduler
su root
echo noop > /sys/block/sda/queue/scheduler
Enable dynamic memory with CentOs.
sudo nano /etc/udev/rules.d/100-balloon.rules
Add this entry
SUBSYSTEM==”memory”, ACTION==”add”, ATTR{state}=”online”

Configure Static IP

Better to have a static IP for each node. (If you select manual network configuration instead of DHCP in installation, please skip this part)

Run nmcli

I’m connected ethernet device (eth0) and DHCP was assign dynamic IP for this VM.

So we’ll stop DHCP client to get dynamic IP and we’ll define a static IP.

sudo nano /etc/sysconfig/network-scripts/ifcfg-eth0
then Change
BOOTPROTO=static and then add desired ipaddress, netmask and gateway
Retart Network adaptor.
sudo systemctl restart network
My Namenode settings

Update “hosts” file

Even though I still didn’t create Datanodes, these are the IPs & hostnames I like to have on my cluster.

sudo nano /etc/hosts

All the commands are from my home directory. I’m mentioning configuration with my own values as an example. So please use your own values for hostname, paths, etc…

Setup passwordless SSH

Generate keys 
ssh-keygen -b 4096 (Do not enter a password)
Set the permissions private key.
chmod 700 .ssh
chmod 600 .ssh/id_rsa

Install Softwears

Java

sudo yum -u update
sudo yum install java-1.8.0.openjdk
update-alternatives — config java (then enter the selection)

Hadoop

wget <URL>
Ex: wget https://downloads.apache.org/hadoop/common/hadoop-3.1.3/hadoop-3.1.3.tar.gz
Unzip
tar -xzf hadoop-3.1.3.tar.gz
Rename
mv hadoop-3.1.3 hadoop

Set enviroment veriables

nano .bash_profileThen add these entries
export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.252.b09–2.el7_8.x86_64/jre/ << My JRE Location
export HADOOP_HOME=/home/shehan/hadoop << My Hadoop root folder
Update Path
PATH=$PATH:/home/shehan/hadoop/bin:/home/shehan/hadoop/sbin
Then
source .bash_profile
My .bash_profile

Set Hadoop configuration

Open hadoop-env.sh
nano ~/hadoop/etc/hadoop/hadoop-env.sh
set Java home
JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.252.b09–2.el7_8.x86_64/jre/
Open core-site.xml & add properties
nano ~/hadoop/etc/hadoop/core-site.xml
<property>
<name>fs.defaultFS</name>
<value>hdfs://master.hadoop.smf:9000</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/home/shehan/data/tmp</value>
</property>

Open hdfs-site.xml & add properties
nano ~/hadoop/etc/hadoop/hdfs-site.xml
<property>
<name>dfs.namenode.name.dir</name>
<value>/home/shehan/data/nameNode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>/home/shehan/data/dataNode</value>
</property>
<property>
<name>dfs.replication</name>
<value>2</value> << how many duplicate blocks for recovery? pick as u want (default is 3)
</property>
<property>
<name>dfs.webhdfs.enabled</name>
<value>true</value>
</property>

Make sure all the folder paths have permission to read/write.

Open ~/hadoop/etc/hadoop/workers
remove all the host names (you may only see localhost for now and remove it)

Now we have Masternode configured.

First format HDFS file system
hdfs namenode -format
Start hadoop dfs deamon.
start-dfs.sh

That’s it, So I put all the IP addresses in my windows “hosts” file, so I can access all the WebUIs via my browser.

192.168.1.200 master.hadoop.smf
192.168.1.210 node01.hadoop.smf
192.168.1.220 node02.hadoop.smf
192.168.1.230 node03.hadoop.smf
http://master.hadoop.smf:9870

Configure Datanodes.

The easiest way to do is just clone this VM by exporting and importing. (again google it 😇).
Import 3 times (or a number of Datanodes that you want).

Since these are cloned VMs, hostname, IP address,etc.. are exactly the same. So we need to change these values in order to run the cluster. (to do this, start VMs one by one and change below configurations)

Change Static IP

Simple as i mention above goto /etc/sysconfig/network-scripts/ifcfg-eth0 and then change IPADDR

Change Hostname

hostnamectl set-hostname node01.hadoop.smf
then reboot the node.

Delete Folders

.ssh folder
rm -r .ssh
And remove the data folders that mention in "hdfs-site.conf".

Do the above steps for each Datanodes.

Copy public key to every node.

From master node
cat .ssh/id_rsa.pub >> .ssh/authorized_keys
chmod 600 .ssh/authorized_keys
ssh-copy-id shehan@192.168.1.210 << do this for every datanode

Try SSH to the data nodes from the master node and it should be a success without a password prompt. Ex:- ssh shehan@192.168.1.210

These are my running master node and slave nodes.

Now everything is set. Start Hadoop daemon and check the relevant java processes.

Start hadoop dfs deamon.
start-dfs.sh
In Masternode
[shehan@master ~]$ jps
1876 NameNode
2068 SecondaryNameNode
2246 Jps
In all Datanodes
[shehan@node1 ~]$ jps
1323 Jps
1213 DataNode

So this is it. you have your own basic Hadoop cluster.

--

--