Make your PC into Hadoop cluster (Windows hyper-v)

5 min readJun 8, 2020

Hortonworks Sandbox is awesome. It helps to learn a lot about Hadoop and related technologies. But what if you can create your own, and without spending money on EMR/Databricks as well.

Then, these are my goals, and let’s try to build one.

Build a Hadoop cluster with Yarn (one Namenode & 3 Datanodes).
Build a Standalone Spark cluster and run Spark on yarn.
Install Zeppelin and play with a spark.

VM Choice

I pick Hyper-v because I’m on Windows 10 virtualization enabled environment. Being a type1 hypervisor, Hyper-v gives some performance advantage over other installed VMs such as VirtualBox.

OS Choice

I would suggest CentOs 7 minimal for this because CentOs and RHEL are mostly used in Hadoop clusters. “ CentOs 7 minimal” gives the advantage to run each node with minimum memory & CPU. So my choice is CentOs7 (“CentOS-7-x86_64-Minimal-2003.iso”)

First, install CentOS in Hyper-v (google it 😇) and make sure to pick an external network for these VMs.

Install required Softwares

sudo yum install -y nano
sudo yum install -y wget
sudo yum install -y hyperv-daemons

Additional configuration to make VM run smoothly

Disable Linux disk I/O optimization.
sudo nano /etc/selinux/config
change SELINUX=enforcing to SELINUX=disabledDisable Firewalld (i know, but this is not a production cluster).
systemctl stop firewalld
systemctl disable firewalld
systemctl mask —now firewalldDisable CentOs I/O Scheduler
su root
echo noop > /sys/block/sda/queue/schedulerEnable dynamic memory with CentOs.
sudo nano /etc/udev/rules.d/100-balloon.rulesAdd this entry
SUBSYSTEM==”memory”, ACTION==”add”, ATTR{state}=”online”

Configure Static IP

Better to have a static IP for each node. (If you select manual network configuration instead of DHCP in installation, please skip this part)

Run nmcli

I’m connected ethernet device (eth0) and DHCP was assign dynamic IP for this VM.

So we’ll stop DHCP client to get dynamic IP and we’ll define a static IP.

sudo nano /etc/sysconfig/network-scripts/ifcfg-eth0
then Change BOOTPROTO=static and then add desired ipaddress, netmask and gatewayRetart Network adaptor.
sudo systemctl restart network

Update “hosts” file

Even though I still didn’t create Datanodes, these are the IPs & hostnames I like to have on my cluster.

sudo nano /etc/hosts

All the commands are from my home directory. I’m mentioning configuration with my own values as an example. So please use your own values for hostname, paths, etc…

Setup passwordless SSH

Generate keys 
ssh-keygen -b 4096  (Do not enter a password)Set the permissions private key.
chmod 700 .ssh 
chmod 600 .ssh/id_rsa

Install Softwears

Java

sudo yum -u update
sudo yum install java-1.8.0.openjdkupdate-alternatives — config java (then enter the selection)

Hadoop

wget <URL>
Ex: wget https://downloads.apache.org/hadoop/common/hadoop-3.1.3/hadoop-3.1.3.tar.gzUnzip
tar -xzf hadoop-3.1.3.tar.gzRename
mv hadoop-3.1.3 hadoop

Set enviroment veriables

nano .bash_profileThen add these entries
export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.252.b09–2.el7_8.x86_64/jre/   << My JRE Location
export HADOOP_HOME=/home/shehan/hadoop <<  My Hadoop root folderUpdate Path
PATH=$PATH:/home/shehan/hadoop/bin:/home/shehan/hadoop/sbinThen
source .bash_profile

Set Hadoop configuration

Open hadoop-env.sh
nano ~/hadoop/etc/hadoop/hadoop-env.shset Java home
JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.252.b09–2.el7_8.x86_64/jre/Open core-site.xml & add properties
nano ~/hadoop/etc/hadoop/core-site.xml<property>
    <name>fs.defaultFS</name>
    <value>hdfs://master.hadoop.smf:9000</value> 
</property><property>
     <name>hadoop.tmp.dir</name>
     <value>/home/shehan/data/tmp</value> 
</property>
Open hdfs-site.xml & add properties
nano ~/hadoop/etc/hadoop/hdfs-site.xml<property>
    <name>dfs.namenode.name.dir</name>
    <value>/home/shehan/data/nameNode</value>
</property><property>
    <name>dfs.datanode.data.dir</name>
    <value>/home/shehan/data/dataNode</value>
</property><property>
    <name>dfs.replication</name>
    <value>2</value> << how many duplicate blocks for recovery? pick as u want (default is 3)
</property><property>
    <name>dfs.webhdfs.enabled</name>
    <value>true</value>
</property>

Make sure all the folder paths have permission to read/write.

Open ~/hadoop/etc/hadoop/workers
remove all the host names (you may only see localhost for now and remove it)

Now we have Masternode configured.

First format HDFS file system
hdfs namenode -formatStart hadoop dfs deamon.
start-dfs.sh

That’s it, So I put all the IP addresses in my windows “hosts” file, so I can access all the WebUIs via my browser.

192.168.1.200 master.hadoop.smf
192.168.1.210 node01.hadoop.smf
192.168.1.220 node02.hadoop.smf
192.168.1.230 node03.hadoop.smf

Configure Datanodes.

The easiest way to do is just clone this VM by exporting and importing. (again google it 😇).
Import 3 times (or a number of Datanodes that you want).

Since these are cloned VMs, hostname, IP address,etc.. are exactly the same. So we need to change these values in order to run the cluster. (to do this, start VMs one by one and change below configurations)

Change Static IP

Simple as i mention above goto /etc/sysconfig/network-scripts/ifcfg-eth0 and then change IPADDR

Change Hostname

hostnamectl set-hostname node01.hadoop.smf
then reboot the node.

Delete Folders

.ssh folder
rm -r .sshAnd remove the data folders that mention in "hdfs-site.conf".

Do the above steps for each Datanodes.

Copy public key to every node.

From master node
cat .ssh/id_rsa.pub >> .ssh/authorized_keys
chmod 600 .ssh/authorized_keys
ssh-copy-id shehan@192.168.1.210 << do this for every datanode

Try SSH to the data nodes from the master node and it should be a success without a password prompt. Ex:- ssh shehan@192.168.1.210

These are my running master node and slave nodes.

Now everything is set. Start Hadoop daemon and check the relevant java processes.

Start hadoop dfs deamon.
start-dfs.shIn Masternode
[shehan@master ~]$ jps
1876 NameNode
2068 SecondaryNameNode
2246 JpsIn all Datanodes 
[shehan@node1 ~]$ jps
1323 Jps
1213 DataNode

So this is it. you have your own basic Hadoop cluster.

Next:-

Referances

Hadoop Cluster Setup

This document describes how to install and configure Hadoop clusters ranging from a few nodes to extremely large…

hadoop.apache.org

Improving Linux System Performance with I/O Scheduler Tuning

In a previous article, I wrote about using pgbench to tune PostgreSQL. While I covered a very common tunable…

rollout.io

CentOS Linux on Hyper-V - A Complete Guide

Note: This article was originally published on May 2017. It has been fully updated to be current as of August 2019…

www.altaro.com

How to Stop and Disable Firewalld on CentOS 7

FirewallD is a complete firewall solution that dynamically manages the trust level of network connections and…

linuxize.c