Make your PC into Hadoop cluster (Windows hyper-v)
Hortonworks Sandbox is awesome. It helps to learn a lot about Hadoop and related technologies. But what if you can create your own, and without spending money on EMR/Databricks as well.
Then, these are my goals, and let’s try to build one.
- Build a Hadoop cluster with Yarn (one Namenode & 3 Datanodes).
- Build a Standalone Spark cluster and run Spark on yarn.
- Install Zeppelin and play with a spark.
VM Choice
I pick Hyper-v because I’m on Windows 10 virtualization enabled environment. Being a type1 hypervisor, Hyper-v gives some performance advantage over other installed VMs such as VirtualBox.
OS Choice
I would suggest CentOs 7 minimal for this because CentOs and RHEL are mostly used in Hadoop clusters. “ CentOs 7 minimal” gives the advantage to run each node with minimum memory & CPU. So my choice is CentOs7 (“CentOS-7-x86_64-Minimal-2003.iso”)
First, install CentOS in Hyper-v (google it 😇) and make sure to pick an external network for these VMs.
Install required Softwares
sudo yum install -y nano
sudo yum install -y wget
sudo yum install -y hyperv-daemons
Additional configuration to make VM run smoothly
Disable Linux disk I/O optimization.
sudo nano /etc/selinux/config
change SELINUX=enforcing to SELINUX=disabledDisable Firewalld (i know, but this is not a production cluster).
systemctl stop firewalld
systemctl disable firewalld
systemctl mask —now firewalldDisable CentOs I/O Scheduler
su root
echo noop > /sys/block/sda/queue/schedulerEnable dynamic memory with CentOs.
sudo nano /etc/udev/rules.d/100-balloon.rulesAdd this entry
SUBSYSTEM==”memory”, ACTION==”add”, ATTR{state}=”online”
Configure Static IP
Better to have a static IP for each node. (If you select manual network configuration instead of DHCP in installation, please skip this part)
Run nmcli
So we’ll stop DHCP client to get dynamic IP and we’ll define a static IP.
sudo nano /etc/sysconfig/network-scripts/ifcfg-eth0
then Change BOOTPROTO=static and then add desired ipaddress, netmask and gatewayRetart Network adaptor.
sudo systemctl restart network
Update “hosts” file
Even though I still didn’t create Datanodes, these are the IPs & hostnames I like to have on my cluster.
sudo nano /etc/hosts
All the commands are from my home directory. I’m mentioning configuration with my own values as an example. So please use your own values for hostname, paths, etc…
Setup passwordless SSH
Generate keys
ssh-keygen -b 4096 (Do not enter a password)Set the permissions private key.
chmod 700 .ssh
chmod 600 .ssh/id_rsa
Install Softwears
Java
sudo yum -u update
sudo yum install java-1.8.0.openjdkupdate-alternatives — config java (then enter the selection)
Hadoop
wget <URL>
Ex: wget https://downloads.apache.org/hadoop/common/hadoop-3.1.3/hadoop-3.1.3.tar.gzUnzip
tar -xzf hadoop-3.1.3.tar.gzRename
mv hadoop-3.1.3 hadoop
Set enviroment veriables
nano .bash_profileThen add these entries
export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.252.b09–2.el7_8.x86_64/jre/ << My JRE Location
export HADOOP_HOME=/home/shehan/hadoop << My Hadoop root folderUpdate Path
PATH=$PATH:/home/shehan/hadoop/bin:/home/shehan/hadoop/sbinThen
source .bash_profile
Set Hadoop configuration
Open hadoop-env.sh
nano ~/hadoop/etc/hadoop/hadoop-env.shset Java home
JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.252.b09–2.el7_8.x86_64/jre/Open core-site.xml & add properties
nano ~/hadoop/etc/hadoop/core-site.xml<property>
<name>fs.defaultFS</name>
<value>hdfs://master.hadoop.smf:9000</value>
</property><property>
<name>hadoop.tmp.dir</name>
<value>/home/shehan/data/tmp</value>
</property>
Open hdfs-site.xml & add properties
nano ~/hadoop/etc/hadoop/hdfs-site.xml<property>
<name>dfs.namenode.name.dir</name>
<value>/home/shehan/data/nameNode</value>
</property><property>
<name>dfs.datanode.data.dir</name>
<value>/home/shehan/data/dataNode</value>
</property><property>
<name>dfs.replication</name>
<value>2</value> << how many duplicate blocks for recovery? pick as u want (default is 3)
</property><property>
<name>dfs.webhdfs.enabled</name>
<value>true</value>
</property>
Make sure all the folder paths have permission to read/write.
Open ~/hadoop/etc/hadoop/workers
remove all the host names (you may only see localhost for now and remove it)
Now we have Masternode configured.
First format HDFS file system
hdfs namenode -formatStart hadoop dfs deamon.
start-dfs.sh
That’s it, So I put all the IP addresses in my windows “hosts” file, so I can access all the WebUIs via my browser.
192.168.1.200 master.hadoop.smf
192.168.1.210 node01.hadoop.smf
192.168.1.220 node02.hadoop.smf
192.168.1.230 node03.hadoop.smf
Configure Datanodes.
The easiest way to do is just clone this VM by exporting and importing. (again google it 😇).
Import 3 times (or a number of Datanodes that you want).
Since these are cloned VMs, hostname, IP address,etc.. are exactly the same. So we need to change these values in order to run the cluster. (to do this, start VMs one by one and change below configurations)
Change Static IP
Simple as i mention above goto /etc/sysconfig/network-scripts/ifcfg-eth0 and then change IPADDR
Change Hostname
hostnamectl set-hostname node01.hadoop.smf
then reboot the node.
Delete Folders
.ssh folder
rm -r .sshAnd remove the data folders that mention in "hdfs-site.conf".
Do the above steps for each Datanodes.
Copy public key to every node.
From master node
cat .ssh/id_rsa.pub >> .ssh/authorized_keys
chmod 600 .ssh/authorized_keys
ssh-copy-id shehan@192.168.1.210 << do this for every datanode
Try SSH to the data nodes from the master node and it should be a success without a password prompt. Ex:- ssh shehan@192.168.1.210
Now everything is set. Start Hadoop daemon and check the relevant java processes.
Start hadoop dfs deamon.
start-dfs.shIn Masternode
[shehan@master ~]$ jps
1876 NameNode
2068 SecondaryNameNode
2246 JpsIn all Datanodes
[shehan@node1 ~]$ jps
1323 Jps
1213 DataNode
So this is it. you have your own basic Hadoop cluster.
Next:-
Referances