Raspberry Pi Hadoop Cluster Guide
This guide explores how you can set up a Hadoop cluster running on three Raspberry Pis.
Apache Hadoop is open-source software that facilitates communication between multiple computers. Such functionality allows to store and process large amounts of data. Hadoop is fault tolerant and can run on commodity hardware, therefore, traditionally, clusters have been relatively cheap and could easily be expanded for more space and processing power. These features led to significant adoption rates and the framework has been one of the catalysts for the emergence of Big Data in the late ’00s.
Before we begin, the natural questions is why would someone build a Hadoop cluster in 2020? A couple of reasons:
- While the overall influence of Hadoop is decreasing, understanding how it works can serve as a basis and a comparison point for more relevant Big Data components like Spark, Kafka, Kubernetes, etc.
- Building a Hadoop cluster can teach useful general computing concepts including using the Terminal, Secure Shell, networking, etc.
- The cluster’s hardware can be repurposed for continuous learning and development of other projects.
- It is really fun!
Hardware component parts:
- 3x Raspberry Pi B+ £99.78
- 3x Micro SD card 32GB £17.07
- 3x USB C to USB 2.0 cables (pack of 5) £7.99
- 6-Port USB charger £17.49
- 5-Port Ethernet Switch £6.78
- USB 2.0 to Ethernet cable adapter £10.99
- 4x Ethernet cables £13.96
- USB 2.0 to 1.35mm DC 5 volt adapter £5.99
- Acrylic case £16.59
- Double sided sticky tape £5.99
TOTAL: £202.63
Step-by-step guide
This specific guide is for a “headless” Raspberry Pi setup using a MacBook machine. This means that the Pis do not have their own dedicated screens and keyboards, but instead will be controlled using the MacBook (simply referred to as machine in this guide).
1. Flash the Raspbian OS image onto the SD cards.
- First, download an Operating System (OS) for your Pis: there is a variety to choose from, but I am using the official Raspberry Pi OS called Raspbian that comes with a desktop and recommended software (download here). It is light, easy to use and is recommended by the Pi Foundation as the preferred OS because it is optimized for the Pi hardware. Once the download is complete and unzipped you should have one file with a
.img
extension on your machine. - Next, you need to write the OS image to your SD card. For this purpose I downloaded and used belenaEtcher flashing software (download here). It is very intuitive to use: simply insert the SD card into your machine, open belenaEtcher and select the
.img
file you want to flash. - Repeat the flashing for the remaining two SD cards using the same OS image.
2. Set up the hardware
This one is straightforward. Mount the Pis to the acrylic case, tape the charger and router, and connect the wires.
3. SSH into the individual Pis and expand the file system
- After the flashing is complete, the boot SD card will be automatically ejected. Reinsert the card into your machine and type
touch /Volumes/boot/ssh
in your Terminal. - Eject the SD boot card from your machine, insert it to your Pi, connect the Pi to your machine and a power source.
- You now need to find out the IP address of the Pi. You can do this by typing
ping raspberrypi.local
in your Terminal (pressctrl + z
to stop the pinging). - Use the information gained from the ping to identify the IP address and type
ssh pi@169.254.63.20
(your IP address will be different). Type inyes
when asked if you want to continue to connect. You will then be asked to type in a password. The default password for a Pi israspberry
. - Once you successfully SSH into the Pi type in
sudo raspi-config
. This should take you to the Configuration Tool. Navigate to7 Advanced Options
and selectA1 Expand Filesystem
. Navigate toFinish
and reboot when prompted. - Repeat these steps for the other two Pis.
- NOTE: you might also want to consider setting up a Graphical User Interface (GUI) for your Pis for ease of control. You can follow this video guide to do so.
4. Configure the network, users and SSH keys
- Once you have expanded the file system of the last Pi do not SSH back into it. Instead, type
sudo nano /etc/hosts
in your machine’s Terminal. Navigate to the bottom of the document, add in the below and save the file by pressingctrl + x
(the IP addresses will be different for you):
169.254.24.25 master
169.254.25.188 worker1
169.254.63.20 worker2
- You can now SSH into the master Pi using
ssh pi@master
instead of using the IP address (replacemaster
withworker1
orworker2
for the other two Pis). Try this out and once connected type thesudo nano /etc/hosts
command used previously to append the Pi’shosts
file with the below (it is recommended that you replaceraspberrypi
for127.0.1.1
to avoid errors):
127.0.1.1 master # worker1, worker2 for the other two Pis
169.254.24.25 master
169.254.25.188 worker1
169.254.63.20 worker2
- The next step is to rename the Pis’
hostname
. To to do this typesudo nano /etc/hostname
while connected and replaceraspberrypi
withmaster
,worker1
andworker2
respectively for each Pi. - For each of the Pis create a dedicated Hadoop group and user by typing in the commands below (you will need to set a password, but can keep the rest of the fields empty):
sudo addgroup hadoop
sudo adduser --ingroup hadoop hduser
sudo adduser hduser sudo
- The last step is to generate SSH keys. This will ensure that the Pis can seamlessly communicate with each other when the cluster is formed. Repeat the steps below for each of the Pis and replace the names of the Pis when copying accordingly.
ssh-keygen -t ed25519 # You are free to use another algorithmssh-copy-id hduser@master
ssh-copy-id hduser@worker1
ssh-copy-id hduser@worker2
5. Create the Hadoop folder structure and environment variables
- Create the required HDFS (Hadoop Distributed File System) folder structure using the commands below:
sudo mkdir /opt/hadoop_tmp/
sudo mkdir /opt/hadoop_tmp/hdfssudo mkdir /opt/hadoop_tmp/hdfs/namenode # master only
sudo mkdir /opt/hadoop_tmp/hdfs/datanode # Both workers only
- Before you continue make sure to modify user permisions. This will ensure that Hadoop can modify the necessary folders (I had the most trouble with this when building the cluster and had to review Hadoop logs to spot the errors).
sudo chown hduser:hadoop -R /opt/hadoop
sudo chown hduser:hadoop -R /opt/hadoop_tmp
sudo chown hduser:hadoop -R /opt/hadoop_tmp/hdfs/datanode
- Lastly you will need to add Java and Hadoop environment variables for your Bash profile by typing
nano ~/.bashrc
and adding the lines below to the bottom of the file (after you make the changes and save the profile typesource ~/.bashrc
to refresh your profile):
export JAVA_HOME=$(readlink -f /usr/bin/java | sed "s:bin/java::")export HADOOP_HOME=/opt/hadoop
export HADOOP_INSTALL=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export PATH=$PATH:$HADOOP_INSTALL/bin
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib"
6. Download, install and configure Hadoop
- You need to choose a Hadoop version to run (download here). For this project I am running Hadoop-3.2.1. Once the download is complete, use the Terminal and navigate to the folder containing the downloaded
.tar.gz
file. Runscp hadoop-3.2.1.tar.gz hduser@master:/opt/
. This will copy the Hadoop file into the master Pi. Repeat this for the two workers as well. - SSH into the Pis and run the commands below in each. This will unpack the Hadoop tarball, rename the folder and finally delete the
.tar.gz
file itself, as it will no longer be needed.
sudo tar -xvzf /opt/hadoop-3.2.1.tar.gz -C /opt/
mv /opt/hadoop-3.2.1 /opt/hadoop
rm /opt/hadoop-3.2.1.tar.gz
- In the master Pi only run
nano /opt/hadoop/etc/hadoop/workers
and replace any existing text withworker1
andworker2
on separate lines. - Edit four Hadoop
.xml
files by typing the below:
nano /opt/hadoop/etc/hadoop/core-site.xml# Configuration to be added:
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://master:9000/</value>
</property>
<property>
<name>fs.default.FS</name>
<value>hdfs://master:9000/</value>
</property>
</configuration>nano /opt/hadoop/etc/hadoop/hdfs-site.xml# Configuration to be added:
<configuration>
<property>
<name>dfs.datanode.data.dir</name>
<value>/opt/hadoop_tmp/hdfs/datanode</value>
<final>true</final>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>/opt/hadoop_tmp/hdfs/namenode</value>
<final>true</final>
</property>
<property>
<name>dfs.namenode.http-address</name>
<value>master:50070</value>
</property>
<property>
<name>dfs.replication</name>
<value>2</value> # Based on the number of workers you have
</property>
</configuration>nano /opt/hadoop/etc/hadoop/yarn-site.xml# Configuration to be added:
<configuration>
<property>
<name>yarn.resourcemanager.resource-tracker.address</name>
<value>master:8025</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>master:8035</value>
</property>
<property>
<name>yarn.resourcemanager.address</name>
<value>master:8050</value>
</property>
</configuration>nano /opt/hadoop/etc/hadoop/mapred-site.xml# Configuration to be added:
<configuration>
<property>
<name>mapreduce.job.tracker</name>
<value>master:5431</value>
</property>
<property>
<name>mapred.framework.name</name>
<value>yarn</value>
</property>
</configuration>
- Lastly, you need to add an environment variable to Hadoop by typing
nano /opt/hadoop/etc/hadoop/hadoop-env.sh
and replacing the commented Java path line withexport JAVA_HOME=$(readlink -f /usr/bin/java | sed “s:bin/java::”)
. NOTE: do not replace the first Java path line that is an exampleJAVA_HOME=/usr/java/testing hdfs dfs -ls
.
7. Format and test the cluster
- The last thing you need do is format the NameNode, the filesystem and initialize the directories. All of this can be done with a simple command on your master Pi
hdfs namenode -format
. - The Hadoop Raspberry Pi cluster should be ready for action. To test it out start up Hadoop Distributed File System daemons using
/opt/hadoop/sbin/start-dfs.sh
and Yarn using/opt/hadoop/sbin/start-yarn.sh
.
- Assuming you did not have troubles in the last step, you can check the status of you cluster by typing
./bin/hdfs dfsadmin -report
(observe the number of live DataNodes).
What’s next?
The next step is to write some simple MapReduce programs for the cluster to run. Afterwards, the intention is to enhance the cluster by installing additional components of the Hadoop ecosystem like Spark or Hue.
The cluster’s computing and storage utility has no chance in competing even against the budget laptops of 2020, however, it contains most physical and software components that have powered companies like Google and Facebook over the last decade. Overall, building a Raspberry Pi Hadoop cluster is a highly enjoyable and engaging educational experience.
A special thank you to Jason I. Carter, oliver hu and Alan Verdugo for their guides that served as inspiration.