Raspberry Pi Hadoop Cluster Guide

This guide explores how you can set up a Hadoop cluster running on three Raspberry Pis.

Published in

Data & Waffles

8 min readFeb 9, 2020

Apache Hadoop is open-source software that facilitates communication between multiple computers. Such functionality allows to store and process large amounts of data. Hadoop is fault tolerant and can run on commodity hardware, therefore, traditionally, clusters have been relatively cheap and could easily be expanded for more space and processing power. These features led to significant adoption rates and the framework has been one of the catalysts for the emergence of Big Data in the late ’00s.

Before we begin, the natural questions is why would someone build a Hadoop cluster in 2020? A couple of reasons:

While the overall influence of Hadoop is decreasing, understanding how it works can serve as a basis and a comparison point for more relevant Big Data components like Spark, Kafka, Kubernetes, etc.
Building a Hadoop cluster can teach useful general computing concepts including using the Terminal, Secure Shell, networking, etc.
The cluster’s hardware can be repurposed for continuous learning and development of other projects.
It is really fun!

Hardware component parts:

3x Raspberry Pi B+ £99.78
3x Micro SD card 32GB £17.07
3x USB C to USB 2.0 cables (pack of 5) £7.99
6-Port USB charger £17.49
5-Port Ethernet Switch £6.78
USB 2.0 to Ethernet cable adapter £10.99
4x Ethernet cables £13.96
USB 2.0 to 1.35mm DC 5 volt adapter £5.99
Acrylic case £16.59
Double sided sticky tape £5.99

TOTAL: £202.63

Step-by-step guide

This specific guide is for a “headless” Raspberry Pi setup using a MacBook machine. This means that the Pis do not have their own dedicated screens and keyboards, but instead will be controlled using the MacBook (simply referred to as machine in this guide).

1. Flash the Raspbian OS image onto the SD cards.

First, download an Operating System (OS) for your Pis: there is a variety to choose from, but I am using the official Raspberry Pi OS called Raspbian that comes with a desktop and recommended software (download here). It is light, easy to use and is recommended by the Pi Foundation as the preferred OS because it is optimized for the Pi hardware. Once the download is complete and unzipped you should have one file with a .img extension on your machine.
Next, you need to write the OS image to your SD card. For this purpose I downloaded and used belenaEtcher flashing software (download here). It is very intuitive to use: simply insert the SD card into your machine, open belenaEtcher and select the .img file you want to flash.
Repeat the flashing for the remaining two SD cards using the same OS image.

2. Set up the hardware

This one is straightforward. Mount the Pis to the acrylic case, tape the charger and router, and connect the wires.

Attaching heatsinks and mounting the Pis

3. SSH into the individual Pis and expand the file system

After the flashing is complete, the boot SD card will be automatically ejected. Reinsert the card into your machine and type touch /Volumes/boot/ssh in your Terminal.
Eject the SD boot card from your machine, insert it to your Pi, connect the Pi to your machine and a power source.
You now need to find out the IP address of the Pi. You can do this by typing ping raspberrypi.local in your Terminal (press ctrl + z to stop the pinging).
Use the information gained from the ping to identify the IP address and type ssh pi@169.254.63.20 (your IP address will be different). Type in yes when asked if you want to continue to connect. You will then be asked to type in a password. The default password for a Pi is raspberry.
Once you successfully SSH into the Pi type in sudo raspi-config. This should take you to the Configuration Tool. Navigate to 7 Advanced Options and select A1 Expand Filesystem. Navigate to Finish and reboot when prompted.
Repeat these steps for the other two Pis.
NOTE: you might also want to consider setting up a Graphical User Interface (GUI) for your Pis for ease of control. You can follow this video guide to do so.

4. Configure the network, users and SSH keys

Once you have expanded the file system of the last Pi do not SSH back into it. Instead, type sudo nano /etc/hosts in your machine’s Terminal. Navigate to the bottom of the document, add in the below and save the file by pressing ctrl + x (the IP addresses will be different for you):

169.254.24.25    master
169.254.25.188   worker1
169.254.63.20    worker2

You can now SSH into the master Pi using ssh pi@master instead of using the IP address (replace master with worker1 or worker2 for the other two Pis). Try this out and once connected type the sudo nano /etc/hosts command used previously to append the Pi’s hosts file with the below (it is recommended that you replace raspberrypi for 127.0.1.1 to avoid errors):

127.0.1.1       master    # worker1, worker2 for the other two Pis

169.254.24.25   master
169.254.25.188  worker1
169.254.63.20   worker2

The next step is to rename the Pis’ hostname. To to do this type sudo nano /etc/hostname while connected and replace raspberrypi with master, worker1 and worker2 respectively for each Pi.
For each of the Pis create a dedicated Hadoop group and user by typing in the commands below (you will need to set a password, but can keep the rest of the fields empty):

sudo addgroup hadoop
sudo adduser --ingroup hadoop hduser
sudo adduser hduser sudo

The last step is to generate SSH keys. This will ensure that the Pis can seamlessly communicate with each other when the cluster is formed. Repeat the steps below for each of the Pis and replace the names of the Pis when copying accordingly.

ssh-keygen -t ed25519 # You are free to use another algorithmssh-copy-id hduser@master
ssh-copy-id hduser@worker1
ssh-copy-id hduser@worker2

5. Create the Hadoop folder structure and environment variables

Create the required HDFS (Hadoop Distributed File System) folder structure using the commands below:

sudo mkdir /opt/hadoop_tmp/
sudo mkdir /opt/hadoop_tmp/hdfssudo mkdir /opt/hadoop_tmp/hdfs/namenode # master only
sudo mkdir /opt/hadoop_tmp/hdfs/datanode # Both workers only

Before you continue make sure to modify user permisions. This will ensure that Hadoop can modify the necessary folders (I had the most trouble with this when building the cluster and had to review Hadoop logs to spot the errors).

sudo chown hduser:hadoop -R /opt/hadoop
sudo chown hduser:hadoop -R /opt/hadoop_tmp
sudo chown hduser:hadoop -R /opt/hadoop_tmp/hdfs/datanode

Lastly you will need to add Java and Hadoop environment variables for your Bash profile by typing nano ~/.bashrc and adding the lines below to the bottom of the file (after you make the changes and save the profile type source ~/.bashrc to refresh your profile):

export JAVA_HOME=$(readlink -f /usr/bin/java | sed "s:bin/java::")export HADOOP_HOME=/opt/hadoop
export HADOOP_INSTALL=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export PATH=$PATH:$HADOOP_INSTALL/bin
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib"

6. Download, install and configure Hadoop

You need to choose a Hadoop version to run (download here). For this project I am running Hadoop-3.2.1. Once the download is complete, use the Terminal and navigate to the folder containing the downloaded .tar.gz file. Run scp hadoop-3.2.1.tar.gz hduser@master:/opt/. This will copy the Hadoop file into the master Pi. Repeat this for the two workers as well.
SSH into the Pis and run the commands below in each. This will unpack the Hadoop tarball, rename the folder and finally delete the .tar.gz file itself, as it will no longer be needed.

sudo tar -xvzf /opt/hadoop-3.2.1.tar.gz -C /opt/
mv /opt/hadoop-3.2.1 /opt/hadoop
rm /opt/hadoop-3.2.1.tar.gz

In the master Pi only run nano /opt/hadoop/etc/hadoop/workers and replace any existing text with worker1 and worker2 on separate lines.
Edit four Hadoop .xml files by typing the below:

nano /opt/hadoop/etc/hadoop/core-site.xml# Configuration to be added:
<configuration>
    <property>
        <name>fs.default.name</name>
        <value>hdfs://master:9000/</value>
    </property>
    <property>
        <name>fs.default.FS</name>
        <value>hdfs://master:9000/</value>
    </property>
</configuration>nano /opt/hadoop/etc/hadoop/hdfs-site.xml# Configuration to be added:
<configuration>
    <property>
        <name>dfs.datanode.data.dir</name>
        <value>/opt/hadoop_tmp/hdfs/datanode</value>
        <final>true</final>
    </property>
    <property>
        <name>dfs.namenode.name.dir</name>
        <value>/opt/hadoop_tmp/hdfs/namenode</value>
        <final>true</final>
    </property>
    <property>
        <name>dfs.namenode.http-address</name>
        <value>master:50070</value>
    </property>
    <property>
        <name>dfs.replication</name>
        <value>2</value> # Based on the number of workers you have
    </property>
</configuration>nano /opt/hadoop/etc/hadoop/yarn-site.xml# Configuration to be added:
<configuration>
    <property>
        <name>yarn.resourcemanager.resource-tracker.address</name>
        <value>master:8025</value>
    </property>
    <property>
        <name>yarn.resourcemanager.scheduler.address</name>
        <value>master:8035</value>
    </property>
    <property>
        <name>yarn.resourcemanager.address</name>
        <value>master:8050</value>
    </property>
</configuration>nano /opt/hadoop/etc/hadoop/mapred-site.xml# Configuration to be added:
<configuration>
    <property>
        <name>mapreduce.job.tracker</name>
        <value>master:5431</value>
    </property>
    <property>
        <name>mapred.framework.name</name>
        <value>yarn</value>
    </property>
</configuration>

Lastly, you need to add an environment variable to Hadoop by typing nano /opt/hadoop/etc/hadoop/hadoop-env.sh and replacing the commented Java path line with export JAVA_HOME=$(readlink -f /usr/bin/java | sed “s:bin/java::”). NOTE: do not replace the first Java path line that is an example JAVA_HOME=/usr/java/testing hdfs dfs -ls.

7. Format and test the cluster

The last thing you need do is format the NameNode, the filesystem and initialize the directories. All of this can be done with a simple command on your master Pi hdfs namenode -format.
The Hadoop Raspberry Pi cluster should be ready for action. To test it out start up Hadoop Distributed File System daemons using /opt/hadoop/sbin/start-dfs.sh and Yarn using /opt/hadoop/sbin/start-yarn.sh.

Assuming you did not have troubles in the last step, you can check the status of you cluster by typing ./bin/hdfs dfsadmin -report (observe the number of live DataNodes).

What’s next?

The next step is to write some simple MapReduce programs for the cluster to run. Afterwards, the intention is to enhance the cluster by installing additional components of the Hadoop ecosystem like Spark or Hue.

The cluster’s computing and storage utility has no chance in competing even against the budget laptops of 2020, however, it contains most physical and software components that have powered companies like Google and Facebook over the last decade. Overall, building a Raspberry Pi Hadoop cluster is a highly enjoyable and engaging educational experience.

A special thank you to Jason I. Carter, oliver hu and Alan Verdugo for their guides that served as inspiration.

Raspberry Pi Hadoop Cluster Guide

This guide explores how you can set up a Hadoop cluster running on three Raspberry Pis.

Hardware component parts:

Step-by-step guide

What’s next?

Written by Justinas Cirtautas