Build Raspberry Pi Hadoop/Spark Cluster from scratch

Henry Liang
Analytics Vidhya
Published in
12 min readDec 23, 2019

Introduction

To have a better understanding of how cloud computing works, me and my classmate Andy Lin decide to dig deep into the world of data engineer. Our goal was to build a Spark Hadoop Raspberry Pi Hadoop cluster from scratch. We will walk you through the steps we took and address the error you might encounter throughout the process.

Picture above shows the components we purchased : Raspberry Pi 4 *3 ,SD card *3, Internet Cable*3, Charing Cable*3, Switch *1
Picture above shows the look of the assembled Raspberry Pi clusters.

Phase 1 : Raspberry Pi Setting

Install Operating System on Raspberry Pi

We choose to use a Debian-based OS called Raspbian. Raspbian NOOB is the simplest way we have found to get the OS running perfectly on the Raspberry Pi. We format the microSD card as FAT, and extract the NOOBS*.zip archive. Then, we copy the files to our microSD card. At last, we insert the microSD card into the Pi, successfully installing operating system.

Picture above shows Raspbian OS running.

Configuring Static IP Addresses

We set the static IP addresses for each Pi on the network switch in order to facilitate plain networking of the Pis. We named each of them as raspberrypi1, raspberrypi2, raspberrypi3. Then, we edit the file /etc/dhcpcd.conf on each Pi and uncomment. Edit this line:

Where X should be replaced by 1 for Pi1, 2 for Pi2, etc.

Pay attention that the static ip_address you set for the Pis should be coherent with the ip_address of your router. For example, our router’s ip_address starts from 192.168.1.10X/24, so we assign each Pi the ip_address shown below.

Picture above shows the ip_address we set for the raspberry pi1. ( Accordingly, we use 192.168.1.102/24 for pi 2, and 192.168.1.103/24 for pi 3)

Enabling SSH

We then enable the SSH by following the instruction from this link.

Setting Hostnames

At first, all the Pis are known as raspberrypi and have a single user. This has the potential to become very confusing if we’re constantly moving back and forth between the different Pis on the network. To solve this problem, we will assign each Pi a hostname based on its position in the case / on the network switch. Pi #1 will be known as raspberrypi1, Pi #2 as raspberrypi2, and so on.

Two files must be edited, including /etc/hosts and /etc/hostname. Within those files, there should be only one occurrence each of the raspberry pi, which is the default hostname. We change each of those to raspberrypiX where X is the appropriate number 1–3. Lastly, in /etc/hosts only, we also add the IPs for all the other Pis at the end of the file, like the picture shown below.

From now on, the terminal we see looks like the picture showing below:

Simplifying SSH

To connect from one pi to another, it could be really troublesome to type a series of command like the picture showing below.

We can simplify this process by setting up ssh aliases and passwordless ssh connections with public/private key pairs.

SSH Aliases

Edit the ~/.ssh/config file on a particular Pi and add the following lines:

Replace X with the number of each Pis. This is done on a single Pi, so that one Pi should have all chunks of code within ~/.ssh/config, which look identical to the above except for the X character, which should change for each Pi on the network.
(In our case, because the hostnames we chose and the cluster we built has three raspberrypi, so it looks like above)

After this, our ssh command sequence becomes like the picture showing above:

But we still need to enter password at this point. We can simplify this further by setting up public/private key pairs.

Public/Private Key Pair

First, on each Pi, run the following command:

This will generate a public / private key pair within the directory ~/.ssh/ ,which can be used to securely ssh without entering a password. One of these files will be called id_ed25519, this is the private key. The other, id_ed25519.pub is the public key. No passphrase is necessary to protect access to the key pair. The public key is used to communicate with the other Pis, and the private key never leaves its host machine and should never be moved or copied to any other device.

Then each public key will need to be concatenated to the ~/.ssh/authorized_keys file on every other Pi. Let’s assume that Pi #1 will contain the “master” record, which is then copied to the other Pis.

To begin with, on Pi #2 (and #3, etc.), run the following command showing above:

This concatenates Pi #2’s public key file to Pi #1’s list of authorized keys, giving Pi #2 permission to ssh into Pi #1 without a password (the public and private keys are instead used to validate the connection). We need to do this for each machine, concatenating each public key file to Pi #1’s list of authorized keys.

We should also do this for Pi #1, so that when we copy the completed authorized_keys file to the other Pis, they all have permission to ssh into Pi #1, as well. Run the following command on Pi #1:

Once this is done, as well as the previous section, ssh-ing is as easy as:

Replicate the Configuration

Finally, to replicate the passwordless ssh across all Pis, simply copy the two files mentioned above from Pi #1 to each other Pi using scp:

Where piX should be the hostnames you choose. (In our case, it is raspberrypi1, raspberrypi2 , raspberrypi3) You should now be able to ssh into any Pi on the cluster from any other Pi with just ssh piX.

For the Ease of Use

Edit ~/.bashrc file and add lines below (Note that, whenever you edit ~/.bashrc, for those changes to take effect, you must source the file or log out and log back in.) Then, you can call the new function. Functions below are the ones we highly recommend use it.

1. Get the hostname of every Pi except this one
2. Send the same command to all Pis
3. Reboot the cluster
4. Shutdown the cluster
5. Send the same file to all Pis

Then we can copy all of the ease-of-use functions we’ve defined in ~/.bashrc to every other Pi on the cluster by using command showing below:

Phase 2 : Hadoop & Spark

Single — Node Setup

Hadoop Installation

Before Installing Hadoop, we make sure our Pis have an acceptable version of Java. We start by building a single-node setup on the master node we chose. On pi#1, get Hadoop with this command (This is a shortened link to getting hadoop-3.2.0.tar.gz):

Next, use command showing below:

Then, make sure to change the permissions on this directory:

Finally, add this directory to the $PATH by editing ~/.bashrc file and putting the following lines at the end of the file:

Picture above showing the lines we use for editing ~/.bashrc file

Then edit /opt/hadoop/etc/hadoop/hadoop-env.sh file to add the following line:

Picture above showing the lines we use for editing /opt/hadoop/etc/hadoop/hadoop-env.sh file

Verify that Hadoop has been installed correctly by checking the version:

Spark Installation

We will download Spark in a similar manner to how we downloaded Hadoop, run the following command (This is a shortened link to spark-2.4.3-bin-hadoop2.7.tgz):

Then use command showing below:

Then, make sure to change the permissions on this directory:

Finally, add this directory to your $PATH by editing ~/.bashrc and putting the following lines at the end of the file:

You can verify that Spark has been installed correctly by checking the version:

HDFS (Hadoop Distributed File System)

To get the Hadoop Distributed File System (HDFS) up and running, we need to modify some configuration files.

  • All of these files are within /opt/hadoop/etc/hadoop
  • Including core-site.xml, hdfs-site.xml, mapred-site.xml, and yarn-site.xml
Edit core-site.xml so it looks like this.
Edit hdfs-site.xml so it looks like this.

Line showing above configures where the DataNode and NameNode information is stored and also sets the replication (the number of times a block is copied across the cluster) to 1 (we can change this later depending on the number of DataNode you have built).

Create these directories with the commands showing below:

Then adjust the owner of these directories:

Add lines below to mapred-site.xml so it looks like the following:

Finally, add lines below to yarn-site.xml so it looks like the following:

Once these four files are edited, we can format the HDFS by using the command showing below. (WARNING: DO NOT DO THIS IF YOU ALREADY HAVE DATA IN THE HDFS! IT WILL BE LOST!):

Then we boot the HDFS with the following two commands showing below:

And test if it’s working by creating a temporary directory:

Or by running the command jps:

Picture above shows that the HDFS is up and running, at least on Pi #1. To check that Spark and Hadoop are working together, we can use the command lines below:

It then will open the Spark shell, with prompt Scala>:

Cluster Setup

At this point, we have a single-node cluster and that single node acts as both a master and a worker node. To set up the worker nodes (and distribute computing to the entire cluster), we take the following steps.

Create the Directories

Create the required directories on all other Pis using command lines showing below:

Copy the Configuration

Copy the files in /opt/hadoop to each other Pi using:

After a while, you can verify that the files copied correctly by querying the Hadoop version on each node with the following command:

Configuring Hadoop on the Cluster

To get HDFS running across the cluster, we need to modify the configuration files that we edited earlier. All of these files are within /opt/hadoop/etc/hadoop

First, edit core-site.xml so it looks like the above.
Edit hdfs-site.xml so it looks like the above. (We didn’t change replication number, using 1 as before, because we decide only have two DataNode.)
Edit mapred-site.xml so it looks like the above.
Finally, edit yarn-site.xml so it looks like the above.

Make these changes to these files, then remove all old files from all Pis. You can clean up all the Pis with:

Next, we need to create two files in $HADOOP_HOME/etc/hadoop/ which tell Hadoop which Pis to use as worker nodes and which Pi should be the master (NameNode) node.

First, we create a file named master in the aforementioned directory and add only a single line Pi(raspberrypi1). Second, we create a file named in the same directory and add all of the other Pis(raspberrypi2…), like the picture showing below:

Then, you’ll need to edit /etc/hosts again. On any Pi, remove the line which looks like:

where X is the index of that particular Pi.

Then, copy this file to all other Pis with:

We can do this now because this file is no longer Pi-specific. Finally, reboot the cluster for these changes to take effect. When all Pis have rebooted, on Pi #1, run the command:

We then boot the HDFS with the following two commands:

We can test the cluster by putting files in HDFS from any Pi (using hadoop fs -put) and making sure they show up on other Pis (using hadoop fs -ls). You can also check that the cluster is up and running by opening a web browser and navigating to http://192.168.1.101:9870/ (In our case) This web interface gives you a file explorer as well as information about the health of the cluster.

Hadoop web UI running on port 9000.
Hadoop web UI showing DataNode statistics.

Configuring Spark on the Cluster

For Spark to be able to communicate with YARN, we need to configure two more environment variables in Pi #1’s ~/.bashrc. Previously, we defined:

In ~/.bashrc. Just beneath this, we will now add two more environment variables:

$HADOOP_CONF_DIR is the directory which contains all of the *-site.xml configuration files that we edited above. Next, we create the Spark configuration file by using the command lines showing below:

Then we add the following lines to the end of this file:

The meaning of these values is explained at this link. But note that the above is very machine-specific. Once all of this has been configured, reboot the cluster. Note that, when you reboot, you should NOT format the HDFS NameNode again. Instead, simply stop and restart the HDFS service with:

Now, you can submit a job to Spark on the command line! We use the command lines showing below to create simple RDD from text file.

The output should be like the picture showing below.

Conclusion

Finally, we have successfully completed this project of building raspberry pi hadoop/spark cluster from scratch. It’s has been a long process of configuring and exploring. However, we do enjoy it. This is our first data engineer side project of our career, and we are looking forward to learning more about the world of big data analysis.

Reference

[1]: Andrew. (July 22, 2019). Building a Raspberry Pi Hadoop / Spark Cluster

--

--