Spark & Hadoop Raspberry Pi 4 Cluster + Macbook Pro as Master

Published in

Testinium Tech

5 min readJun 1, 2021

Hello hello hello, I have been using spark for quite some time. I think my local development environment is the best way to code Spark project with scala. I use the Scala plugin for IntelliJ. The project I work on is an SBT project. (SBT is a software management tool like Maven)

libraryDependencies ++= Seq(
  "org.apache.spark" %% "spark-mllib" % "2.4.7",
  "org.apache.spark" %% "spark-sql" % "2.4.7",
  "org.apache.hadoop" % "hadoop-common" % "2.10.1" % "provided",
  "org.apache.hadoop" % "hadoop-aws" % "2.10.1" % "provided",
  "org.scala-lang" % "scala-library" % scalaVersion.value,
  "org.scalatest" %% "scalatest-funsuite" % "3.2.2" % "test",
  "com.github.scopt" %% "scopt" % "4.0.0-RC2",
  "org.mongodb.spark" %% "mongo-spark-connector" % "2.4.3",
  "org.mongodb.scala" %% "mongo-scala-driver" % "2.4.2",
  "log4j" % "log4j" % "1.2.17"
)

Above you can see my SBT dependencies. I can trigger all my pipelines from the test part of my project as well as through the CLI. This is important because for most spark environments you have to communicate your app through the CLI therefore most developers run the app through CLI in their local environments. If you ask me there is a better way.

spark-submit --deploy-mode cluster --class com.testinium.analytics.AppS3Downloader --conf spark.memory.offHeap.enabled=true --conf spark.memory.offHeap.size=2G --conf spark.sql.shuffle.partitions=1000 --conf spark.default.parallelism=20 --conf spark.executor.memory=8G --conf spark.driver.memory=9G --conf spark.eventLog.enabled=true --conf spark.executor.extraJavaOptions=-Xss4m --conf spark.driver.extraJavaOptions=-Xss4m */miner.jar --domainId * --submitId * --startTime 1621555200000 --endTime 1621900800000 --accessKey * --secretKey * --region * --bucketName * --prefix *

For example, the above command is a command that my EMR(Amazon’s Distributed Hadoop Environment Service) connector API generates on our prod environment.

test("S3DownloadTest") {
  val cfg =  AppS3DownloaderConfig(domainId = *,startTime = 1587369461000L, endTime = 1650441461000L, accessKey = *, secretKey = *, bucketName = * , prefix = *)
  AppS3DownloaderPipeline.run(cfg,sparkSession("S3DownloadTest"))
}

And this triggers the same pipeline. The test code block that I run using scalatest-funsuite, makes my development process easier and debuggable.

As I mentioned before I am using EMR on AWS to run Spark pipelines. Everything is great so far. But when I want to scale the cluster on EMR I realized it is harder than I thought there are a lot of tuning configs that affect the pipelines in terms of time and I have to learn more about Hadoop and Spark environments to tune better.

Wait for a second… I really like Raspberry Pis and I own two of them already so why don’t I try to build my own Spark and Hadoop cluster.

Let’s talk about my environment, I have a dummy switch to connect all components on the same local network. I am using my MacBook Pro as a router at the same time it shares the internet to the pis.

This armored guy is a Raspberry Pi 4. Since I really like the case and I have only two pis I am using a switch without the PoE feature.

As I said before I shared the internet on my Macbook through USB Type-C to Ethernet converter.

This is how I active the following:

192.168.2.2 pi1 --> Raspberry Pi 4192.168.2.3 pi2 --> Raspberry Pi 4192.168.2.1 master --> Macbook

I choose Ubuntu as the OS for my pis. But any other distro should be fine.

Hadoop uses SSH protocol to communicate, to run Hadoop as a cluster I have to make machines connect to each other without using passwords. To achieve this I executed the followings,

usermod -l ekin -d /home/ekin -m ubuntu

I changed the default username and home directories, because if you type only “ssh host” without a username it takes your current user’s name as username to connect via SSH, since the name of my Macbook is ekin to achieve connection through “ssh pi1” command firstly I changed the default username.

Then I add the default IP address of machines in /etc/hosts file on any machine. One of the hosts files looks like this. Of course, the Macbook one is a little bit different but the only important rows are the last three.

127.0.0.1 localhost# The following lines are desirable for IPv6 capable hosts
::1 ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
ff02::3 ip6-allhosts
192.168.2.2 pi1
192.168.2.3 pi2
192.168.2.1 master

To connect without a password, you need to generate a key using the command

ssh-keygen -t rsa -P “”

then you need to execute the following

cat ~/.ssh/id_dsa.pub | ssh ekin@192.168.2.1 'cat >> ~/.ssh/authorized_keys'

After these steps in a raspberry pi if you execute “ssh master” you will be in. Do these steps for each node in your cluster.

Installing Hadoop

wget https://archive.apache.org/dist/hadoop/common/hadoop-3.2.0/hadoop-3.2.0.tar.gz

Hadoop needs Java so you need to install Java8 to your environment in Raspberry Pi you may use the below command.

sudo apt install openjdk-8-jre-headless

You need to set JAVA_HOME environment variable.

export JAVA_HOME="/usr/lib/jvm/java-8-openjdk-arm64"

After the installation execute followings.

sudo tar -xvf hadoop-3.2.0.tar.gz -C /optcd /opt/mv hadoop-3.2.0.tar.gz hadoopsudo chown -R ekin:ubuntu /opt

Set environment variables for Hadoop.

export HADOOP_HOME=/opt/hadoop
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin

Check Hadoop installation.

hadoop version | grep Hadoop

It should give “Hadoop 3.2.0” as result.

Installing Spark

Execute followings.

wget https://kozyatagi.mirror.guzel.net.tr/apache/spark/spark-2.4.8/spark-2.4.8-bin-hadoop2.7.tgzsudo tar -xvf spark-2.4.8-bin-hadoop2.7.tgz -C /optcd /opt/mv spark-2.4.8-bin-hadoop2.7.tgz sparksudo chown -R ekin:ubuntu /opt

Set environment variables for Spark.

export SPARK_HOME=/opt/spark
export PATH=$PATH:$SPARK_HOME/bin

Check Spark installation.

spark-shell --version

You have to make this installation for all nodes recursively. Since I only have 3 nodes I did it all manually. 🙈

After all these setups you have to execute the below commands on the master node

hdfs namenode -format -force
start-dfs.sh && start-yarn.sh

After the commands, master:9870 should be up and running.

For this writing, I inspired by these two great sources: https://www.linode.com/docs/guides/how-to-install-and-set-up-hadoop-cluster/ , https://dev.to/awwsmm/building-a-raspberry-pi-hadoop-spark-cluster-8b2

Thanks for reading so far, I will write more as I learn more.

Spark & Hadoop Raspberry Pi 4 Cluster + Macbook Pro as Master

Written by Ekin Gün Öncü