How to Install and Set Up an Apache Spark Cluster on Hadoop 18.04

João Torres
7 min readFeb 3, 2020

In this arcticle I will explain how to install Apache Spark on a multi-node cluster, providing step by step instructions.

Spark Architecture

Apache Spark follows a master/slave architecture with two main daemons and a cluster manager.

  • Master Daemon — (Master/Driver Process)
  • Worker Daemon –(Slave Process)
  • Cluster Manager

A spark cluster has a single Master and any number of Slaves/Workers. The driver and the executors run their individual Java processes and users can run them on the same horizontal spark cluster or on separate machines.

Pre-requirements

  • Ubuntu 18.04 installed on a virtual machine.

1st Step:

Create 2 clones of the Virtual Machine you’ve previously created.

Make sure that you have the option “Generate new MAC addresses for all network adapters selected. Also, choose the option “Full Clone”.

2nd Step:

Make sure all the VM’s have the following network configuration on Apapter 2:

3rd Step:

Let’s change the hostname on each virtual machine. Open the file and type the name of the machina. Use this command:

sudo nano /hostname

4th Step:

Now let’s figure out what our ip address is. To do that just type the command:

ip addr

This is on the master VM, as you can see our IP is 192.168.205.10. For you this will be different.

This means that our IP’s are:

master: 192.168.205.10

slave1: 192.168.205.11

slave2: 192.168.205.12

5th Step:

We need to edit the hosts file. Use the following command:

sudo nano /etc/hosts

and add your network information:

6th Step:

In order for the machines to assimilate the previous steps we need to reboot them. Use the following command in all of them:

sudo reboot

7th Step:

Do this step on all the Machines, master and slaves.

Now, in order to install Java we need to do some things. Follow these commands and give permission when needed:

$ sudo apt-get install software_properties_common
$ sudo add-apt-repository ppa:webupd8team/java
$ sudo apt-get update
$ sudo apt-get install openjdk-11-jdk

To check if java is installed, run the following command.

$ java -version

8th Step:

Now let’s install Scala on the master and the slaves. Use this command:

$ sudo apt-get install scala

To check if Scala was correctly installed run this command:

$ scala -version

As you can see, Scala version 2.11.12 is now installed on my machine.

9th Step:

We will configure SSH, but this step in on master only.

We need to install the Open SSH Server-Client, use the command:

$ sudo apt-get install openssh-server openssh-client

Now generate key pairs:

$ ssh-keygen -t rsa -P ""

Use the following command in order to make this key an authorized one:

cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

Now we need to copy the content of .ssh/id_rsa.pub (of master) to .ssh/authorized_keys (of all the slaves as well as master). Use these commands:

ssh-copy-id user@pd-master
ssh-copy-id user@pd-slave1
ssh-copy-id user@pd-slave2

Let’s check if everything went well, try to connect to the slaves:

$ ssh slave01
$ ssh slave02

As you can see everything went well, to exit just type the command:

exit

10th Step:

Now we Download the latest version of Apache Spark.

NOTE: Everything inside this step must be done on all the virtual machines.

Use the following command :

$ wget http://www-us.apache.org/dist/spark/spark-2.4.4/spark-2.4.4-bin-hadoop2.7.tgz

This is the most recent version as of the writing of this arcticle, it might have changed if you try it later. Anyway, I think you’ll still be good using this one.

Extract the Apache Spark file you just downloaded

Use the following command to extract the Spark tar file:

$ tar xvf spark-2.4.4-bin-hadoop2.7.tgz

Move Apache Spark software files

Use the following command to move the spark software files to respective directory (/usr/local/bin)

$ sudo mv spark-2.4.4-bin-hadoop2.7 /usr/local/spark

Set up the environment for Apache Spark

Edit the bashrc file using this command:

$ sudo gedit~/.bashrc

Add the following line to the file. This adds the location where the spark software file are located to the PATH variable.

export PATH = $PATH:/usr/local/spark/bin
Note: this screenshot has a mistake, when you’re doing this don’t leave a space like I did. Just write “PATH=$PATH”.

Now we need to use the following command for sourcing the ~/.bashrc file:

$ source ~/.bashrc

11th Step:

Apache Spark Master Configuration (do this step on the Master VM only)

Edit spark-env.sh

Move to spark conf folder and create a copy of the template of spark-env.sh and rename it.

$ cd /usr/local/spark/conf
$ cp spark-env.sh.template spark-env.sh

Now edit the configuration file spark-env.sh.

$ sudo vim spark-env.sh

And add the following parameters:

export SPARK_MASTER_HOST='<MASTER-IP>'export JAVA_HOME=<Path_of_JAVA_installation>

Add Workers

Edit the configuration file slaves in (/usr/local/spark/conf).

$ sudo nano slaves

And add the following entries.

pd-master
pd-slave01
pd-slave02

12th Step:

Let’s try to start our Apache Spark Cluster, hopefully everything is ok!

To start the spark cluster, run the following command on master.:

$ cd /usr/local/spark
$ ./sbin/start-all.sh

I won’t stop it, but in case you want to stop the cluster, this is the command:

$ ./sbin/stop-all.sh

13th Step:

To check if the services started we use the command:

$ jps

14th Step:

Browse the Spark UI to know about your cluster. To do this, go to your browser and type:

http://<MASTER-IP>:8080/

As you can see we have 2 Alive Workers, our slaves, which means it’s all done!

Final considerations:

Hopefully you managed to successfully follow this tutorial and have a perfectly working Apache Spark Cluster.

Any doubts feel free to ask me.

See you later!

--

--