Set up Apache Spark on a Multi-Node Cluster

This blog explains how to install Apache Spark on a multi-node cluster. This guide provides step by step instructions to deploy and configure Apache Spark on the real multi-node cluster.

Spark is a fast and powerful framework that provides an API to perform massive distributed processing over resilient sets of data.

Recommended Platform

OS - Linux is supported as development and deployment platform. Windows is supported as a dev platform.

For Apache Spark installation on a multi-node cluster, we will be needing multiple nodes, for that we can use multiple machines or AWS instances.

Spark Architecture

Apache Spark follows a master/slave architecture with two main daemons and a cluster manager –

  • Master Daemon — (Master/Driver Process)
  • Worker Daemon –(Slave Process)
  • Cluster Manager
Spark Architecture

A spark cluster has a single Master and any number of Slaves/Workers. The driver and the executors run their individual Java processes and users can run them on the same horizontal spark cluster or on separate machines i.e. in a vertical spark cluster or in mixed machine configuration.

Prerequisites

Create a user of same name in master and all slaves to make your tasks easier during ssh and also switch to that user in master.

Add entries in hosts file (master and slaves)

Edit hosts file.

$ sudo vim /etc/hosts

Now add entries of master and slaves in hosts file.

<MASTER-IP> master
<SLAVE01-IP> slave01
<SLAVE02-IP> slave02

Install Java 7 (master and slaves)

$ sudo apt-get install python-software-properties
$ sudo add-apt-repository ppa:webupd8team/java
$ sudo apt-get update
$ sudo apt-get install oracle-java7-installer

To check if java is installed, run the following command.

$ java -version

Install Scala (master and slaves)

$ sudo apt-get install scala

To check if scala is installed, run the following command.

$ scala -version

Configure SSH (only master)

Install Open SSH Server-Client

$ sudo apt-get install openssh-server openssh-client

Generate key pairs

$ ssh-keygen -t rsa -P ""

Configure passwordless SSH

Copy the content of .ssh/id_rsa.pub (of master) to .ssh/authorized_keys (of all the slaves as well as master).

Check by SSH to all the slaves

$ ssh slave01
$ ssh slave02

Install Spark

Download latest version of Spark

Use the following command to download latest version of apache spark.

$ wget http://www-us.apache.org/dist/spark/spark-2.3.0/spark-2.3.0-bin-hadoop2.7.tgz

Extract Spark tar

Use the following command for extracting the spark tar file.

$ tar xvf spark-2.3.0-bin-hadoop2.7.tgz

Move Spark software files

Use the following command to move the spark software files to respective directory (/usr/local/bin)

$ sudo mv spark-2.3.0-bin-hadoop2.7 /usr/local/spark

Set up the environment for Spark

Edit bashrc file.

$ sudo vim ~/.bashrc

Add the following line to ~/.bashrc file. It means adding the location, where the spark software file are located to the PATH variable.

export PATH = $PATH:/usr/local/spark/bin

Use the following command for sourcing the ~/.bashrc file.

$ source ~/.bashrc
Note: The whole spark installation procedure must be done in master as well as in all slaves.

Spark Master Configuration

Do the following procedures only in master.

Edit spark-env.sh

Move to spark conf folder and create a copy of template of spark-env.sh and rename it.

$ cd /usr/local/spark/conf
$ cp spark-env.sh.template spark-env.sh

Now edit the configuration file spark-env.sh.

$ sudo vim spark-env.sh

And set the following parameters.

export SPARK_MASTER_HOST='<MASTER-IP>'
export JAVA_HOME=<Path_of_JAVA_installation>

Add Workers

Edit the configuration file slaves in (/usr/local/spark/conf).

$ sudo vim slaves

And add the following entries.

master
slave01
slave02

Start Spark Cluster

To start the spark cluster, run the following command on master.

$ cd /usr/local/spark
$ ./sbin/start-all.sh

To stop the spark cluster, run the following command on master.

$ cd /usr/local/spark
$ ./sbin/stop-all.sh

Check whether services have been started

To check daemons on master and slaves, use the following command.

$ jps

Spark Web UI

Browse the Spark UI to know about worker nodes, running application, cluster resources.

Spark Master UI

http://<MASTER-IP>:8080/

Spark Application UI

http://<MASTER_IP>:4040/
You can proceed further with Spark shell commands to play with Spark.