Set up Apache Spark on a Multi-Node Cluster
This blog explains how to install Apache Spark on a multi-node cluster. This guide provides step by step instructions to deploy and configure Apache Spark on the real multi-node cluster.
Spark is a fast and powerful framework that provides an API to perform massive distributed processing over resilient sets of data.
OS - Linux is supported as development and deployment platform. Windows is supported as a dev platform.
For Apache Spark installation on a multi-node cluster, we will be needing multiple nodes, for that we can use multiple machines or AWS instances.
Apache Spark follows a master/slave architecture with two main daemons and a cluster manager –
- Master Daemon — (Master/Driver Process)
- Worker Daemon –(Slave Process)
- Cluster Manager
A spark cluster has a single Master and any number of Slaves/Workers. The driver and the executors run their individual Java processes and users can run them on the same horizontal spark cluster or on separate machines i.e. in a vertical spark cluster or in mixed machine configuration.
Create a user of same name in master and all slaves to make your tasks easier during ssh and also switch to that user in master.
Add entries in hosts file (master and slaves)
Edit hosts file.
$ sudo vim /etc/hosts
Now add entries of master and slaves in hosts file.
Install Java 7 (master and slaves)
$ sudo apt-get install python-software-properties
$ sudo add-apt-repository ppa:webupd8team/java
$ sudo apt-get update
$ sudo apt-get install oracle-java7-installer
To check if java is installed, run the following command.
$ java -version
Install Scala (master and slaves)
$ sudo apt-get install scala
To check if scala is installed, run the following command.
$ scala -version
Configure SSH (only master)
Install Open SSH Server-Client
$ sudo apt-get install openssh-server openssh-client
Generate key pairs
$ ssh-keygen -t rsa -P ""
Configure passwordless SSH
Copy the content of .ssh/id_rsa.pub (of master) to .ssh/authorized_keys (of all the slaves as well as master).
Check by SSH to all the slaves
$ ssh slave01
$ ssh slave02
Download latest version of Spark
Use the following command to download latest version of apache spark.
Extract Spark tar
Use the following command for extracting the spark tar file.
$ tar xvf spark-2.3.0-bin-hadoop2.7.tgz
Move Spark software files
Use the following command to move the spark software files to respective directory (/usr/local/bin)
$ sudo mv spark-2.3.0-bin-hadoop2.7 /usr/local/spark
Set up the environment for Spark
Edit bashrc file.
$ sudo vim ~/.bashrc
Add the following line to ~/.bashrc file. It means adding the location, where the spark software file are located to the PATH variable.
export PATH = $PATH:/usr/local/spark/bin
Use the following command for sourcing the ~/.bashrc file.
$ source ~/.bashrc
Note: The whole spark installation procedure must be done in master as well as in all slaves.
Spark Master Configuration
Do the following procedures only in master.
Move to spark conf folder and create a copy of template of spark-env.sh and rename it.
$ cd /usr/local/spark/conf
$ cp spark-env.sh.template spark-env.sh
Now edit the configuration file spark-env.sh.
$ sudo vim spark-env.sh
And set the following parameters.
Edit the configuration file slaves in (/usr/local/spark/conf).
$ sudo vim slaves
And add the following entries.
Start Spark Cluster
To start the spark cluster, run the following command on master.
$ cd /usr/local/spark
To stop the spark cluster, run the following command on master.
$ cd /usr/local/spark
Check whether services have been started
To check daemons on master and slaves, use the following command.
Spark Web UI
Browse the Spark UI to know about worker nodes, running application, cluster resources.
Spark Master UI
Spark Application UI
You can proceed further with Spark shell commands to play with Spark.