Install apache spark on ubuntu

This blog explains how to install Apache Spark on a single node ubuntu machine.

Amit Patil
3 min readMay 31, 2023
Fig. Apache spark Logo.

Apache Spark is an open-source, general-purpose, multi-language analytics engine for large-scale data processing. It works on both single and multiple nodes by utilizing the RAM in clusters to perform fast data queries on large amounts of data.

Prerequisites

Create an ubuntu EC2 m4.xlarge instance on AWS and open SSH and Http Ports.

OS — Linux/ubuntu is supported as development and deployment platform.
Storage: 20GB minimum free space.
RAM: Minimum 8Gb of RAM is required.

Spark Architecture

Apache Spark follows a master/slave architecture with two main daemons and a cluster manager –

  • Master Daemon — (Master/Driver Process)
  • Worker Daemon –(Slave Process)
  • Cluster Manager
Fig. Spark Architecture

Spark Architecture

A spark cluster has a single Master and any number of Slaves/Workers. The driver and the executors run their individual Java processes and users can run them on the same horizontal spark cluster or on separate machines i.e. in a vertical spark cluster or any mixed machine configuration.

Steps to install apache spark.

1. Install Java

Update system packages.

$ sudo apt update

Install Java.

$ sudo apt install default-jdk -y

Verify Java Version.

$ java -version

2. Install Apache Spark

Install required packages.

$ sudo apt install curl mlocate git scala -y
$ curl -O https://archive.apache.org/dist/spark/spark-3.2.0/spark-3.2.0-bin-hadoop3.2.tgz

Extract the Spark tarball.

$ sudo tar xvf spark-3.2.0-bin-hadoop3.2.tgz

Create an installation directory /opt/spark.
Also move the extracted files to the installation directory and change permissions.

$ sudo mkdir /opt/spark
$ sudo mv spark-3.2.0-bin-hadoop3.2/* /opt/spark
$ sudo chmod -R 777 /opt/spark

Edit the .bashrc configuration file to add Apache Spark installation directory to the system path.

$ sudo vim ~/.bashrc

Add the lines below at the end of the file, save and exit the file:

export SPARK_HOME=/opt/spark
export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin

Save the changes to bring into effect.

$ source ~/.bashrc

Start the standalone master server.

$ start-master.sh

Start the Apache Spark worker process.

$ start-slave.sh

Testing Installation:
Create a RDD to test from CLI.

Fig. Spark CLI

Spark Web UI

Browse the Spark UI to know about worker nodes, running application, cluster resources.

use http://server public IP:8080

Fig. Spark Master UI.

Thank You!!

--

--