Install apache spark on ubuntu

This blog explains how to install Apache Spark on a single node ubuntu machine.

3 min readMay 31, 2023

Apache Spark is an open-source, general-purpose, multi-language analytics engine for large-scale data processing. It works on both single and multiple nodes by utilizing the RAM in clusters to perform fast data queries on large amounts of data.

Prerequisites

Create an ubuntu EC2 m4.xlarge instance on AWS and open SSH and Http Ports.

OS — Linux/ubuntu is supported as development and deployment platform.
Storage: 20GB minimum free space.
RAM: Minimum 8Gb of RAM is required.

Spark Architecture

Apache Spark follows a master/slave architecture with two main daemons and a cluster manager –

Master Daemon — (Master/Driver Process)
Worker Daemon –(Slave Process)
Cluster Manager

Spark Architecture

A spark cluster has a single Master and any number of Slaves/Workers. The driver and the executors run their individual Java processes and users can run them on the same horizontal spark cluster or on separate machines i.e. in a vertical spark cluster or any mixed machine configuration.

Steps to install apache spark.

1. Install Java

Update system packages.

$ sudo apt update

Install Java.

$ sudo apt install default-jdk -y

Verify Java Version.

$ java -version

2. Install Apache Spark

Install required packages.

$ sudo apt install curl mlocate git scala -y
$ curl -O https://archive.apache.org/dist/spark/spark-3.2.0/spark-3.2.0-bin-hadoop3.2.tgz

Extract the Spark tarball.

$ sudo tar xvf spark-3.2.0-bin-hadoop3.2.tgz

Create an installation directory /opt/spark.
Also move the extracted files to the installation directory and change permissions.

$ sudo mkdir /opt/spark
$ sudo mv spark-3.2.0-bin-hadoop3.2/* /opt/spark
$ sudo chmod -R 777 /opt/spark

Edit the .`bashrc` configuration file to add Apache Spark installation directory to the system path.

$ sudo vim ~/.bashrc

Add the lines below at the end of the file, save and exit the file:

export SPARK_HOME=/opt/spark
export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin

Save the changes to bring into effect.

$ source ~/.bashrc

Start the standalone master server.

$ start-master.sh

Start the Apache Spark worker process.

$ start-slave.sh

Testing Installation:
Create a RDD to test from CLI.

Spark Web UI

Browse the Spark UI to know about worker nodes, running application, cluster resources.

use http://server public IP:8080

Thank You!!

Install apache spark on ubuntu

This blog explains how to install Apache Spark on a single node ubuntu machine.

Prerequisites

Spark Architecture

Steps to install apache spark.

1. Install Java

2. Install Apache Spark

Edit the .`bashrc` configuration file to add Apache Spark installation directory to the system path.

Spark Web UI

Written by Amit Patil

No responses yet

Install apache spark on ubuntu

This blog explains how to install Apache Spark on a single node ubuntu machine.

Prerequisites

Spark Architecture

Steps to install apache spark.

1. Install Java

2. Install Apache Spark

Edit the .bashrc configuration file to add Apache Spark installation directory to the system path.

Spark Web UI

Written by Amit Patil

No responses yet

Edit the .`bashrc` configuration file to add Apache Spark installation directory to the system path.