Install apache spark on ubuntu
This blog explains how to install Apache Spark on a single node ubuntu machine.
Apache Spark is an open-source, general-purpose, multi-language analytics engine for large-scale data processing. It works on both single and multiple nodes by utilizing the RAM in clusters to perform fast data queries on large amounts of data.
Prerequisites
Create an ubuntu EC2 m4.xlarge instance on AWS and open SSH and Http Ports.
OS — Linux/ubuntu is supported as development and deployment platform.
Storage: 20GB minimum free space.
RAM: Minimum 8Gb of RAM is required.
Spark Architecture
Apache Spark follows a master/slave architecture with two main daemons and a cluster manager –
- Master Daemon — (Master/Driver Process)
- Worker Daemon –(Slave Process)
- Cluster Manager
Spark Architecture
A spark cluster has a single Master and any number of Slaves/Workers. The driver and the executors run their individual Java processes and users can run them on the same horizontal spark cluster or on separate machines i.e. in a vertical spark cluster or any mixed machine configuration.
Steps to install apache spark.
1. Install Java
Update system packages.
$ sudo apt update
Install Java.
$ sudo apt install default-jdk -y
Verify Java Version.
$ java -version
2. Install Apache Spark
Install required packages.
$ sudo apt install curl mlocate git scala -y
$ curl -O https://archive.apache.org/dist/spark/spark-3.2.0/spark-3.2.0-bin-hadoop3.2.tgz
Extract the Spark tarball.
$ sudo tar xvf spark-3.2.0-bin-hadoop3.2.tgz
Create an installation directory /opt/spark
.
Also move the extracted files to the installation directory and change permissions.
$ sudo mkdir /opt/spark
$ sudo mv spark-3.2.0-bin-hadoop3.2/* /opt/spark
$ sudo chmod -R 777 /opt/spark
Edit the .bashrc
configuration file to add Apache Spark installation directory to the system path.
$ sudo vim ~/.bashrc
Add the lines below at the end of the file, save and exit the file:
export SPARK_HOME=/opt/spark
export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin
Save the changes to bring into effect.
$ source ~/.bashrc
Start the standalone master server.
$ start-master.sh
Start the Apache Spark worker process.
$ start-slave.sh
Testing Installation:
Create a RDD to test from CLI.
Spark Web UI
Browse the Spark UI to know about worker nodes, running application, cluster resources.
use http://server public IP:8080
Thank You!!