Setting Up Spark Cluster and Submitting Your First Spark Job
Before diving into the technical discussion we first need to understand Apache Spark and what can be done using it.
Apache Spark is a distributed processing system developed to handle Big Data workloads just like various other Big Data tools i.e Hadoop, Hive, Cassandra, etc. Apache Spark can be used in use-cases such as Data Integration and ETL, High-Performance Batch Processing, Machine Learning, and Stream Processing in Real-Time.
Spark Architecture
Apache Spark Architecture comprises 3 main components that are: Driver Program, Cluster Manager, and Worker Node(s).
The SparkContext which resides in the Driver Program connects with the Cluster Manager for resource allocation across the various applications. The cluster manager can either be Spark's own standalone cluster manager or Mesos, YARN, or Kubernetes. Once the connection is established the SparkContext gets access to the Worker Nodes that are able to store and process data.
A detailed explanation of Apache Spark Arcchiteute and its modules can be read here.
Spark APIs
At the heart of Apache Spark, there are 4 libraries.
Spark SQL
Over the years Apache Spark project is getting mature and Spark SQL is becoming more important as it acts as an interface for accessing structured data. Spark SQL uses data frame based approach which is borrowed from R and Python.
In Spark SQL we can use standard SQL language for fetching and retrieving data as well as we can use Pandas DataFrame based functions.
Detailed documentation of Spark SQL can be read here.
Spark Streaming
Spark Streaming API is used to handle streaming data from the source system to provide real-time analytics. Earlier, processing data streams in real-time was a cumbersome task in frameworks like Hadoop and had to use multiple 3rd party tools to carry out the task.
However, Spark process streams in micro-batches and does in-memory computation which reduces the overhear significantly as compared to other stream processing libraries.
Detailed documentation on Spark Streaming can be read here.
Spark MLlib
Spark MLlib comprises of almost all the components of an ML pipeline. These components are data pre-processing, feature extraction, feature selection, feature transformations, and almost all Machine Learning models (except form DL models). The advantage MLlib possesses over other machine learning libraries is that it has a built-in ecosystem that is able to process Tera-bytes of dataset within seconds.
Detailed documentation of the MLlib library can be seen here.
Spark GraphX
Apache Spark also has various algorithms that are able to process graph-based structures. GraphFrames packages help in performing graph-based operations on data frames.
Detailed documentation of the Spark GraphX library can be seen here.
Deployment of Spark Cluster Locally
Over the year various Big Data tools are made available by major Cloud provider companies in the form of services such as Google Cloud Dataproc, Amazon EMR, Azure HDInsight, etc. However, for a newbie, you don’t have the resources to pay for those subscriptions. Therefore, we would be focussing on setting up Spark Cluster on a local(Ubuntu) machine with Spark standalone cluster manager.
Install Dependencies
Before installing Apache Spark there are a few dependencies that are needed to be met. The dependencies are:
- JDK
- Scala
- Git
To install the dependencies run the following command in the terminal:
sudo apt install default-jdk scala git -y
Once the installation is complete verify the installation by using the following command:
java -version; javac -version; scala -version; git --version
Download and Setup Spark Binaries
First, we need to download Spark binaries that are compressed in the .tgz format. To download the compressed file run the following command in the terminal:
wget https://downloads.apache.org/spark/spark-3.2.1/spark-3.2.1-bin-hadoop2.7.tgz
Once the download is complete we then extract the compressed file using the command:
tar xvf spark-*
We then move the decompressed files to the /opt/spark directory using the command:
sudo mv spark-3.2.1-bin-hadoop2.7 /opt/spark
Configure Spark Environment Variables
Before starting spark nodes we need to first configure environment variables so that spark utilities are accessible to us via command line. Run the commands below to setup environment variables:
echo “export SPARK_HOME=/opt/spark” >> ~/.profile
echo “export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin” >> ~/.profile
echo “export PYSPARK_PYTHON=/usr/bin/python3” >> ~/.profile
Once environment variables are setup reload the profile using the command:
source ~/.bashrc
Start Spark Driver Program (Master Server)
As we have discussed at the start of the article that Spark consists of the Spark Driver Program. So we will now be starting the Spark Driver Program using the command:
start-master.sh
Once the master node has started you can then have a look at the web interface at:
http://127.0.0.1:8080/
The interface would look like this:
Start Worker Node (Slave Server)
We have set up the master node we then move on toward the Worker Node that loads and processed the dataset. Since we are using our local hardware therefore we will only set up one slave node. To start the salve node run the command below:
start-slave.sh spark://master:port
The master node with the port in my case is nauyan-hp-probook-640-g2:7077
So the above command would be:
start-slave.sh spark://nauyan-hp-probook-640-g2:7077
Once you have run the above command you will see in the web interface that workers have changed from 0 to 1.
You can read more about master and slave node configuration in this article.
Running Code on Spark Local Cluster
As we have mentioned that Apache Spark gives the user flexibility to write code in various languages. Therefore for simplicity, we will run python code to demonstrate that the Master Node will fetch the code and hand out it to salve nodes for execution. Therefore, first, we need to open python using the command:
pyspark --master spark://nauyan-HP-ProBook-640-G2:7077
In the above command, we run the PySpark utility that is python built into spark and then we specify the IP and port where the master node resides. Once we have done this we can then execute any python command and it would automatically run on the Spark Cluster.
The screenshot below shows that a PySpark application was run on Spark cluster.
Concluding Remarks
I currently am working on a project that involves working with a large volume of data and I have had no prior experience working with Big Data tools as my core expertise is in Computer Vision but I was keen on learning new technologies. Therefore I have watched numerous videos and tutorials about understanding how Big Data Tools work and what are its core components.
In this article, I have focussed on setting up Spark cluster on your local machine so that you can play around with your data pipelines without paying anything and once you are done with it you can then deploy them on cloud-based environments.
Since I am new to this domain therefore I would welcome constructive criticism from the audience.
References
- https://medium.com/@ashish1512/what-is-apache-spark-e41700980615
- https://spark.apache.org/docs/3.2.1/cluster-overview.html#components
- https://spark.apache.org/sql/
- https://spark.apache.org/streaming/
- https://spark.apache.org/mllib/
- https://spark.apache.org/graphx/
- https://phoenixnap.com/kb/install-spark-on-ubuntu
About Syed Nauyan Rashid
- Linkedin Profile: https://www.linkedin.com/in/nauyan/
- Github Profile: https://github.com/nauyan
About Red Buffer
- Website — redbuffer.net
- Linkedin Page — linkedin.com/company/red-buffer