Getting started with Apache Spark II

Published in

Geek Culture

6 min readJan 5, 2022

The second installment in your Spark journey!

Now that we have a brief understanding of distributed computing, we move on to Hadoop. Hadoop is an open-source framework used for implementing applications for processing large amount of data in parallel, distributed across multiple nodes. It used MapReduce programming models to process the large amounts of data in parallel.

MapReduce
The MapReduce has 3 execution stages, the input of each stage is a key value pair <key,value>

Map
The map or the mapper job, is used to process the input data that is stored in HDFS in the form of files or directory. The mapper function processes the files line by line.
Shuffle/Sorting
This step consists of two main steps — sorting and merging of the key value pairs, based on the key. The output of this phase will be keys and values.
Reduce
Reduce phase will further reduce the intermediate value into smaller values.

Now we will go through one of the most common examples to understand the working of MapReduce — Word problem.
Problem statement: We need to count the words of an input source file and display the word within its count.

So we have an input file and output file at the ends. We need to define how many different occurrences of each elements. So now Hadoop will split the input file into 128 mb files. The Map function will assign a key value pair to each element in the different partitions. After this we have the shuffling stage, where we shuffle the elements and put them into same partition. Now we reduce based on the Key. Reducing basically means we reduce the data by 1 key factor. Now we merge and collate all the files into one and give the output file.

Hadoop Ecosystem

The architecture of Hadoop can be divided into four distinctive layers:
1. Storage layer
2. Resource Management layer
3. Compute
4. APIs

HDFS is the storage layer: Yarn acts as the Resource Manager; Tez, Spark, MapReduce act as the Compute; Hive, Hbase, PIG and others are the APIs.

Why do we need Spark?

Apache Spark is an in-memory data processing engine which allows workers to efficiently process large scale data in standalone or clustered configurations. It is widely used as an industry standard because of the following features it provides:
1. In-Memory processing
2. Low Latency
3. Stream Processing
4. Availability of APIs
5. Supports input from multiple sources

Spark vs MapReduce
MapReduce requires files to be stored on HDFS but it is not required to be done on Spark, we can store data in memory. The MapReduce writes data to disk while Spark keeps most of the data in memory. In case the memory is filled, Spark spills over the data to disk.

Apache Spark Ecosystem

1. Spark Core
Accountable for the I/O functionalities, supports fault recovery and job and cluster monitoring
2. Spark SQL
It is used to perform the structure data analysis, run the queries on data, create tables and views.

3. Spark Streaming
It is used for creating analytic and interactive applications for live streaming data
4. Spark MLLib
It is a machine learning library which supports many machine learning algorithms and their implementations.
5. GraphX
It is the graph computational engine for graphs and graphical computations

Spark Language APIs
The spark language APIs enable you to run the spark code using various programming languages
Scala: Primarily Spark is written in Scala
Other languages supported are Python, R, SQL, Java

Apache Spark Architecture

Spark is a distributed processing engine, and it follows a master-slave architecture. For each application in Spark, Spark creates —

Master — Driver
Slave — Executors

The basic terminology resolves around the Spark is Spark Context and Spark Session

Spark Session : It serves as the entry point for all Spark Functionality
All Functionality available with SparkContext are also available with SparkSession. However if someone prefers to use SparkContext , they can continue to do so.

Spark Job, Stages and Task

Applications : A user program built on Spar using its APIs. It consists of a driver program and executors on the cluster
Job : A parallel computation consisting of multiple tasks that get spawned in response to a Spark action
Task : Smallest unit of work performed by executor
Stage : Jobs are divided into stages based on what operations will be performed — serial or parallel

Spark Execution Types

There are two ways of executing programs on Spark:
1. Interactive Clients
This includes Spark Shell, Pyspark Shell, Notebooks etc
2. Spark Submit Utility
Spark-Submit is a utility to submit your spark program (or job) to Spark clusters. This is when you submit your whole code as a file to spark for execution

Spark Job Types

If we want to deploy any application, spark offers two types of deployment modes:
1. Client Mode: ‘Driver’ component of spark job will run on the machine from which job is submitted. Hence, this Spark mode is basically ‘client mode’.
2. Cluster Mode: Here ‘Driver’ component of spark job will not run on the local machine from which job is submitted. Hence, this spark mode is basically ‘Cluster mode’. Cluster mode is the most commonly used mode of operation.
3. Local Mode: This mode has both the Driver and the Executor working on only one node. It is usually used for training and educational purposes.

Spark Submit Command —
./bin/spark-submit \ --class <main-class> \ --master <master-url> \ --deploy-mode <deploy-mode> \ --conf <key>=<value> \ ... # other options <application-jar> \ [application-arguments]

Cluster manager working

In the local system we have, when we do a Spark Submit and we have YARN as a cluster manager so what it does it that it sends the spark submit to the Resource Manager of the YARN. The Resource Manager will ask for the details for the requirements of the application from the Application Master, how many executers are needed etc.