Understating Spark Internal Architecture

- Internal Working components

Divya Chandana
The AI Guide
4 min readJan 17, 2023

--

Introduction

Spark is a general-purpose distributed data processing engine used for big data applications. uses optimized query execution and in-memory caching for data of any size.

Understanding the Terminology

RDD (Resilient Distributed Dataset), a core data structure in Spark, is a collection of fault-tolerant elements that function concurrently. It is a distributed collection of immutable things. Each dataset in an RDD is split into logical partitions that can be computed on several cluster nodes. In PySpark, RDDs are a set of partitions.

RDD carries out two operations
Transformations are steps performed on the input data, such as a map, filter, sortBy, or the creation of a new data set from the original data.
Actions will be taken in order to obtain the desired outcomes from the newly created dataset, returning value to the driver program.

High level Architecture Of Spark

The Driver program would collaborate with the cluster manager to assign tasks to the worker node. The Driver node will receive the results once all tasks have finished running.

High Level Architecture of Spark

How Spark Components Communicate

The main method of the program is called by the Spark. The main class, the program’s entry point, is provided when execution code is written, and the Spark Driver executes that point.

Following the creation of the Spark Driver, the RDD data frames and data units are created by the Spark Context. The driver executes tasks that are tasks that are executed in the executors, doing all different transformations and activities.

The two primary functions of a driver
1. creating tasks from user programs
2. scheduling tasks to executors through the cluster’s help

Cluster Manager

Launching of the executors is done by the cluster manager. Executors are scheduled and launched by the cluster manager. Cluster Manager allocates resources for task execution, and depending on the workload, it can dynamically change the resources used by the Spark application. Depending on the type of workload that needs to be processed, cluster managers can raise or decrease the number of executors.

Spark Executors

When a job is submitted, the executors are launched at the start of the spark application and run for the lifetime of the application. The executors run individual tasks, which are the actual part of the programming logic of the data processing.

The two roles of executors are

  1. run the tasks and return results to the driver

2. provides in-memory. storage for RDD data frame and dataset that are cashed

Example Explained

PySpark Parallelizing an existing collection in your driver program

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('pySparkSetup').getOrCreate()
sc = spark.sparkContext
rdd = sc.parallelize(['Divya','Chandana','DC', 'Div', 'D', 'C'])

SparkContext is used to create an RDD from a list collection.Each dataset in RDD is divided into logical partitions, which may be computed on different nodes of the cluster

d_in_names = rdd.filter(lambda name:'D' in name)
d_in_names.collect()
rdd.getNumPartitions()
rdd.first()
rdd.collect()

Quick Summary

The driver program is run and seeks resources from the cluster manager at the same time that the main program of the user function is launched.

The various transformations and actions are processed using spark context. Until the time of action is reached, all transformations will be placed into the spark context in the form of a DAG, which will construct the RDD lineage.

The Job is created after the action is called. A job is a collection of various data that is created and launched by the cluster manager on the worker nodes, and all of this is handled by the task scheduler.

References

https://techvidvan.com/tutorials/spark-architecture/

--

--