Understanding Apache Spark Architecture

8 min readJan 31, 2024

The demand for Data Engineering is increasing day by day due to the increasing volume of data inflow into the systems. Hence, a data engineer needs to be proficient in their respective data engineering stack. Nowadays almost everyone is capable of writing code thanks to several learning platforms but how efficiently the code is written or the systems are designed is what matters the most. At the end of the day, only systems that are robust, scalable, and dynamic can survive in the long run.

There are several tools when it comes to big data processing and data engineering tools and Apache Spark is one of the most widely used stacks. In this article, we will discuss why Spark is so popular when it comes to big data. We will also have a close look at the Spark architecture and design efficient systems which can be done after having a clear picture of its architecture. The article is divided into the following sections:-

Introduction
Terminologies
Spark Cluster Internal Working
Architecture of a Worker in Spark Cluster
On-Heap and Off-Heap memory
Jobs, Stages and Tasks
Life Cycle of a Spark Session

Introduction

Apache Spark is an open-source distributed computing engine and unified analytics engine used for processing and analyzing large-scale data. Like Hadoop, Spark also works in a distributed manner. However, Spark works on the principle of in-memory processing compared to Hadoop or traditional systems which work on disc-processing. It is ~100 times faster in memory and ~10 times faster in disc processing compared to Hadoop or traditional systems.

How is Spark so fast and how does it handle the processes at lightning speed?

Well, its lightning-speed processing is because of its In-Memory and Parallel Processing, which is because of its distributed computing nature. We will dive into these two terms in detail below. Spark has a well-designed Layered Architecture. It follows the parent/child concept. It has a Driver Node which can be associated with the parent and several Worker nodes which can be correlated with the child. Between the driver and worker nodes comes the Cluster Manager. These layers - Driver, Worker, and Cluster Manager are designed well within their boundaries and are loosely coupled to each other.

Terminologies

There are certain terminologies associated with Spark that are important to understand as they form the fundamentals.

RDDs — Resilient Distributed Dataset or RDDs is the fundamental structure of Spark. When spark reads data, as a dataframe or a dataset, internally it is converted into RDD where the data is distributed across the nodes by partitioning the data.

Partition — The RDDs/Dataframe is stored in the memory of the Spark cluster in chunks of data also called Partitions.

Transformation — Transforms the input RDDs into another RDD using certain functions. It works on the Lazy Evaluation Model, i.e, when the code is received, the execution/data processing does not start immediately rather logical plans are created for optimal execution. These logical plans are associated with DAG. Eg: joins, drop and rename columns, etc are examples of transformations.

Action — Unlike transformations, Actions are commands that perform actual operations based on DAG. Eg: count, show, etc

DAGs — Directed Acyclic Graphs or DAGs keeps track of all transformations. It creates a logical plan for each transformation and the lineage graph is maintained by it. One important thing to note here is that when a transformation statement is called only the lineage graph is updated and there is no operation actually performed. Only when an action statement is called, the operations are performed based on the created graph which may include data shuffling.

Application — It is a piece of code that is to be executed. It can be a single command or a combination of notebooks performing a complex logic. An Application starts when this code is submitted to Spark for execution.

Spark Cluster Internal Working

In the Introduction, I have already highlighted about driver, worker, and cluster manager, but now let’s discuss their roles when the actual computation happens. When a code is submitted to the spark cluster for execution, it is first assigned to the Driver which assigns it to the Cluster Manager. Now the Cluster Manager checks and maintains the availability and health of the Workers and confirms the same to the Driver. It is also responsible for allocating resources to execute the task.

Upon confirmation from the Cluster Manager regarding the resources, the Driver establishes a direct connection with the Workers, starts assigning the tasks to them, and gets the response directly as well. In case of any failure, the Cluster Manager comes into play, it detects node failures, automatically reschedules the affected tasks on other available nodes, and re-initializes the node for the next task.

Architecture of a Worker in Spark Cluster

Now that we know about the high-level working of a cluster, let us dive into the architecture of a worker. There are Executors associated with each Worker. Each Worker can have multiple Executors which in turn have Cores. A worker is capable of working independently.

The above image is an example showcasing the architecture of a single Worker having 32 Cores and 64 GB of memory. Let us say that the number of executors assigned to this worker is 4, hence the cores per executor can be calculated by (total_cores/executors_per_core) which in this case would be 32/4 = 8 cores per executor. The same goes with the memory, i.e. (total_memory_per_worker/executors_per_core) which turns out to be 64/4 = 16 GB of memory per executor. Similarly, you can do this simple math for your needs.

Note — But it is not that the entire 16 GB is assigned to an Executor. There is something called overhead memory which is assigned to an Executor and also a certain amount of memory is allocated for Cluster Manager, but for simplicity, I have eased the calculations for understanding purposes. Feel free to check this link which discusses in details about memory management.

On-Heap vs Off-Heap Memory

There are two concepts of storage that are quite important to discuss as they play a key role in the optimization of an Application. Every Executor has access to two types of memory:- On-Heap and Off-Heap memory. Each Executor has its own on-heap memory but the off-heap memory is common for all executors within a worker node. On-heap memory is managed by Java Virtual Machine (JVM) while off-heap memory is managed by the Operating System. The latter is used for caching and storing serialized data. Another important concept of Garbage Collection(GC) is applicable for On-heap memory as it is managed by JVM but does not apply to off-heap memory, therefore giving more control over memory management in case of off-heap memory.

Let us discuss in detail about the On-Heap Memory architecture. It is divided into 3 parts:-

Reserved Memory — This is the fixed memory that cannot be used for storage or execution and cannot be altered. About 300 MB of heap memory is taken up by this. It is mainly used for failure recovery. In case there is some fault in the system and it is unable to function, then this memory will be used for recovery.
User Memory —User Memory refers to the storage dedicated to retaining user-defined data structures, internal metadata within Spark, and any User-Defined Functions (UDFs) created by the user. There is a memory fraction parameter that can be altered in the spark settings by the parameter “spark.memory.fraction” and the formula for the calculation of this storage is:-

(Executor Memory - Reserved Memory) * (1 - spark.memory.fraction)

3. Unified Memory — This memory is again divided into 2 segments whose fraction can be altered by the parameter “spark.memory.memoryFraction” in the spark settings. By default, it is set to 0.5. These segments are:-

Storage Memory — It is used for storing the cached information, broadcast variables, etc, and can be calculated by:-

(Executor Memory — Reserved Memory) * spark.memory.fraction * spark.memory.storageFraction

Execution Memory — This memory is utilized for storing intermediate states of execution processes like joins, aggregations, shuffles, etc. This can be calculated by:-

(Executor Memory — Reserved Memory) * spark.memory.fraction * (1.0 - spark.memory.storageFraction)

The performance optimization of an application depends a lot on both the above-mentioned fraction values, i.e. “spark.memory.fraction” and “spark.memory.memoryFraction”. Tweaking these values based on the percentage of storage and execution requirements can have a drastic effect on the performance of the application.

Jobs, Stages and Tasks

In this section, we will discuss about the flow of an Application in steps when it is sent for execution and how it is converted into job, stages and tasks.

When an Application is submitted, the code is converted into a Job.
Jobs are divided into stages and the number of stages is determined by the number of times data is shuffled. So, as the shuffling increases so does the number of stages. Join is an example of data shuffling.
Each stage can have several Tasks. Every task within a stage executes the same operation, with each task handling one Partition. The number of partitions indicates the number of tasks within each stage.

Life Cycle of a Spark Session

Finally, let us understand the entire life cycle of a spark session that can be divided into simple steps. Once the user is ready with the Application, it is submitted to Spark for execution. The application goes to the Driver which initiates the spark session. After the application is compiled, DAG creates logical plans in the form of lineage graphs. Based on the computation required and the presence of worker nodes, the task executors request resources from the Cluster Manager. The Cluster Manager then allocates the required resources to perform the task. Now the Driver establishes a direct connection with the Workers, assigns tasks to is, and also gets the response directly. The Driver returns the result to the user and the Application ends.

I hope this article gave a clear picture and understanding about the architecture of Spark and this helps in writing more efficient applications. Just to give a brief about myself, I work as a Data Engineer and a Backend Developer with 2+ years of experience. I am working towards building, optimizing, and streamlining processes in both the above-mentioned domains and also developing Deep Learning and Machine Learning applications.

I’m always open to a good tech discussion so if you want to talk about tech topics do hit me up on LinkedIn and follow my GitHub for new projects, articles and updates.