Understanding Apache Spark - Part 1: Spark Architecture

A beginner’s guide to Apache Spark architecture

Tom Corbin
10 min readAug 7, 2023

What is Spark?

At its core, Apache Spark is a distributed processing system. This means it takes large data sets, breaks them down into smaller, manageable parts and then processes these parts across multiple computers at the same time. This network of computers is known as a ‘cluster’, and it allows Spark to handle vast amounts of data much faster than if the data were processed on a single machine.

The primary idea behind Spark is to carry out computations in parallel. Imagine you had to fold 1000 shirts. If you did it by yourself, it would take quite a bit of time. But if you gave some shirts out to your friends and everyone folded simultaneously, the job would be much faster. That’s the basic principle behind Spark — distributing tasks to achieve quick, efficient results. When you are dealing with extremely large datasets, Spark is a lifesaver because it allows you to process vast amounts of data extremely quickly. The more computers (nodes) you have on your cluster, the faster you can process your data.

When you tell Spark to do something, it divides your data into smaller partitions and sends them to the “worker nodes” in the cluster. It also divides the job into smaller tasks, and each worker performs these tasks with the data it has been given. Once all the tasks have been completed, the worker nodes send the results to the “driver node”, which is responsible for gathering all these individual results and compiling them into the final output that fulfills the purpose of your Spark application. This could be a transformed dataset, the result of an aggregate calculation, a machine learning model, or whatever else you have asked Spark to do.

An “application” refers to a program written by a user to perform a specific task or set of tasks using Spark.

Driver Program: The Conductor

The Driver Program is a crucial component of Spark’s architecture. It’s essentially the control centre of your Spark application, organising the various tasks that need to be carried out, and returning the final result.

To help picture this, think of an orchestra. The Driver Program is like the conductor. It doesn’t play an instrument itself but directs the musicians to play the symphony. The Driver Program doesn’t itself process any data (for the most part), but it orders the workers to process the data in a synchronised fashion.

The Driver Program’s responsibilities can be broadly categorised into three core tasks:

  1. Maintaining Application Information: The Driver Program holds the essential application details. It knows the classes to be executed, values, variables, and their types within the application. Think of it as a blueprint holder for the application. Without it, the workers wouldn’t know what tasks they need to perform.
  2. Responding to User Input: The Driver Program isn’t a static component; it’s interactive. When a user submits a program, the driver program springs into action, analyzing the user’s input, and translating it into executable tasks. Imagine asking the conductor to change the music’s tempo; the conductor communicates your request to the musicians, ensuring the music’s pace changes according to your wish.
  3. Distributing and Scheduling Tasks: One of the most significant responsibilities of the Driver Program is to take the application and break it down into tasks that can be performed in parallel. It then assigns these tasks to Executors. Picture our conductor; they must decide which musician plays what part, when they come in, and coordinate the players so they all contribute harmoniously to the performance.

SparkContext: The Baton

Think of the SparkContext as the baton in the hands of the conductor (the Driver Program). It’s the tool through which the conductor communicates and directs the orchestra.

When the Driver Program initiates a Spark application, it’s like the conductor raising the baton to signal the start of the performance. This is when the SparkContext comes into play, establishing the connection to the ‘orchestra’ — the Spark Cluster.

Its responsibilities can be broken down into two key parts:

  1. Initiating Rehearsals: Before the actual performance (the execution of tasks) can begin, the orchestra needs to rehearse, which requires a suitable space and time. Similarly, the SparkContext arranges resources like CPU cores and memory across the various nodes in the cluster. The better the allocation of resources, the more harmonious the performance.
  2. Directing the Performance: Once the resources are allocated, the conductor starts directing the orchestra using their baton. Similarly, the SparkContext starts distributing the tasks to the worker nodes. It determines where tasks are sent, much in the same way that the conductor might signal the brass section of the band to play with a wave of the baton.

Cluster Manager: The Stage Manager

Continuing with our orchestra analogy, the cluster manager can be likened to a stage manager in a concert. Just as the stage manager makes sure the orchestra has everything it needs to perform — from coordinating with the musicians, managing the logistics, to even taking care of the lighting and sound systems, the cluster manager in Spark ensures that the cluster has everything it needs. It’s responsible for allocating resources across the cluster, and managing them to get the best performance. It accepts resource requests from SparkContext (much like a stage manager would take requests from the conductor) and provides the necessary resources, be it CPU, memory, or disk space, for Spark to run its operations smoothly.

You might be wondering which bit actually allocates the resources. Is it the Driver Program? The SparkContext? Or is it the cluster manager? We need to clarify the role each component plays in resource allocation within:

  1. Driver Program: While the Driver Program is responsible for managing the Spark application and dividing it into tasks, it doesn’t directly handle the allocation of resources (like CPU, memory, etc.). Its primary role is controlling the flow of the application.
  2. SparkContext: The SparkContext acts as a liaison between the Driver Program and the cluster manager. It requests resources from the cluster manager based on the tasks that the Driver Program has outlined. However, the SparkContext itself doesn’t allocate these resources; it merely requests them.
  3. Cluster Manager: The cluster manager is the component that is directly responsible for the allocation of resources in Spark. It oversees the resources available in the cluster and, upon receiving a request from SparkContext, assigns the requested resources accordingly. The Cluster Manager also keeps track of the health of its workers and allocates new ones if a worker fails.

In essence, the Driver Program identifies what resources are required (based on the tasks), SparkContext requests these resources, and the Cluster Manager allocates them. It’s a team effort, with each component playing its specific role to ensure efficient resource management and task execution within the Spark environment.

Workers and Executors: The Musicians and Their Instruments

To recap, a worker is a node in the cluster that has resources like CPU and memory that can be utilized to execute tasks. However, the actual execution of these tasks is carried out by executors, which run on these workers. So, while the workers provide the resources and environment necessary to carry out the tasks, the executors actually carry out the tasks.

A computer can perform multiple tasks at the same time. Each task performed is a separate process. Likewise, each executor is a separate process running on a node (computer) in the Spark cluster. Yet, instead of performing on task in that process, executors can run multiple tasks at the same time. Each executor is essentially like a mini-worker, contributing to the overall computation by performing its own set of tasks independently of the other executors.

To extend our orchestra analogy, you can think of the workers as the musicians and the executors as their instruments. The musicians (workers) are vital for the performance, but they can’t produce music without their instruments (executors). Each musician can have one or multiple instruments (because each worker can have multiple executors).

Here are the key responsibilities of executors:

  1. Executing Tasks: Just like an instrument plays a song, the executor’s primary role is to execute tasks. These tasks are essentially portions of the application code that the Driver Program has assigned to them.
  2. Reporting Results: After an Executor has completed its task, it needs to report the result. The Executor either sends the result back to the Driver Program or stores the data in memory or on disk for future tasks.

In a nutshell, while workers are the machines that make up the Spark cluster and provide the resources, Executors are the processes that actually execute the tasks.

Tasks: The Musical Notes of the Symphony

In the grand symphony of your Spark application, tasks are like the individual notes played by the musicians. They are the smallest units of work that get sent to the executors (or instruments in our analogy).

A task represents a single unit of work sent to one executor. When we speak about tasks, we’re referring to a specific operation performed on a data partition. For instance, if you have a dataset of a million records and want to increment each record by one, Spark might break this down into a thousand tasks, each task incrementing a thousand records.

Here’s a closer look at the nature of tasks in Spark:

  1. Individual Operation: Each task represents a single operation that needs to be performed on a slice of data. This operation could be as simple as reading data or as complex as performing a machine learning algorithm.
  2. Executed by Executors: Tasks are executed by executors. Each executor can process multiple tasks at a time. You can think of each executor as an instrument that can play multiple notes (tasks) simultaneously.
  3. Parallel Execution: Tasks are designed to be executed in parallel. This means multiple tasks can run at the same time on different data partitions across different executors. Just like multiple notes in a symphony are played simultaneously to create harmonious music.
  4. Failure Recovery: If a task fails due to an issue in the executor node, Spark’s resilience kicks in. The failed task can be reassigned to another executor for processing. This resilience is similar to how a symphony won’t stop if a musician misses a note; instead, the other musicians carry on, ensuring the performance continues.

In simple terms, if you think of your Spark application as a big project, like painting a house, then the tasks are like the individual brush strokes applied by different painters. Each brush stroke is part of the larger task of painting the house, and all the strokes together result in the house being fully painted. If your Spark application aims to count all the words in a collection of documents, Spark might divide this collection into single documents and send them to the executors to count.

SparkSession: The One-stop Entry Point

In older versions of Spark, different contexts were needed to interact with the different functionalities of Spark. For example, a SQLContext was needed to perform SQL operations, a HiveContext was needed to enable Hive support, and the SparkContext was used for core Spark functionalities and to communicate with the Spark cluster. However, with the advent of Spark 2.0, all these complexities have been simplified into a single, unified interface — the SparkSession.

The SparkSession acts as a gateway to all the functionalities that Spark offers. Whether you need to work with core Spark operations, perform SQL queries, or read and write data in a variety of structured formats, the SparkSession is a one-stop-shop. It encapsulates the SparkContext, SQLContext, and HiveContext.

In essence, the SparkSession makes it easier to use Spark’s capabilities by eliminating the need for different contexts.

Logical and Physical Execution Plans

When you ask Spark to perform an operation, it doesn’t immediately rush into action. Instead, it takes a step back and formulates a plan of attack. This planning phase involves two key steps: the creation of a logical plan and a physical plan.

The Logical Plan

The logical plan is a series of transformations that represent the abstract computations to be carried out. At this stage, Spark isn’t concerned with the ‘how’ but rather the ‘what’. For instance, if you want to filter a dataset based on certain criteria, and then count the number of items, Spark’s logical plan would be something like “filter and then count”. It doesn’t worry about how it’s going to filter or count at this point. It’s a high-level view of the tasks required to meet the user’s demand, regardless of the execution details.

The Physical Plan

After determining what needs to be done, Spark then figures out how to do it. This is where the physical plan comes in. It represents a sequence of stages required to accomplish the tasks outlined in the logical plan. The physical plan takes into account the realities of execution, such as cluster layout, data partitioning, and memory management. Going back to our previous example, in the physical plan, Spark decides how it will filter and count, where these operations will take place, how data will move around, and so on. The physical plan is like a detailed roadmap, guiding Spark through the necessary steps to achieve the desired outcome.

By formulating these plans, Spark can optimize its operations for efficiency and speed. It’s a key part of why Spark can handle big data operations so quickly and effectively. It doesn’t just dive in headfirst; it thinks about what it needs to do and then formulates the best way to do it. Like a skilled chess player, Spark plans its moves in advance, ensuring it plays the game as efficiently as possible.

Conclusion

Spark’s orchestra of various components — the driver program, SparkContext, cluster manager, executors, and tasks — all work in harmony, just like a well-conducted symphony, to ensure that big data computations are carried out efficiently and effectively.

In this first part of our series on how Spark works, we’ve taken a high-level look at the architecture of Spark, the roles of its various components, and the way they interact to execute tasks. In upcoming articles, we’ll dive deeper into the other mechanics of Spark, like its computing model, scheduler system, memory management, and more.

If you found this article useful, use my referral link to join Medium and dive deeper into Databricks, PySpark, and the world of Data Engineering:

--

--

Tom Corbin

Data Engineer, Spark Enthusiast, and Databricks Advocate