Distribute & Conquer: A Playful Guide to Spark’s Distributed Execution

Apache Spark’s Distributed Architecture Simplified

Humberto Rendon
Byte-Sized Data
4 min readMar 13, 2023

--

Apache Spark’s logo with a Spark’s stars as a background

If you are reading this article, you probably know the Spark has something to do with big data. Spark is a really good tool when working with large amounts of data, but why? Spark has a particular way of doing things, since it partitions and parallelizes effort. This means that it basically “divides & conquers” large amounts of data. Keep reading to learn about Spark’s components and how they fit into its distributed architecture.

Components

Apache Spark’s Distributed Architecture’s Components scattered

It’s important to understand beforehand what each component does. This allows us to maximize performance on every aspect of Spark when needed. When reading about each component, focus on the one at hand so you don’t get lost trying to understand the ones you haven’t got to yet.

Spark Driver

The Spark Driver is responsible for instantiating a Spark Session and communicating with the Cluster Manager. The Spark Driver also has the very important task of asking for resources (things like CPU, memory, etc). It asks from the Cluster Manager and gives to the Spark Executors.

The Spark Driver is also responsible for transforming operations into DAG (Directed Acyclic Graphs) computations. It will then schedule the DAGs and distribute the execution as tasks across the Spark Executors. After resources are allocated, the Spark Driver will communicate directly with the Spark Executors (no need for a middle man anymore).

The Spark Driver

Spark Session

The Spark Session acts as a conduit to all Spark operations and data. This means that the Spark Session absorbs entry points like SparkContext, SQLContext, StreamingContext, etc. You can still access individual contexts and their methods, but it’s no longer required. This makes everything simpler and easier for developers.

A small example of creating a session, reading data and issuing a query:

The Spark Session is also used to define data frames and datasets, issue Spark SQL queries, read from data sources, create JVM runtime parameters or accessing catalog metadata. It’s not bound by a programming language, so you can use any of the high-level APIs it offers.

While using the Spark Shell, the session is created for you and it can be accessed through a global variable called spark

The Spark Session joined with the Spark Driver

Cluster Manager

The cluster manager is mostly responsible for allocating resources. The allocated resources will be for the cluster of nodes where you application runs. Spark supports 4 types of cluster managers: Built-in standalone, Apache Hadoop YARN, Apache Mesos and Kubernetes.

The Spark Driver communicating directly with the Cluster Manager

Spark Executor

Executors run on worker nodes in the cluster. They communicate directly with the Spark Driver (as we saw earlier, this only happens once resources have been allocated). Executors are also the ones responsible to execute tasks on workers. Usually only one executor runs per node.

The Spark Executors communicating with the Cluster Manager, Spark Driver and Spark Session

Distributed Data & Partitions

Data itself is partitioned and distributed across storage partitions. These partitions could be either local using HDFS or on a cloud storage like AWS or Azure. Even though the data is spread out across the physical cluster, Spark treats every partition as a data frame in memory.

When possible, executors are allocated tasks that need to read the partition closest to it

File being distributed across data partitions

The benefit for partitioning the data is efficient parallelism. This means that breaking the data into chunks, allows the executors to only work with data that is close to them, resulting in less network bandwidth. In the following code snippet we can see how easy it is to make partitions in Spark. The code is generating 5 partitions with 10000 numbers each, going from 0 to 50000. When printing how many partitions the dataframe has, it prints 5 (as we specified).

5

--

--