An introduction to Spark

5 min readAug 17, 2022

In this post, we will explore the basic concepts behind Spark. While workload optimization, machine learning, autoscaling, deployment modes, and so on are nice topics, without a solid understanding of the underlying concepts it is impossible to fully comprehend how to use this technology.

This post draws information from the paper that introduced the concept of RDD, I strongly recommend to anyone interested in Spark to read it: nsdi_spark.pdf (mit.edu)

The post will explore the following topics:

MapReduce
RDDs
Spark architecture
Transformations vs Actions
Wide and narrow transformations
DAGs

Without further ado, let’s get started.

MapReduce

The MapReduce framework derives from the mathematical model of monads. Long story short, the concept is to divide the problem into 3 steps. Given a set of groups, their elements are:

Mapped into their categories
Shuffled, bringing the elements of the same categories together
Reduced applying a fold logic

The picture below provides an example of MapReduce application in computing the total number of occurrences of each shape.

While the MapReduce framework has proven itself to be extremely useful in a variety of situations, its usage requires advanced knowledge of either C++ or Java.

Spark was introduced as a form of middleware to abstract the user from the MapReduce logics and as an implementation of RDDs.

RDD

RDD stands for Resilient Distributed Dataset. To understand what RDDs are let’s go through the acronym (however, in reverse order):

Dataset: RDDs are an abstraction over data
Distributed: Data is divided into partitions and distributed over the computation framework
Resilient: Data is treated following the functional paradigm. Each operation does not modify the data it is applied to, it just creates new data. Furthermore, a certain level of fault tolerance is provided

While the definition is still vague, everything will fall into place as we explore Spark’s architecture and the transformation-action model.

Spark architecture

Spark follows a master-slave architecture. The Driver, which acts as master, defines the operations to be performed. The Workers, instead, act as slaves and perform the actions defined by the Driver. Once the tasks are completed the results are sent back to the Driver.

Both Driver and Workers have their resources in terms of primary memory (RAM) and CPU cores.

As already mentioned, data is partitioned and distributed. To be more specific, the partitions are distributed across the workers. Each worker can perform operations, in a parallel fashion, on as many partitions as many CPU cores as available.

On a practical level, the Driver needs far fewer resources than the Workers. While the driver only has to define the operation plan, the Workers do the actual heavy lifting.

Another relevant difference between Driver and Workers is that when a Worker dies another one can be brought up to take its place. If, instead, the Driver crashes the application crashes too.

Transformations vs actions

Two kinds of operations can be performed by Spark:

Transformations: do not materialize data, and are evaluated lazily. Examples of these operations are operations like map, filter, join, …
Actions: materialize data, and evaluated eagerly. Save and show are good examples of actions

On a practical note, therefore, as long as we only define transformations Spark will not execute anything. Once an action is added to the operations to be performed on an RDD the computation begins and output is, eventually, materialized.

The series of transformations and actions that are to be performed on an RDD from the lineage of the RDD. The concept of lineage, united with the partitioning of the RDD and the functional approach to data management, guarantee fault tolerance at the Worker level. If a worker, for any reason, crashes Spark will only need to re-execute a fraction of the computations. The only data lost is the one generated by the partitions assigned to that particular worker. Given the functional paradigm used by Spark, the original RDD is unaltered and, given the lineage, the worker is aware of what transformations and actions are to be applied to the RDD partitions.

As such, only the new worker will need to compute its new RDD partitions. As already mentioned, however, if the Driver crashes the whole application crashes too.

Wide and Narrow transformations

Another extremely relevant distinction is the one between narrow and wide transformations. The difference can be found in the mapping of partitions from the original RDD (parent) to the one obtained from the transformation (son):

Narrow transformations: each partition in the parent RDD is linked to at most one partition in the son RDD. An example of this kind of transformations is filter()
Wide transformations: at least one partition in the parent RDD is linked to two or more partitions in the son RDD. An example can be found in the sort() transformation

A visual example can be found in the image below:

Partitions mapping in narrow and wide transformations

As mentioned, partitions are distributed across workers. As long as the transformations are narrow each worker can proceed independently, as it has all the data it needs. If, however, a wide transformation is involved, the worker will need data that other workers have. As such, workers will exchange data among them as needed. This stage is known as shuffle stage (remember MapReduce? :) ).

On a practical note, shuffle stage takes time: if we compare the time that is needed to access data stored in the main memory against the time to send data over a network it is easy to see that shuffling is a costly operation.

DAGs

Last, but not least, how is a workload organized?

The whole lineage is abstracted as a Directed Acyclic Graph (DAG). The DAG is then divided into Stages, which are the longest streak of transformations and actions that can be executed in a parallel fashion. Shuffle phases define the boundaries of Stages: as already mentioned, as long as a shuffle is not involved, each worker can act independently.

Stages in, turn, are divided into tasks. As mentioned above, each task corresponds to a partition and is handled by a CPU core in the worker that partition was assigned to.

Conclusions

This post has introduced the main concepts behind Spark. Starting from MapReduce we have talked about RDDs, architecture, and lineage. While there are a lot of other topics to be covered (such as SparkSQL and query optimization, streaming, deployment, worker scaling, proper Spark configuration, …), this information is the foundation of understanding and properly using Spark.

I once again urge you to read the original paper and to take a look at all the other research papers on Spark that can be found here.

That’s it for today.

Cheers, see you next post.