Spark Concepts Simplified: Lazy Evaluation

The what and how

7 min readNov 18, 2023

Hi there — welcome to my blog! This is one of hopefully many articles aimed to address concepts in data engineering (but not exclusively) through mediums which are easy to digest.

As someone who’s transitioned into Data Engineering from Economics, I’m well aware of how some of these concepts may prove a daunting bridge to cross at first (Shrek and Donkey crossing that bridge comes to mind).

This guide will aim to NOT follow in the teaching footsteps of Shrek

This series aims to help you cross that bridge!

This article will:

Explain what lazy evaluation is
Delve into how lazy evaluation can speed up your data pipelines
Provide an example of lazy evaluation in action
(Bonus) Explain how a catalyst optimizer works

Do not worry if the concepts are difficult to grasp at first, we’ll be using analogies and code visualizations as well to explain how it works.

Key definitions

Before we get into lazy evaluation, it’s important to first understand the following Spark concepts:

Transformation

A method on a dataframe which returns another dataframe.

Action

A method on a dataframe which returns a value.

Spark has a great breakdown of which operations are classified as transformations and actions.

Catalyst Optimizer

A Spark mechanism which assesses all the transformations it has to run, and figures out the most efficient way to run them together.

This concept is very important, as when paired with lazy evaluation, it speeds up your data pipelines! More on that later…

Note: The catalyst optimizer does not apply to Resilient Distributed Datasets (RDDs) - only dataframes and datasets. We’ll be focusing on dataframes in this article.

Now that we’ve covered the key definitions, let’s get into lazy evaluation.

What is Lazy Evaluation?

Lazy evaluation is a feature in Spark, which holds off on executing transformations until an action is executed.

For example, you can run a transformation to filter your dataframe - df.filter() - But Spark won’t actually filter your dataframe until you run an action e.g. showing your dataframe: df.show().

Why do we use lazy evaluation?

You might be thinking - why don’t we just evaluate every transformation immediately? Why the wait?

By not executing transformations on a dataframe immediately, we avoid bringing the entire dataframe into memory immediately. This can save cluster capacity, as storing a dataframe in memory can be a resource-intensive operation.

Another crucial optimization is how lazy evaluation is used in unison with a catalyst optimizer. How so?

Due to lazy loading, we end up having a bunch of transformations to run together.
Then thanks to the catalyst optimizer, Spark doesn’t just run the pending transformations one by one. It looks at all the transformations it has to run, and figures out the most efficient way to combine them.

Here’s an analogy to better illustrate the point:

Analogy: Grocery Shopping

Let’s say you’re forming a list of groceries you need to buy from the supermarket. Every time you think of buying something, you head straight to the supermarket to obtain it.

You end up going to the supermarket multiple times a week - you’re best friends with the cashier at this point.

But you want to try something new. You think about waiting until the end of the week to go to the supermarket. Since you don’t need some groceries immediately, when you think of an item you need, you can just add it to your list.

Finally it’s Saturday. You go and get all the food and household products from that list you’ve been adding to. You realize you have everything you need, and you’ve saved yourself all those excessive trips to the supermarket!

Grouping all your shopping lists into one. Source: Author

You realize that sometimes you were going to the supermarket to buy eggs on a Monday AND Thursday, when you could’ve instead just gone once on a Saturday.

The above example can be applied to lazy evaluation. By holding off on going to the supermarket (evaluating Spark) every time you add an item to the grocery list (transformation), you’re more efficient!

Following this analogy, you might be asking:

That’s great! If I want to save even more time, what if I just held off on buying groceries for a whole month? Or longer?

You could! But it is all dependent on how urgent your groceries are. You might not be able to wait for some food items, so you can’t keep delaying your trip.

Equally, you might want to hold off on running an action, but sometimes you need to evaluate your code early in the data pipeline due to a requirement.

Now that we’ve gone over our analogy, let’s go through a code example.

Code example

Let’s say you have an existing dataframe with column name price. We perform 3 transformations on the dataframe:

# Coded in pyspark

# Here we assume we are working with a dataframe which 
# already has an integer column called 'price'

# Transformation 1 : You return a dataframe where the price column is multiplied by 3
df = df.withColumn('price', df.price * 2)
# Transformation 2 : You return a dataframe where the price column is multiplied by 2
df = df.withColumn('price', df.price * 3)
# Transformation 3: You return a dataframe where the price column is multiplied by 5
df = df.withColumn('price', df.price * 5)
df_collect = df.collect()

Lazy evaluation means we hold off on performing any of the 3 transformations, so the price column has not been modified yet.

Then you run an action: df.collect()

Now thanks to the catalyst optimizer, Spark combines the three transformations into one: multiply the price column by 30.

Lazy evaluation and catalyst optimizer in action. Source: Author

The above example was aimed to illustrate the concept — in most scenarios, this feature will instead be applied to a more complex sequences of transformations.

Bonus: How the catalyst optimizer actually works

We’ve explained how lazy evaluation works, and how with the catalyst optimizer, it can speed up your pipeline.

If you recall, we stated the catalyst optimizer:

Figures out the most efficient way to combine them.

But how does it figure that out?

Let’s take a step back and start from when your code is evaluated. The catalyst optimizer then has 3 key steps:

Resolving your logical plan

The first step for the catalyst optimizer is to make sure all your references make sense. Do the column names your query contains exist? What columns are you trying to filter on? It’s a series of verifications.

Optimizing your logical plan

Now that the catalyst optimizer has confirmed and validated the operations you intend to perform on spark, it figures out several efficient logical plans.

By efficient, we mean figuring out what type of join, when to filter your data, and applying other optimizations such as constant folding and predicate pushdown.

Translating to a physical plan

We then end up with several optimized logical plans - some are more optimal than others depending on your resources. So next up is figuring out what resources you have. Spark looks at your cluster, figures out how much resource and execution time different plans require, then selects the most cost-efficient one given your capacity.

Note: In pyspark, you can use a df.explain() to see what the plan is.

These concepts may feel a bit foreign, so let’s once again refer to our grocery analogy:

Analogy: Grocery shopping (revisited)

Let’s say before the catalyst optimizer, we had 4 separate grocery lists. We then evaluate our lists and let the catalyst optimizer work it’s magic:

The catalyst optimizer, explained. Source: Author

The logical plan (resolved) would be checking the grocery lists, and making sure the names and quantities of the food items make sense.

An optimized logical plan would be then combining the grocery lists into one. It would also be figuring out if any discounts exist for certain products, and if any bulk discounts exist (e.g. 2 for 1!).

For example, if we had a dozen eggs on Monday’s grocery list, and a dozen on Thursday’s list, we would combine them into one order for 2 dozen eggs. The supermarket might also be having a buy 1 get 1 free offer, so it would also be a cheaper cost per egg!

Finally, the physical plan would take into account your capacity. For example, your order ends up being so big you’d need to make two trips by walking. If you’re trying to optimize the cost of your time, you can make one trip by car, and thus this option is selected.

Success! You’ve gone from 4 grocery orders to 1 grocery order a week, saving yourself both money and time.

Conclusion

So to summarize: Lazy evaluation is a feature inbuilt in Spark, which when paired with a catalyst optimizer, brings about great efficiency gains.

Lazy evaluation is not something you need to actively implement, as Spark automatically does it for you, but it’s a very good concept to understand.

Thanks for reading!

Relevant resources:

Databricks definition of a catalyst optimizer: https://www.databricks.com/glossary/catalyst-optimizer
Databricks’s deep dive into the catalyst optimizer (more technical): https://www.databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html