Mastering Lazy Evaluation: A Must-Know for PySpark Pros!

Think Data
3 min readJul 30, 2023

Spark lazy evaluation is a fundamental concept in Apache Spark, which is an open-source distributed data processing framework. Lazy evaluation refers to the way Spark handles the execution of data transformations and actions in a deferred manner, rather than immediately executing them when they are defined. This approach offers several advantages in terms of optimization and performance.

Photo by David Clode on Unsplash

Transformations: In Spark, transformations are operations that are applied to a distributed collection of data (like RDD — Resilient Distributed Dataset or DataFrame) to produce a new distributed dataset. Examples of transformations include map, filter, groupBy, join, etc. When you call a transformation on an RDD or DataFrame, Spark does not perform the computation right away. Instead, it builds up a logical execution plan, which is a directed acyclic graph (DAG) representing the sequence of transformations to be applied.

Lazy Evaluation: Spark’s lazy evaluation comes into play at this point. The logical execution plan is not immediately executed, and Spark defers the computation until an action is called.

Actions are operations that trigger the actual computation and return results to the driver program or write data to an external storage system. Examples of actions include count, collect

