Mastering Lazy Evaluation: A Must-Know for PySpark Pros!

Think Data
3 min readJul 30, 2023

Spark lazy evaluation is a fundamental concept in Apache Spark, which is an open-source distributed data processing framework. Lazy evaluation refers to the way Spark handles the execution of data transformations and actions in a deferred manner, rather than immediately executing them when they are defined. This approach offers several advantages in terms of optimization and performance.

Photo by David Clode on Unsplash

Transformations: In Spark, transformations are operations that are applied to a distributed collection of data (like RDD — Resilient Distributed Dataset or DataFrame) to produce a new distributed dataset. Examples of transformations include map, filter, groupBy, join, etc. When you call a transformation on an RDD or DataFrame, Spark does not perform the computation right away. Instead, it builds up a logical execution plan, which is a directed acyclic graph (DAG) representing the sequence of transformations to be applied.

Lazy Evaluation: Spark’s lazy evaluation comes into play at this point. The logical execution plan is not immediately executed, and Spark defers the computation until an action is called.

Actions:
Actions are operations that trigger the actual computation and return results to the driver program or write data to an external storage system. Examples of actions include count, collect

--

--