Spark AQE- Dynamic Coalescing

Vivek Chaudhary
Plumbers Of Data Science
3 min readSep 27, 2023

The Objective of this article is to understand a newly added feature in Spark 3.0 that is AQE (Adaptive Query Execution) to enhance Spark optimization capabilities.

Before diving into AQE, lets understand the history of spark optimization and how it was implemented in various versions of spark.

  • Spark 1.x: Catalyst Optimizer (e.g., predicate pushdown and projection pushdown) and Tungsten Project (CPU, cache, and memory efficiency, eliminate the overhead of JVM objects and garbage collection).
  • Spark 2.x: Cost-Based Optimizer (CBO) to improve queries with multiple joins, using table statistics to determine the most efficient query execution plan.
  • Spark 2.x optimization took into considerations various factors such as: total size of a table/table partition (in Bytes), row count of table/table partition, column statistics such as min, max, distinct_count, avg_col_length, max_col_length, histogram. Basically, Job optimizations are done during compile time only, where Spark checks the application code syntactically and performs logical optimizations.
  • Spark 3.x: Adaptive Query Execution (AQE) to Speed Up Spark SQL at Runtime, based on runtime statistics collected during the execution of the query. It has resolved the biggest drawback of CBO, by making the balance between the stats collection overhead and the estimation accuracy.

By default, AQE is disabled in version 3.0 and can be enabled by spark.conf.set(“spark.sql.adaptive.enabled”,true)

What is Dynamic Coalesce?

Dynamic Coalesce refers to coalescing the number of Post Shuffle Partitions that are created as a result of shuffle operation that happens during execution of wide transformation.

So basically, whenever a Spark developer or a Data Engineer applies a wide transformation e.g., groupBy operation on a dataset, spark internally creates shuffle partitions and default number of shuffle partitions created are 200. To counter this challenge DE has to re-partition increase or decrease the number of partitions. Sometimes or most of the times it becomes very challenging for a DE to decide the exact number of partitions and has to do several iterations to understand what the optimal number of partitions can be.

In such scenarios AQE dynamically compute the statistics on data during shuffling/sorting to determine the optimal number of partitions.

Let’s try to understand this by implementing a piece of code.

BTW I have used Spark 3.3, and AQE is by default enabled.

spark.conf.set(“spark.sql.adaptive.enabled”,False)
spark.conf.set(“spark.sql.adaptive.coalescePartitions.enabled”,False)

e_df=spark.read.csv('<path>/emp_big.csv',header=True,inferSchema=True)
e_df.show()

To demonstrate I have taken dataset size of around 800 MB.

  1. When AQE is disabled.

Apply a groupby() operation to check number of shuffle partitions created in wide transformation.

sum_df=e_df.groupBy(‘DEPTNO’).count()
sum_df.show()
sum_df.rdd.getNumPartitions()

Output:

As we can see when AQE was disabled, spark created 200 shuffle partitions, and may be that much number of partitions are not even required, so let's enable the AQE and see how it spark behaves.

2. When AQE is enabled.

#now set to true
spark.conf.set(“spark.sql.adaptive.enabled”,True)
spark.conf.set(“spark.sql.adaptive.coalescePartitions.enabled”,True)
sum1_df=e_df.groupBy(‘DEPTNO’).count()
sum1_df.show()
sum1_df.rdd.getNumPartitions()

Output:

We can clearly see that when AQE was disabled spark followed its default approach and created 200 partitions but when we enabled AQE, it created only 1 shuffle partition. Yes I understand the argument that dataset size matters but still there were 199 partitions that were not required and spark dynamically coalesced them into single partition.

This use case is just to understand how AQE — Dynamically coalesce the number of shuffle partitions.

Summary:

  • Spark 3.0 new feature AQE or Adaptive Query Execution, how it works behind the scenes.
  • How Dynamic Coalescing works in Spark 3.x.

I hope you like my content. Please connect with me on Medium and linkedin @ linkedin.com/in/vivek-chaudhary90.

Thanks for the support.

--

--

Vivek Chaudhary
Plumbers Of Data Science

Aspiring Full Stack Web Developer, Full-time Data Engineer, Blogger by choice.