How Apache Spark decides on the join strategy

Venkatakrishnan
3 min readSep 18, 2023

Apache Spark uses a cost-based optimizer to decide on the join strategy. The optimizer takes into account a number of factors, including the size of the datasets, the type of join, the join condition, the distribution of the data, and the availability of resources.

Based on these factors, the optimizer will choose the join strategy that it estimates will be the most efficient.

Steps in Decision Making:

  1. Check if broadcast join is possible: Spark checks if a dataset is smaller than the spark.sql.autoBroadcastJoinThreshold configuration parameter, which defaults to 10MB. If it is, a broadcast join, where the smaller dataset is broadcasted to all executors, becomes the chosen strategy. This is efficient but relies on the dataset's ability to fit into the memory across all executors.
  2. Choose between shuffle hash join and sort merge join: If broadcast join is not possible, then Spark will choose between shuffle hash join and sort merge join. Shuffle hash join is the default join strategy, but sort merge join may be more efficient for certain types of joins (e.g., joins with equality join conditions).
  3. Adaptive Query Execution (AQE):
  • Introduced in Spark 3.0, AQE allows Spark to dynamically switch join strategies during runtime based on the actual size of shuffled partitions.
  • For instance, if AQE finds that after shuffling, a partition is smaller than the…

--

--

Venkatakrishnan

Experienced Lead Data Engineer with expertise in SAS Products, SQL, Python, Spark, Hadoop Ecosystem, AWS, Kafka, Data Warehouse, and Agile Methodologies.