Join Strategies in Apache Spark— Deep Dive

Siddharth Ghosh
5 min readMay 13, 2022
Photo by Heather Gill on Unsplash

Joins are inevitable when dealing with data. And we all have come across at least once with Joins when writing queries be it — inner, outer, left, right or semi-joins. But when it comes to Apache Spark, these simple joins are handled very differently. Apache Spark employs different strategies to perform the join operation. And knowing about the internals of these strategies can help us utilize the power of Apache Spark and optimize our codes for improved performance of Spark Jobs. Before delving deep into Spark’s strategies, we need to first understand the factors affecting the Join Operations.

What Factors affect Join Operations?

  1. Dataset Size — The size of the datasets participating in the join will directly affect the join operation performance.
  2. Condition of Join — The condition of join derives from the logical comparison of the fields in the datasets. These conditions can be categorized as Equivalence condition( =) or Non-Equivalence condition( >, <, ≥, ≤, <>).
  3. Type of Join — After selecting the condition of Join, we need to check for the type of join which is categorized as Inner, Outer, Semi, or Cross Join.

Apache Spark has created the below strategies for join execution based on the above factors.

  1. Broadcast Hash Join
  2. Shuffle Hash Join
  3. Shuffle Sort Merge Join
  4. Cartesian Join
  5. Broadcast Nested Loop Join

Before moving further into the details of the strategies, let us first understand what is Hash Join?

Hash Join

In the case of a Hash Join, a hash table is created based on the join key of the smaller dataset and then looping over the larger dataset to match the hashed join key fields. It only supports the equivalence join condition. And this strategy is applied at the per node level(all partitions on the nodes where the dataset is available). The creation of the hash table improves the searching. Once the hash table is created for the smaller dataset, loop over the larger dataset and based on the join key attribute, search the hash value in the smaller dataset(O(1) operation).

Broadcast Hash Join

Broadcast Hash join employs a simple strategy of broadcasting the smaller dataset to the worker nodes thus saving the Shuffle cost. This type of join is quite useful when one of the datasets is huge and the other is small usually less than 10 MB(default) but can be configurable. This join is also called Map End Join.

spark.sql.autoBroadcastJoinThreshold (Default Value 10485760(10 MB)

  1. The broadcasted dataset should fit in the driver as well as executor nodes. The driver first gets the dataset from the executor side and then broadcasts the datasets to all the worker nodes where the partitions for the larger dataset are present.
  2. Equivalence Join condition and join types except Full Outer Join is supported.
  3. The property is configurable and has a max limit of 8GB.
  4. The table is cached on the driver node as well on the executor nodes and if a large table is broadcasted then it will be a network intensive operation leading to performance degradation.
Broadcast Hash Join Implementation

Shuffle Hash Join

Shuffle Hash Join involves a two-phase process, the shuffle and hash join phase. Datasets with the same join key are moved to the same executor node and then on the executor node, create a hash table for the smaller table and apply Hash Join.

  1. The join keys need NOT be sortable.
  2. Equivalence Join condition and join types except Full Outer Join is supported.
  3. This is an expensive join as it involves shuffling as well as creating a hash table on the dataset participating in the join operation.
  4. To enable this join to need to set Shuffle Sort Merge Join to false

spark.sql.join.preferSortMergeJoin=false (Default value is true)

Shuffle Hash Join

Shuffle Sort Merge Join

Shuffle Sort Merge Join as the name suggest involves a shuffle and sort-merge phase. Datasets with the same join key are moved to the same executor node and then on the executor node, the dataset partitions on the node are sorted by the join keys and then merged based on the join keys.

  1. This is the default join strategy in Apache Spark since Spark 2.3. It can be disabled using spark.sql.join.preferSortMergeJoin=false.
  2. The join keys need to be sortable.
  3. All join types are supported.
Shuffle Sort Merge Join

Cartesian Join

If the participating datasets do not specify the join key(on condition), the cartesian product strategy will be picked.

  1. Only inner join types are supported.
  2. Supports equivalence and non-equivalence join conditions.

Broadcast Nested Loop Join

This join strategy is selected when no suitable join mechanism to choose from. In other words, if the join condition and hint type are not mentioned, then this join is chosen.

  1. Supports equivalence and non-equivalence join conditions.
  2. All join types are supported.
  3. This is a very expensive join and Spark automatically optimizes the join by looking for an appropriate dataset that can be broadcasted.

Spark Join Selection Strategy

Case-I) Equivalence Join Condition

Look at the join hints in the following order:

  1. Broadcast HintPick Broadcast Hash Join if join type is supported.

2. Sort Merge Hint — Pick Sort-Merge to join if join keys are sortable.

3. Shuffle Hash Hint — Pick Shuffle Hash Join if join type is supported.

4. Shuffle Replicate NL Hint — Pick Cartesian Product if join type is inner like.

If no hints are applicable

  1. If one of the datasets is small enough to be broadcasted and join type is supported then pick Broadcast Hash Join.
  2. If join keys are sortable then pick Sort Merge Join.
  3. If one of the datasets is small enough to build a Hash table, and spark.sql.join.preferSortMergeJoin=false, choose Shuffle Hash Join.
  4. If the join type is inner type, then pick Cartesian Product Join.
  5. Call the Broadcast Nested Loop Join if none of the options satisfies.

Case-II) Non-Equivalence Join Conditions

Look at the join hints in the following order:

  1. Broadcast Hint — Pick the Broadcast Nested Loop Join.
  2. Shuffle Replicate NL Hint — Pick Cartesian Product if the join type is inner like.

Conclusion

Whatever strategy Spark chooses but if we as a developer are cognizant of the datasets participating in the join operations, we can enforce the strategies as per our needs and improve the performance of the Spark jobs. With this, I would like to summarize the whole article in a table and conclude the article.

Summary

--

--