Joins in Spark — Part 2

Tharun Kumar Sekar

Published in

Analytics Vidhya

2 min readJan 12, 2020

This article explains the types of joins that happen in your spark application.

Sort Merge Join

A Sort Merge join is the most basic type of join and goes back to Map Reduce fundamentals.

The first step in a shuffle hash join is Spark will map through both of the tables and create an output key. The Output Key is the field on which you are joining the tables. Using the output key, Spark is going to shuffle the data.

Spark will pick each key and create a hash on it, decide which partition that key should be put into and then sort, shuffle and load the data into respective partitions.

In the reduce phase, any row of both tables with the same keys are on the same machine and are sorted. Reduce is the phase where the join actually happens and since the data from both the tables for a single key is on the same machine, the joins can run in parallel.

Performance:

Sort Merge Join works best when the data you have is

Distributed evenly with the key you are joining on.
Have an adequate number of keys for parallelism.

Broadcast Hash Join

When one of the data frames is small enough to fit in the working memory of a single executor, you can force Spark to broadcast that data. 10mb is the default value of spark.sql.autoBroadcastJoinThreshold

You can still increase this threshold based on your executor size.

After the small dataset has been broadcasted, the joins run in parallel and the results are then delivered to the driver. This increases Spark processing exponentially as it eliminates the Shuffle phase.

For more info on Broadcasting, refer to this article.

Joins in Spark — Part 2

Written by Tharun Kumar Sekar