Spark Join Types Visualized

Published in

Nerd For Tech

5 min readJul 23, 2021

Joins are an integral part of any data analysis or integration process. Two sets of data, left and right, are brought together by comparing one or more columns (read keys) along with the joining conditions to determine the final output which can contain data from either left or right or both based on the join types.

Apache Spark provides the below joins types,

Inner Joins (Records with keys matched in BOTH left and right datasets)
Outer Joins (Records with keys matched in EITHER left or right datasets)
Left Outer Joins (Records with keys in the left dataset)
Right Outer Joins (Records with keys in the right dataset)
Left Semi Joins (Records from left dataset with matching keys in right dataset)
Left Anti Joins (Records from left dataset with not matching keys in right dataset)
Natural Joins (done using implicit column matching between datasets)
Cross Joins (Cartesian product of the left and right datasets)

Let’s create some data to demonstrate the various joins.

I’ve created 2 DataFrames (namesDF, typesDF) and they can be linked using the ‘Id’ column. Here’s how they look like,

I’ll focus more on the DataFrame APIs as in SparkSQL the syntax is similar to any other SQL. The syntax for joining using the DataFrame APIs is,

DataFrame1.join(DataFrame2, [on=keys], [how=joinType])

The joining on condition and join type are optional. DataFrame1 is termed as left and DataFrame2 is termed as right

Inner Joins

This is the default join i.e. if no join type is mentioned then Spark performs an inner join. Only those records are pulled into the output where the keys from both datasets, left and right, match. Keys that do not exists in both the datasets are ignored i.e. not shown / outputted.

Outer Joins aka Full Outer Joins

Records from both left and right datasets are included in output. In places where the keys don't match null values are inserted.

Left Joins aka Left Outer Joins

All records from the left dataset are part of the output. Records from the right dataset are included where keys are matched. For remaining null values are inserted.

Right Outer Joins aka Right Joins

All records from the right dataset are part of the output. Records from the left dataset are included where keys are matched. For remaining nulls are inserted.

Left Semi Joins

Only records from the left dataset are included where they have a matching key in the right dataset. If left dataset has duplicates then output will contain duplicates. It can be looked upon as a filter rather than a join. We filter the left dataset based on matching keys from the right dataset.

Left Anti Joins

Opposite to Left Semi Joins.
Only records from the left dataset are included where they do not have a matching key in the right dataset. Similar to a NOT IN SQL filter.

Natural Joins

When using Natural Join Spark tries to implicitly guess on the columns to join. Implicit is always risky and hence this join should be used with caution. Also its available only in Spark SQL and not in the DataFrame APIs.

Cross Join aka Cartesian Product

Every record from the left dataset is joined to every record in the right dataset. This results in an explosion 🎆🧨 of records. If we cross two datasets with 1000 records each then the resultant output will be 1000 * 1000 = 1,000,000 records. This can easily result into OOM errors.
But there can be a valid case where you want to generate every possible combination of two datasets. So to avoid confusion you can use the crossJoin method to call it out explicitly.

Note: I believe as a precautionary step to avoid creating a cartesian product which may crash the application, Spark in versions lesser than 3.x has the cross join config setting to False by default. You’ll get an error like below (checked on v2.4.4) and it’ll request you to explicitly set the spark.sql.crossJoin.enabled setting to True. This setting is set to True by default in v3.x onwards.

pyspark.sql.utils.AnalysisException: 'Detected implicit cartesian product for INNER join between logical plans
Join condition is missing or trivial.
Either: use the CROSS JOIN syntax to allow cartesian products between these relations, 
or: enable implicit cartesian products by setting the configuration 
variable spark.sql.crossJoin.enabled=True;'

Notes

One thing the many Venn diagrams on SQL joins fail to communicate is that if there are duplicates in your source they will be in your join output

2. Usually the keys you join on would not include null values. But in case there is a scenarios where you’d like to join on null keys then you can use the eqNullSafe option in the joining condition.