Exploring the Different Join Types in Spark SQL: A Step-by-Step Guide

Understand the Key Concepts and Syntax of Cross, Outer, Anti, Semi, and Self Joins

Ahmed Uz Zaman
Plumbers Of Data Science

--

This article will go over all the different types of joins that PySpark SQL has to offer with their syntaxes and simple examples.

Image by educba

The list of joins provided by Spark SQL is:

  • Inner Join
  • Left / Left Outer Join
  • Right / Right Outer Join
  • Outer / Full Join
  • Cross Join
  • Left Anti Join
  • Left Semi Join
  • Self Join

Inner Join

Returns only the rows from both the dataframes that have matching values in both columns specified as the join keys.

Syntax

df1.join(df2, df1['key'] == df2['key'], 'inner').show()

In the above example, df1 and df2 are the two dataframes that are being joined, and key is the column that is used as the join key. The last parameter, 'inner', specifies the type of outer join to perform

Example

from pyspark.sql import SparkSession

# Create a Spark session
spark = SparkSession.builder.appName("InnerJoinExample").getOrCreate()

# Create the first dataframe
df1 = spark.createDataFrame([("A", 1), ("B", 2), ("C", 3)], ["letter", "number"])

# Create the second dataframe
df2 = spark.createDataFrame([("A", 4), ("B", 5), ("D", 6)], ["letter", "value"])

# Perform the inner join
inner_join = df1.join(df2, df1['letter'] == df2['letter'], "inner")

# Show the result of the join
inner_join.show()

# Output:
# +-----+------+-----+
# |letter|number|value|
# +-----+------+-----+
# | A| 1| 4|
# | B| 2| 5|
# +-----+------+-----+

In this example, two dataframes, df1 and df2, are created with columns letter and number, and letter and value, respectively. An inner join is performed between df1 and df2 using the column letter as the join key. The result of the inner join is a new dataframe that contains only the rows from both df1 and df2 where the values in the letter column match. The resulting dataframe contains two rows, for letters "A" and "B", which are the only letters that are present in both dataframes.

Left / Left Outer Join

Returns all the rows from the left dataframe and the matching rows from the right dataframe. If there are no matching values in the right dataframe, then it returns a null.

Syntax

df1.join(df2, df1['key'] == df2['key'], 'left').show()
(OR)
df1.join(df2, df1['key'] == df2['key'], 'leftouter').show()

In the above example, df1 and df2 are the two dataframes that are being joined, and key is the column that is used as the join key. The last parameter, 'left', specifies the type of outer join to perform

Example

Using the same dataset of inner join

# Perform the left outer join
left_outer_join = df1.join(df2, df1['letter'] == df2['letter'], "left_outer")

# Show the result of the join
left_outer_join.show()

# Output:
# +-----+------+-----+
# |letter|number|value|
# +-----+------+-----+
# | A| 1| 4|
# | B| 2| 5|
# | C| 3| null|
# +-----+------+-----+

In this example, two dataframes, df1 and df2, are created with columns letter and number, and letter and value, respectively. A left outer join is performed between df1 and df2 using the column letter as the join key. The result of the left outer join is a new dataframe that contains all the rows from the left dataframe (df1) and the matching rows from the right dataframe (df2).

Right / Right Outer Join

Returns all the rows from the right dataframe and the matching rows from the left dataframe. If there are no matching values in the left dataframe, then it returns a null.

Syntax

df1.join(df2, df1['key'] == df2['key'], 'right').show()
(OR)
df1.join(df2, df1['key'] == df2['key'], 'rightouter').show()

In the above example, df1 and df2 are the two dataframes that are being joined, and key is the column that is used as the join key. The last parameter, 'right', specifies the type of outer join to perform

Example

Using the same dataset of inner join

# Perform the right outer join
right_outer_join = df1.join(df2, df1['letter'] == df2['letter'], "right_outer")

# Show the result of the join
right_outer_join.show()

# Output:
# +-----+------+-----+
# |letter|number|value|
# +-----+------+-----+
# | A| 1| 4|
# | B| 2| 5|
# | D| null| 6|
# +-----+------+-----+

In this example, two dataframes, df1 and df2, are created with columns letter and number, and letter and value, respectively. A right outer join is performed between df1 and df2 using the column letter as the join key. The result of the right outer join is a new dataframe that contains all the rows from the right dataframe (df2) and the matching rows from the left dataframe (df1). If there is no match for a row in the left dataframe, the corresponding value in the right dataframe is retained and a null value is assigned to the columns from the left dataframe.

Outer / Full Join

Returns all the rows from both the dataframes, including the matching and non-matching rows. If there are no matching values, then the result will contain a NULL value in place of the missing data.

Syntax

df1.join(df2, df1['key'] == df2['key'], 'outer').show()
(OR)
df1.join(df2, df1['key'] == df2['key'], 'full').show()
(OR)
df1.join(df2, df1['key'] == df2['key'], 'fullouter').show()

In the above example, df1 and df2 are the two dataframes that are being joined, and key is the column that is used as the join key. The last parameter, 'outer', specifies the type of outer join to perform.

Example

Using the same dataset of inner join

# Perform the full outer join
full_outer_join = df1.join(df2, df1['letter'] == df2['letter'], "full_outer")

# Show the result of the join
full_outer_join.show()

# Output:
# +-----+------+-----+
# |letter|number|value|
# +-----+------+-----+
# | A| 1| 4|
# | B| 2| 5|
# | C| 3| null|
# | D| null| 6|
# +-----+------+-----+

In this example, two dataframes, df1 and df2, are created with columns letter and number, and letter and value, respectively. A full outer join is performed between df1 and df2 using the column letter as the join key. The result of the full outer join is a new dataframe that contains all the rows from both dataframes. If there is no match for a row in one of the dataframes, the corresponding value in the other dataframe is retained and a null value is assigned to the columns from the dataframe with no match.

Cross Join

Returns all possible combinations of rows from both the dataframes. In other words, it takes every row from one dataframe and matches it with every row in the other dataframe. The result is a new dataframe with all possible combinations of the rows from the two input dataframes.

A cross-join is used when we want to perform a full outer join but in a more computationally efficient manner. Cross joins are not recommended for large datasets as they can produce a very large number of records, leading to memory issues and poor performance.

Syntax:

df1.crossJoin(df2).show()

In the above example, df1 and df2 are the two dataframes that are being joined using the'crossJoin'.

Example

from pyspark.sql import SparkSession

# Create a Spark session
spark = SparkSession.builder.appName("CrossJoinExample").getOrCreate()

# Create the first dataframe
df1 = spark.createDataFrame([("A", 1), ("B", 2), ("C", 3)], ["letter", "number"])

# Create the second dataframe
df2 = spark.createDataFrame([("X", 4), ("Y", 5), ("Z", 6)], ["symbol", "value"])

# Perform the cross join
cross_join = df1.crossJoin(df2)

# Show the result of the join
cross_join.show()

# Output:
# +-----+------+------+-----+
# |letter|number|symbol|value|
# +-----+------+------+-----+
# | A| 1| X | 4 |
# | A| 1| Y | 5 |
# | A| 1| Z | 6 |
# | B| 2| X | 4 |
# | B| 2| Y | 5 |
# | B| 2| Z | 6 |
# | C| 3| X | 4 |
# | C| 3| Y | 5 |
# | C| 3| Z | 6 |
# +-----+------+------+-----+

In this example, two dataframes, df1 and df2, are created with columns letter and number, and symbol and value, respectively. A cross join is performed between df1 and df2. The result of the cross join is a new dataframe that contains all possible combinations of rows from both dataframes. This type of join results in a large dataframe with many duplicated rows, so it should be used carefully.

Left Anti Join

A left anti join in Spark SQL is a type of left join operation that returns only the rows from the left dataframe that do not have matching values in the right dataframe. It is used to find the rows in one dataframe that do not have corresponding values in another dataframe.

The result of a left anti join is a dataframe that contains only the rows from the left dataframe that do not have matching values in the right dataframe. If a row from the left dataframe has matching values in the right dataframe, it will not be included in the result.

Syntax:

df1.join(df2, df1['key'] == df2['key'], 'left_anti').show()

In the above example, df1 and df2 are the two dataframes that are being joined, and key is the column that is used as the join key. The last parameter, 'left_anti', specifies that this is a left anti join.

Example

from pyspark.sql import SparkSession

# Create a Spark session
spark = SparkSession.builder.appName("LeftAntiJoinExample").getOrCreate()

# Create the first dataframe
df1 = spark.createDataFrame([("A", 1), ("B", 2), ("C", 3)], ["letter", "number"])

# Create the second dataframe
df2 = spark.createDataFrame([("A", 4), ("B", 5)], ["letter", "value"])

# Perform the left anti join
left_anti_join = df1.join(df2, df1['letter'] == df2['letter'], "left_anti")

# Show the result of the join
left_anti_join.show()

# Output:
# +-----+------+
# |letter|number|
# +-----+------+
# | C| 3|
# +-----+------+

In this example, two dataframes, df1 and df2, are created with columns letter and number, and letter and value, respectively. A left anti join is performed between df1 and df2 using the column letter as the join key. The result of the left anti join is a new dataframe that contains only the rows from the first dataframe (df1) that do not match with any rows in the second dataframe (df2). The result of this type of join only includes the columns from the first dataframe, as the purpose of the left anti join is to find the rows that are unique to the first dataframe.

Left Semi Join

A left semi join in Spark SQL is a type of join operation that returns only the columns from the left dataframe that have matching values in the right dataframe. It is used to find the values in one dataframe that have corresponding values in another dataframe.

The result of a left semi join is a dataframe that contains only the columns from the left dataframe that have matching values in the right dataframe. The columns from the right dataframe are not included in the result.

Syntax

df1.join(df2, df1['key'] == df2['key'], 'leftsemi').show()

In the above example, df1 and df2 are the two dataframes that are being joined, and key is the column that is used as the join key. The last parameter, 'leftsemi', specifies that this is a left semi join.

Example

from pyspark.sql import SparkSession

# Create a Spark session
spark = SparkSession.builder.appName("LeftSemiJoinExample").getOrCreate()

# Create the first dataframe
df1 = spark.createDataFrame([("A", 1), ("B", 2), ("C", 3)], ["letter", "number"])

# Create the second dataframe
df2 = spark.createDataFrame([("A", 4), ("B", 5)], ["letter", "value"])

# Perform the left semi join
left_semi_join = df1.join(df2, df1['letter'] == df2['letter'], "leftsemi")

# Show the result of the join
left_semi_join.show()

# Output:
# +-----+------+
# |letter|number|
# +-----+------+
# | A| 1|
# | B| 2|
# +-----+------+

In this example, two dataframes, df1 and df2, are created with columns letter and number, and letter and value, respectively. A left semi join is performed between df1 and df2 using the column letter as the join key. The result of the left semi join is a new dataframe that contains only the columns from the first dataframe (df1) that match with rows in the second dataframe (df2). The result of this type of join only includes the columns from the first dataframe, as the purpose of the left semi join is to find the rows in the first dataframe that match with the rows in the second dataframe.

Self Join

A self join in Spark SQL is a join operation in which a dataframe is joined with itself. It is used to compare the values within a single dataframe and return the rows that match specified criteria.

For example, a self join could be used to find all pairs of rows in a dataframe where the values in two columns are equal. The result would be a new dataframe that contains only the rows that meet the specified criteria.

Syntax

df.alias("df1").join(df.alias("df2"), df1['key'] == df2['key']).show()

In the above example, df is the dataframe that is being joined with itself, and key is the column that is used as the join key. To perform a self join, the dataframe must be aliased with two different names, in this case df1 and df2.

Example

from pyspark.sql import SparkSession

# Create a Spark session
spark = SparkSession.builder.appName("SelfJoinExample").getOrCreate()

# Create a dataframe
df = spark.createDataFrame([("A", 1), ("B", 2), ("C", 3)], ["letter", "number"])

# Perform the self join
self_join = df.alias("df1").join(df.alias("df2"), df["letter"] == df["letter"])

# Show the result of the join
self_join.show()

# Output:
# +-----+------+-----+------+
# |letter|number|letter|number|
# +-----+------+-----+------+
# | A| 1| A| 1|
# | B| 2| B| 2|
# | C| 3| C| 3|
# +-----+------+-----+------+

In this example, a single dataframe df is created with columns letter and number. A self join is performed between the dataframe and itself, with the columns letter being used as the join key. The result of the self join is a new dataframe that contains all combinations of rows between the two instances of the original dataframe. The two instances of the original dataframe are given unique aliases to distinguish them in the join.

Conclusion

In this article, you have learned Spark SQL Join Types INNER, LEFT OUTER, RIGHT OUTER,OUTER ,CROSS , LEFT ANTI, LEFT SEMI, SELF joins usage, and examples with Python. Follow for more PySpark related articles and do check out my PySpark 101 series which will have all basics related to spark with python.

Enjoy Reading..!!

--

--

Ahmed Uz Zaman
Plumbers Of Data Science

Lead QA Engineer | ETL Test Engineer | PySpark | SQL | AWS | Azure | Improvising Data Quality through innovative technologies | linkedin.com/in/ahmed-uz-zaman/