Exploring the Different Join Types in Spark SQL: A Step-by-Step Guide
Understand the Key Concepts and Syntax of Cross, Outer, Anti, Semi, and Self Joins
This article will go over all the different types of joins that PySpark SQL has to offer with their syntaxes and simple examples.
The list of joins provided by Spark SQL is:
- Inner Join
- Left / Left Outer Join
- Right / Right Outer Join
- Outer / Full Join
- Cross Join
- Left Anti Join
- Left Semi Join
- Self Join
Inner Join
Returns only the rows from both the dataframes that have matching values in both columns specified as the join keys.
Syntax
df1.join(df2, df1['key'] == df2['key'], 'inner').show()
In the above example,
df1
anddf2
are the two dataframes that are being joined, andkey
is the column that is used as the join key. The last parameter,'inner'
, specifies the type of outer join to perform
Example
from pyspark.sql import SparkSession
# Create a Spark session
spark = SparkSession.builder.appName("InnerJoinExample").getOrCreate()
# Create the first dataframe
df1 = spark.createDataFrame([("A", 1), ("B", 2), ("C", 3)], ["letter", "number"])
# Create the second dataframe
df2 = spark.createDataFrame([("A", 4), ("B", 5), ("D", 6)], ["letter", "value"])
# Perform the inner join
inner_join = df1.join(df2, df1['letter'] == df2['letter'], "inner")
# Show the result of the join
inner_join.show()
# Output:
# +-----+------+-----+
# |letter|number|value|
# +-----+------+-----+
# | A| 1| 4|
# | B| 2| 5|
# +-----+------+-----+
In this example, two dataframes,
df1
anddf2
, are created with columnsletter
andnumber
, andletter
andvalue
, respectively. An inner join is performed betweendf1
anddf2
using the columnletter
as the join key. The result of the inner join is a new dataframe that contains only the rows from bothdf1
anddf2
where the values in theletter
column match. The resulting dataframe contains two rows, for letters "A" and "B", which are the only letters that are present in both dataframes.
Left / Left Outer Join
Returns all the rows from the left dataframe and the matching rows from the right dataframe. If there are no matching values in the right dataframe, then it returns a null.
Syntax
df1.join(df2, df1['key'] == df2['key'], 'left').show()
(OR)
df1.join(df2, df1['key'] == df2['key'], 'leftouter').show()
In the above example,
df1
anddf2
are the two dataframes that are being joined, andkey
is the column that is used as the join key. The last parameter,'left'
, specifies the type of outer join to perform
Example
Using the same dataset of inner join
# Perform the left outer join
left_outer_join = df1.join(df2, df1['letter'] == df2['letter'], "left_outer")
# Show the result of the join
left_outer_join.show()
# Output:
# +-----+------+-----+
# |letter|number|value|
# +-----+------+-----+
# | A| 1| 4|
# | B| 2| 5|
# | C| 3| null|
# +-----+------+-----+
In this example, two dataframes,
df1
anddf2
, are created with columnsletter
andnumber
, andletter
andvalue
, respectively. A left outer join is performed betweendf1
anddf2
using the columnletter
as the join key. The result of the left outer join is a new dataframe that contains all the rows from the left dataframe (df1
) and the matching rows from the right dataframe (df2
).
Right / Right Outer Join
Returns all the rows from the right dataframe and the matching rows from the left dataframe. If there are no matching values in the left dataframe, then it returns a null.
Syntax
df1.join(df2, df1['key'] == df2['key'], 'right').show()
(OR)
df1.join(df2, df1['key'] == df2['key'], 'rightouter').show()
In the above example,
df1
anddf2
are the two dataframes that are being joined, andkey
is the column that is used as the join key. The last parameter,'right'
, specifies the type of outer join to perform
Example
Using the same dataset of inner join
# Perform the right outer join
right_outer_join = df1.join(df2, df1['letter'] == df2['letter'], "right_outer")
# Show the result of the join
right_outer_join.show()
# Output:
# +-----+------+-----+
# |letter|number|value|
# +-----+------+-----+
# | A| 1| 4|
# | B| 2| 5|
# | D| null| 6|
# +-----+------+-----+
In this example, two dataframes,
df1
anddf2
, are created with columnsletter
andnumber
, andletter
andvalue
, respectively. A right outer join is performed betweendf1
anddf2
using the columnletter
as the join key. The result of the right outer join is a new dataframe that contains all the rows from the right dataframe (df2
) and the matching rows from the left dataframe (df1
). If there is no match for a row in the left dataframe, the corresponding value in the right dataframe is retained and a null value is assigned to the columns from the left dataframe.
Outer / Full Join
Returns all the rows from both the dataframes, including the matching and non-matching rows. If there are no matching values, then the result will contain a NULL value in place of the missing data.
Syntax
df1.join(df2, df1['key'] == df2['key'], 'outer').show()
(OR)
df1.join(df2, df1['key'] == df2['key'], 'full').show()
(OR)
df1.join(df2, df1['key'] == df2['key'], 'fullouter').show()
In the above example,
df1
anddf2
are the two dataframes that are being joined, andkey
is the column that is used as the join key. The last parameter,'outer'
, specifies the type of outer join to perform.
Example
Using the same dataset of inner join
# Perform the full outer join
full_outer_join = df1.join(df2, df1['letter'] == df2['letter'], "full_outer")
# Show the result of the join
full_outer_join.show()
# Output:
# +-----+------+-----+
# |letter|number|value|
# +-----+------+-----+
# | A| 1| 4|
# | B| 2| 5|
# | C| 3| null|
# | D| null| 6|
# +-----+------+-----+
In this example, two dataframes,
df1
anddf2
, are created with columnsletter
andnumber
, andletter
andvalue
, respectively. A full outer join is performed betweendf1
anddf2
using the columnletter
as the join key. The result of the full outer join is a new dataframe that contains all the rows from both dataframes. If there is no match for a row in one of the dataframes, the corresponding value in the other dataframe is retained and a null value is assigned to the columns from the dataframe with no match.
Cross Join
Returns all possible combinations of rows from both the dataframes. In other words, it takes every row from one dataframe and matches it with every row in the other dataframe. The result is a new dataframe with all possible combinations of the rows from the two input dataframes.
A cross-join is used when we want to perform a full outer join but in a more computationally efficient manner. Cross joins are not recommended for large datasets as they can produce a very large number of records, leading to memory issues and poor performance.
Syntax:
df1.crossJoin(df2).show()
In the above example,
df1
anddf2
are the two dataframes that are being joined using the'crossJoin'.
Example
from pyspark.sql import SparkSession
# Create a Spark session
spark = SparkSession.builder.appName("CrossJoinExample").getOrCreate()
# Create the first dataframe
df1 = spark.createDataFrame([("A", 1), ("B", 2), ("C", 3)], ["letter", "number"])
# Create the second dataframe
df2 = spark.createDataFrame([("X", 4), ("Y", 5), ("Z", 6)], ["symbol", "value"])
# Perform the cross join
cross_join = df1.crossJoin(df2)
# Show the result of the join
cross_join.show()
# Output:
# +-----+------+------+-----+
# |letter|number|symbol|value|
# +-----+------+------+-----+
# | A| 1| X | 4 |
# | A| 1| Y | 5 |
# | A| 1| Z | 6 |
# | B| 2| X | 4 |
# | B| 2| Y | 5 |
# | B| 2| Z | 6 |
# | C| 3| X | 4 |
# | C| 3| Y | 5 |
# | C| 3| Z | 6 |
# +-----+------+------+-----+
In this example, two dataframes,
df1
anddf2
, are created with columnsletter
andnumber
, andsymbol
andvalue
, respectively. A cross join is performed betweendf1
anddf2
. The result of the cross join is a new dataframe that contains all possible combinations of rows from both dataframes. This type of join results in a large dataframe with many duplicated rows, so it should be used carefully.
Left Anti Join
A left anti join in Spark SQL is a type of left join operation that returns only the rows from the left dataframe that do not have matching values in the right dataframe. It is used to find the rows in one dataframe that do not have corresponding values in another dataframe.
The result of a left anti join is a dataframe that contains only the rows from the left dataframe that do not have matching values in the right dataframe. If a row from the left dataframe has matching values in the right dataframe, it will not be included in the result.
Syntax:
df1.join(df2, df1['key'] == df2['key'], 'left_anti').show()
In the above example,
df1
anddf2
are the two dataframes that are being joined, andkey
is the column that is used as the join key. The last parameter,'left_anti'
, specifies that this is a left anti join.
Example
from pyspark.sql import SparkSession
# Create a Spark session
spark = SparkSession.builder.appName("LeftAntiJoinExample").getOrCreate()
# Create the first dataframe
df1 = spark.createDataFrame([("A", 1), ("B", 2), ("C", 3)], ["letter", "number"])
# Create the second dataframe
df2 = spark.createDataFrame([("A", 4), ("B", 5)], ["letter", "value"])
# Perform the left anti join
left_anti_join = df1.join(df2, df1['letter'] == df2['letter'], "left_anti")
# Show the result of the join
left_anti_join.show()
# Output:
# +-----+------+
# |letter|number|
# +-----+------+
# | C| 3|
# +-----+------+
In this example, two dataframes,
df1
anddf2
, are created with columnsletter
andnumber
, andletter
andvalue
, respectively. A left anti join is performed betweendf1
anddf2
using the columnletter
as the join key. The result of the left anti join is a new dataframe that contains only the rows from the first dataframe (df1
) that do not match with any rows in the second dataframe (df2
). The result of this type of join only includes the columns from the first dataframe, as the purpose of the left anti join is to find the rows that are unique to the first dataframe.
Left Semi Join
A left semi join in Spark SQL is a type of join operation that returns only the columns from the left dataframe that have matching values in the right dataframe. It is used to find the values in one dataframe that have corresponding values in another dataframe.
The result of a left semi join is a dataframe that contains only the columns from the left dataframe that have matching values in the right dataframe. The columns from the right dataframe are not included in the result.
Syntax
df1.join(df2, df1['key'] == df2['key'], 'leftsemi').show()
In the above example,
df1
anddf2
are the two dataframes that are being joined, andkey
is the column that is used as the join key. The last parameter,'leftsemi'
, specifies that this is a left semi join.
Example
from pyspark.sql import SparkSession
# Create a Spark session
spark = SparkSession.builder.appName("LeftSemiJoinExample").getOrCreate()
# Create the first dataframe
df1 = spark.createDataFrame([("A", 1), ("B", 2), ("C", 3)], ["letter", "number"])
# Create the second dataframe
df2 = spark.createDataFrame([("A", 4), ("B", 5)], ["letter", "value"])
# Perform the left semi join
left_semi_join = df1.join(df2, df1['letter'] == df2['letter'], "leftsemi")
# Show the result of the join
left_semi_join.show()
# Output:
# +-----+------+
# |letter|number|
# +-----+------+
# | A| 1|
# | B| 2|
# +-----+------+
In this example, two dataframes,
df1
anddf2
, are created with columnsletter
andnumber
, andletter
andvalue
, respectively. A left semi join is performed betweendf1
anddf2
using the columnletter
as the join key. The result of the left semi join is a new dataframe that contains only the columns from the first dataframe (df1
) that match with rows in the second dataframe (df2
). The result of this type of join only includes the columns from the first dataframe, as the purpose of the left semi join is to find the rows in the first dataframe that match with the rows in the second dataframe.
Self Join
A self join in Spark SQL is a join operation in which a dataframe is joined with itself. It is used to compare the values within a single dataframe and return the rows that match specified criteria.
For example, a self join could be used to find all pairs of rows in a dataframe where the values in two columns are equal. The result would be a new dataframe that contains only the rows that meet the specified criteria.
Syntax
df.alias("df1").join(df.alias("df2"), df1['key'] == df2['key']).show()
In the above example,
df
is the dataframe that is being joined with itself, andkey
is the column that is used as the join key. To perform a self join, the dataframe must be aliased with two different names, in this casedf1
anddf2
.
Example
from pyspark.sql import SparkSession
# Create a Spark session
spark = SparkSession.builder.appName("SelfJoinExample").getOrCreate()
# Create a dataframe
df = spark.createDataFrame([("A", 1), ("B", 2), ("C", 3)], ["letter", "number"])
# Perform the self join
self_join = df.alias("df1").join(df.alias("df2"), df["letter"] == df["letter"])
# Show the result of the join
self_join.show()
# Output:
# +-----+------+-----+------+
# |letter|number|letter|number|
# +-----+------+-----+------+
# | A| 1| A| 1|
# | B| 2| B| 2|
# | C| 3| C| 3|
# +-----+------+-----+------+
In this example, a single dataframe
df
is created with columnsletter
andnumber
. A self join is performed between the dataframe and itself, with the columnsletter
being used as the join key. The result of the self join is a new dataframe that contains all combinations of rows between the two instances of the original dataframe. The two instances of the original dataframe are given unique aliases to distinguish them in the join.
Conclusion
In this article, you have learned Spark SQL Join Types INNER
, LEFT OUTER
, RIGHT OUTER
,OUTER
,CROSS
, LEFT ANTI
, LEFT SEMI
, SELF
joins usage, and examples with Python. Follow for more PySpark related articles and do check out my PySpark 101 series which will have all basics related to spark with python.
Enjoy Reading..!!