Joins in PySpark
In PySpark, Join is used to combine two DataFrames
It supports all basic join type operations available in traditional SQL like INNER, LEFT OUTER, RIGHT OUTER, LEFT ANTI, LEFT SEMI, CROSS, SELF JOIN.
PySpark Joins are wider transformations that involve data shuffling across the network.
Syntax for Join in PySpark:
join(self, other, on=None, how=None)
join() operation takes parameters as below and returns DataFrame.
- other: Right side of the join
- on: a string for the join column name
- how: default inner. Must be one of inner, cross, outer,full, full_outer, left, left_outer, right, right_outer,left_semi, and left_anti.
PySpark Inner Join DataFrame:
Inner join is the default join in PySpark and it’s mostly used. This joins two datasets on key columns, where keys don’t match the rows get dropped from both datasets.
DF_01.join(DF_02,DF_01.id == DF_02.id,"inner").show(truncate=False)
PySpark Full Outer Join:
Fullouter join returns all rows from both datasets, where join expression doesn’t match it returns null on respective record columns.
DF_01.join(DF_02,DF_01.id == DF_02.id,"outer").show(truncate=False)
PySpark Left Outer Join:
Leftouter join returns all rows from the left dataset regardless of match found on the right dataset when join expression doesn’t match, it assigns null for that record and drops records from right where match not found.
DF_01.join(DF_02,DF_01("id") == DF_02("id"),"leftouter").show(false)
Right Outer Join:
Rightouter join is opposite of left join, here it returns all rows from the right dataset regardless of math found on the left dataset, when join expression doesn’t match, it assigns null for that record and drops records from left where match not found.
DF_01.join(DF_02,DF_01.id==DF_02.id,"rightouter").show()
PySpark Self Join:
There is no self-join type available, we can use any of the above-explained join types to join DataFrame to itself.