Little known Spark DataFrame join types

Probably most of you know the basic join types from SQL: left, right, inner and outer. Since these are supported by most of the data-oriented frameworks like pandas, SQL or R, it is easy to restrict oneself to these 4 approaches during the data flow design stage.

However, Spark DataFrame API (since version 2.0) has 2 other types: left_semi (alias leftsemi), and left_anti. In the following we will explore what they do.

Let’s define some data to work with:

case class Article(author: String, title: String, id: Int)
case class ArticleView(articleId: Int, viewCount: Int)
val articles = Seq(
Article("Gergely", "Spark joins", 0),
Article("Gergely", "In preparation", 3),
Article("Kris", "Athena with DataGrip", 1),
Article("Kris", "Something Else", 2)
).toDF
val views = Seq(
ArticleView(0, 1), //my article is not very popular :(
ArticleView(1, 123),//Kris' article has been viewed 123 times
ArticleView(2, 24), //Kris' not so popular article
ArticleView(10, 10) //deleted article
).toDF

The left_semi join type is essentially a filter on the left table based on the join condition:

articles
.join(views, articles("id") === views("articleId"), "left_semi")
+-------+--------------------+---+
| author| title| id|
+-------+--------------------+---+
|Gergely| Spark joins| 0|
| Kris|Athena with DataGrip| 1|
| Kris| Something Else| 2|
+-------+--------------------+---+

Notice that my second article (“In preparation”) is missing from the output: it has been filtered out, because it has no corresponding entry in the views DataFrame. The output schema is identical to the left table’s schema.

The left_anti does the exact opposite of left_semi: it filters out all entries from the left table which have a corresponding entry in the right table. In our case, only my article in preparation remains:

articles
.join(views, articles("id") === views("articleId"), "left_anti")
+-------+--------------+---+
| author| title| id|
+-------+--------------+---+
|Gergely|In preparation| 3|
+-------+--------------+---+

Unfortunately, these join types are only available for DataFrames, the Datasets do not support these.