Little known Spark DataFrame join types

Published in

datamindedbe

2 min readNov 29, 2018

Probably most of you know the basic join types from SQL: left, right, inner and outer. Since these are supported by most of the data-oriented frameworks like pandas, SQL or R, it is easy to restrict oneself to these 4 approaches during the data flow design stage.

However, Spark DataFrame API (since version 2.0) has 2 other types: left_semi (alias leftsemi), and left_anti. In the following we will explore what they do.

Let’s define some data to work with:

case class Article(author: String, title: String, id: Int)
case class ArticleView(articleId: Int, viewCount: Int)val articles = Seq(
    Article("Gergely", "Spark joins", 0),
    Article("Gergely", "In preparation", 3),
    Article("Kris", "Athena with DataGrip", 1),
    Article("Kris", "Something Else", 2)
).toDFval views = Seq(
    ArticleView(0, 1),  //my article is not very popular :(
    ArticleView(1, 123),//Kris' article has been viewed 123 times
    ArticleView(2, 24), //Kris' not so popular article
    ArticleView(10, 10) //deleted article
).toDF

The left_semi join type is essentially a filter on the left table based on the join condition:

articles
  .join(views, articles("id") === views("articleId"), "left_semi")+-------+--------------------+---+
| author|               title| id|
+-------+--------------------+---+
|Gergely|         Spark joins|  0|
|   Kris|Athena with DataGrip|  1|
|   Kris|      Something Else|  2|
+-------+--------------------+---+

Notice that my second article (“In preparation”) is missing from the output: it has been filtered out, because it has no corresponding entry in the views DataFrame. The output schema is identical to the left table’s schema.

The left_anti does the exact opposite of left_semi: it filters out all entries from the left table which have a corresponding entry in the right table. In our case, only my article in preparation remains:

articles
  .join(views, articles("id") === views("articleId"), "left_anti")+-------+--------------+---+
| author|         title| id|
+-------+--------------+---+
|Gergely|In preparation|  3|
+-------+--------------+---+

Unfortunately, these join types are only available for DataFrames, the Datasets do not support these.

Little known Spark DataFrame join types

Written by Gergely Soti