Spark Dataset/DataFrame IN/NOT_IN expression

Mark Andreev
2 min readJul 2, 2022

--

The gist of the article

In this post we will explain how to write IN/NOT_IN expression in dataset/dataframe api:
- SELECT * FROM main WHERE index IN (SELECT index FROM dict)
- SELECT * FROM main WHERE index NOT IN (SELECT index FROM dict)

Spark Dataset/DataFrame API has equal expression for IN/NOT_IN. You can use join for these needs. Spark supports extra types of join like semi join and anti join.
- Use “left_semi” for IN
- Use “left_anti” for NOT IN

We can summarise all knownloadge of this post in this price of code:

“LEFT SEMI” join is synonym for “IN” expression

LEFT_SEMI join example

“LEFT_ANTI” join is synonym for “NOT IN” expression

LEFT_ANTI join example

What difference between left semi join and inner join? When you use inner join records in left table can be repeated when one key relates to multiple records in right table.

INNER join issue when dict contains duplicates

Links:

--

--

No responses yet