Spark Dataset/DataFrame IN/NOT_IN expression
In this post we will explain how to write IN/NOT_IN expression in dataset/dataframe api:
- SELECT * FROM main WHERE index IN (SELECT index FROM dict)
- SELECT * FROM main WHERE index NOT IN (SELECT index FROM dict)
Spark Dataset/DataFrame API has equal expression for IN/NOT_IN. You can use join for these needs. Spark supports extra types of join like semi join and anti join.
- Use “left_semi” for IN
- Use “left_anti” for NOT IN
We can summarise all knownloadge of this post in this price of code:
“LEFT SEMI” join is synonym for “IN” expression
“LEFT_ANTI” join is synonym for “NOT IN” expression
What difference between left semi join and inner join? When you use inner join records in left table can be repeated when one key relates to multiple records in right table.
Links: