Exclude a list of items in PySpark DataFrame

Jun Wan
2 min readFeb 25, 2020

--

list exclusion

Sometimes you have two dataframes, and want to exclude from one dataframe all the values in the other dataframe. For example, you want to calculate the word count for a text corpus, but want to skip a set of stop words. In Python, you can simply do:

list_c = [x for x in list_a if x not in list_b]

How to do that in PySpark? Let’s set up an example:

from pyspark.sql.types import StringType
from pyspark.sql.functions import explode, lower, split, length, col
df = spark.createDataFrame(["Be not afraid of greatness.",
"To be, or not to be, that is the question"],
StringType())
stop_words_df = spark.createDataFrame("not of or the".split(),
StringType())

We want to get the top 5 words in the df, but exclude the stop words listed in stop_words_df.

First let’s convert the df to a plain word list, using the split and explode functions.

all_words_df = df.select(explode(split(lower(col("value")), 
'[^a-z]')).alias("word"))\
.where(length("word") > 1)
display(all_words_df)

all_words_df now contains the word list:

  word
----------
be
not
afraid
of
greatness
to
be
or
not
to
be
that
is
the
question

Now we will use the all_words_df to left join with the stop_words_df, and the words in all_words_df but without a match in stop_words_df will result in a null value, with which we can filter:

remain_df = all_words_df.join(stop_words_df, 
col("word") == stop_words_df.value,
how="left")\
.filter(col("value").isNull())
display(remain_df)
word | value
---------|----------
afraid null
be null
be null
be null
is null
question null
greatness null
that null
to null
to null

With this remaining word list, we can then apply groupBy and count and sort to get the top words:

top5_df = remain_df.groupBy("word")\
.count()\
.sort(f.col("count"), ascending=False)\
.limit(5)
display(top5_df)
word | count
---------|---------
be 3
to 2
greatness 1
question 1
is 1

Here you get the top words excluding the stop words. To summarize, useful PySpark functions used in this exercise:

  • split
  • explode
  • (left) join
  • groupBy
  • count
  • sort

--

--

Jun Wan

Software Builder in Boston Area. Occasionally write about technologies and life.