Exclude a list of items in PySpark DataFrame

2 min readFeb 25, 2020

Sometimes you have two dataframes, and want to exclude from one dataframe all the values in the other dataframe. For example, you want to calculate the word count for a text corpus, but want to skip a set of stop words. In Python, you can simply do:

list_c = [x for x in list_a if x not in list_b]

How to do that in PySpark? Let’s set up an example:

from pyspark.sql.types import StringType
from pyspark.sql.functions import explode, lower, split, length, coldf = spark.createDataFrame(["Be not afraid of greatness.",
                            "To be, or not to be, that is the question"], 
                           StringType())stop_words_df = spark.createDataFrame("not of or the".split(),
                                      StringType())

We want to get the top 5 words in the df, but exclude the stop words listed in stop_words_df.

First let’s convert the df to a plain word list, using the split and explode functions.

all_words_df = df.select(explode(split(lower(col("value")), 
                                       '[^a-z]')).alias("word"))\
                  .where(length("word") > 1)
display(all_words_df)

all_words_df now contains the word list:

  word
----------
be
not
afraid
of
greatness
to
be
or
not
to
be
that
is
the
question

Now we will use the all_words_df to left join with the stop_words_df, and the words in all_words_df but without a match in stop_words_df will result in a null value, with which we can filter:

remain_df = all_words_df.join(stop_words_df, 
                              col("word") == stop_words_df.value,
                              how="left")\
                        .filter(col("value").isNull())
display(remain_df)  word   |  value
---------|----------
afraid    null
be        null
be        null
be        null
is        null
question  null
greatness null
that      null
to        null
to        null

With this remaining word list, we can then apply groupBy and count and sort to get the top words:

top5_df = remain_df.groupBy("word")\
                   .count()\
                   .sort(f.col("count"), ascending=False)\
                   .limit(5)
display(top5_df)  word   |  count
---------|---------
be        3
to        2
greatness 1
question  1
is        1

Here you get the top words excluding the stop words. To summarize, useful PySpark functions used in this exercise:

split
explode
(left) join
groupBy
count
sort

Exclude a list of items in PySpark DataFrame

Written by Jun Wan