Sometimes you have two dataframes, and want to exclude from one dataframe all the values in the other dataframe. For example, you want to calculate the word count for a text corpus, but want to skip a set of stop words. In Python, you can simply do:
list_c = [x for x in list_a if x not in list_b]
How to do that in PySpark? Let’s set up an example:
from pyspark.sql.types import StringType
from pyspark.sql.functions import explode, lower, split, length, coldf = spark.createDataFrame(["Be not afraid of greatness.",
"To be, or not to be, that is the question"],
StringType())stop_words_df = spark.createDataFrame("not of or the".split(),
StringType())
We want to get the top 5 words in the df
, but exclude the stop words listed in stop_words_df
.
First let’s convert the df to a plain word list, using the split
and explode
functions.
all_words_df = df.select(explode(split(lower(col("value")),
'[^a-z]')).alias("word"))\
.where(length("word") > 1)
display(all_words_df)
all_words_df
now contains the word list:
word
----------
be
not
afraid
of
greatness
to
be
or
not
to
be
that
is
the
question
Now we will use the all_words_df
to left join with the stop_words_df
, and the words in all_words_df
but without a match in stop_words_df
will result in a null value, with which we can filter:
remain_df = all_words_df.join(stop_words_df,
col("word") == stop_words_df.value,
how="left")\
.filter(col("value").isNull())
display(remain_df) word | value
---------|----------
afraid null
be null
be null
be null
is null
question null
greatness null
that null
to null
to null
With this remaining word list, we can then apply groupBy
and count
and sort
to get the top words:
top5_df = remain_df.groupBy("word")\
.count()\
.sort(f.col("count"), ascending=False)\
.limit(5)
display(top5_df) word | count
---------|---------
be 3
to 2
greatness 1
question 1
is 1
Here you get the top words excluding the stop words. To summarize, useful PySpark functions used in this exercise:
- split
- explode
- (left) join
- groupBy
- count
- sort