[Project] Once Upon a Spark: Analyzing Shrek’s Script with PySpark

A beginner level project to start using PySpark

Published in

Byte-Sized Data

5 min readApr 7, 2023

Counting things is a great way to get started with Spark. Usually tutorials use books to count words as an introduction to start using Spark. In my case I felt it would be a little bit more fun using the script for the movie Shrek.

Let’s start mapping out this project before we begin. Doing an outline of our projects is always a good idea, specially when working with big data. For this project we will focus on 5 stages or steps:

Reading: Read the script from a text file. To do this we might need to search for the script and create the file for ourselves.
Tokenizing: Tokenize every word
Cleaning: It’s more than likely that words might come with punctuation, character indications or things that aren’t words
Counting: Count how many times a word occurs in the script
Answering: Show which where the most common words (10, 50, 100)

Before starting our process, let’s import the necessary libraries and create our Spark Session.

import sys
from pyspark.sql import SparkSession
from pyspark.sql.functions import split, count, col, explode, lower, regexp_extract

spark = (SparkSession
  .builder
  .appName("ShrekWordCounter")
  .getOrCreate())

Reading

Let’s get our data by loading the text file we filled with Shrek’s script. If you haven’t already, I suggest that before continuing, get the script from this website. Now let’s load it into a DataFrame.

shrek_script = (spark.read.format("text")
    .option("header", "false")
    .option("inferSchema", "true")
    .load("./shrek_script.txt")
    )

Tokenizing

OK, reading the script wasn’t that hard right? Well tokenizing the whole script might be a little bit more complicated. First we should break down the script into lines.

lines = shrek_script.select(split(col("value"), " ").alias("line"))
lines.show(10)

When creating my file, for some reason the script entered the text file as one giant string. If you had the same problem don’t worry, the code will work just the same.

+--------------------+
|                line|
+--------------------+
|[{Man}, Once, upo...|
+--------------------+

Now the next logical step would be to break the lines into words. Giving an alias to the resulting column is a good practice, since the generating names might get confusing if you don’t keep track.

words = lines.select(explode(col("line")).alias("word"))
words.show(15)

+-----------+
|       word|
+-----------+
|      {Man}|
|       Once|
|       upon|
|          a|
|       time|
|      there|
|        was|
|          a|
|     lovely|
|  princess.|
|        But|
|        she|
|        had|
|         an|
|enchantment|
+-----------+

Cleaning

Now that we have the whole script broken down into words, we need to make everything sort of standardized. To do that, let’s start by making everything lowercase.

words_lower = words.select(lower(col("word")).alias("word_lower"))
words_lower.show(15)

+-----------+
| word_lower|
+-----------+
|      {man}|
|       once|
|       upon|
|          a|
|       time|
|      there|
|        was|
|          a|
|     lovely|
|  princess.|
|        but|
|        she|
|        had|
|         an|
|enchantment|
+-----------+

Great! Now to continue our cleaning process, we should get rid of symbols and words that aren’t really words. For example, the website where I got the script lays out the scene by writing things between curly braces. Even though they add context, for the analysis it’s just noise. To remove things, we’ll use the “regexp_extract” to get only words that match our regular expression. Our regular expression will focus on continuous strings of characters.

words_clean = words_lower.select(
    regexp_extract(col("word_lower"), "[a-z]*", 0).alias("word")
)
words_clean.show(20)

+-----------+
|       word|
+-----------+
|           |
|       once|
|       upon|
|          a|
|       time|
|      there|
|        was|
|          a|
|     lovely|
|   princess|
|        but|
|        she|
|        had|
|         an|
|enchantment|
|       upon|
|        her|
|         of|
|          a|
|    fearful|
+-----------+

Notice how words that didn’t match our regular expression got replaced by null values. We don’t want that either. To clean it up, we’ll filter our column using a “WHERE” clause just like in SQL.

words_without_null = words_clean.where(col("word") != "")
words_without_null.show()

+-----------+
|       word|
+-----------+
|       once|
|       upon|
|          a|
|       time|
|      there|
|        was|
|          a|
|     lovely|
|   princess|
|        but|
|        she|
|        had|
|         an|
|enchantment|
|       upon|
|        her|
|         of|
|          a|
|    fearful|
|       sort|
+-----------+
only showing top 20 rows

Great! Now we have our data sparkly clean ;)

Counting

Now that we’ve gotten to the fun part, we first have to group all the words. Then after they’ve been grouped, we can count them.

groups = words_without_null.groupBy(col("word"))
results = groups.count()
results.orderBy("count", ascending=False).show(50)

+--------+-----+
|    word|count|
+--------+-----+
|       i|  406|
|     you|  406|
|     the|  230|
|       a|  225|
|      to|  168|
|      it|  160|
|    that|  135|
|      me|  134|
|     and|  120|
|      no|   90|
|      is|   87|
|      of|   87|
|      my|   86|
|     don|   77|
|    this|   77|
|    what|   76|
|      on|   75|
|      in|   73|
|    know|   70|
|      do|   63|
|     not|   62|
|    your|   61|
...
|      he|   31|
+--------+-----+
only showing top 50 rows

It’s important to remember how Spark handles actions and transformations. In this particular case, groupBy() and count() are transformations, meaning that Spark will queue them lazily and nothing will happen until an action is called. If we didn’t add the show(50) action at the end, spark wouldn’t trigger the chain of computation. Btw, just so we have a little more clarity, this is how that chain would look like.

Our data is scattered across 3 different partitions in our cluster

Workers will perform individually the groupBy() and count() on every partition

Finally every worker sends the summarized data to the master node

Answering

From what were able to observe, the most common words in the script were “I” and “You”. Awww, how cute, but not surprising. Any written content will require a myriad of pronouns and very common words. Even though we had these underwhelming revelations, we can still get some interesting insights. For example, if we see the top 50 most common words in the script, we can notice that Shrek (49 times) is mentioned more than Princess (39 times) or that swamp (20 times) is mentioned more than Fiona (17 times). At least we know Shrek’s priorities :)

Wrapping Things Up

Just to finish our project with a gold star, let’s export it as a CSV. We don’t really have to, but it would be nice in case we need it in the future.

results.coalesce(1).write.csv("./shrek_word_counts.csv")