Text Analysis with South Park — Part 1: TF-IDF
I noticed recently that Kaggle has an interesting dataset — 70,000 lines of South Park dialogue. It’s nicely labelled by episode and character. I figured it would be a good practical test for the TF-IDF tools in scikit learn that I’ve been wanting to try recently.
There’s been a lot written about the theory behind TF-IDF but the gist is: calculate the term frequency (TF), or number of times a word appears across the text you’re interested in analysing. This is intuitive — more important words appear more often right?
Then offset each term by the number of times it appears in a collection of different associated texts (think pages vs a whole book) called the Inverse Document Frequency (IDF). So words like “and” or “I”, which don’t carry much weight informationally, are downweighted due to the large number of times they appear in the overall corpus.
What can we do with this?
As a first step we’ll calculate the most important keywords for any given episode to see if that gives a reasonable approximation of what it’s about. The next article in this series uses that information to take search queries “The episode where Cartman turns someones parents into chili” and identifies the episode it came from.
Step 1 — Collecting the Data
Once you’ve downloaded the dataset the first step is to collect all the lines into the episodes they represent. So our overall corpus will be a collection of strings representing each episode (a document.) Note that this can be sliced up any number of ways depending on the question we want to answer — we could collect each characters lines together and find the terms that most represent them, for example.
We append the lines together in a dict initially then convert it to a list, as order will matter later in the process.
Step 2 — Pre Processing the Data
To get good results from this process we need to transform the existing text in a few important ways. The result is great for feature processing but not very readable so we’ll create a new list corpus which is aligned numerically with list_form (i.e. list_form[n] corresponds to corpus[n].)
- We use .rstrip() to get rid of newline characters.
- .lower() because “Bob” and “bob” mean the same thing here.
- The .translate(…) is a slightly complicated way of removing punctuation. This is so when we split up the words we don’t have “bob” and “bob.” treated differently.
- The whole thing is wrapped in a list comprehension and .join() to combine all the individual lines into one long string representing the whole episode.
The result is something that looks like this for each episode:
“‘mkay kids as your counselor i know it can sometimes be difficult to talk about subjects like drugs and sex and alcohol mkay so as you remember last week i told you to write down any difficult questions you had and put them in this box anonymously so we could discuss in class mkay i got a lot o responses so lets read some aloud mr mackey is gay…..”
Step 3 — The Actual Calculation
This part is very easy thanks to scikit learn. It’s relatively easy to implement TF-IDF yourself as an exercise, but here we’re focused on quick results so we’ll use the efficient implementation provided.
Let’s dig in to this a bit — we initialise a vectorizer which takes instructions on exactly how we want to break down the document.
- The tokenizer defines how we split the big episode strings down into individual words. We’re borrowing one from the nltk package which you might have to download.
- We also define a set of stop words. These are words like “a” or “is” which appear so often in a language that we know they won’t provide useful information and so can be ignored.
- The min_df specifies that a word must be used at least twice to be considered. In practice this is useful for removing things like URLs from text, which appear as one offs.
- The max_df is a float value — this tell the vectorizer to ignore words which appears in more than 50% of documents in the corpus. This generally catches words not already defined in the stopwords set.
Finally we use fit_transform() to train the vectorizer using the corpus we defined above.
Let’s see how well it defines keywords for random episodes:
This gets the index of a random episode and lists the top 5 keywords and their relative importance.
feature_names is a list of the keywords the vectorizer has extracted from the corpus.
We then ask the vectorizer to consider a single document from the corpus based on the random index.
An empty list of keywords is defined and we iterate over the response looking for non zero values which represent words that have been found in both the corpus as a whole and the document we specified. We append the keywords list with the keyword in English and the importance attached.
The last step is to sort the keywords and slice off the top 5, this would normally give us an ascending order (least to most relevant) but as we want the most relevant first we give it a lambda function key to invert the importance.
Let’s look at the output for three well known episodes — the first ever “Cartman Gets an Anal Probe” (Season 1, Episode 1), “Scott Tenorman Must Die” (Season 5, Episode 4) and the Canadian Royal Wedding episode “Royal Pudding” (Season 15, Epsiode 3).
Looks like a pretty good representation of the episodes!
There are some problems — Canada and Canadian are effectively the same word here. We could address this by introducing “stemming” which tries to account for root words, removing plurals etc. We also see that words which wouldn’t be used by a human describing the episode but which are repeated heavily and exclusively in the episode, such as “moo” or “oink”, get upweighted heavily.
So while it’s obvious that these are generally good examples of words important to the episode, it’s less clear that they would be useful when trying to search for episodes. It turns out that they are, thanks to some techniques we’ll discover in Part 2 of this series!