Data Loves Comedy: Analysis of a Standup Act
I started with downloading the subtitles on YouTube and — after some minor editing — did a basic count:
- 5,259 signs
- 972 words
- 326 unique words
To analyze it further, I turned to Natural Language Processing, a domain mixing computing, statistics and linguistics.
I first needed to figure out which words were more ‘meaningful’. For that, it is common practice to filter out the most common words (e.g. you, when, at, any, before, …). Those are called stop words.
As shown in the chart below, the first word that is not a stop word among the 20 most frequent words in the routine is ‘Dottie’ (the wisecracking secretary).
As there is no ‘standard’ set of stop words, I started with an initial list of 322, to which I added a few more, and punctuation, up to 375 elements.
The count went down from 326 unique words including stop words, to 235 unique words without.
The most frequent ‘non stop words’ are:
Interestingly, we can use those words to get a fairly good summary of the routine:
‘A boss, guys and Dottie abbreviate States. And there’s an omelet chef.’
What makes this joke a great piece of writing? One aspect is the unique flavor given by rare words.
It is possible to analyze the uniqueness of the words used using their frequency in the English language. As a reference, here is a list of the top 100 most frequent words in English.
One way is to use a metric called the Zipf Frequency (named after the linguist George Kingsley Zipf). It measures the frequency of a word across a broad collection of English sources, on a logarithmic scale.
- The lower the number, the rarer the word.
- A Zipf of 7 means a word appears every 100 words.
- A Zipf of 3 means once every 1 million words.
I picked some examples below as references to give a sense of what those rarity levels mean in practice.
Here are the words with a Zipf lower than 3 in the routine:
Note: The words ‘abbreviators’ and ‘apostrophizing’ got the Zipf function a bit confused and returned a zero. Despite its overall comedic quality, “ne’er-do-well” scored a Zipf of 3.83 and didn’t do well enough to make the cut. If you want to check the Zipf of some words yourself, I made a basic checker here.
This second series can offer a very similar but more flavorful summary:
‘Rogues, misfits and the wisecracking Dottie abbreviate.’
…and something about a ‘hollandaise omelet’?
Checking with google
As I was wondering if Google knew any better regarding the missing 2 words, I thought the number of search results could shed some light.
I looked up some basic words first, and calculated the Log10 of their number of search results.
Note: The reference material of google results is entirely ‘online stuff’. While it includes scanned books, it probably has a recency bias.
As the difference seemed to be a constant, I averaged the differences and found about 4.26 (I called it ‘B’). The updated graphic is below and fits quite well.
Side note: In mathematics, the Dottie number is a constant that is the unique real root of the equation cos(x) = x. Apparently a professor of French discovered it by pressing the ‘cos’ key repeatedly on a calculator and seeing the number converge every time. It is a transcendental number (alongside Pi and e) and, for those versed in chaos theory, it is also a universal attractor (a ‘randy minx’?). Seems quite fitting to Gary’s routine!
Let’s see how our ‘google results formula’ applies to the rare words in the routine:
- It’s not perfect but it’s ‘within range’.
- Maybe the Dottie.vn Vietnamese online store is skewing the results a bit?
- While ‘abbreviators’ and ‘apostrophizing’ scored a zero with the Zipf module, Google didn’t drop the ball: the former had ±40,000 results, the latter about 77,000. Very small compared to the many millions of other words. Even ‘wisecracking’ had almost 2 million hits!
Finally, out of curiosity, I wanted to check the lexical diversity.
As it turns out, it is tricky to measure and compare as:
- Shorter texts have less space to introduce varied words,
- The reference texts are skewed toward modern speeches,
- Some complex terms might be repeated.
The method that seems least affected by the above is called MLTD (“Measure of Textual Lexical Diversity”).
Note that lexical diversity is not a measure of the quality of a text. Each text serves a particular purpose, and repetition and simple words can help both humor and persuasion.
I selected a few reference texts and extracted their MLTD:
- Martin Luther King — I Have a Dream: 41.9
- Maya Angelou — Address at Wellesley College: 47.7
- Winston Churchill — We shall fight on the beaches: 49.8
- Dr. Seuss — Oh The Places You’ll Go!: 51.0
- Gary Gulman — State Abbreviations: 52.7
- JFK — We Chose to go to the Moon: 57.7
- Marc Antony — Bury Caesar: 58.5
- Abraham Lincoln — Gettysburg Address: 59.4
- Greta Thunberg — UN climate speech: 89.8
I was initially surprised by how Greta Thunberg’s ‘How dare you’ UN speech turned out more lexically diverse than iconic speeches by MLK and JFK. Upon re-reading, I think it is due to her use of specialized vocabulary to cover briefly several complex topics.
Funny sounding words
Gary tweeted that, to him, words with the sound “buh”, “puh”, “kuh” sound funnier (than alternates). Maybe “guh” too?
Those consonants didn’t stand out overall in the routine, but here are some notable uses:
- ‘Crack squad of abbreviators’
- ‘ragtag outfit of rogues’
- back / track / work
It is not easy to explain humor with data. My takeaways on the lexical aspects are that repetition, varied, rare, and funny-sounding words all contribute to comedy.
For more, better check what the man himself says here on Twitter where he shared 366 tips. His tour dates, albums and more are available on https://garygulman.com. I had a chance to see him live at the Comedy Cellar in 2019 and he was the best!
PS: Folks at Oxford looked into ways to use machine learning to classify one-liners and tell the difference between articles from The Onion and Reuters. They got impressive results but it is still limited to those formats, and they don’t have automated joke-writing quite yet :)