The Startup
Published in

The Startup

Data Loves Comedy: Analysis of a Standup Act

One of my favorite standup routines is Gary Gulman’s ‘State Abbreviations’, which lasts about 6 minutes. Here is what I found when doing a ‘quick’ analysis of its text.

Basic Set

  • 5,259 signs
  • 972 words
  • 326 unique words

Filtering

I first needed to figure out which words were more ‘meaningful’. For that, it is common practice to filter out the most common words (e.g. you, when, at, any, before, …). Those are called stop words.

As shown in the chart below, the first word that is not a stop word among the 20 most frequent words in the routine is ‘Dottie’ (the wisecracking secretary).

As there is no ‘standard’ set of stop words, I started with an initial list of 322, to which I added a few more, and punctuation, up to 375 elements.

The count went down from 326 unique words including stop words, to 235 unique words without.

Zooming in

Interestingly, we can use those words to get a fairly good summary of the routine:

‘A boss, guys and Dottie abbreviate States. And there’s an omelet chef.’

Omelet or Omelette? I’d go with the latter.

Rare words

It is possible to analyze the uniqueness of the words used using their frequency in the English language. As a reference, here is a list of the top 100 most frequent words in English.

Zipf frequency

  • The lower the number, the rarer the word.
  • A Zipf of 7 means a word appears every 100 words.
  • A Zipf of 3 means once every 1 million words.

I picked some examples below as references to give a sense of what those rarity levels mean in practice.

Words with Zipf < 2 like “gobbledygook”, “mizpah” or “ingrate” are at risk of losing the crowd

Here are the words with a Zipf lower than 3 in the routine:

Note: The words ‘abbreviators’ and ‘apostrophizing’ got the Zipf function a bit confused and returned a zero. Despite its overall comedic quality, “ne’er-do-well” scored a Zipf of 3.83 and didn’t do well enough to make the cut. If you want to check the Zipf of some words yourself, I made a basic checker here.

This second series can offer a very similar but more flavorful summary:

‘Rogues, misfits and the wisecracking Dottie abbreviate.’

…and something about a ‘hollandaise omelet’?

Checking with google

I looked up some basic words first, and calculated the Log10 of their number of search results.

Note: The reference material of google results is entirely ‘online stuff’. While it includes scanned books, it probably has a recency bias.

As the difference seemed to be a constant, I averaged the differences and found about 4.26 (I called it ‘B’). The updated graphic is below and fits quite well.

Side note: In mathematics, the Dottie number is a constant that is the unique real root of the equation cos(x) = x. Apparently a professor of French discovered it by pressing the ‘cos’ key repeatedly on a calculator and seeing the number converge every time. It is a transcendental number (alongside Pi and e) and, for those versed in chaos theory, it is also a universal attractor (a ‘randy minx’?). Seems quite fitting to Gary’s routine!

‘Dottie, you randy minx!’ (angles in radians). x = cos(x) = cos(cos(cos…(cos(x))

Let’s see how our ‘google results formula’ applies to the rare words in the routine:

  • It’s not perfect but it’s ‘within range’.
  • Maybe the Dottie.vn Vietnamese online store is skewing the results a bit?
  • While ‘abbreviators’ and ‘apostrophizing’ scored a zero with the Zipf module, Google didn’t drop the ball: the former had ±40,000 results, the latter about 77,000. Very small compared to the many millions of other words. Even ‘wisecracking’ had almost 2 million hits!

Lexical diversity

As it turns out, it is tricky to measure and compare as:

  • Shorter texts have less space to introduce varied words,
  • The reference texts are skewed toward modern speeches,
  • Some complex terms might be repeated.

The method that seems least affected by the above is called MLTD (“Measure of Textual Lexical Diversity”).

Note that lexical diversity is not a measure of the quality of a text. Each text serves a particular purpose, and repetition and simple words can help both humor and persuasion.

I selected a few reference texts and extracted their MLTD:

  • Martin Luther King — I Have a Dream: 41.9
  • Maya Angelou — Address at Wellesley College: 47.7
  • Winston Churchill — We shall fight on the beaches: 49.8
  • Dr. Seuss — Oh The Places You’ll Go!: 51.0
  • Gary Gulman — State Abbreviations: 52.7
  • JFK — We Chose to go to the Moon: 57.7
  • Marc Antony — Bury Caesar: 58.5
  • Abraham Lincoln — Gettysburg Address: 59.4
  • Greta Thunberg — UN climate speech: 89.8
Can you figure out which is which?

I was initially surprised by how Greta Thunberg’s ‘How dare you’ UN speech turned out more lexically diverse than iconic speeches by MLK and JFK. Upon re-reading, I think it is due to her use of specialized vocabulary to cover briefly several complex topics.

Funny sounding words

Those consonants didn’t stand out overall in the routine, but here are some notable uses:

  • Crack squad of abbreviators’
  • ‘ragtag outfit of rogues’
  • back / track / work

Conclusion

For more, better check what the man himself says here on Twitter where he shared 366 tips. His tour dates, albums and more are available on https://garygulman.com. I had a chance to see him live at the Comedy Cellar in 2019 and he was the best!

PS: Folks at Oxford looked into ways to use machine learning to classify one-liners and tell the difference between articles from The Onion and Reuters. They got impressive results but it is still limited to those formats, and they don’t have automated joke-writing quite yet :)

A minx.