Image for post
Image for post

Data Loves Comedy: Analysis of a Standup Act

Benjamin Joffe
Dec 8, 2020 · 6 min read

One of my favorite standup routines is Gary Gulman’s ‘State Abbreviations’, which lasts about 6 minutes. Here is what I found when doing a ‘quick’ analysis of its text.

Basic Set

I started with downloading the subtitles on YouTube and — after some minor editing — did a basic count:

  • 5,259 signs

Filtering

To analyze it further, I turned to Natural Language Processing, a domain mixing computing, statistics and linguistics.

I first needed to figure out which words were more ‘meaningful’. For that, it is common practice to filter out the most common words (e.g. you, when, at, any, before, …). Those are called stop words.

As shown in the chart below, the first word that is not a stop word among the 20 most frequent words in the routine is ‘Dottie’ (the wisecracking secretary).

Image for post
Image for post

As there is no ‘standard’ set of stop words, I started with an initial list of 322, to which I added a few more, and punctuation, up to 375 elements.

The count went down from 326 unique words including stop words, to 235 unique words without.

Zooming in

The most frequent ‘non stop words’ are:

Image for post
Image for post

Interestingly, we can use those words to get a fairly good summary of the routine:

‘A boss, guys and Dottie abbreviate States. And there’s an omelet chef.’

Image for post
Image for post
Omelet or Omelette? I’d go with the latter.

Rare words

What makes this joke a great piece of writing? One aspect is the unique flavor given by rare words.

It is possible to analyze the uniqueness of the words used using their frequency in the English language. As a reference, here is a list of the top 100 most frequent words in English.

Zipf frequency

One way is to use a metric called the Zipf Frequency (named after the linguist George Kingsley Zipf). It measures the frequency of a word across a broad collection of English sources, on a logarithmic scale.

  • The lower the number, the rarer the word.

I picked some examples below as references to give a sense of what those rarity levels mean in practice.

Image for post
Image for post
Words with Zipf < 2 like “gobbledygook”, “mizpah” or “ingrate” are at risk of losing the crowd

Here are the words with a Zipf lower than 3 in the routine:

Image for post
Image for post

Note: The words ‘abbreviators’ and ‘apostrophizing’ got the Zipf function a bit confused and returned a zero. Despite its overall comedic quality, “ne’er-do-well” scored a Zipf of 3.83 and didn’t do well enough to make the cut. If you want to check the Zipf of some words yourself, I made a basic checker here.

This second series can offer a very similar but more flavorful summary:

‘Rogues, misfits and the wisecracking Dottie abbreviate.’

…and something about a ‘hollandaise omelet’?

Checking with google

As I was wondering if Google knew any better regarding the missing 2 words, I thought the number of search results could shed some light.

I looked up some basic words first, and calculated the Log10 of their number of search results.

Note: The reference material of google results is entirely ‘online stuff’. While it includes scanned books, it probably has a recency bias.

Image for post
Image for post
Image for post
Image for post

As the difference seemed to be a constant, I averaged the differences and found about 4.26 (I called it ‘B’). The updated graphic is below and fits quite well.

Image for post
Image for post

Side note: In mathematics, the Dottie number is a constant that is the unique real root of the equation cos(x) = x. Apparently a professor of French discovered it by pressing the ‘cos’ key repeatedly on a calculator and seeing the number converge every time. It is a transcendental number (alongside Pi and e) and, for those versed in chaos theory, it is also a universal attractor (a ‘randy minx’?). Seems quite fitting to Gary’s routine!

Image for post
Image for post
‘Dottie, you randy minx!’ (angles in radians). x = cos(x) = cos(cos(cos…(cos(x))

Let’s see how our ‘google results formula’ applies to the rare words in the routine:

Image for post
Image for post
  • It’s not perfect but it’s ‘within range’.

Lexical diversity

Finally, out of curiosity, I wanted to check the lexical diversity.

As it turns out, it is tricky to measure and compare as:

  • Shorter texts have less space to introduce varied words,

The method that seems least affected by the above is called MLTD (“Measure of Textual Lexical Diversity”).

Note that lexical diversity is not a measure of the quality of a text. Each text serves a particular purpose, and repetition and simple words can help both humor and persuasion.

I selected a few reference texts and extracted their MLTD:

  • Martin Luther King — I Have a Dream: 41.9
Image for post
Image for post
Can you figure out which is which?

I was initially surprised by how Greta Thunberg’s ‘How dare you’ UN speech turned out more lexically diverse than iconic speeches by MLK and JFK. Upon re-reading, I think it is due to her use of specialized vocabulary to cover briefly several complex topics.

Funny sounding words

Gary tweeted that, to him, words with the sound “buh”, “puh”, “kuh” sound funnier (than alternates). Maybe “guh” too?

Those consonants didn’t stand out overall in the routine, but here are some notable uses:

  • Crack squad of abbreviators’

Conclusion

It is not easy to explain humor with data. My takeaways on the lexical aspects are that repetition, varied, rare, and funny-sounding words all contribute to comedy.

For more, better check what the man himself says here on Twitter where he shared 366 tips. His tour dates, albums and more are available on https://garygulman.com. I had a chance to see him live at the Comedy Cellar in 2019 and he was the best!

PS: Folks at Oxford looked into ways to use machine learning to classify one-liners and tell the difference between articles from The Onion and Reuters. They got impressive results but it is still limited to those formats, and they don’t have automated joke-writing quite yet :)

Image for post
Image for post
A minx.

The Startup

Medium's largest active publication, followed by +773K people. Follow to join our community.

Benjamin Joffe

Written by

Partner @ SOSV — $700m VC fund for Deep Tech (biology, robotics, etc.) | Digital Naturalist | Keynote Speaker | Angel Investor

The Startup

Medium's largest active publication, followed by +773K people. Follow to join our community.

Benjamin Joffe

Written by

Partner @ SOSV — $700m VC fund for Deep Tech (biology, robotics, etc.) | Digital Naturalist | Keynote Speaker | Angel Investor

The Startup

Medium's largest active publication, followed by +773K people. Follow to join our community.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store