Generating jokes with machine learning: An Irishman walks out of a bar. He was brought to the closest hospital quickly
Recently I came across a public dataset on GitHub containing about 200,000 jokes from different sources. Well, mostly it’s reddit, but also a few other sites.
From this dataset, I put together a small Python script in less than 100 lines to create machine-generated jokes, like the title of this post (for a much larger list of generated jokes, I’ve put some up on this page).
In this post, I will go through using my code and get you started generating your own jokes.
Before actually looking at code, I will briefly go over the background of how the jokes are being generated. If you don’t care about this, feel free to skip this section.
The model used is a Markov chain. What it does is look at the current sentence and then pick the next word to write at random, with respect to how probable the next word is.
Alright, that may sound odd and but given the sentence “Yo mama is so”, what’s likely to be the next word? Given a lot of jokes in a dataset (corpus, from here on), for likely 70% of the jokes containing that sentence, the next word will be “fat” (giving the new sentence “Yo mama is so fat”), and the word “ugly” may occur 30% of the time, giving the new sentence “Yo mama is so ugly”.
Thus, if we are given a sentence, we can append the sentence with a new word based in prior likelihoods. With this way of thinking, we can create a new sentence by picking a random starting word (weighted by probability of starting a sentence with that word), then continue to pick words at weighted probability until the sentence is finished.
This obviously means that the jokes are never completely original but will always be somewhat based on the sentences in the corpus. However, that doesn’t mean that generated text will be completely derivative; for instance, the joke in the title is not found in the corpus.
The simplest way of creating chains is to only look at the current word and pick the next one. This gives only very little context. For instance, what’s the most likely word to succeed “is”? We can of course pick between many words, but most likely the sentence will make close to zero sense.
To overcome this, we may look at the latest two words. Now we get the words “mama is”, and this gives a much better basis for picking the next word.
However, what if we were given the words “did the”? Now it will also be very difficult to create a meaningful sentence.
We may fix this by just choosing to look at the latest 10 (or another large number of) words. How many different sentences in a corpus will contain the sentence “how do you find blind man on a nude beach”? Likely not that many. This means that the set of words that can now be picked is very small, likely generating the same joke as is in the corpus (by the way, the answer is: it’s not hard). And also, the probability table required to handle the generation will be of size: num_different_words_in_corpus^num_look_back_words. So increasing the number of previous words in the model will exponentially cripple the memory of your computer.
Picking how many words to use will be a trade-off between originality (and memory usage) and how much sense the sentence should make. For the joke-generation, I found that looking at the previous 3 or 4 words is a decent choice.
Getting and running code
When coding in Python, there’s a brilliant Markov chain library named markovify.
Start by cloning my code and installing markovify using the terminal commands:
git clone https://github.com/Thiele/markovjokes.git
sudo pip install markovify
The repository doesn’t contain the joke sources. So download the files reddit_jokes.json, stupidstuff.json and wocka.json and put them inside the folder you just cloned markovjokes to.
Now you can run the joke-generator with:
This will generate (and overwrite the old) generated_jokes.json, which will contain a json array of jokes.
If you look through generate.py, you will see some loading and preprocessing. I will now go through some of the details.
The reddit jokes contains both a title and a body. There seems to be a (sort of) common consensus that the titles are part of the jokes. So the reddit jokes in the corpus will be [title]+” “+[body].
Stupidstuff and wocka only contains a body, so for these sources, only the body is used.
Handling surplus dots
How often will we see the word “chicken……”? I’m guessing not many. This means that the set of distinct words to follow this is very small. The author might as well just have writen “chicken…” (… is more commonly used in language), and the meaning should be the same. However, people tend to also write “..”, so I have a re-writing rule: while joke.contains(“…”), joke.replace( “…” with ”..”)
Handling special characters
Some jokes will contain characters that does nothing but mess with the data. For instance, we don’t care about carriage returns and line-breaks. So these are removed.
I’m also removing ‘ and , from jokes to simplify the corpus (“hi ” and “hi,” will be the same)
The question mark and dot have special meaning in the language, so these needs to be handled. But to keep the corpus as small as possible, I will not want to have both “chicken” and “chicken?” in the corpus.
To handle this, I’m encoding dots as the new word “ DOT “ (note the spaces) and question marks are “ QUESTIONMARK” “ (again, spaces) as they will now be taken into account by the model.
Other smaller preprocessing steps:
- Remove jokes longer than 80 characters
- Make all jokes all lower-case
- Remove empty jokes (some may happen to be there after removing characters)
The Markov chain is built using the command in Python:
text_model = markovify.NewlineText(jokes, state_size=4)
Note the state_size=4. This means “look at the previous 4 words when picking the next”. Play around with this parameter to get different results.
A sentence can now be created in Python using:
After creating sentences, I replace “ dot” and “ questionmark” (lowercased by the preprocessing step) with . and ? respectively to create proper looking sentences.
Hopefully you now have a basic understanding of Markov chains for content generation and how to use markovify.
If you have read through some of the jokes, you will also by now have realized that they often really aren’t that good or makes that much sense. And once in a while, a rather grim joke pops up, but I guess that’s just from reddit’s sense of humor. But sometimes it happens that an actually good and sort-of original joke is generated.
If you are curious as to what else people have been doing with Markov chains, a small list can be found on markovify’s github page.