A Primer on Markov chains

Using Markov chains to generate sentences

6 min readApr 19, 2018

Yesterday, after spending some time on the r/singapore random discussion and small questions thread, I suddenly got an idea to try to generate fake comments which looked like they came from there.

What about GANs?

At first, I thought that it would be a good opportunity to try playing with GANs (generative adversarial networks) myself. GANs are best known for generating realistic looking fake images but have also used for other applications such as creating encryption algorithms.

However, while reading up I came across a post on Reddit by Ian Goodfellow, inventor of GANs, which suggested that perhaps text processing wasn’t the best fit for GANs:

Old is gold

I then decided to try my luck with a more traditional method of text generation: Markov chains!

I first encountered Markov chains used to generate text a long time ago when I played the open-source, turn-based strategy game The Battle for Wesnoth, where it was used to generate unit names.

Markov chains

A Markov chain is a sequence of states which can transition between each other with certain probabilities. For example, below is a Markov chain with 2 states from the Wikipedia article Examples of Markov chains:

By Pmdusso — I drew it on LucidChart., CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=25300524

In this example, the states are weather conditions, and each state has a certain probability of transitioning to another one in the next time period. Let each time period be a single day. If it is currently sunny, then there’s a 10% chance that it will rain the next day, but also a 90% chance it will be sunny again.

Memorylessness

The most important property that defines Markov chains is the Markov property, or memorylessness: the next state depends only on the current state. In our example above, given that today is sunny, it doesn’t matter whether the past 10 days were also sunny or only today; the probability of tomorrow’s weather depends only on today’s.

Generating sentences

Applying this to text generation, we can model sentences as Markov chains as well, by letting each word be a different state, for example:

What dessert do you like?START -> what -> dessert -> do -> you -> like -> END

We also have separate states for the start and end of a sentence. But where did the probabilities go? To find out, we need some existing sentences to analyse. This set of sentences is our corpus, or training data:

What dessert do you like?
I like fruit tarts.
I like fruit tarts too.
What kind of fruit tart do you like?
I like mixed fruit tarts.

Breaking everything down into state transitions (ignoring case and punctuation for simplicity):

START -> what -> dessert -> do -> you -> like -> ENDSTART -> i -> like -> fruit -> tarts -> ENDSTART -> i -> like -> fruit -> tarts -> too -> ENDSTART -> what -> kind -> of -> fruit -> tart -> do -> you -> like -> -> ENDSTART -> i -> like -> mixed -> fruit -> tarts -> END

Now we can determine the probabilities for each state transition by counting how often they happen:

START -> what (2/5)
START -> i    (3/5)what -> dessert (1/2)
what -> kind    (1/2)i -> like (2/2)like -> END   (2/5)
like -> fruit (2/5)
like -> mixed (1/5)fruit -> tarts (3/4)
fruit -> tart  (1/4)...

After calculating the transition probabilities, we can now try to generate some new sentences! Starting from the START state, we randomly move to one of the possible next states with our previously calculated probabilities.

Chain: START
Current state: START
Possible next states: what (40%), i (60%)

Let’s say we randomly picked what for our next state:

Chain: START -> what
Current state: what
Possible next states: dessert (50%), kind (50%)

We’ll continue on like this until we reach the END state.

Chain: START -> what -> kind
Current state: kind
Possible next states: of (100%)Chain: START -> what -> kind -> of
Current state: of
Possible next states: fruit (100%)Chain: START -> what -> kind -> of -> fruit
Current state: fruit
Possible next states: tarts (75%), tart (25%)Chain: START -> what -> kind -> of -> fruit -> tarts
Current state: tarts
Possible next states: END (67%), too (33%)Chain: START -> what -> kind -> of -> fruit -> tarts -> END
Current state: END
Possible next states: None

In this way, we’ve generated a new sentence what kind of fruit tarts which was not in our original set of sentences, yet (sort of) looks like it came from there.

Here’s a short Python script which calculates the transition probabilities for a set of sentences and generates new sentences based on it:

You can run it to see the full set of transition probabilities for this example and play with it by changing the sentences used. Here’s an example run and some generated sentences:

{'START': {'what': 0.4, 'i': 0.6}, 'what': {'dessert': 0.5, 'kind': 0.5}, 'dessert': {'do': 1.0}, 'do': {'you': 1.0}, 'you': {'like': 1.0}, 'like': {'END': 0.4, 'fruit': 0.4, 'mixed': 0.2}, 'i': {'like': 1.0}, 'fruit': {'tarts': 0.75, 'tart': 0.25}, 'tarts': {'END': 0.6666666666666666, 'too': 0.3333333333333333}, 'too': {'END': 1.0}, 'kind': {'of': 1.0}, 'of': {'fruit': 1.0}, 'tart': {'do': 1.0}, 'mixed': {'fruit': 1.0}}
START i like fruit tarts END
START i like mixed fruit tart do you like fruit tarts END
START what dessert do you like fruit tarts END
START i like fruit tart do you like END
START what kind of fruit tarts too END

Because our original set of sentences was so small, we see a lot of repetition in our generated sentences. If we have a much larger body of text to use as training data, we’ll see much more diversity in the state transitions and more interesting output.

While we ignored punctuation in this example, we could easily have included it too just by considering punctuation marks as states as well:

START -> what -> dessert -> do -> you -> like -> ? -> ENDSTART -> i -> like -> fruit -> tarts -> . -> ENDSTART -> i -> like -> fruit -> tarts -> too -> . -> ENDSTART -> what -> kind -> of -> fruit -> tart -> do -> you -> like -> -> ? -> ENDSTART -> i -> like -> mixed -> fruit -> tarts -> . -> END

Generating Reddit comments

Have you heard of r/SubredditSimulator? It’s a subreddit which is entirely populated by bots — every submission and comment on it is actually generated with Markov chains. Every bot represents a particular subreddit and is trained on submissions and comments from that subreddit, so things the bot posts also resembles content you would expect to see on the subreddit it represents.

As for myself, as mentioned at the beginning of this post, I’ll just be using comments from the daily r/singapore random discussion and small questions threads to generate some more, uniquely r/singaporean comments.

Next up

In a following post, we’ll use the Reddit API to download all the comments from a few r/singapore random discussion and small questions threads to use as our training data. While the Reddit API is pretty interesting, rather than reinvent the wheel, we’ll make use of the popular Python Reddit API Wrapper, or PRAW, to do this.

praw-dev/praw

praw - PRAW, an acronym for "Python Reddit API Wrapper", is a python package that allows for simple access to Reddit's…

github.com

Similarly for the Markov chain building and generation, although we do have our own rudimentary implementation above, we’ll make use of the excellent Markovify library instead. Markovify is also used by many other generative text experiments around the internet (Markovify in the wild).

jsvine/markovify

markovify - A simple, extensible Markov chain generator.