Making a Text Generator Using Markov Chains

Akshay Sharma
Analytics Vidhya
Published in
3 min readAug 26, 2019

Having a computer sound like a person with text being auto-generated sounds like the future. There have been small proof of concepts, such as Gmail’s predictive text as you write out an email, and the point where this concept is truly realized may be arriving sooner rather than later.

A small project I completed, that doesn’t use neural networks to generate text, used Yelp Reviews to generate new reviews using Markov Chains. The text generated, although not perfect, does an incredible job of capturing the context, sentiment, and style of a review written by a typical user.

What is a Markov Chain?

Markov Chains are mathematical systems that go from one state to another. A few rules associated with this broad statement are as follows: the next state is entirely dependent on the previous state. The next state is determined on a probabilistic basis.

Example Image of Markov Chain from Brilliant.org

To put this into the context of a text generator, imagine an article you recently read. That article contains x number of words where there are probably many words used multiple times. For each word in the article, the words directly after are grouped together where words that occur more often are weighted more heavily. When generating text, a random word is chosen and from that a random word from the list of words is selected continuously until the desired word count is reached.

An Example Using Python

In python terms, you create a dictionary of every unique word in your corpus. From there, you make the values of your dictionary a list of words that appear after each unique word. The more often a word appears after another, the higher the probability that it will be selected in the text generator.

Sample text generated

As you can see from above, the text generated looks, at a high level, as if a human being wrote it. It is when you look deeper into the words that it becomes a little more evident that it doesn’t make sense holistically. Something to fix this could be adding weights to the probabilities of the words selected.

While the concept is simple, the hardest part of creating a generator using Markov Chains is to ensure you have enough text in your corpus so the text you generate doesn’t end up being the same words over and over.

Conclusion

Markov Chains can work wonderfully in generating text to mimic a human being’s style. However, in order to effectively generate text, your corpus needs to be filled with documents that are similar. In the example above, I captured 3 star reviews from Yelp. However, it contains phrases like manure, office buildings, nfl, and theater. These are generally unrelated and would not be posted in a typical review. In order to correct this, you will need to keep documents discussing similar topics (i.e. pizza parlors) in the same corpus and use that for Markov Chains. This way the text generated will all be pizza related. That being said, for such a simple execution, the result of the text generated is remarkable and much easier to obtain than heavily trained neural networks!

--

--

Akshay Sharma
Analytics Vidhya

Data scientist with consulting and public accounting experience, and a CPA background at one of the largest accounting firms.