Making a Text Generator Using Markov Chains
Having a computer sound like a person with text being auto-generated sounds like the future. There have been small proof of concepts, such as Gmail’s predictive text as you write out an email, and the point where this concept is truly realized may be arriving sooner rather than later.
A small project I completed, that doesn’t use neural networks to generate text, used Yelp Reviews to generate new reviews using Markov Chains. The text generated, although not perfect, does an incredible job of capturing the context, sentiment, and style of a review written by a typical user.
This project uses data scraped from Yelp and through natural language processing and neural networks, I have developed…
What is a Markov Chain?
Markov Chains are mathematical systems that go from one state to another. A few rules associated with this broad statement are as follows: the next state is entirely dependent on the previous state. The next state is determined on a probabilistic basis.
To put this into the context of a text generator, imagine an article you recently read. That article contains x number of words where there are probably many words used multiple times. For each word in the article, the words directly after are grouped together where words that occur more often are weighted more heavily. When generating text, a random word is chosen and from that a random word from the list of words is selected continuously until the desired word count is reached.
An Example Using Python
In python terms, you create a dictionary of every unique word in your corpus. From there, you make the values of your dictionary a list of words that appear after each unique word. The more often a word appears after another, the higher the probability that it will be selected in the text generator.
As you can see from above, the text generated looks, at a high level, as if a human being wrote it. It is when you look deeper into the words that it becomes a little more evident that it doesn’t make sense holistically. Something to fix this could be adding weights to the probabilities of the words selected.
While the concept is simple, the hardest part of creating a generator using Markov Chains is to ensure you have enough text in your corpus so the text you generate doesn’t end up being the same words over and over.
Markov Chains can work wonderfully in generating text to mimic a human being’s style. However, in order to effectively generate text, your corpus needs to be filled with documents that are similar. In the example above, I captured 3 star reviews from Yelp. However, it contains phrases like manure, office buildings, nfl, and theater. These are generally unrelated and would not be posted in a typical review. In order to correct this, you will need to keep documents discussing similar topics (i.e. pizza parlors) in the same corpus and use that for Markov Chains. This way the text generated will all be pizza related. That being said, for such a simple execution, the result of the text generated is remarkable and much easier to obtain than heavily trained neural networks!