Next Word Prediction using Markov Model

Published in

YML Innovation Lab

6 min readMar 16, 2018

If you ever ask a machine learning engineer, how would you go about generating text or building a predictive model, Recurrent Neural Networks (RNN) that too specifically Long Short-Term Memory (LSTM) would be the most obvious answer. Traditional models offer simpler and perform better compared to deep learning models in certain cases¹. That’s what we will be exploring in this article. We will learn how to make use of Markov Model for word prediction.

Markov Model

Let’s understand what a Markov model is before we dive into it. Simply stated, Markov model is a model that obeys Markov property.

So, what is Markov property? In a process wherein the next state depends only on the current state, such a process is said to follow Markov property. For example, let’s say that tomorrow’s weather depends only on today’s weather or today’s stock price depends only on yesterday’s stock price, then such processes are said to exhibit Markov property. Mathematically speaking, the conditional probability distribution of the next state depends on the current state and not the past states. That is s(t) depends only on s(t-1), where s(t) is the state at time t. This is what is called as the first-order Markov model.

In general, if the current state of a system depends on n previous states, then it is called n-th order Markov model.

A sequence of events which follow the Markov model is referred to as the Markov Chain.

Next word prediction

Now let’s take our understanding of Markov model and do something interesting. Suppose we want to build a system which when given an incomplete sentence, the system tries to predict the next word in the sentence. So, how do we take a word prediction case as in this one and model it as a Markov model problem? Treat every word as a state and predict the next word based on the previous state, as simple as that. This case is a perfect fit for Markov chain.

Wait, but how do you do that? Enter probability distribution. Let’s understand this better with a simple example. Consider the three simple sentences -

I like Photography.
I like Science.
I love Mathematics.

All the unique words from above sentences that is ‘I’, ‘like’, ‘love’, ‘Photography’, ‘Science’ and ‘Mathematics’ could form the different states. The probability distribution is all about determining the probability of transition from one state to another, in our case, it is from one word to another. In our scenario, it is clear from the above examples that first word always starts out with the word ‘I’. So there is 100% chance that the first word of the sentence will be ‘I’. For the second state, we have to choose between the words ‘like’ and ‘love’. Probability distribution now is all about determining the probability that the next word will be ‘like’ or ‘love’ given that the previous word is ‘I’. For our example, we can see that the word ‘like’ appears in 2 of the 3 sentences after ‘I’ whereas the word ‘love’ appears only once. Hence there is approximately 67% (2/3) probability that ‘like’ will succeed after ‘I’ and 33% (1/3) probability for ‘love’. Similarly, there is 50–50 chance for ‘Science’ and ‘fruits’ to succeed ‘like’. And ‘love’ will always be followed by ‘Mathematics’ in our case.

State transition diagram for our sample data

Representing the above work Mathematically as conditional probabilities -

P(like | I) = 0.67
P(love | I) = 0.33
P(fruits | like) = P(Science | like) = 0.5
P(Mathematics | love) = 1

This is how we build a probability distribution from a sample data.

Generating an Eminem style song

Now let’s build something big. We will train a Markov model on a bunch of Eminem song lyrics and then try to generate a new song lyrics from the model. A typical case of Markov chain. All the code and data for this post can be found on Github.

For the new song generation, we will make use of a 2nd-order Markov model. At first, we need to clean up the data and then train a Markov model on the cleaned up data.

The training of the Markov model can be divided into the following stages -

Tokenisation
Building the state pairs
Determining the probability distribution

Let’s understand the procedure with a simple sentence -

The quick brown fox jumps over the lazy dog.

At first, we need to perform tokenisation. Tokenisation is nothing but breaking down the sentence into words.

The second stage consists of forming the previous and current state pairs. Since we are building a 2nd-order Markov model, our previous state will consist of two words. For our example sentence, the pairs will be something like this -

(the, quick) --> brown
(quick, brown) --> fox
(brown, fox) --> jumps ...

Additionally, we have to consider two peculiar cases. Namely, the first word and the second word. Both of them will not have two previous words. So, we have to handle them differently. For the first word, we will just calculate the initial state distribution. And for the second word, we will treat it as a 1st-order Markov model, since it contains one previous word. Finally, for the end of the sentence, we will add an additional identification token ‘END’ and form pairs like

(lazy, dog) --> END

Once we have formed the state pairs, in stage 3 all we need to do is perform simple counts and calculate the probability of the next states possible for a given current state as before. For example, the word ‘the’ can be followed by the words ‘quick’ or ‘lazy’. We need to build a probability distribution as follows -

the --> [quick, lazy]
{quick: 1/2, lazy: 1/2}

Once we have completed the training, we will have the initial word distribution, second-word distribution and the state transition distributions. Next to generate song all we need is to write a function to sample out from the above-created distributions.

Tada! That’s it. We are now ready to test out our song generator. One of the sample lyrics generated by our Markov model -

and i should not be a king when you feel em
while he play piano
you better lose yourself and make her fall on her
and its absurd how people hang on every word
off a plank and
look i was but im
teeter totter caught up between being a father and a primadonna
at least once in a while
now who thinks their arms are long enough to one day grow up
but for me to try to get em motivated

Yeah, I know you tried to hum it like Eminem and it didn’t make much sense. It is senseless because I’m not Eminem neither the code is 😂. Jokes apart, on a serious note, the sentences kind of make sense but the whole prose doesn’t connect properly. This is mainly due to the fact that Markov model only considers the previous state and neglects the past which indeed results in loss of information. This is what we refer to as the memoryless property of a stochastic process. It is this memory that makes LSTMs outperform the Markov models in such cases.

Conclusion

As we can notice, Markov models do provide decent results. Hence, Markov models should not be completely written off. It is advisable to try Markov models before jumping into much complex models such as LSTMs. It would be much more interesting to see how the combination of Markov models and LSTM would play out together.

Happy exploring!

References:

M. Panzner and P. Cimiano, “Comparing Hidden Markov Models and Long Short Term Memory Neural Networks for Learning Action Representations” (link)
Unsupervised Machine Learning: Hidden Markov Models in Python by Lazy Programmer (link)
Visual explanation of Markov Chains by Victor Powell and Lewis Lehe (link)

Developed at Innovation Labs @ Y Media