Markov Chain Model

Prerana
2 min readAug 30, 2021

--

Detailed implementation of n-gram language model .

This article is related to the article title Next Word Prediction using Swiftkey Data .

Feature Development

Creating n -gram Dictionaries

I have created n-gram dictionaries with n-gram words as the key and the word following it along with its frequency of occurrence with the n-gram words as the value .

Unigram Dictionary

Code Sample

cleaned_corpusis the corpus containing all the tokens . ngram_dict1 is the unigram dictionary

Bigram Dictionary

Code Sample

ngram_dict2 is the bigram dictionary and I have used the first 2000000 bigram pairs to create the dictionary .

Trigram Dictionary

Code Sample

ngram_dict3 is the trigram dictionary which is built only on a part of cleaned_corpus .

Generating Probabilities from the dictionary values

In this step , I have converted the frequency of occurrence to probability ie. frequency of a word following a n-gram / total frequency of all words following that n-gram .

I have also ensured laplace smoothing , in case the frequency is 0 .

Code Sample

Developing Markov Chain

Key Idea :

  1. The sequence of words (history) is taken whose next word has to be predicted .
  2. If length of history = 1 , then we look for it in unigram dictionary keys . The word having highest probability corresponding to the history key in the dictionary is the next word . If it is not found in the keys then I have generated a random next word .
  3. If length of history = 2 , then we look for it in bigram dictionary keys .The word having highest probability corresponding to the history key in the dictionary is the next word .If it is not found in the keys then I take history = history[-1:]and repeat step 2 .
  4. If length of history = 3 , then we look for it in trigram dictionary keys .The word having highest probability corresponding to the history key in the dictionary is the next word .If it is not found in the keys then I take history = history[-2:]and repeat step 3 .
  5. If length of history > 3 ,then I take history = history[-3:]and repeat step 4 .

Code Sample

Github Link : https://github.com/kurchi1205/Next-word-Prediction-using-Swiftkey-Data/blob/main/Markov%20Model%20.ipynb

Test Results :

I have Perplexity to test this model .

For a vocabulary of 40241 , it gives a perplexity of 82351.45367405319 which is not so satisfactory .

As a result I have built classification based models to improve the performance.Please refer to the main article for the same .

--

--