Detailed implementation of n-gram language model .
This article is related to the article title Next Word Prediction using Swiftkey Data .
Feature Development
Creating n -gram Dictionaries
I have created n-gram dictionaries with n-gram words as the key and the word following it along with its frequency of occurrence with the n-gram words as the value .
Unigram Dictionary
Code Sample
cleaned_corpus
is the corpus containing all the tokens . ngram_dict1
is the unigram dictionary
Bigram Dictionary
Code Sample
ngram_dict2
is the bigram dictionary and I have used the first 2000000 bigram pairs to create the dictionary .
Trigram Dictionary
Code Sample
ngram_dict3
is the trigram dictionary which is built only on a part of cleaned_corpus
.
Generating Probabilities from the dictionary values
In this step , I have converted the frequency of occurrence to probability ie. frequency of a word following a n-gram / total frequency of all words following that n-gram .
I have also ensured laplace smoothing , in case the frequency is 0 .
Code Sample
Developing Markov Chain
Key Idea :
- The sequence of words
(history)
is taken whose next word has to be predicted . - If length of
history
= 1 , then we look for it in unigram dictionary keys . The word having highest probability corresponding to thehistory
key in the dictionary is the next word . If it is not found in the keys then I have generated a random next word . - If length of
history
= 2 , then we look for it in bigram dictionary keys .The word having highest probability corresponding to thehistory
key in the dictionary is the next word .If it is not found in the keys then I takehistory = history[-1:]
and repeat step 2 . - If length of
history
= 3 , then we look for it in trigram dictionary keys .The word having highest probability corresponding to thehistory
key in the dictionary is the next word .If it is not found in the keys then I takehistory = history[-2:]
and repeat step 3 . - If length of
history
> 3 ,then I takehistory = history[-3:]
and repeat step 4 .
Code Sample
Github Link : https://github.com/kurchi1205/Next-word-Prediction-using-Swiftkey-Data/blob/main/Markov%20Model%20.ipynb
Test Results :
I have Perplexity to test this model .
For a vocabulary of 40241 , it gives a perplexity of 82351.45367405319 which is not so satisfactory .
As a result I have built classification based models to improve the performance.Please refer to the main article for the same .