Next Word Prediction Using SwiftKey Data

Detailed Implementation of Next Word Prediction

Prerana
Analytics Vidhya
10 min readAug 30, 2021

--

In this blog I will walk you through the probabilistic and classification approach to predict the next word given a sequence of words.

Table of Contents

  • Business Problem
  • ML Formulation
  • Existing Work
  • First Cut Approach
  • Data Visualization
  • Probabilistic Approach
  • Classification Approach
  • Result Comparison
  • Examples
  • Future Works
  • References

Business Problem

This problem falls in the area known as natural language processing(NLP) and text mining is a branch of AI that helps computers understand , interpret and manipulate human languages. Text Mining is an AI technology that uses NLP to convert unstructured text data from documents to structured data that can be used by ML algorithms. Next word prediction problem uses the above technologies to accomplish results.

Why next word prediction is important?

It helps in minimizing keystrokes which in turn saves time while typing, checks spelling errors and helps in word recall. Students who are still working on their language skills can take its help. Also people with dyslexia can be benefited by this.

GITHUB LINK
https://github.com/kurchi1205/Next-word-Prediction-using-Swiftkey-Data

LINKED BLOGS

https://medium.com/@prerana1298/markov-chain-model-e642050e6c6f
https://medium.com/@prerana1298/next-word-prediction-using-gpt-1-ae999acfe3de
https://medium.com/@prerana1298/next-word-prediction-using-lstms-98a1edec8594
https://medium.com/@prerana1298/next-word-prediction-using-albert-4e2bf5372132

ML Formulation

In this section , I will discuss how to convert the above business problem into a Machine Learning problem .

Data Overview

For solving this problem , we need corpus of text data . This data is provided by Swiftkey .

Data source: https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip

This zip file contains 4 folders each containing data from 4 different languages namely English, Russian , Finnish and German.

I will be using the data from English folder . It contains the following files.

  • en_US.twitter.txt — This text file contains all English language data from twitter.
  • en_US.news.txt — This text file contains all English language data from news websites and channels.
  • en_US.blogs.txt — This text file contains all English language data from all blog sites.

Mapping Real-World to ML Problem

Next word prediction involves predicting the next word . So given a sequence of words generated from the corpus , I have to predict the next word which has highest probability of occurrence. Thus it is a predictive modeling problem for languages also known as Language Modeling. This is the Probabilistic model .

We can also approach this problem in another way. We can consider each of the next word to be predicted as a class. So it can be treated as Multi-Class Classification problem.

Business Objectives and Constraints

Objectives

  • Cleaning text data from the corpus
  • Creating Sequences from the cleaned data
  • Building Statistical or Neural Language Models to predict next word

Constraints

  • Memory Constraint — Corpus size may be too large which might cause a memory error .
  • Latency Constraint — It is a low latency problem as the entire problem is designed to enable fast typing.
  • OOV words- It is important to take care of out of vocabulary words because all words may not be present in the corpus but the model should be able to handle it.

Performance Metric

For Probabilistic Model

Perplexity : It is defined as the inverse probability of the test set normalized by the number of words .

Perplexity Equation

For Classification Model

Categorical Cross Entropy : Each next word to be predicted is considered a category so I will be using Categorical Crossentropy for this.

Formula for Categorical CrossEntropy

Existing Work

  1. Paper doi:10.1109/icdse47409.2019.8971796

This research paper is on Next Word Prediction in Hindi using Deep Learning Techniques. I have summarised the techniques that has been used for next word prediction for Hindi Language.

  • Text is cleaned and unique words are generated
  • These unique words are stored in a dictionary and mapped to indices as neural networks work better with indices
  • Sequence Lengths are set , for eg. if sequence length=6 then sentence is divided into 6:1 ratio ie. first six words in input_x and the remaining in input_y
  • Tensors are created from input_x based on the sequence length and stored in data_x
  • For each of the tensor sequences in data _x , a sequence of next words that is input_y is stored as a tensor in data_y.
  • LSTM and Bi-LSTM are used for the prediction based on the highest probability of the next word.
  • Activation function used is Softmax to assign probability of the next words .
  • As this problem is treated as a multiclass classification problem so the loss function is categorical cross entropy.
  • The LSTM model ran for 197 epochs and the Bi-LSTM model ran for 121 epochs .
  • It has been repeated for all sequence lengths.

2. Paper doi: https://doi.org/10.1007/s00521-020-05245-3(0123456789().,-volV)(01234567 89(). ,- volV)

This paper develops next word prediction model for Kurdish Language. Here they have discussed the N-Gram language model for this problem.

N-Gram Language Model

In language models , either probabilities are assigned to a series of words or probabilities are assigned to the next word given some preceding words.

Here 2 scenarios are discussed:

1st Scenario- In case of a sentence (w1 w2 w3 w4 w5) ,the probability for the sentence is given by P(w1)*P(w2)*P(w3)*P(w4)*P(w5) where P(wn) is the probability of wn .

P(wn)= (Number of occurrences of wn in the corpus) / (Total number of words in the corpus)

2nd Scenario — In case of a sentence (w1 w2 w3 w4 w5) ,the probability for the sentence is given by P(w1)*P(w2|w1)*P(w3|w1w2)*P(w4|w1w2w3)*P(w5|w1w2w3w4) where P(wn|wn-1) is the probability of wn given the preceding word is wn-1 .

P(wn|wn-1)= P(wn-1 wn)/P(wn-1) that is Probability of occurrence of (wn-1 wn) in the corpus divided by Probability of occurrence of wn-1.

N- Gram models can be of different types . Those are unigram , bigram , trigram and so on.

Unigram Model — In this model, the probability of each word in the sentence is given by P(wi) = Count of wi in the text corpus / total words .

Bigram Model — This is similar to the 2nd scenario discussed above . But here we take into account only the previous word.

In the case of a sentence (w1 w2 w3 w4 w5) ,the probability for the sentence is given by P(w1)*P(w2|w1)*P(w3|w2)*P(w4|w3)*P(w5|w4) where P(wn|wn-1) is the probability of wn given the preceding word is wn-1 .

Trigram Model — This is similar to the bigram model . But here we take into account 2 previous words.

Thus the expression for n-gram model is

P(wi|w1w2…wi-1)=P(wi | w(i-n+1)….wi-1) .

This is also called Markov assumption.

Now to deal with 0 probability , Stupid Backoff Algorithm is used.

If higher order n-gram results in 0 , then we backoff to lower order n-gram.

It is given by the following formula.

S(wi|wi-k+1i-1)= f(w ii−k+1)/ f(w i−1i−k+1) if f(w ii−k+1) > 0

αS(wi |w i−1 i−k+2) otherwise

For example: Let us consider a sequence of words “this is a very beautiful” .

Here we have to find probability of “beautiful” given “this is a very”. Let’s say “beautiful” never occurred in the context “this is a very” so for the 4-grams model “beautiful” has probability 0 . So we backoff to 3-gram model and find the probability as α*P(“beautiful”|”is a very”).

3. Article on Smoothing

Link: https://towardsdatascience.com/n-gram-language-models-af6085435eeb

Smoothing is required to deal with words that appear in the test set in an unknown context .There are different Smoothing techniques .

  • Laplace Smoothing : For calculating probability , we simple add 1 to the numerator and an additional V vocabulary to the denominator. P(word)=(wordcount+1)/(totalnumberofwords(N)+V)
  • P(new_word)=1/(N+V)
  • Add k- Smoothing : Instead of adding 1 to the frequency of the words , we will be adding some fraction (k) to the count.
  • P(word)=(wordcount+k)/(totalnumberofwords(N)+kV)
  • P(new_word)=k/(N+kV)
  • Backoff and Interpolation : Backoff is same as the Stupid Backoff algortthm explained earlier. Interpolation is the way of combining different n-gram models such as unigram,bigram , trigram models by giving weights to each of them.
Backoff and Interpolation

First-Cut Approach

In this section , I have discussed the first cut approach that I decided upon to solve this project .

  • Reading the data from 3 different text files and then appending them together into 1.
  • Cleaning the text which involves removing punctuations , numbers,special characters, extra whitespaces .
  • Stopwords won’t be removed as they play an important role for next word prediction.
  • Converting all words to lower case.

Visualizations

  • Generating unigrams , bigrams , trigrams along with their frequency.
  • Visualizing wordclouds of n grams.

Featurization

For Markov Chain Model using ngrams

  • I will create a dictionary with 2 keys -first key will be a tuple of different word sequences and 2nd key will have the next word.
  • The value of this dictionary will be the probability of occurrence of 2nd key given 1st key along with laplace smoothing.
  • This will be for bigram and trigram.

For neural network model

  • Encode the entire data to indices .
  • I will generate sequences of length 2 , 3 and 4 from the encoded data maintaining the order of indices.
  • I will then split the sequences as x and y such that only the right most index is in y rest in x .

Modelling

Markov Chain Model

  • I will develop a Markov model with Bigram and Trigram probabilities using the entire data.

Neural Network Model

  • Divide x and y into train and test dataset for each sequence length
  • Train different LSTM models for different sequence lengths.
  • Different LSTM models will be used for different input lengths.

Data Visualization

To get an idea about the corpus , I carried out certain visualizations after cleaning the data .

Data Cleaning

For cleaning I have carried out the following steps:

  • Removing Extra space : All the extra space between the words are removed .
  • Removing Special Characters : Any form special character including punctuations are removed .
  • Tokenizing text data :News and Blogs data are tokenized using nltk Tokenizer .
  • Tokenizing Twitter data : Twitter data is tokenized using nltk TweetTokenizer .

Code Sample :

Visualization 1

Top 50 words based on frequency
  • Visualising the top 50 words based on their frequency of occurrence
  • Most of them are stopwords as they are used frequently .

Visualization 2

Last 50 words based on their frequency
  • Visualizing the last 50 words based on their frequency
  • All of them occur once in the entire corpus

Visualization 3

Unigram WordCloud
  • Generated the word cloud with unigram words .
  • There are no stopwords here as I was trying to visualize most commonly used non-stopwords.

Visualization 4

Bi-Gram Wordcloud
  • Here I have included the stopwords otherwise the meaning of the phrases will change.
  • There is a lot of preposition use as observed from the word cloud

Visualization 5

Trigram Wordcloud

Most commonly used phrases

  • ‘one of the’
  • ‘out of the’
  • ‘be able to’
  • ‘some of the’
  • ‘going to be’

Visualization 6

Quadgram Wordcloud

Most commonly used phrases

  • ‘in the middle of’
  • ‘one of the most’
  • ‘the rest of the’
  • ‘in the middle of’

In most of the cases , there is large use of stopwords.

There is a total of 101098262 words in the entire corpus . From the visualization , it was clear that most of them are stopwords . So can’t exclude them while developing the model . Also I couldn’t work on the entire corpus as I did not have enough RAM to support operations on such large data.

Probabilistic Approach

In probabilistic approach , next word is predicted based on the probability of its occurrence given a set of input words .

Under this section , we have 2 models .

  1. Markov Model- Detailed Explanation is available here .
  2. GPT- 1 Model- Detailed Explanation is available here .

Classification Approach

In classification approach , next word is treated as a class to be predicted . Thus the problem turns into a Multi-Class Classification Problem .

Under this section , we have 2 model .

  1. Stacked LSTM with and without Attention Mechanism- Detailed Explanation is available here .
  2. ALBERT Model- Detailed Explanation is available here .

Result Comparison

Performance of Different Models

Although the test loss for Al-BERT is lowest , it produces very bad predictions as we have seen earlier . So I decided to consider GPT as the final model for predictions as it is the second best .

Next Word Predictions

Following are few examples of next word predictions by GPT Model:

Prediction Examples

Github Link : https://github.com/kurchi1205/Next-word-Prediction-using-Swiftkey-Data

Future Works

  • Here I have worked with only 1% of the entire data because of memory issues , if we work on entire data ,it will give pretty much better results.
  • The best model is obtained by fine-tuning GPT -1 . If we fine tune GPT-2 and GPT-3 , we can get better results as they have more parameters .
  • Swiftkey Data also contains corpus of many different languages , we can work on them .

References

--

--