Code diaries: Text Prediction (n-grams)

Jadesse Chan
Apr 16 · 5 min read

Natural Language Processing, otherwise known as NLP, is a subfield of computer science that allows computers to understand human language. Do you use the suggested words above your keyboard when you’re texting? That’s thanks to NLP!

NLP is all around us: information extraction, sentiment analysis, machine translation (e.g. Google translate), spam detection, autofill, and chat bots to name a few.

This article will give a brief introduction to NLP and describe my process of building a text prediction program using an n-gram language model.

The human language is constantly evolving and has many nuances/ambiguities.

“The woman hit a man with an umbrella.” can be interpreted in two ways:

Courtesy of my artistic skills

The sentence above is known as a crash blossom, a sentence with semantic ambiguity. Other factors that make NLP difficult are, but not limited to non-standard English, such as tweets on Twitter and neologisms like ‘unfriend’ and ‘unfollow’ that have developed with social media.

The foundation of NLP is a language model, which computes the probability of a sentence or sequence of words.

For example, the phrase “The cat is sleeping” is more likely to appear than “The cat is friendly” (at least in my opinion). Just kidding, let’s try another one: “The cat is sleeping” or “The cat is dancing”. I think most of us can agree on this one ;)

The probability of these sentences can be shown as:

P(“The cat is sleeping”) > P(“The cat is dancing”)

Where the probability, P, of the sentence “The cat is sleeping” is greater than the sentence “The cat is dancing” based on our general knowledge.

This same example can be applied to making an n-gram language model, which predicts the probability of a sequence of N tokens (words) from a training corpus.

I trained my text prediction program off of a trigram (3-gram) language model.

I used the Markov assumption, where the occurrence of each word is based on the occurrence of the previous 2 words (for a trigram model).

The re library for Python is a great tool for regex parsing the corpus before training the language model on it.

how I pre-processed my corpus using the re and unicode library

Pre-processing the text is crucial to remove unnecessary characters, like numbers, punctuation, and capitalization. Depending on your NLP task, you may need to remove stopwords (i.e articles like ‘a’, ‘the’, etc). The NLTK library has a list of them that you can use. However, I did not remove stopwords in my corpus, because it will make the text prediction output more coherent.

Left: no stopwords removed. Right: stopwords removed

Now the text is ready for tokenization! The NLTK library provides a plethora of functions to aid in various NLP processes. In my case, I used ngrams() to split my text into trigrams

trigrams = list(nltk.ngrams(text, 3, pad_left=True, pad_right=True, left_pad_symbol='<s>', right_pad_symbol='</s>'))

The pad_left and pad_right mark the start and end of a sentence.

<s>I love computer science!</s>

A trigram is a sequence of 3 words or tokens of a sentence.

“<s> I love”, “computer science </s>”

Padding sentences help signify how likely a certain word will be at the start or end of a sentence.

When working with large datasets, such as the Alice in Wonderland corpus, I performed Exploratory Data Analysis, so I could summarize the main characteristics of my text.

n-gram information about corpus

Here, I used NLTK’s FreqDist() to get the frequency distribution of my trigrams and visualized them on a line graph. Then I printed the top 5 trigrams.

However, while the NLTK library is incredibly helpful, the documentation is a bit challenging to comprehend. I often found myself spending more time trying to understand NLTK rather than coding. This happened when I used the NLTK’s Language Model interface to get find the Maximum Likelihood Estimate, which essentially finds the probability distribution of the given model. I ended up doing a conditional frequency distribution because I understood the underlying concept and data structures NLTK used for their implementation.

Calculating the probability of a word (w3) preceded by 2 words (w1 and w2)

A default dictionary is the underlying data structure of ConditionaFreqDist() and counts the frequency of co-occurrence and key-value pairs. It stores the counts of a word (w3) after the previous 2 words (w1, w2) in the trigram. If the count of w3 is not present, then a default value is used. A default dictionary is similar to a Map with GetOrDefault() in Java.

To make my text prediction program emulate real-world scenarios, I appended predictions to the user’s input based on a weighted random probability, where each predicted word is weighted based on its frequency and randomly chosen by its relative weight.

And that’s it! The user can decide to keep appending words to the initial phrase or terminate the program.

Text prediction is just a small example of how NLP is applied to real-world scenarios. Other use-cases include sentiment analysis, spam detection, and chatbots! Many NLP tasks are now implemented with machine learning techniques now, but using statistical methods/models is a perfect foundation for understanding the foundational computations. The world of probability, linguistics, and computer science come together and form this evolving and fascinating topic of NLP!

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data…