A quick introduction to Language Models in Natural Language Processing

5 min readMar 16, 2020

In this post, we’ll discuss everything about Language models (LM), we’ll cover

What is LM
Applications of LM
How to make a LM
Evaluation of a LM

Introduction

Language model in NLP is a model that computes probability of a sentence( sequence of words) or the probability of a next word in a sequence. i.e

Probability of a sentence (Source: [1])

Probability of the next word (Source:[1])

Language Model v/s Word Embedding

Language models are often confused with word embeddings. The major difference being that in language models, word order matters because it tries to capture the context between words, whereas, in the case of words embeddings, only semantic similarity is captured because it is trained by predicting words in a window regardless of the order.

Application of Language models

Language models a major part of NLP and finds its use in a lot of places such as,

Sentimental Analysis
Question-Answering
Summarization
Machine Translation
Speech recognition

Making a Language model

There are different ways to generate a language model, lets see all of them one by one.

Using N-grams

N-grams are sequence of N words from a given corpus. For the sentence ‘I like pizza very much’, bigram will be, ‘I like’, ‘like pizza’, ‘pizza very’ and ‘very much’.

Let’s say, we have a sentence ‘students opened their’, and we want to find the next word for it, say w. Using 4-gram, we can represent the above problem using the following equation, which returns us the probability of ‘w’ being the next word.

Probability for ‘w’ given ‘students opened their’ (Source: [3])

Here, count(X) denotes the number of time X appears in the corpus.

For our LM, we’ll have to compute and store all the n-grams in the whole corpus, which requires a lot of storage as the corpus gets bigger. Suppose, our LM gives us a list of words and the probability of them being the next word, now, we can us Sampling to choose a word from the given list.

It can be seen that for a N-gram, the next word always depends on the last N-1 words of the sentence. Therefore, the context and dependencies of the sentences are lost as we add more words.

“Today the price of gold per ton,while production of shoe lasts and shoe industry,the bank intervened just after it considered and rejected an IMF demand to rebuild depleted European stocks, sept 30th end primary 76 cts a share.’’

The above text (Source: [3])is generated using N-grams(N=3) from a corpus of business and financial news, it is grammatical but incoherent because we are only considering last 2 words to predict the next word.

This method is also susceptible to the Sparsity problem, which occurs when the word ‘w’ never occurs after the given sentence, hence the probability of ‘w’ will always be 0.

Using Neural Networks

To make a LM using Neural networks, we consider a fixed window, i.e fixed number of words at a time. As shown in the diagram below, the words are then represented in the form of one-hot vectors and multiplied with the word embeddings and concatenated to create a matrix e. This matrix is then flattened and passed through hidden layers. The output happens through a softmax function.

Architecture for Neural Language model (Source: [3])

This method solves the Sparsity problem and doesn’t have much storage needs as compared to N-grams but possesses some problem of it’s own. Since neural networks uses fixed window for input, the length of text generated by such model is fixed and hence not very flexible to use. On increasing the size of the window, the size of the models increases thus it becomes inefficient.

Using Long Short Term Memory Networks (LSTM)

To solve the fixed input length problem, we use Recurrent Neural networks (RNNs). As we have seen with the N-grams approach, the long term dependency is missing. If we use vanilla RNNs, we will still have the same problem of long term dependency because of Vanishing gradient in RNNs. However, a special kind of RNN known as LSTM solves the above problem.

LSTMs are capable of learning long-term dependencies. They were introduced by Hochreiter & Schmidhuber (1997), and were refined and popularized by many people in following work.

All RNNs have the form of a chain of repeating modules of neural network. In standard RNNs, this repeating module will have a very simple structure, such as a single tanh layer. In LSTMs, the repeating module has a different structure. Instead of having a single neural network layer, there are four, interacting in a very special way. Read about LSTMs in more detail here.

Evaluation of a Language model

Any model needs to be evaluated to improve it or to compare it with other models. Perplexity is used to evaluate Language models. It is a measurement of how well a probability model predicts test data.

Like the word ‘perplexed’, which means puzzled or confused, we measure how confused our model is, low perplexity means coherent/well-made sentences are generated by the model, whereas, high perplexity denotes incoherent and jumbled sentences. Therefore, low perplexity is good and high perplexity is bad.

Mathematically, Perplexity is the inverse probability of the test set, normalized by the number of words.

Conclusion

Language models are an important part of NLP and can be used in many NLP tasks. We saw how to make our own language models and what problems occur in each method. We came to the conclusion that LSTM in the best method for making a language model as it takes care of long-term dependencies.

Reference and Image sources

[1]: Learning NLP Language Models with Real Data (https://towardsdatascience.com/learning-nlp-language-models-with-real-data-cdff04c51c25)
[2]:Language Modelling (http://web.stanford.edu/class/cs124/lec/languagemodeling.pdf)
[3]: Stanford CS224N: NLP with Deep Learning | Winter 2019 | Lecture 6 — Language Models and RNNs
(https://www.youtube.com/watch?v=iWea12EAu6U)
[4]: Understanding LSTM Networks
(https://colah.github.io/posts/2015-08-Understanding-LSTMs)