An overview of Language models

Specifically for sentence completion application

9 min readDec 1, 2019

Article by CHAUHAN JAINISH NILESHKUMAR 17110038 and Manas Satish Bedmutha, IIT Gandhinagar.

In this article, we explain the advancements in language modelling with a focus on the sentence completion task. Starting from the early N-gram language model, we present the explanations and transitions till the recent models. We explain what an n-gram model is, and then we derive probability expression for predicting future words. We then show the leap from classical to neural approaches and then explain the RNN language model. We also discuss recent improvements in the RNN LM through dependency RNNs and labelling. We demonstrate the utility of these models with respect to sentence completion. Finally, we summarize the journey of language models and list the current state of the art of models for language models.

What is language model?

“A statistical language model is a probability distribution over sequences of words. Given such a sequence, say of length m, it assigns a probability to the whole sequence(w1,w2,…,wm).”

There are several applications of language modeling in NLP. Speech recognition, Sentence completion, Grammar/Spelling checker, machine translation, Information extraction etc. are some of them. There are main two categories of language models one is count-based language model and other is continuous space language model.

N-Gram Language model

For sentence prediction count-based models can be used by modelling the joint probability distribution of the given words and future words. Here given a word sequence (w1,w2,…,wt) we want to predict word sequence (wt+1,wt+2,…,wt+T). The probability of latter given the occurrence of the former can be given by below conditional probability expression:

And the best possible sequence of predicted words will generate the sequence for which the above probability comes out to be maximum. This task can be modelled by:

By Multiplication rule for joint probability and Bayes law of probability we can write equation-2 as:

Here for prediction, we have used all previous samples of the words. Which means that we have assumed that the next word will be dependent on all the previous examples. But this is not the case in many practical applications. In many practical applications such as prediction of words, prediction of weather, it is observed that the next coming samples are dependent only on the few previous N samples. Having this assumption not only simplifies the process but also gives better results. This assumption is called the N-1th order — Markov assumption. For different applications, value of N will be different. And the language model with the above assumption is called the N-gram model. If we use the N-gram language model, the equation (3) is simplified to:

Here probability values are defined by the below equation:

N-Gram counts are obtained from the given corpus. If N=1 we call it unigram model, if N=2 we call it trigram model and similarly for higher orders also.

Below graph shows the graph representing the N-gram distributions of words: Albert Einstein, Sherlock Holmes, Frankenstein counted on google books corpus.

Google Ngram Viewer

© 2013 Google - Privacy & Terms - About Google - About Google Books - About Ngram Viewer

books.google.com

Using N-gram language model for sentence completion

For efficient sentence completion using the N-gram language model, we need to incorporate the Viterbi algorithm along with the N-gram language model. In Viterbi algorithm we start with most recently entered word (wt) and iteratively predict future words until the word with the highest probability which should be added in the sentence comes out to be period(.). But every time it is not certain that the after how many words (.) will come as the word with the highest probability and whether it will come. So to tackle that we introduce a threshold such that if for the next predicted word, the word with maximum probability has probability less than the threshold, then we will stop adding words. Also for higher-order N-gram models, many of the probability values become zero. So we need to choose the order wisely and then apply different smoothing techniques to the model. To lower the computational cost, we compute log of the equation (4) as summations are less computationally expensive than multiplications and max in log form will also give max in normal form.

Interesting fact: If we use a corpus like Textbooks or Digests, for getting the n-grams then we can use the above approach to let machine solve different MCQs and fill in the blanks questions based on the given book.

Transition to neural models

Count-based language models assign probabilities to sentences based on their likelihood probability into N-grams. However, n-grams had a lot of sparsity and were not very efficient in modelling the higher-order dependencies. That is where the idea of statistical language models emerged, that embedded each word into a lower-dimensional vector. This representation is done based on either classical our neural techniques. In 2003, Bengio et al. proposed a neural language model. The pioneer to neural language models, this model learns word representations during training and enables us to define a word in with respect to others in a quantitative space. This model, however, is a feedforward network. Hence the context depth is limited. However, with the advent of faster computing resources and large scale databases, the approach towards language modelling has shifted focus to neural network-based frameworks that can either be feedforward to a long-range or use a recurrent feature.

The neural language model by Bengio et. al.

Recurrent Neural Network models

In 2010, Mikolov et al. suggested a model that could compactly record the previous dependencies as compared to the earlier models. They used Recurrent Neural Networks (RNNs) to map the long-range dependencies within a sentence. This enhances the embeddings generated for each word by capturing the context as deep as required.

Assume a sentence to be a series signal w. At each the tth instance from the beginning, the word at point t is called as w(t). We make w(t) as a one hot encoding of the word at the tth instance over the entire vocabulary. Let the output representation be another series of signals called s. Then the representation for the word w(t) will be given as:

Here, s(t-1) is the representation of the previous (t-1) words. W is the matrix connecting this s(t-1) with the current word vector w(t). U is the dictionary matrix. It contains embeddings of all the words in the vocabulary. The activation f is usually taken as sigmoid, but other activations are also valid.

With respect to this mapping, The RNN generates the word probability vector y(t) for the next word w(t+ 1), using the output word embedding matrix V. A softmax nonlinearity g(xi) [=exp(xi)/∑iexp(xi)] is applied at the end. Thus y(t) =g(Vs(t)). The output vector y(t) is the probability distribution of the next word.

Mikolov further combined the Maximum Entropy model for a vocabulary with the RNNs. He used an N-gram context w(t−n+ 1,…,t) as input to the RNN to get w(t+1) using the possible index corresponding to the maximum probability in the distribution of the next word w(t+1). An additional optimization is to supply the hash instead of the direct vectors. Since the vocabulary set can be huge, all vectors will be humongous, making the computations cumbersome and inefficient and hence hashing the context improves upon the number of computations needed. Thus, based on Recurrent Neural Networks, we have been able to predict the contextually aware words.

In recent years, many improvements have been made to this baseline RNN based language model. One of the most popular being the use of Dependency RNNs. This network represents the words of each sentence as a graph. This graph is recursively iterated in a breadth-first fashion to build multiple unrolled sequences for a given sentence. Using these sequences improves the training efficiency of the RNN. Also adding labels to the RNN equation enhances the learning process further. The labels can be any learnable metadata for the given corpus of words.

Using RNN Language Models for sentence completion

Sentence completion requires generating words that provide the correct end to a given set of initial words. Since RNNs depend largely on the context, they are very heavily utilized to predict the words based on previous context.

The sentence is first padded with two new words <s> and </s> at the beginning and ending respectively. This helps the programs to identify the beginning and ending of a sentence. The RNN LM is modelled to our task such that the first few words of each sentence starting from <s> are provided as an input. The output vector is the probability of the next word. The word with the highest probability is chosen. Upon finding this word, it is added to the input sentence and passed again through the network. This added context is used to predict the next word. This process goes on and on until the word corresponding to the maximum is the Begin <s> or the End of Sentence tag </s>.

Summary

The count-based approaches gave the first insight to mapping the words we speak in the form of numbers. Using probability distributions and statistical approaches, we were only able to capture small range context effectively. With the introduction of RNN LMs, we have been successful in capturing the long dependencies too. These models have been highly successful in the task of sentence completion. Multiple improvements and changes in parameters can help improve the accuracy of prediction of the correct words.

A survey of state of the art for each dataset and its corresponding details can be found below:

A comparison of the state of the art language models for different datasets

As more and more datasets are produced, we will have access to more training data. In addition to a large amount of training data, the advancements in deep learning techniques and implementations will bring forth newer models that are faster and more efficient. The day may not be far when AI will not just write our emails for us, but also capture our style of writing and chat with other people as if it is a human.

References

Mikolov, T., Karafiát, M., Burget, L., Černocký, J. and Khudanpur, S., 2010. Recurrent neural network based language model. In Eleventh annual conference of the international speech communication association.

Bengio, Y., Ducharme, R., Vincent, P. and Jauvin, C., 2003. A neural probabilistic language model. Journal of machine learning research, 3(Feb), pp.1137–1155.

Bickel, S., Haider, P. and Scheffer, T., 2005. Predicting sentences using n-gram language models. In Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing.

Mirowski, P. and Vlachos, A., 2015. Dependency recurrent neural language models for sentence completion. arXiv preprint arXiv:1507.01193.

Slides from NLP — 2019, IITGN— Prof. Mayank Singh

Language model

A statistical language model is a probability distribution over sequences of words. Given such a sequence, say of…

en.wikipedia.org

Sources of images:

Papers With Code : Language Modelling

Language modeling is the task of predicting the next word or character in a document. * indicates models using dynamic…

www.paperswithcode.com

Understanding LSTM Networks

Posted on August 27, 2015 Humans don't start their thinking from scratch every second. As you read this essay, you…

colah.github.io

Building Gmail style smart compose with a char ngram language model

“OpenAI built a language model so good, it’s considered too dangerous to release” — Techcrunch

towardsdatascience.com

Paper Dissected: "XLNet: Generalized Autoregressive Pretraining for Language Understanding"…

The era of muppets seems to have come to an end. There's a new paper called XLNet, and it's cementing itself as the new…

mlexplained.com