Language Models and their Evolution

Published in

Machine Learning Reply DACH

7 min readDec 6, 2022

Natural Language Processing is one of the hot research topics that deals with extracting and understanding text data. The field has undergone significant growth in recent years, and one of the major contributing factors to this growth is the emergence of language models. This blog post gives an overview of what a language model is and how it has evolved over the years to become more powerful.

In simple words, a Language Model (LM) assigns a probability to a sentence, i.e a sequence of words. It determines how likely a sequence of words will occur. For example, the sentence It is going to rain today is more likely to occur than the sentence Today is rain. This example can be mathematically represented as follows:

But why do we need such probabilities? Getting these probabilities can be useful in different use cases such as text summarisation in which the input sentence is shortened without losing meaning, question-answering in which answers for a given question are extracted from the web/input text, chatbots in which computer communicates with humans and provides assistance, text recognition which converts text from image to textual data, and many more. The following examples show how these probability distributions can be useful.

1. Machine translation: Translate input sentence from one language to another

When a sentence from one language is translated into another, the order of words in the output sentence might need rearrangement, or some words might need replacement with more plausible words to make sense of the output. In the given example, heavy rain is more plausible than large rain. This probability helps in choosing the correct synonym during translation.

2. Autocorrect: Suggest a correction for misspelled words or incorrect grammar

When a word is misspelled, the probability of the given sequence drops. When this happens, the likelihood of a word that fits best with the given sequence can be suggested as the correct replacement. For example, the occurrence of 15 minutes is much more probable than 15 minuets. Autocorrect also considers which words from the most probable words need the least character replacement with the misspelled word.

3. Speech recognition: Convert input voice signal to text

In speech recognition, many similar sounding words can be confused with each other. The likelihood of combinations of words is required to recognize which word was spoken correctly. For example, the probability of someone saying I saw a cow is much higher than saying Eyes of a bow.

4. Sentence completion: Suggest next words in the input sentence

For sentence completion, the model needs to suggest the most likely word to occur given the current sequence of words. For example, a cheese pizza is much more likely to be ordered with olives than rice.

The methods of finding these probability distributions have improved tremendously in the last few years.

Our intuition tells us to find this probability distribution by applying the chain rule to the joint probability of a sentence. For example, to find the probability distribution of the sentence I will not be able to attend the meeting, we do the following:

And we get these individual conditional probabilities as follows:

However, the issue with this approach is that we cannot have enough data for estimating these conditional probabilities for long sentences. These long term dependencies can be handled by simply limiting the number of words on which the next word depends. In other words, instead of computing P(meeting|I will not be able to attend the), the dependency can be reduced to 2 words to compute P(meeting|attend the). In NLP, this kind of prediction which is based on short history is called the Markov assumption.

This assumption implies that the probability of the n-th word depends only on n-1 words, and the words prior to that do not affect this probability. The N-gram language model is based on this assumption.

Unigram model: probability of n-th word depends on 1–1=0 prior words
Bi-gram model: probability of n-th word depends on 2–1 = 1 prior words
N-gram model: probability of n-th word depends on n-1 prior words

With the increase in n, looking back becomes more and more expensive. In addition, setting this hard limit on how far back to look results in completely forgetting the beginning of a sentence. To combat this limitation, RNN-based language models were developed allowing information to persist.

In a RNN model, one word is processed after another in a loop by the same neural network, which introduces the recurrent nature of the model as shown in the image above. At every step, the network processes the new input along with the information obtained from the previous step and passes it to the next step. This helps the network connect previous information with current information, which can help with tasks such as sentence completion. For example, if we want to predict the next word in The cat drank ___, the RNN based language model can easily connect cat and drank to milk as the prediction. However, this model also comes with its own limitations. When a sentence gets long, only the most recent information is not enough, and some context from earlier steps of the sentence is required for the prediction. For example, in the sentence, I went to India to visit my parents and I ate ___, the model can use the recent information ate to predict the next word to be a food item. However, the model needs to look even further back to India to predict a food item from Indian cuisine. When this gap between current steps and relevant information increases, the performance of RNN drops.

Similar to the training process on a simple NN, during the training of RNN, the weights are updated in the backward pass. The gradients during backpropagation flow backward across time steps. And with an increase in sentence length, the risk of facing exploding or vanishing gradients increases. These two issues of RNN-based language models were solved by using LSTM networks.

LSTMs are a special kind of RNNs that consists of so called long term memory. Unlike RNNs in which each cell consists of a single NN layer, a LSTM cell consists of 4 layers that decide which information to keep, which to forget and how much information to pass through. Using LSTMs can give a better performance than simple RNN as the network can hold relevant information for longer. However, training a LSTM network can be time expensive. Researchers developed another breakthrough approach, much simpler and faster than LSTM, called Attention which was proposed in a paper titled Attention is all you need.

LSTMs consist of gates for forgetting information. Unlike LSTMs, instead of actively deciding which information is not important, an Attention mechanism concentrates only a few things at every step while ignoring others. For example, if we want to predict a pronoun after Jane is writing ___, we only need to focus on Jane to predict the pronoun to be her and other information is writing can be ignored. Similarly, to predict the next word of Jane is writing her ___, the attention mechanism only focuses on is writing and gives essay as output, and it ignores other non-relevant information Jane and her.

The working of the attention mechanism idea can be visualized quite nicely during image captioning, as shown below. The different colors show the correspondence between underlined words and the attended region in the attention map.

This attention mechanism was then adopted in a deep learning model with an encoder-decoder structure called Transformers. The encoder encodes the input sentence like in any encoder-decoder architecture. However, the attention mechanism helps the encoder to look also at other parts of the input sentence while it encodes a word. Similarly, it also helps the decoder to look at other relevant information while decoding.

A Transformer architecture does not contain any recurrence component and takes the entire input sequence simultaneously, which enables parallelizing the computation. This can help in faster training of a model with a huge number of parameters on a large corpus. Because of these advantages, Transformers created a massive impact in NLP and Computer Vision domain. BERT by Google, GPT by OpenAI, and their variants are some examples in NLP of such Transformer models with billions of parameters trained on large corpora using parallel computation power. Nowadays, when we talk about language models, we always refer to these powerful Transformer-based language models.

References:

https://www.youtube.com/watch?v=rURRYI66E54
https://www.analyticsvidhya.com/blog/2019/11/comprehensive-guide-attention-mechanism-deep-learning/
http://jalammar.github.io/illustrated-transformer/
https://huggingface.co/docs/transformers/main/en/index
http://colah.github.io/posts/2015-08-Understanding-LSTMs/