Evolution of Natural Language Generation

An article to draw attention towards the evolution of Language Generation Models

Abhishek Sunnak, Sri Gayatri Rachakonda, Oluwaseyi Talabi

Since the dawn of Sci-Fi cinema, society has been fascinated with Artificial Intelligence. Whenever we hear the term “AI”, our first thought is typically one of a futuristic robot from movies such as Terminator, The Matrix and I, Robot.

Although we might still be a few years away from robots that can think for themselves, there have been significant developments in the fields of machine learning and natural language understanding over the past few years. Applications such as Personal Assistants (Siri/Alexa), chatbots and Question-Answering bots are truly revolutionizing the way we interface with machines and go about our daily lives.

Natural Language Understanding (NLU) and Natural Language Generation (NLG) are among the fastest growing applications of AI due to the increasing need to understand and derive meaning from language, with its numerous ambiguities and varied structure. According to Gartner, “By 2019, natural-language generation will be a standard feature of 90 percent of modern BI and Analytics platforms”. In this post, we will discuss a brief history of NLG since the early days of its inception, and where it is headed in the coming years.

What is Natural Language Generation?

The goal of language generation is to convey a message by predicting the next word in a sentence. The problem of which likely word to predict (among millions of possibilities) can be tackled by using Language models, which are a probability distribution over sequences of words. Language models can be constructed at a character level, n-gram level, sentence level or even paragraph level. For example, to predict the next word that comes after “I need to learn how to ___”, the model assigns a probability for the next possible set of words which can be “write”, “drive” etc. Recent advances in neural networks such as RNNs and LSTMs have allowed processing of long sentences, significantly improving the accuracy of language models.

Markov Chains

Markov chains are among the earliest algorithms used for language generation. They predict the next word in a sentence by just using the current word. For example, if a model was trained using only the following sentences: “I drink coffee in the morning” and “I eat sandwiches with tea”. There is 100% chance it would predict “coffee” to follow “drink”, while there is 50% chance for “I” to be followed by “drink” and 50% to be followed by “eat”. A Markov chain takes the relationship between each unique word into consideration to calculate the probability of the next word. They were used in earlier versions of smartphone keyboards to generate suggestions for the next word in the sentence.

Markov Model for an example sentence (Source: Hackernoon)

However, by just focusing on the current word, Markov models lose all context and structure of the preceding words in the sentence which can lead to incorrect predictions, limiting their applicability in many generative scenarios.

Recurrent Neural Network (RNN)

Neural networks are models that are inspired by the workings of a human brain, offering an alternate method for computing by modeling non-linear relationships between inputs and outputs — their use for language modeling is known as neural language modeling.

An RNN is a type of neural network that can exploit the sequential nature of the input. It passes each item of the sequence through a feedforward network and gives the output of the model as an input to the next item in the sequence, allowing for the storage of information from the previous steps. The “memory” possessed by RNNs makes them great for language generation, as they can remember the context of the conversation over time. RNNs differ from Markov chains, in that they also look at words previously seen (unlike Markov chains, which just look at the previous word) to make predictions.

Unrolled Architecture of an RNN module (Source: Github)

RNNs for Language Generation

In every iteration of the RNN, the model stores in its memory the previous words encountered and calculates the probability of the next word. For example, if the model generated the text “We need to rent a ___ ”, it now has to figure out the next word in the sentence. For every word in the dictionary, the model assigns the probability based on the previous words it has seen. In our example, the words “house” or “car” will have a higher probability than words like “river” or “dinner”. The word with the highest probability is selected and stored in the memory, and the model then proceeds with the next iteration.

Sentence Generation through unrolling of an RNN

RNNs suffer from a major limitation — the vanishing gradient problem. As the length of the sequence increases, RNNs cannot store words encountered far back in the sentence, and only make predictions based on recent words. This limits the application of RNNs towards generating long sentences that sound coherent.

Long Short-Term Memory (LSTM)

Architecture of an LSTM module (Source: Github)

LSTM based neural networks are a variant of RNNs designed to handle long-range dependencies in the input sequence more accurately than vanilla RNNs. They are used in a wide variety of problems. LSTMs have a similar chain-like structure to RNNs; however, they comprise a four-layer neural network instead of a single layer network for RNNs. An LSTM is composed of 4 components: a cell, an input gate, an output gate and a forget gate. These allow RNNs to remember or forget words over arbitrary time intervals by regulating the flow of information in and out of the cell.

LSTMs for Language Generation

Sentence Generation trough unrolling of an LSTM

Consider the following sentence as an input to the model: “I am from Spain. I am fluent in ____.” To correctly predict the next word as “Spanish”, the model focuses on the word “Spain” in an earlier sentence and “remembers” it using the cell’s memory. This information is stored by the cell while processing the sequence and is then used when predicting the next word. When the full stop is encountered, the forget gate realizes that there may be a change in the context of the sentence, and the current cell state information can be overlooked. This allows the network to selectively keep track of only relevant information while also minimizing the vanishing gradients problem which allows the model to remember information over a more extended period.

LSTMs and its variations seemed to be the answer to the problem of vanishing gradients to generate coherent sentences. However, there is a limitation to how much information can be saved as there is still a complex sequential path from previous cells to the current cell. This limits the length of sequences that an LSTM could remember to just a few hundred words. An additional pitfall is that LSTMs are very difficult to train due to high computational requirements. Due to their sequential nature, they are hard to parallelize, limiting their ability to take advantage of modern computing devices such as GPUs and TPUs.


The Transformer was first introduced in the 2017 Google Paper “Attention Is All You Need”, where it proposed a novel method called the “self-attention mechanism”. Transformers are currently being used across a wide variety of NLP tasks, such as language modeling, machine translation and text generation. A transformer consists of a stack of encoders to process an input of any arbitrary length and another stack of decoders to output the generated sentence.

Animation showing the use of a transformer for machine translation (Source: GoogleBlog)

In the above example, the encoder processes the input sentence and generates a representation for it. The decoder uses this representation to create an output sentence word by word. The initial representation/embedding for each word are represented by the unfilled circles. The model then aggregates information from all other words using self-attention to generate a new representation per word, represented by the filled balls, informed by the entire context. This step is then repeated multiple times in parallel for all words, successively generating new representations. Similarly, the decoder generates one word at a time, from left to right. It attends not only to the other previously created words but also to the final representations developed by the encoder.

In contrast to LSTMs, a transformer only performs a small, constant number of steps while applying a self-attention mechanism which directly models relationships between all words in a sentence, regardless of their respective position. As a model processes each word in an input sequence, self-attention allows the model to look at other relevant parts of the input sequence for better encoding of the word. It uses multiple attention heads which expands the model’s ability to focus on different positions regardless of their distance in the sequence.

In recent times, there have been a few modifications made to vanilla transformer architectures which significantly improved their speed and accuracy. In 2018, Google released a paper on Bidirectional Encoder Representations from Transformers (BERT) which produced state of the art results for a variety of NLP tasks. Similarly, In 2019 OpenAI released a transformer-based language model with around 1.5 billion parameters to generate long, coherent articles using just a few lines of input text as a prompt.

Language generation using OpenAI’s GPT-2 model (Source: Venture Beat)

Transformers for language generation

Recently, Transformers have also been used for language generation. One of the most well-known examples of transformers used for language generation is by OpenAI, in their GPT-2 language model. The model learns to predict the next word in a sentence by using attention to focus on the words previously seen in the model that are relevant to predicting the next word.

Relationships determined by the self-attention mechanism in transformers (Source: Medium)

Text Generation with Transformers is based on a similar structure to the one followed for machine translation. If we take an example sentence “Her gown with the dots that are pink, white and ___.” The model would predict blue, by using self-attention to analyze the previous words in the list as colors (white and pink) and understanding that the expected word also needs to be a color. Self-attention allows the model to selectively focus on different parts of the sentence for each word instead of just remembering a few features across recurrent blocks (in RNNs and LSTMs) which mostly will not be used for several blocks. This helps the model recall more characteristics of the preceding sentence and leads to more accurate and coherent predictions. Unlike previous models, transformers can use representations of all words in context without needing to compress all information into a single fixed-length representation. This architecture allows transformers to retain information across much longer sentences without significantly increasing the computation requirements. They also perform better than previous models across domains without the need for domain-specific modifications.

The future of Language Generation

In this blog, we saw the evolution of language generation from using simple Markov chains for sentence generation to using self-attention models for generating longer range coherent text. However, we are just at the dawn of generative language modeling, and transformers are just one step in the direction towards truly autonomous text generation. Generative models are also being developed for other types of content such as images, videos, and audio. This opens the possibility to integrate these models with generative text models to develop advanced personal assistants with audio/visual interfaces.

However, we, as a society, need to be careful with the application of generative models as they open several possibilities for their exploitation in generating fake news, fake reviews and impersonating people online. OpenAI’s decision to withhold release of their GPT-2 language model due to the potential for its misuse is a testament to the fact that we have now entered an age where language models are powerful enough to cause concern.

Generative models have the potential to transform our lives; however, they are a double-edged sword. By putting these models through appropriate levels of scrutiny, whether through the research community or government regulation, there is certainly going to be a lot more progress in this domain over the coming years. Regardless of the outcome, there should be exciting times ahead!