Neural text generation: How to generate text using conditional language models

Neil Yager
Phrasee
Published in
11 min readMay 4, 2018

Here is a toy project: build a Twitter bot that generates dialog in the style of Simpsons characters. The process is straightforward:

  • scrape the dialog from the scripts for all 635+episodes
  • build a generative language model (e.g. using an open source tool such as Markovify)
  • let it loose and see what it comes up with

It wouldn’t take long, and the Twitter account might even get a small handful followers (e.g. Parry and Jasper from Phrasee HQ).

This has been done for everything. There is even a service that will make online comments for you (with your trademark wit and charm) after you are dead and gone.

These models are fun to play with and demonstrate some key concepts in natural language processing (NLP). However, there aren’t many practical uses for random, undirected text.

What if we want to control the output? For example, how would we tell the generator to create something in the style of Marge or Grampa Simpson? Let’s take a look.

Text generation

Language models

It all starts with a language model. A language model is at the core of many NLP tasks, and is simply a probability distribution over a sequence of words:

It can also be used to estimate the conditional probability of the next word in a sequence:

Let’s assume we have the sequence [my, cat's, breath, smells, like, cat, ____] and we want to guess the final word. A language model would estimate the probability for every word in the vocabulary:

There are several ways to create a language model. The most straightforward is an n-gram model that counts occurrences to estimate frequencies. A bare-bones implementation requires only a dozen lines of Python code and can be surprisingly powerful.

Neural language models are built using recurrent neural networks (RNNs). Two popular variations of RNNs are Long Short Term Memory (LSTM) networks and Gated Recurrent Unit (GRU) networks. RNNs have several advantages. Firstly, RNNs are able to take arbitrary length sequences as input. Secondly, and more importantly, they can learn long term dependencies in the data. For example, a neural language model trained on C source code can generate properly indented code and remember to close open brackets over long distances (examples here). You can’t do that with an n-gram model.

Generating new text

Given a language model, how do we generate text? It is an iterative process: select a word based on the sequence so far, add this word to the sequence, and repeat. Therefore, we just need to know how to pick the next word. There are a few strategies:

  • Sampling: Sample from the conditional word probability distribution. Words that are a better fit are more likely to be selected. For the example above, we would select the word “food” with probability 62%, “toys” with probability 14%, “aardvark” with probability 0.001%, etc.
  • Greedy: Always pick the word with the highest probability (akaargmax). Select “food”.
  • Beam search: The greedy approach doesn’t always result in the final sequence with the highest overall probability. A beam search keeps track of several probable variants at each step to avoid being led astray by local maxima. Select “food” and “toys”, and reassess what is better when more words have been added.

The output

Let’s look at an excerpt from Andrej Karpathy’s classic post on character level RNNs. It is from a sampled Wikipedia article:

Copyright was the succession of independence in the slop of Syrian influence that was a famous German movement based on a more popular servicious, non-doctrinal and sexual power post.

At a distance it has the appearance of English text. However, upon closer inspection, it is at best devoid of meaning, and at worst it will give you a splitting headache. The “sampling” method is partly to blame. There is too much randomness. It is like a drunkard managing to stay upright, but stumbling from one word to the next with no particular direction. However, even if we use a beam search instead of sampling, we still wouldn’t have any control over the semantics.

How can we generate text that means something?

Conditional language models

Recall that a language model assigns a probability to a sequence of words. A conditional language model is a generalization of this idea: it assigns probabilities to a sequence of words given some conditioning context (x):

Let’s look at two examples.

Example 1: Neural Machine Translation

Machine translation is exactly what is sounds like: automatically translating from one language to another. In 2014 the field was rocked by a new approach. In the blink of an eye decades of research were overturned by a new technique known as neural machine translation (NMT).

NMT uses a single neural network comprised of two RNNs:

  • Encoder RNN: Extracts all of the pertinent information from the source sentence to produce an encoding
  • Decoder RNN: A language model that generates the target sentence conditioned with the encoding created by the encoder

This architecture is known as a sequence2sequence model (or simply seq2seq for those studious of brevity). It is trained on sample pairs of the source language and its translation. Crucially, it is trained “end to end” via backpropagation as a single system — no more need for hand crafted rules and intricate linguistic knowledge. (To the Phrasee linguistics team: this is an oversimplification of course! We need you more than ever.)

The “decoder” is a conditional language model. The output is based on the sequence generated so far and the original text to be translated:

Credit: this picture has been adapted from the excellent lectures notes from the Stanford course Natural Language Processing with Deep Learning (CS224n)

The decoder is trained with a method called “teacher forcing”. The target sequence is the input sequence offset by one. It is learning to predict the word that comes next.

Note that we are no longer generating random nonsense! NMT generates text with meaning. Text with purpose. The kind of text you could see yourself reading.

Example 2: Image captioning

For the image captioning problem we have:

  • Input: An image
  • Output: Text describing the image

The encoder extracts key features from the image. For example, using a convolutional neural network (CNN). These features are stored as a compact encoding that is used to condition the language model:

How is information transferred from the encoder to the decoder?

In NMT and image captioning the encoder creates a fixed-length encoding (a vector of real numbers) that encapsulates information about the input. This representation has several names:

  • embedding
  • latent vector
  • meaning vector
  • thought vector

Here is the key: the embedding becomes the initial state of the decoder RNN. Read that again. When the decoding process starts it has, in theory, all of the information that it needs to generate the target sequence.

Once you understand this the sky is the limit. Here are a few more examples of applications for conditional language models:

Conditioning with word vectors

At Phrasee we do something a little different: we condition our language models with word embeddings. A word embedding is a dense vector of real numbers:

Word embeddings have the following desirable qualities:

  • they are fixed-length, which is convenient for machine learning algorithms
  • word embeddings capture semantic information about words (e.g. synonyms will be close in the vector space)
  • word embeddings are easy and fast to compute
  • word embeddings can be combined using vector arithmetic. The classic example: “king” − “man” + “woman” = a vector that is pretty close to “queen”.
  • word embeddings can be combined to build up more complex concepts that don’t correspond to a single word

Here is what our model looks like:

Note that in this case there is no encoder; the decoder is conditioned directly with the word embedding. It would be an easy modification to turn this into an end to end system where encoder embeddings are also learned. However, in our case we have an external source of embeddings that we would like to use.

Experiments

Data

Phrasee’s core business is using natural language generation to write marketing language (that outperforms human language). We will demonstrate conditional language models by generating email subject lines. We have a data set of about 2 million subject lines. Some were harvested internally, and the rest supplied by Notablist. They are predominantly promotional emails. A typical example would be an online retailer advertising their latest deal:

This week only: two pairs of running shoes for the price of one!

The decoder model

The decoder model has two inputs:

  1. The word embedding. For training, we select 1–3 words at random from the subject line and add their embeddings.
  2. The email subject line. This is integer encoded as a sequence, with each number corresponding to a word in the vocabulary. An embedding layer is used to learn custom embeddings for the decoding process. (We can input the same embeddings used for #1 above. However, we found that allowing the decoder to learn custom embeddings improved the results.)

The decoder model output:

  1. The probability distributions for the subsequent words in the sequence.
  2. The hidden state of the RNN. This is not needed during training since the hidden state is only specified once at the start. However, during inference (text generation) the state is fed back to the decoder after each word is selected and the sequence is updated.

The decoder is a two-layer GRU. During development, a GRU worked slightly better than an LSTM, and two layers performed better than a single layer. Here is the model:

Implementation

For our implementation we used the following:

  • 50 dimensional word embeddings. They were built using Gensim’s word2vec implementation and the subject line database. Since embeddings are used as the initial state for the first GRU layer, for simplicity both GRU layers have 50 hidden units.
  • The vocabulary is the 15,000 most common words in the subject line database.
  • We limit the sequence length to 22. This is 20 words for the subject line and two more for the “start of sequence” and “end of sequence” tokens.

We built our model using Keras, which is a high-level API for defining and training neural networks. Keras is incredibly expressive. The code for the conditional language model (which was originally based on the Keras seq2seq tutorial) is concise:

Training the model

Here are a few details about the training process. Since this is for demonstration purposes, we didn’t put a great deal of effort into tuning the hyperparameters.

  • Loss function: categorical_crossentropy
  • Learning rate: 0.001
  • Optimizer: RMSProp
  • Batch size: 128
  • Backend: Tensorflow v1.7.0
  • Platform: We used FloydHub (please don’t tell me you are still managing your own deep learning AWS instances!). Tesla K80 GPU with 4 cores.
  • Runtime: We trained the model for about 10 hours (5 epochs), during which time we were standing around like a couple of Rory Calhouns. The performance was still improving but we deemed it to be sufficient for our purposes.

Sample output

Now for the fun part! Let’s generate some subject lines using the beam search strategy. Here are some good (and not so good) results:

Note 1: UNK is the token for “unknown”. This is a word that is not in the vocabulary.

Note 2: The output has been post-processed by an in-house algorithm that handles things like capitalization, spacing, plurality, inflection, etc. It also replaces generic fields (like “first_name” and “percent_val”) with specific values.

The first thing to notice is that the subject lines aren’t a random jumble of marketing buzzwords. They are broadly fluent and coherent. (Although, to be fair, if we use the sampling method for generation instead of beam search they make less sense.) More impressively, the model has successfully learned semantically related concepts. When conditioned with the “food” vector it generates a subject line about a 2 course meal with wine at a restaurant. Good stuff! Similarly:

  • “hurry” → “don’t miss out”
  • “won’t” + “last” → “last chance”
  • “best” + “deals” → “top 10 deals of the day”

Other observations:

  • The UNK token works as expected: it is used in place of specific products, locations, restaurant names, etc. These aren’t common enough to be included in the vocabulary.
  • When the model is conditioned with UNK we get a list of UNK’s in the output. This is probably because there are many subject lines in the training set that are simply lists of products, all of which are outside of the vocabulary.
  • Certain phrases (“and more”, “free shipping”, “30% off”) are over-represented in the output. In a production system the decoder would penalize overly generic phrases to enforce more diversity.
  • When conditioned on “free” + “shipping” the decoder got a little bit too excited. This paper suggests ways to deal with repetition.
  • A somewhat confusing result is “Happy Christmas!” when conditioned with “halloween”. The model has produced a seasonal subject line, which is cool. However, it has selected the wrong season, which is weird.

Future directions

There are many ways to enhance conditional language models. For example, it is possible to control the sentiment of the generated text as well (see here and here). Consider the subject line generator. Not only would it generate a subject line about an upcoming sale, but we can also direct it to convey a sense of urgency.

The main limitation is the length of text. Subject lines and tweets are in the sweet spot. However, it is challenging to generate lengthy text that is fluent and coherent, while maintaining control over the semantics of the output.

Further reading

By necessity, a lot of details have been glossed over, and some of the finer points may be confusing. Here are some helpful links to fill in the gaps.

Deep learning for NLP:

Reccurent neural networks:

seq2seq models:

Neural text generation

--

--