Sequence to sequence learning: sequence generation

Vina Chang
Sep 2, 2018 · 6 min read

This is my note for this lecture: sequence-to-sequence learning by prof Hung-yi Lee


Simple generation:

Feed the “<BOS>” (begin of sentence) token into an RNN model, sample (or argmax) from the output distribution and repeatedly feed the previous output into the RNN model until the “<EOS>” (end of sentence) token is generated.

The output of each timestamp can be regarded as the conditional probability of each word given all the words generated previously.

What if we want more than random generation?

Conditional sequence generation. By providing conditions to the RNN, we can get it to do tasks like machine translation (input: sentences from one language, output: sentences from another language), video caption (input: video frames, output: caption), chat bot (input: conversation, output: response), etc. In this case, the RNN is usually referred to as the decoder.

By getting an vector representation of the input, we can feed it into an RNN model. For image inputs, we can use a CNN model to get the vector encoding (the output right before the flatten layer) ; for text inputs, we can use a RNN model to get the encoding (the final hidden state of the model after feeding in all input texts).

Image caption generation with RNN

But 2 problems result from this approach: 1) for long input, it’s hard to compress all the information into one vector embedding 2) the RNN model might forget what words it has already generated because it’s hidden state cannot hold that much information.

Ref: Serban, Iulian V., Alessandro Sordoni, Yoshua Bengio, Aaron Courville, and Joelle Pineau, 2015 “Building End-To-End Dialogue Systems Using Generative Hierarchical Neural Network Models.

Attention comes to the rescue

What if we let the RNN “takes a look” at the input instead of only the input embedding while it’s generating words? That’s the idea of attention. At each timestamp the RNN would compute the similarity between its current hidden state and all the hidden states of the input sequence and decides which parts the input to look at. The similarity can be computed in many different ways such as cosine similarity or by using a small NN (input: RNN hidden state and input input state; output: similarity). If an NN is used, then it’s learned jointly with the RNN.

Attention calculation of a decoder at timetamp 2

The similarity calculated will be normalized to give a distribution. This inner product of this distribution and the input hidden states is the input to the decoder at current timestamp.

*Note: z0 is learned

For image caption tasks, the input image is usually split into regions, and each region will have its own embedding. Similarity is calculated with these.

Attention calculation in image captioning
The decoder learns to look at the right input (the frisbee in the pic) when generating the word “frisbee”

For video captioning, the similarity is calculated with the embedding of each frame.

Ref:
Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard Zemel, Yoshua Bengio, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML, 2015

Li Yao, Atousa Torabi, Kyunghyun Cho, Nicolas Ballas, Christopher Pal, Hugo Larochelle, Aaron Courville, “Describing Videos by Exploiting Temporal Structure”, ICCV, 2015


Some issues with conditional generation

The decoder keeps attends to the same part and repeats output words

The decoder looks at frame 2 at timestamp 2 and 4, thus generating the word “woman” twice.

Ideally we would want the decoder to look at each input evenly without ignoring any or focusing too much on some. Thus it’s common to add regularization for attention weights.

Add regularization to force the sum of weights across timestampes of each component to approach τ, an hyperparameter to be choosen

Exposure bias: the mismatch between training and test set

When training, at each timestamp the correct answer of the previous timestamp is provided as input no matter the encoder gets the previous one correct or not. However when in testing, if the previous answer is wrong, it would still be fed into the decoder as there is no so-called correct answer when testing. This creates a mismatch between training and test set and once the decoder gets one step wrong, all of the following result would likely be wrong because the sequence might have never been explored.

One might try to get the training process works like in the testing setting; that is, don’t provide correct answer as input but use the previous output. Yet it results in slow training or makes training extremely hard because in the beginning the model is almost learning the wrong stuff. Say the correct sequence is ABB initially the decoder generates B at timestamp 1. It’s then trying to learn the sequence generation beginning with B. When it finally learn to generate A at timestamp 1, all the things it learned become useless and it has to learn all over again.

There are 2 ways to mitigate the issue: 1) scheduled sampling 2) beam search

Scheduled sampling is a mixture between the original training process and the match between training and test set. It uses randomness to mitigate the issue; it can either use the correct label or use the previous generated answer when training. At the initial stages of the training process, the probability of using the correct answer is higher so the model gets to learn the ground truth; as the training continues, the probability to use previous output is raised so training and testing match better.

Beam search saves the K best sequences (with the highest probability) at each time stamp, so one mistake might not affect the final result.

Beam search with K = 3

Mismatch between object level and component level

When training a sequence model, the loss is summed over all the components (i.e. the sum of the cross entropy of each step), but a sequence model doing well on the component level might not be good when you looks at all the output as a whole. For example, the sentence “the dog is is fast” might be good on the component level but as it repeats is twice it’s not a good sentence.

However if we use some criterion that measures the quality at the object level, it might not be differentiable and thus we cannot use the traditional training methods.

Reinforcement learning might be used in this case by regarding the previous output as observation, all the possible y as the action set and the output as the action taken.

Ref: Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli, Wojciech Zaremba, “Sequence Level Training with Recurrent Neural Networks”, ICLR, 2016

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade