Sequence to sequence learning: sequence generation
This is my note for this lecture: sequence-to-sequence learning by prof Hung-yi Lee
Simple generation:
Feed the “<BOS>” (begin of sentence) token into an RNN model, sample (or argmax) from the output distribution and repeatedly feed the previous output into the RNN model until the “<EOS>” (end of sentence) token is generated.
The output of each timestamp can be regarded as the conditional probability of each word given all the words generated previously.

What if we want more than random generation?
Conditional sequence generation. By providing conditions to the RNN, we can get it to do tasks like machine translation (input: sentences from one language, output: sentences from another language), video caption (input: video frames, output: caption), chat bot (input: conversation, output: response), etc. In this case, the RNN is usually referred to as the decoder.
By getting an vector representation of the input, we can feed it into an RNN model. For image inputs, we can use a CNN model to get the vector encoding (the output right before the flatten layer) ; for text inputs, we can use a RNN model to get the encoding (the final hidden state of the model after feeding in all input texts).

But 2 problems result from this approach: 1) for long input, it’s hard to compress all the information into one vector embedding 2) the RNN model might forget what words it has already generated because it’s hidden state cannot hold that much information.
Ref: Serban, Iulian V., Alessandro Sordoni, Yoshua Bengio, Aaron Courville, and Joelle Pineau, 2015 “Building End-To-End Dialogue Systems Using Generative Hierarchical Neural Network Models.
Attention comes to the rescue
What if we let the RNN “takes a look” at the input instead of only the input embedding while it’s generating words? That’s the idea of attention. At each timestamp the RNN would compute the similarity between its current hidden state and all the hidden states of the input sequence and decides which parts the input to look at. The similarity can be computed in many different ways such as cosine similarity or by using a small NN (input: RNN hidden state and input input state; output: similarity). If an NN is used, then it’s learned jointly with the RNN.

The similarity calculated will be normalized to give a distribution. This inner product of this distribution and the input hidden states is the input to the decoder at current timestamp.
*Note: z0 is learned
For image caption tasks, the input image is usually split into regions, and each region will have its own embedding. Similarity is calculated with these.


For video captioning, the similarity is calculated with the embedding of each frame.
Ref:
Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard Zemel, Yoshua Bengio, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML, 2015
Li Yao, Atousa Torabi, Kyunghyun Cho, Nicolas Ballas, Christopher Pal, Hugo Larochelle, Aaron Courville, “Describing Videos by Exploiting Temporal Structure”, ICCV, 2015
Some issues with conditional generation
The decoder keeps attends to the same part and repeats output words

Ideally we would want the decoder to look at each input evenly without ignoring any or focusing too much on some. Thus it’s common to add regularization for attention weights.

Exposure bias: the mismatch between training and test set
When training, at each timestamp the correct answer of the previous timestamp is provided as input no matter the encoder gets the previous one correct or not. However when in testing, if the previous answer is wrong, it would still be fed into the decoder as there is no so-called correct answer when testing. This creates a mismatch between training and test set and once the decoder gets one step wrong, all of the following result would likely be wrong because the sequence might have never been explored.

One might try to get the training process works like in the testing setting; that is, don’t provide correct answer as input but use the previous output. Yet it results in slow training or makes training extremely hard because in the beginning the model is almost learning the wrong stuff. Say the correct sequence is ABB initially the decoder generates B at timestamp 1. It’s then trying to learn the sequence generation beginning with B. When it finally learn to generate A at timestamp 1, all the things it learned become useless and it has to learn all over again.
There are 2 ways to mitigate the issue: 1) scheduled sampling 2) beam search
Scheduled sampling is a mixture between the original training process and the match between training and test set. It uses randomness to mitigate the issue; it can either use the correct label or use the previous generated answer when training. At the initial stages of the training process, the probability of using the correct answer is higher so the model gets to learn the ground truth; as the training continues, the probability to use previous output is raised so training and testing match better.

Beam search saves the K best sequences (with the highest probability) at each time stamp, so one mistake might not affect the final result.

Mismatch between object level and component level
When training a sequence model, the loss is summed over all the components (i.e. the sum of the cross entropy of each step), but a sequence model doing well on the component level might not be good when you looks at all the output as a whole. For example, the sentence “the dog is is fast” might be good on the component level but as it repeats is twice it’s not a good sentence.
However if we use some criterion that measures the quality at the object level, it might not be differentiable and thus we cannot use the traditional training methods.
Reinforcement learning might be used in this case by regarding the previous output as observation, all the possible y as the action set and the output as the action taken.
Ref: Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli, Wojciech Zaremba, “Sequence Level Training with Recurrent Neural Networks”, ICLR, 2016
