A Paper A Day: #2 Sequence-to-Sequence Learning as Beam-Search Optimization

Today I’ll summarize a paper by Sam Wiseman and Alexander Rush for seq2seq learning using beam search optimization.

The dominant approach for training seq2seq models is based on learning a conditional language model with training optimizing for the likelihood of the next target word conditioning on the input sequence and the ground-truth history of target words. Usually the objective function the model learns is a cross entropy loss function, minimizing the KL-divergence between the true distribution of words and the distribution of words generated by our learned model.

While this approach has proven to be effective for language models, the objective function doesn’t really align with the metric we care about at test time. For example, in machine translation, we care about generating fully formed sequences for a target language given a source sentence. Thus, we should be optimizing for a “sequence level” score (e.g. BLEU) rather than using a per-word level objective function.

Another issue is the exposure bias between training and test time, where at training time we feed-in the ground truth annotations, while at test time the model is exposed to its own predictions by feeding-in predicted words rather than ground-truth words.

The authors also highlight the importance of controlling for the label bias, where a local objective function like cross-entropy will normalize the probability scores guaranteeing that successors of incorrect histories receive the same mass as do the successors of the true history.

The question now is how can we design an effective approach for solving the above problems: training by scoring sequences rather than words, exposing the model to its own predictions, and controlling for the labeling bias.

The authors offer a possible solution for the three problems. The main idea is to assign a score to any possible target sequence, and use a training procedure inspired by the learning as search optimization (LaSO) framework of Daume ́ III and Marcu (2005). Furthermore, the framework uses an efficient algorithm to backpropagate through beam-search during the seq2seq training procedure, thus the standard model architecture for seq2seq training can still be maintained.

Beam-search Training Scheme

I’ll leave out the details for how to do LaSO for a future post (maybe tomorrow?) but the main idea for doing beam search training can be summarized as follows. The first trick is to predict a ranking score for a sequence rather than predicting the probability of the next word. This can simply be done by using a recurrent LM architecture and leaving out the softmax layer. This helps in controlling for the labeling bias, and by scoring sequences rather than words, this provides a solution for the mismatch in training / testing time.

The tricky part is how to do this in a tractable way. Ideally we would train by comparing the gold sequence to the highest-scoring complete sequence. However, because finding the argmax sequence according to this model is intractable, the authors adopt a LaSO-like scheme to train, where a loss function is defined to penalizes the gold sequence falling off the beam during training.


Experiments span three different tasks: word ordering, syntactic parsing, and machine translation. The results are compared to a highly tuned seq2seq system with attention (Luong et al., 2015). The version with beam search optimization shows significant improvements on all three tasks, and particular improvements on tasks that require difficult search.

Source Code

The source code is available on Github.