Beam Search in Seq2Seq Model

Dharti Dhami
5 min readDec 19, 2018

--

We learnt about a basic seq2seq model as a type of conditional RNN model with an encoder/decoder network here. When we use this architecture we need a way to find the best combination of words for a translation. The most common algorithm for doing this is called beam search. Let’s take a look how.

Beam Search algorithm has a parameter called B, which is called the beam width. Let’s see how beam search works by using beam width of 3.

With beam width 3, beam search will consider three words to choose from for the first word. So, in order to perform this first step of Beam search, we run the input French sentence through the encoder network and then the first in the decoder network is a softmax output with overall 10,000 possibilities and we will store the top 3 words because of beam width 3.

Next, for each of the three choices, it will calculate the probability of second word given the first word and the input french sentence. As output of this step, we want to find the pair of the first and second words that is most likely. By the rules of conditional probability, this can be expressed as P of the first words times probability of the second word.

So for this second step of beam search since we have 10,000 words in our vocabulary, we would end up considering three times 10000 or thirty thousand possibilities and then pick the top three.

Because of beam width is equal to three, at every step, we instantiate three copies of the network to evaluate partial sentence fragments and the output. So we will use three copies of the network with different choices for the first words, but these three copies of the network can be very efficiently used to evaluate all 30,000 options for the second word.

And the outcome of this process will be that adding one word at a time until Beam search will decide EOS as the best next symbol.

Notice, if the beam width was 1, then this essentially becomes the greedy search algorithm.

Length normalization is a small change to the beam search algorithm that can help get much better results.

Beam search is maximizing the probability in the first formula below. It is the product of all the probabilities where t is total number of words in the output.

Since these probabilities are all numbers less than 1, multiplying a lot of numbers less than 1 will result in a tiny, tiny, tiny number, which can result in numerical underflow. So in practice, instead of maximizing this product, we will take logs and log of a product becomes sum of a log which is the 2nd formula. This provides us with a numerically stable algorithm that is less prone to rounding errors. Because the log function is strictly monotonically increasing function, we know that maximizing log P(y) given x should give the same result as maximizing P(y) given x.

This objective function has an undesirable effect where it unnaturally tends to prefer very short translations(because the value of multiplying small number of probabilities is higher). The same is true for the log of probability (since the log of values less than 1 is in negative range). To solve this, we normalize the result by dividing it by number of tokens.

Error Analysis in Beam Search

Beam search is an approximate search algorithm so it doesn’t always output the most likely sentence. We can improve the performance of beam search by increase a beam width so some extent but we first need to know if it is indeed the beam search algorithm that’s causing problems or is it the RNN architecture.

Let’s use this example of Jane visite l’Afrique en septembre. So let’s say that in your machine translation dev set, the human provided the translation y* which is “Jane visits Africa in September”. When we run beam search on the learned translation model, it ends up with the translation, y-hat, “Jane visited Africa last September” which is a much worse translation of the French sentence.

The seq2seq model computes P(y given x). So let’s compute P(y* given x) and P(y-hat given x) using our RNN model and see which of two is bigger. Using that we can figure out whether the model or beam search is at fault.

So the error analysis process looks as follows.

We go through the examples in development set and find the mistakes that the algorithm made. And through this process, we can then carry out error analysis to figure out what fraction of errors are due to beam search versus the seq2seq model. And if we find that beam search is responsible for a lot of errors, then we increase the beam width. Whereas in contrast, if we find that the seq2seq model is at fault, then we could do a deeper layer of analysis to try to figure out if we want to add regularization, or get more training data, or try a different network architecture, or something else.

--

--

Dharti Dhami

Mom, Tech Enthusiast, Engineering lead @Youtube Music.