Understanding greedy search and beam search

Jessica López Espejel
5 min readFeb 20, 2022

--

Greedy search and beam search are well known algorithms in the language generation tasks of NLP (Natural Language Processing). Neural machine translation (NLM), and automatic summarization (AS) are some of the most popular tasks in the automatic language generation.

Greedy search and beam search aim to generate the sequence outputs of tokens from a neural network model. Both approaches are focused in sequence-to-sequence models. The model maps an input sequence to a target sequence (Figure 1) .

Figure 1. : Sequence-to-sequence model [3]

The encoder accepts the input sequence of tokens (1), where N is the number of tokens in the input.

1. Input sequence

The decoder generates the output sequence (2) at each time step t, where T is the maximum number of tokens in the sequence.

2. Output sequence

The probability of each decoder output y_(t) is conditional of the previous token outputs. Mathematically, we represent it as it follows.

At each time step y_(t), the decoder computes the probability of each token in the vocabulary. In consequence, the more tokens there are in the vocabulary, the more the computational cost increases.

Greedy Search

Greedy search consists of taking the token with the highest conditional probability from the vocabulary V.

For example, we consider 16 words in our vocabulary V: <EOS>, <UNK>, the, war, is, to, from, as, was, last, second, global, abbreviated, WWII, 1939, 1945. Moreover, the maximum number of tokens in the sequence, T = 8.

According to the figure 2, the squares highlighted in red correspond to the words that have the highest conditional probability at each time step t. In the first time step the word with the highest conditional probability is the, in the second time step is last, etc. Therefore, the sequence output predicted by the decoder is: the last global war is abbreviated as WWII.

Figure 2. Greedy search algorithm

Main drawback:

  • Greedy search algorithm hides high probabilities that can be found in posterior tokens. Therefore, it does not always generate optimal output sequences.

Beam Search

Beam search algorithm is the improved version of greedy search. Beam search has a parameter called beam_size. The beam_size is the number of tokens with the highest conditional probabilities at each time step t. In the Figure 2, the beam_size=3.

Figure 3. Beam search algorithm

Sequence 1 — the last global war — (0.35 * 0.4 * 0.1 * 0.21) = 0.0029

Sequence 2 — the second war was — (0.35 * 0.2 * 0.25 * 0.2) =0.0034

Sequence 3 — the war was the — (0.35 * 0.1 * 0.15 * 0.17) = 0.00089

Following the greedy search algorithm we selected the sequence 1, because the highest probability the greedy search found was in the second token (last = 0.4), and it continues the token generation from this only branch. However, if we use beam search algorithm, the sequence with the highest probability is the sequence 2. This is a clear example of greedy search algorithm discarding sequences of tokens with higher probability.

Drawbacks:

  • Increasing the beam_size, the quality of the output sequence improves significantly, but it reduces the decoder speed.
  • There is a saturation point, where even if we increase the number of beam_size, the quality of the decoding does not improve anymore [1].
  • It does not guarantee of finding the output sequence with the highest score [2]

From products to addition

Imagine, we have an output sequence of length=100. To have the final conditional probability we multiply P(y_1) * P(y_2) * P(y_3) * … * P(y_100), and the result will be a tiny number. The computers can be limited to represent it, and the consequence is that we loss precision. To tackle this issue, we use the product property of logarithms to turn multiplication into an addition.

Therefore, we compute the probability of the output sequences:

However, the longer an output sequence is, the more logarithmic terms it has. Therefore, one way to penalize long sequences is through the following mathematical expression. The most common value for alpha in the literature varies from 0.6 to 0.8 [5, 6].

Criteria to stop the generation of tokens

  • When the sequence of tokens exceeds the maximum number of tokens T
  • When the full generation of tokens exceeds some amount of time
  • The token predict in the output sequence is <eos>, or the token we are using to indicate end of sequence.

Conclusion

Greedy search and beam search have extensively used in NLP tasks. Meanwhile, various improvements have surged to prune the search graphs. Those propositions achieve to speed up the decoding process and/or improve the quality of the output sequences. Moreover, algorithms such as PICARD [4] via slight modifications to beam search, it helps to find valid output sequences of SQL code.

References

[1]Cohen, Eldan and Beck, J. Christopher. Empirical Analysis of Beam Search Performance Degradation in Neural Sequence Models. In Proceedings of the First Workshop on Neural Machine Translation, ACL 2017, Vancouver, Canada, August 4, 2017, pages 56–60

[2] Meister, Clara and Vieira, Tim and Cotterell, Ryan. Best-First Beam Search. In Transactions of the Association for Computational Linguistics, 2020, Cambridge, MA, vol. 8, pages 795–809

[3] Lopez Espejel, Jessica. Automatic abstractive summarization of long medical texts with multi-encoders Transformer and general-domain summary evaluation with wikiSERA. Computers and Society [cs.CY]. Université Paris-Nord — Paris XIII, 2021. English. ffNNT : 2021PA131010ff. fftel-03376172

[4] Scholak, Torsten and Schucher, Nathan and Bahdanau, Dzmitry. PICARD: Parsing Incrementally for Constrained Auto-Regressive Decoding from Language Models. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 2021, pages 9895–9901

[5] Zhang, Jingqing and Zhao, Yao and Saleh, Mohammad and Liu, Peter J. PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization. Proceedings of the 37 th International Conference on Machine Learning, Vienna, Austria, PMLR 119, 2020

[6] Raffel, Colin et al. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. Journal of Machine Learning Research, 2020, vol. 21, pages 1–67

--

--