Automatic speech recognition (ASR) is one of the most difficult tasks in natural language processing. Traditionally it has been necessary to break down the process into a series of subtasks such as speech segmentation, acoustic modelling, and language modelling. Each of these subtasks was then solved by separate, individually trained models. However, in 2006, Alex Graves and his colleagues introduced Connectionist Temporal Classification (CTC). The CTC loss function allows for training deep neural networks end-to-end for tasks like ASR. The previously unavoidable task of segmenting the sound into chunks representing words or phones was now redundant. Alas, CTC-based models struggled to outperform traditional systems. But a lot of errors could easily be avoided with a little prior linguistic knowledge. Traditional ASR systems use a language model to overcome such issues. So when Hannun and his colleagues in 2014 proposed a search strategy for decoding CTC output with a language model, denoted prefix beam search, it was a logical and important step forward. They demonstrated that a BRNN-based acoustic model could go from a WER of 35.8% to 14.1%. Today, this method is used in some of the best performing ASR systems, such as Baidu’s Deep Speech 2 (Amodei et al., 2016).
Beam search in itself is not very complicated. However, when combined with CTC, the procedure can be somewhat mind boggling. This blog post aims to clarify the details of prefix beam search by going through the algorithm step-by-step.
In Figure 1 we have sketched out a typical ASR pipeline. As you can see, the decoding portion, which is where prefix beam search belongs, is dependent on both a CTC network and a language model. It is not necessary for you to understand exactly how to build and train a CTC network, but it is necessary to understand what the CTC output looks like. Also, you don’t need to understand every single detail about language models, but you must have a basic understanding of what these models are good for. We will start by briefly introducing these concepts below.
In order to keep it as simple as possible, just think of a language model as a function taking a sentence as input, which is often only partly constructed, and returning the probability of the last word given all the previous words. What is this good for? Imagine we have the following partly constructed sentence: “tell us a fairy”. Now we hear the next word, but are not able discern whether it should be “tale” or “tail”. These words are homophones, which means they sound alike, although they are spelled differently. A well trained language model should be able to tell us that “tale” is much more probable than “tail”, making the choice straight forward. For further background reading, you can check out the Wikipedia page on language models.
Language model - Wikipedia
In speech recognition, the computer tries to match sounds with word sequences. The language model provides context to…
CTC output and max decoding
Above, we mentioned that a CTC network could predict a transcript directly from an audio input. Well, almost. What is actually being predicted is a matrix, let’s call it
ctc for short, where columns correspond to timesteps and each row corresponds to a letter in our alphabet. Since each column sums to one and all entries in
ctc are greater than zero, we have a categorical distribution over our alphabet for each timestep － essentially, a letter prediction. Usually, we make such a letter prediction for every 20ms of audio, and our alphabet contains at least the letters A-Z, a space (_), and a blank token (-), where the latter is required by CTC networks. We will also be using an end-character (>) to be predicted after the last word. An example is sketched out in Figure 2 below. Since most values are very small numbers, we have rounded them to zero for readability.
The easiest way to decode this is to simply take the letter with the highest probability at each timestep － a method called max decoding or greedy decoding. In Figure 1, this is represented by the red highlights and will yield a path through the following letters:
This is not very readable. In order for it to become so, we need to perform two more steps: (1) Collapse repeating characters, then (2) discard blank tokens. This results in:
You have probably already figured that it was supposed to say BUGS_BUNNY>. Although we didn’t get it quite right, this example still illustrates two important functions of the blank token; it works as a “garbage token” that allows for predicting sequences of varying length, and it makes it possible to successively predict the same character. If the network had not predicted a blank between the N’s, they would have been collapsed. Thus, the two contraction rules used to arrive at the final label are at the core of CTC networks.
Problems with max decoding
However, merely taking the letter with the highest probability at each timestep is a simplification. In fact, we can trace multiple paths through
ctc all yielding the same label, given the contraction rules. In order for us find the optimal label, we need to first calculate the probability of all possible paths in
ctc, sum together the probability of paths yielding the same label, and then select the label with the highest probability. In Figure 2 we have 28³⁰ possible paths (usually the number of timesteps is much higher than 30, so this approach is clearly not tractable).
Furthermore, max decoding doesn’t give us the option to incorporate a language model. A language model would have penalized the use of a non-word like BUNNI. And if BUNNI is successfully corrected to BUNNY, the word BUGS should be a more likely preceding word than BOX. Using prefix beam search, we can alleviate both these issues.
Prefix beam search
In this post we will present a basic Python implementation of prefix beam search which is available on GitHub. We will go through this step-by-step below, but let’s start with a short intuitive high-level explanation. Starting with an empty string, we iterate over each timestep in
ctc. We then try to extend this initial candidate with each of the characters in our alphabet. We only select a fixed sized subset of the most likely extended candidates, which we will then try to extend further at the next timestep. Whenever we try to extend with a space character, we will impose a language model constraint on the last word, thus forcing it to select candidates with a high language model probability. At the last time-step, we select the most probable candidate as our final label. The candidates at each time-step are normally referred to as prefixes, which explains the name of the algorithm.
Before we start the code walk-through, we need to define a few important variables. We will present these as they are written in our Python implementation, but it should not be hard to figure out what corresponds to what in the original paper from Hannun and colleagues.
The emission probability,
ctc[t][c], is the probability of a character, c, at a specified timestep, t. Technically, this is just a look-up in the matrix Y. Returning to Figure 2, we would for example have
ctc=0.98, where 4 is the timestep index and 2 the index of the letter B.
The blank probability,
Pb[t][l], is the probability that a prefix, l, at a specific timestep, t, originates from one or more paths ending in the blank token. For example, after three timesteps, how many paths will yield the prefix B and also end in a blank token? I count three: B--, -B-, and BB-. So in an ideal world,
Pb["B"] would be exactly equal to the sum of the probability for each of these paths. But as noted above, it’s not tractable to calculate the probability for all paths at later timesteps. Often that’s fine, as the majority of paths will have a probability of essentially zero. As we will see, prefix beam search
Pb[t][l] should only capture the most probable paths yielding a given prefix.
The non-blank probability,
Pnb[t][l], is the logical counterpart of the blank probability, thus, accounting for all underlying paths ending in a non-blank token. Using the same example as above, we would now have a different set of paths (--B, -BB, and BBB), but that is also the only difference － the same logic applies. As we walk through the code, we will make clear why it is important to maintain both a blank and a non-blank probability for each prefix.
k should also be introduced.
alpha is the weight of the language model,
beta is a compensation term, and
k is the beam width determining how many prefixes will be selected at each timestep.
Remaining variables will be introduced on the go. At this point, it might be a good idea to go to the GitHub repository containing the code for this tutorial, quickly skim the
prefix_beam_search.py to get a feeling of the context of each of the steps we will go through below, and execute the
test.py to see what exactly we are aiming for here. This requires installing
kenlm for the language modeling part, but all of this is described in the repo’s
prefix-beam-search - Code for prefix beam search tutorial by @labodk
Step 1: Initialization
We start with a single empty prefix. The blank probability is set to 1 and the non-blank is set to 0. It should be pretty straightforward to see why this is appropriate; an empty prefix will by definition not be able to arise from a path going through non-blank characters. The list
A_prev holds all prefixes that we will try to extend at a given timestep and will be updated to hold the
k most probable prefixes at the end of each iteration. You can think of this as an imaginative timestep zero, where the probability of the blank token is 1 and all other characters have a probability of 0 in the
ctc matrix of emission probabilities.
Step 2: Iterations and pruning
For every timestep, we evaluate each prefix in
A_prev and try to extend it with each character in our alphabet. Well, almost. As you maybe noticed in Figure 2, most probabilities are very small numbers. If we only try to extend with characters that have a greater emission probability than a certain threshold, we can speed up the beam search a lot. We have found that setting the threshold to
0.001 doesn’t degrade performance. The if-statement right after we begin iterating over prefixes ensures that prefixes that already have an end token do not get extended any further. In theory, this works as a third contraction rule: collapse everything after the end token.
Step 3: “Extending” with a blank
In the previous step, I said that we would try to extend the prefixes in
A_prev with each of the characters in our alphabet. However, when it comes to the blank token, we actually don’t extend the prefix
l, but rather we calculate the probability of maintaining it at the next time-step. Also, notice that we don’t need to distinguish between the blank and the non-blank probability from the previous time-step in this case; adding a blank token will result in the same prefix regardless of the end-character of the prefix’s underlying path.
Step 4: Extending with the end character
By “end character” we mean the actual end character of a given prefix
l and not the end-of-sentence token >. This is where the distinction between the blank and non-blank probability becomes important. For example, if we want BOX_BUN to become BOX_BUNN, defined as
l_plus above, then it must be that the underlying set of paths for BOX_BUN all end in the blank token. If they don’t, the two consecutive N’s will be contracted. In other words, we can only build on the paths comprised by the blank probability [Step 4, l. 4]. On the other hand, if BOX_BUN is to be maintained when predicting an extra N, then we can only build on the non-blank probability [Step 4, l. 5].
Step 5: Extending with any other non-blank character and LM constraints
This is where the language model comes into the picture, but as noted earlier, it’s only applied when we extend with a space or the end-character. First, let’s look at the the more common scenario where we just want to extend the prefix with a non-blank character that is not a space or the prefix’ end-character [Step 5, l. 5]. In this case, we can again use both the blank and non-blank probability, as the extension character will not be collapsed either way. However, you may have pondered whether it is possible for the non-blank probability to be defined twice for the same prefix at the same timestep. For example, say that
A_prev contains both BO and BOX. If you first try to extend BO with the letter X, you will define a non-blank probability for the prefix BOX [Step 5, l. 5]. However, you may also try to extend BOX with X, in which case the adjacent X’s may be collapsed [Step 4, l. 5]. Thus, this will again lead you to define a non-blank probability for the prefix BOX. Should you only keep one of these two probabilities? No, you want to add the two probabilities together, since the two initial prefixes stem from two different sets of underlying paths.
When incorporating the language model constraint, we just multiply by the weighted language model probability [Step 5, l. 3]. The language model probability,
lm(l_plus.strip(' >')), should be interpreted as the probability of the last word in
l_plus conditioned on all previous words. It is usually the case that you want to downweight the language model. Setting
alpha within the range 0.2 to 0.7 is not uncommon.
Hopefully by now you have a pretty good idea about what the difference between the blank and non-blank probability means for the prefix beam search algorithm. If not, you may want to go back to step 3 and review calculations from there onwards.
Step 6: Make use of discarded prefixes
It turns out that just because a prefix has been discarded at the previous timestep, it might still be useful to us. For example, assume we have the following three prefixes: BU, BOX, and BUG. We only use a beam width of two, so we will have to discard one prefix. We find that BUG is the least likely, and so we end up with
A_prev only containing BU and BOX for the next timestep,
t. Now the prefix BU will again be extended with G, yielding the prefix BUG. Unfortunately, since we discarded BUG at
t — 1, our estimate of the prefix is less accurate than it could be. Had we not discarded BUG at
t — 1, we would have seen it add to the total probability of the prefix at
t by both being extended with the blank token [Step 3, l. 2] and with its own collapsible end-character [Step 4, l. 5]. This is exactly why we check if a new extended prefix,
l_plus, is not already in
A_prev. If it turns out that
l_plus was actually discarded at
t — 1, then we might as well try to include it by either extending it with the blank token [Step 6, l. 2] or by adding its end-character and collapsing [Step 6, l. 3]. Notice that line 2 in this step is essentially the same as line 2 in step 3, and line 3 is equivalent to line 5 in step 4.
It might seem a little counter-intuitive that we should look for prefixes not in
A_prev, because how can we be sure that the blank and non-blank probabilities have even been defined for
t — 1? Actually, we don’t know. In the majority of cases where we find that
l_plus is not in
A_prev, then the two probabilities will simply be zero as they are essentially undefined, and the if-statement in this step will have no effect.
Step 7: Select most probable prefixes
After having extended each prefix in
A_prev, we are ready to update
A_prev with the
k most probable prefixes. We do this primarily based on the sum of the blank and non-blank probability [Step 7, l. 1], but as mentioned earlier, we also need to compensate for the language model constraint. The function
W simply extracts all full words (i.e., words followed by a space or the end-character), so the length of the function output is just a word count. In order to avoid the total prefix probability becoming zero prior to appending the first space, we need to add one to the word count. Thus, we compensate for the language model constraint by multiplying the total probability with the (adjusted) word count to the power of
Unfortunately, it is not easy to set
beta through intuition. A good setting will make sure that prefixes avoiding the space character are not preferred, but we have to keep in mind that such behavior is already implicitly punished by the CTC network, which should have strong space character predictions at certain timesteps. I have found a setting in the range 4 to 14 yields good results. However, this heavily depends on the weight given to the language model － the higher the weight, the bigger the compensation. One should therefore tune these parameters on a validation set using a grid-search.
That’s all folks
And that’s how you turn BOX_BUNNI into BUGS_BUNNY. Although our simple running example is made up, it is still quite realistic. The only real discrepancy is the limited number of time steps. If you still don’t feel completely confident with what is going on here, then hopefully the code on GitHub can help you. And please leave a comment if there is something you find unclear — we’ll gladly help!