Modeling Sequences with CTC (Part 2)

A Visual Guide to Connectionist Temporal Classification(CTC)

Kushagra Bhatnagar
6 min readAug 24, 2018

In the last part I talked about how CTC can be used to map speech input to its corresponding transcript and discussed about the CTC model and algorithm and how it calculates the probability of an output sequence given an input sequence. At last I discussed about the CTC score and here I’ll discuss how to compute the same.

As mentioned in Part 1, there are two cases to compute α(s,t):

Case 1:

Here ϵ is between repeating characters

In this case, we can’t jump over Z[s-1], the previous token in Z. The first reason is that the previous token can be an element of Y, and we can’t skip elements of Y. Since every element of Y in Z is followed by an ϵ, we can identify this when Z[s]=ϵ. The second reason is that we must have an ϵ between repeat characters in Y We can identify this when Z[s]=Z[s]−2.

To ensure we don’t skip Z[s-1]​, we can either be there at the previous time-step or have already passed through at some earlier time-step. As a result there are two positions we can transition from.

Case 2:

Here ϵ is between non-repeating characters

In the second case, we’re allowed to skip the previous token in Z. We have this case whenever Z[s-1]​ is an ϵ between unique characters. As a result there are three positions we could have come from at the previous step.

Let’s take a more detailed look on how the CTC score is being computed via dynamic programming algorithm.

Every valid alignment has a path in this graph

There are two valid starting nodes and two valid final nodes since the ϵ at the beginning and end of the sequence are optional. The complete probability is the sum of the two final nodes.

Now that we can efficiently compute the loss function, the next step is to compute a gradient and train the model. The CTC loss function is differentiable with respect to the per time-step output probabilities since it’s just sums and products of them. Given this, we can analytically compute the gradient of the loss function with respect to the non-normalized output probabilities and from there run backpropagation as usual.

For a training set D, the model’s parameters are tuned to minimize the negative log-likelihood instead of maximizing the likelihood directly.

Inference

We have learned to train the model, now, we’d like to use it to find a likely output for a given input. More precisely, we need to solve:

One heuristic is to take the most likely output at each time-step. This gives us the alignment with the highest probability:

We can then collapse repeats and remove ϵ tokens to get Y.

For many applications this heuristic works well, especially when most of the probability mass is allotted to a single alignment. However, this approach can sometimes miss easy to find outputs with much higher probability. The problem is, it doesn’t take into account the fact that a single output can have many alignments.

Here’s an example. Assume the alignments [a, a, ϵ] and [a, a, a] individually have lower probability than [b, b, b]. But the sum of their probabilities is actually greater than that of [b, b, b]. The naive heuristic will incorrectly propose Y= [b] as the most likely hypothesis. It should have chosen Y= [a]. To fix this, the algorithm needs to account for the fact that [a, a, a] and [a, a, ϵ] collapse to the same output.

We can use a modified beam search to solve this. Given limited computation, the modified beam search won’t necessarily find the most likely Y. It does, at least, have the nice property that we can trade-off more computation (a larger beam-size) for an asymptotically better solution.

A regular beam search computes a new set of hypotheses at each input step. The new set of hypotheses is generated from the previous set by extending each hypothesis with all possible output characters and keeping only the top candidates.

We can modify the vanilla beam search to handle multiple alignments mapping to the same output. In this case instead of keeping a list of alignments in the beam, we store the output prefixes after collapsing repeats and removing ϵ characters. At each step of the search we accumulate scores for a given prefix based on all the alignments which map to it.

A proposed extension can map to two output prefixes if the character is a repeat. This is shown at T=3 in the figure above where ‘a’ is proposed as an extension to the prefix [a]. Both [a] and [a, a] are valid outputs for this proposed extension.

When we extend [a] to produce [a,a], we only want include the part of the previous score for alignments which end in ϵ. Remember, the ϵ is required between repeat characters. Similarly, when we don’t extend the prefix and produce [a], we should only include the part of the previous score for alignments which don’t end in ϵ.

Given this, we have to keep track of two probabilities for each prefix in the beam. The probability of all alignments which end in ϵ and the probability of all alignments which don’t end in ϵ. When we rank the hypotheses at each step before pruning the beam, we’ll use their combined scores.

The implementation of this algorithm doesn’t require much code, but it is dense and tricky to get right.

In some problems, such as speech recognition, incorporating a language model over the outputs significantly improves accuracy. We can include the language model as a factor in the inference problem.

The function L(Y) computes the length of Y in terms of the language model tokens and acts as a word insertion bonus. With a word-based language model L(Y) counts the number of words in Y. If we use a character-based language model then L(Y) counts the number of characters in Y. The language model scores are only included when a prefix is extended by a character (or word) and not at every step of the algorithm. This causes the search to favor shorter prefixes, as measured by L(Y), since they don’t include as many language model updates. The word insertion bonus helps with this. The parameters α and β are usually set by cross-validation.

The language model scores and word insertion term can be included in the beam search. Whenever we propose to extend a prefix by a character, we can include the language model score for the new character given the prefix so far.

--

--