Speech Recognition — Connectionist Temporal Classification (CTC)
- Step-by-step guidance for Speech recognition loss function.
- Brief explanation on the steps behind CTC algorithm.
Is Google home fascinating? Does Amazon echo impress you? Have you ever wondered what is the technology behind these devices? Let’s me guide you through the technology behind these devices.
The first technology would be automatic speech recognition (ASR) that can be trained by using Neural network with CTC loss function. This method has been published in deep speech 2 and showed good results. Therefore, we are going deeper into deep speech 2 neural network architecture and CTC loss function.
For the neural network architecture, we take in speech and represent the speech input as spectrogram. Spectrogram is a visual representation of speech frequency with time variance. When we have visual representations, we could solve speech recognition problem as image classification problem by splitting a speech visual representation into segments. Then, we have few layers of RNN to learn the relationship between the segments and classify each segment into a label. For example, when we speak “pool” in 6 seconds, we could split “pool” into 6 segments [wav1,wav2,wav3,wav4,wav5,wav6]. The trained model will predict the split spectrogram segments into labels [p,o,-,o,l,l]. We introduce “-” here to distinguish the consecutive labels. if we do not introduce “-” label, [p,o,o,o,l,l] could represent “pool” or “pol”. When “-” label is introduced, [p,o,o,o,l,l] can only represent “pol”, and [p,o,-,o,l,l] represent “pool”.
Sound Simple, right? When we have input : [wav1,wav2,wav3,wav4,wav5,wav6] and label : [p,o,-,o,l,l], we could train a speech recognition model. However, a person can speak “pool” with a long “l” ending sound [p,o,-,o,l,l], or a person can speak very fast[p,o,-,o,l,-], or a person can speak a very long “p” sound [p,p,o,-,o,l] at the beginning. There are so many combinations for each word. it is not feasible to label sound [wav1,wav2,wav3,wav4,wav5,wav6] of different people with different speaking speed to train a speech recognition model. To overcome this problem, we just label [wav1,wav2,wav3,wav4,wav5,wav6] as “pool” and use Connectionist Temporal Classification (CTC) algorithm.

First, we create all possible alignments for a label given the input spectrogram. In figure 1, we have generated 3 possible alignments for the label “pool”. And based on these possible alignments, we get the count of each [p,o,o,l,-] at specific timestamp. At time 1, t1, we have 3 “p”. At time 2, t2, we have 1 “p” and 2 “o”. We perform the similar counting method for t3, t4, t5 and t6 to generate the above table.

Next, we would like to get the conditional probabilities for each possible alignment. For alignment 1, we have:
p(y, alignment1 | X) = p(y=p|wav1)*p(y=o|wav2)*p(y=o|wav3)*p(y=-|wav4)*p(y=o|wav5)*p(y=l|wav6) = 0.296296.at t1 : p(y=p|wav1) = 3/3
at t2 : p(y=o|wav2) = 2/3
at t3 : p(y=o|wav3) = 2/3
at t4 : p(y=-|wav4) = 2/3
at t5 : p(y=o|wav5) = 3/3
at t6 : p(y=l|wav6) = 3/3
Using similar methods, we will get p(y, alignment2 | X) = 0.074074 and p(y, alignment3 | X) = 0.148148.
After that, we would like to eliminate the alignment information to get only p(y | X). The term for this process is call marginalisation. This step is simple. we just SUM_ALL_POSSIBLE_ALIGNMENT( p(y, alignment1|X), p(y, alignment2|X), p(y, alignment3|X) ).
So now, we know that the CTC process is just the process to find the p(y | X) for all possible alignments where y = is the label “pool” and X is the split spectrogram (wav1, wav2, wav3, wav4, wav5, wav6). At the end of the day, the speech recognition model is just a model with the knowledge of assigning best label (optimised path) to all kind of variance in the input spectrogram.

Until here, we have the understanding in CTC. Thanks to the paper “Deep Speech 2: End-to-End Speech Recognition in English and Mandarin” published by Baidu Research — Silicon Valley AI Lab. There are few interesting papers related to speech recognition such as Segmental RNN and exploring neural transducers for end-to-end speech recognition. In segmental RNN, we generate alignments using duration. This will eliminate the use of additional label “-”. and for the neural transducers, we are using language model and word counts to make the prediction. These papers are interesting but required the fundamental understanding of CTC. Feel free to revisit this article when you have doubt in CTC. You may try to prove the marginalisation process, it is quite simple. I think the marginalisation process is the gist of the CTC and it has been used widely in NLP too. If you need help in proving, just comment below. I will find some time to update this article. If not, I hope that this article helps you in understanding CTC and open a door for you to enter Speech Recognition World!