Machine Growth
Published in

Machine Growth

Speech Recognition — Connectionist Temporal Classification (CTC)

  • Step-by-step guidance for Speech recognition loss function.
  • Brief explanation on the steps behind CTC algorithm.

Is Google home fascinating? Does Amazon echo impress you? Have you ever wondered what is the technology behind these devices? Let’s me guide you through the technology behind these devices.

The first technology would be automatic speech recognition (ASR) that can be trained by using Neural network with CTC loss function. This method has been published in deep speech 2 and showed good results. Therefore, we are going deeper into deep speech 2 neural network architecture and CTC loss function.

For the neural network architecture, we take in speech and represent the speech input as spectrogram. Spectrogram is a visual representation of speech frequency with time variance. When we have visual representations, we could solve speech recognition problem as image classification problem by splitting a speech visual representation into segments. Then, we have few layers of RNN to learn the relationship between the segments and classify each segment into a label. For example, when we speak “pool” in 6 seconds, we could split “pool” into 6 segments [wav1,wav2,wav3,wav4,wav5,wav6]. The trained model will predict the split spectrogram segments into labels [p,o,-,o,l,l]. We introduce “-” here to distinguish the consecutive labels. if we do not introduce “-” label, [p,o,o,o,l,l] could represent “pool” or “pol”. When “-” label is introduced, [p,o,o,o,l,l] can only represent “pol”, and [p,o,-,o,l,l] represent “pool”.

Sound Simple, right? When we have input : [wav1,wav2,wav3,wav4,wav5,wav6] and label : [p,o,-,o,l,l], we could train a speech recognition model. However, a person can speak “pool” with a long “l” ending sound [p,o,-,o,l,l], or a person can speak very fast[p,o,-,o,l,-], or a person can speak a very long “p” sound [p,p,o,-,o,l] at the beginning. There are so many combinations for each word. it is not feasible to label sound [wav1,wav2,wav3,wav4,wav5,wav6] of different people with different speaking speed to train a speech recognition model. To overcome this problem, we just label [wav1,wav2,wav3,wav4,wav5,wav6] as “pool” and use Connectionist Temporal Classification (CTC) algorithm.

Figure 1 : Possible alignments Generation

First, we create all possible alignments for a label given the input spectrogram. In figure 1, we have generated 3 possible alignments for the label “pool”. And based on these possible alignments, we get the count of each [p,o,o,l,-] at specific timestamp. At time 1, t1, we have 3 “p”. At time 2, t2, we have 1 “p” and 2 “o”. We perform the similar counting method for t3, t4, t5 and t6 to generate the above table.

Figure 2 : Probabilities generation for each alignment

Next, we would like to get the conditional probabilities for each possible alignment. For alignment 1, we have:

p(y, alignment1 | X) = p(y=p|wav1)*p(y=o|wav2)*p(y=o|wav3)*p(y=-|wav4)*p(y=o|wav5)*p(y=l|wav6) = t1 : p(y=p|wav1) = 3/3
at t2 : p(y=o|wav2) = 2/3
at t3 : p(y=o|wav3) = 2/3
at t4 : p(y=-|wav4) = 2/3
at t5 : p(y=o|wav5) = 3/3
at t6 : p(y=l|wav6) = 3/3

Using similar methods, we will get p(y, alignment2 | X) = 0.074074 and p(y, alignment3 | X) = 0.148148.

After that, we would like to eliminate the alignment information to get only p(y | X). The term for this process is call marginalisation. This step is simple. we just SUM_ALL_POSSIBLE_ALIGNMENT( p(y, alignment1|X), p(y, alignment2|X), p(y, alignment3|X) ).

So now, we know that the CTC process is just the process to find the p(y | X) for all possible alignments where y = is the label “pool” and X is the split spectrogram (wav1, wav2, wav3, wav4, wav5, wav6). At the end of the day, the speech recognition model is just a model with the knowledge of assigning best label (optimised path) to all kind of variance in the input spectrogram.

CTC loss function

Until here, we have the understanding in CTC. Thanks to the paper “Deep Speech 2: End-to-End Speech Recognition in English and Mandarin” published by Baidu Research — Silicon Valley AI Lab. There are few interesting papers related to speech recognition such as Segmental RNN and exploring neural transducers for end-to-end speech recognition. In segmental RNN, we generate alignments using duration. This will eliminate the use of additional label “-”. and for the neural transducers, we are using language model and word counts to make the prediction. These papers are interesting but required the fundamental understanding of CTC. Feel free to revisit this article when you have doubt in CTC. You may try to prove the marginalisation process, it is quite simple. I think the marginalisation process is the gist of the CTC and it has been used widely in NLP too. If you need help in proving, just comment below. I will find some time to update this article. If not, I hope that this article helps you in understanding CTC and open a door for you to enter Speech Recognition World!




Angel is hiding in the details

Recommended from Medium

MLOps on GCP — Understand basic ML Workflow Management up-to Production-Ready

Can I Use Java for Machine Learning?

A production ready approach for outlier detection and monitoring

Sidewalk Riding Detection at Lime

How Big Is The Number — Tree(3)

Statistics and Probability

Bacteria, machine learning, and basic decision making

Dimensional Reduction — Feature Selection Part 1

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Alex Yeo

Alex Yeo

More from Medium

Quantization in Deep Neural Networks


Uncertainty in Forecasts, Weather and Other, Part 2: Commune With Ensembles

AI moving to the Edge of the Cloud

Review — A Neural Probabilistic Language Model