Connectionist Temporal Classification

陳明佐
我就問一句,怎麼寫?
3 min readApr 3, 2019

--

Connectionist Temporal Classification

As known as CTC, it’ s popular method used in speech recognition, hand-writing recognition even hand gesture recognition.

The tasks which consider the context information would take advantage of using CTC. As the data amount increases, the labels of data difficultly do the pre-processing. The alignment is tricky between the data and its own label.

The proposal of CTC makes the foundation of speech recognition using deep neural network.

CTC algorithm concept

Literally, it is used to deal with the classification of the temporal data. The training of the acoustic model of traditional speech recognition, we need to know each label corresponding to each input data, and must align each other. That’ s a time-wasting work and it means that we don’t know that which word or phoneme belongs to which frame of data. It could not be easy to set the rules to constraint.

CTC is absolutely born to solve this tricky problem. The processing is not limited by any rules, in the other words, you can use any other model to get the results of CTC.

The signal of ‘’ Ni Hao ‘’ ‘’你好’’ in Chinese

As the above figure shows, the traditional method must know every single frame correspond to which phoneme, i.e. the 1st, 2nd, 3rd, 4th corresponding to ‘’n ‘’ and the 5,6,7 corresponding to ‘’i’’ ….( Assume that every letter is one phoneme.)

Compare to the tradition, adopting the acoustic model which use CTC as loss function is a kind of ‘’end-to-end’’ training acoustic model, need not pre-processing to do the alignment. Only needs an input sequence and an output sequence, and could start training. Since CTC does concern the result of an input and an output sequence, it would just prefer to concern the output sequence similar to the truth ( or said the same ) , than the prediction whether each result in the output sequence is exactly aligned with the input sequence at the point in time.

It introduces the ‘’blank’’ mechanism, which means there is no prediction in the frame. Each prediction correspond to one ‘’spike’’ of the total speech, and the other parts are viewed as ‘’blank’’. The final CTC output is a spike sequence not how long a phoneme continues.

Blank

At the beginning, there is no blank in CTC solution, but it rises another problem. Because CTC would just predict the order of words and sequences, the continuous appearing letter will be deleted, ex: hello would become ‘helo’ by CTC. Hence, ‘blank’ mechanism is required.

Example

In order to explain CTC simply, we take the easiest example.

‘’ to ‘’ , it could be ‘’===tttttoooo’’ , or ‘’=to==’’ ,or ‘’to’’
‘’ too ‘’, could be ‘’ tt=o=oo=’’, or ‘’ =t=o=o’’ ,or ‘’ to=o’’ but not ‘’too’’

Figure for https://towardsdatascience.com/intuitively-understanding-connectionist-temporal-classification-3797e43a86c

Take another example, you could see the figure :
consider that time sequence is 2 , t0 and t1. The label is ‘’a’’,’’b’’ or ‘’-’’.
The lighter the color is, the higher the probability is.
So the output is ‘’ — ‘’ corresponding to the probability is 0.6*0.6=0.36.
The output is ‘’a’’ and its probability is 0.4*0.4+0.4*0.6+0.6*0.4=0.64.
Since the output sequence ‘’ a-’’ and ‘’-a’’ are also viewed as ‘’ a ‘’.

It means, the CTC output decoding is doing the same thing which find out the highest probability path of the task.

Output matrix of NN. Dash line means the best path

Reference

  1. https://zhuanlan.zhihu.com/p/36488476
  2. http://www.phreedream.com/2018/03/22/ctc_algorithem.html#ctcconnectionist-temporal-classi%EF%AC%81cation
  3. https://towardsdatascience.com/intuitively-understanding-connectionist-temporal-classification-3797e43a86c

--

--