We tested the effectiveness of three neural network architectures commonly used in image recognition for automatic speech recognition. These architectures: Residual Networks, Highway Networks, and Densely Connected Networks, all use nontrivial connections or skip connections. This allows networks with a very large number of layers to be trained without suffering from the vanishing gradient problem.
Read the Full Paper
Before skip-connectivity was introduced, shallow neural networks outperformed deeper models. When input is propagated forward through the network layers, information can easily be lost after only a relative few number of layers. Each layer introduces noise and at a certain point the noise overshadows important features of the original input. This is a problem as deeper networks are able to learn increasingly complex patterns which could result in a better model.
As illustrated in figure 1, networks with skip-connections address this problem by adding or concatenating output from previous layers to subsequent layers thereby making early stage information accessible throughout the network. In image classification tasks, neural networks using skip-connections have led to state-of-the-art results, but benchmarks on speech recognition models are limited.
Approach and contribution
We decided to benchmark three architectures that are known to perform well on image tasks after modifying them for automatic speech recognition. Based on a purely convolutional architecture we adapt Residual Networks, Highway Networks and Densely Connected Networks to an automatic speech recognitions task. We train and evaluate the proposed architectures on a standard dataset and compare them to the convolutional baseline model (figure 2).
The results are encouraging, as we show that skip-connections can be successfully used for automatic speech recognition. In particular, we find densely connected networks to outperform other proposed architectures and yield significant improvements on the transcription task.