Dense LSTMs for Speech Recognition

In this post, we introduce a new neural network architecture for speech recognition, densely connected LSTM (or dense LSTM). At Capio, we have recently achieved state-of-the-art performance on the industry-standard NIST 2000 Hub5 English evaluation data [1], which is a significant jump from our first work in the domain [5]. A combination of multiple systems, including a few benefiting from dense LSTM acoustic models [1], enabled us to achieve these results. Let’s start with how we were motivated.

Gradient vanishing is a phenomenon where error signals vanish during back propagation as they go deeper inside a neural network being stacked with a number of layers. This prevents deep neural networks from being trained properly. Deep Residual Learning [2] was proposed to mitigate the gradient vanishing phenomenon, which exploits skip connections between neural layers as depicted in Fig. 1 below.

Fig. 1. Skip connection in deep residual learning [2].

The skip connection adds the original input to the processed output of a neural layer. This helps alleviate gradient vanishing as it provides more direct connections for error signals to “skip” layers during back propagation.

Fig. 2. Dense convolutional neural network (CNN) [3].

Densely connected neural networks have the same purpose, to avoid gradient vanishing, through more direct connections between layers as shown in Fig. 2. Dense CNNs were introduced for image classification tasks, outperforming residual networks which had been the best performing neural network architecture on the CIFAR-10/100 data set [3]. Dense connections allow error signals to get further back-propagated with less gradient vanishing between layers in a deep neural network.

One notable difference between dense networks and residual networks is the connectivity pattern. Considering that 𝓗₅(·) is a general composite function of operations in the 5th layer of a given neural network, a residual connectivity pattern for the output of the 5th layer, x₅, can be written as x₅ =𝓗₅(x₄) + x₄, while a dense connectivity pattern can be represented as x₅ =𝓗₅([x₁,x₂,x₃,x₄]) + x₄, where [x₁,x₂,x₃,x₄] is a concatenated vector of outputs from the first layer to the 4th layer. The dense connectivity pattern has direct connections between many layers while the residual connectivity pattern only has connections between adjacent layers. The increased number of direct connections in the dense connectivity pattern saves information flow from vanishing during back propagation while training deep neural networks.

Fig. 3. Word error rate (%) comparison between residual and dense connection for LSTM (with the cell dimension of 128). The models were trained with Switchboard-1 Release 2 and tested against the NIST 2000 Hub5 English evaluation data set.

Motivated by the success of the dense CNNs [3], we applied the dense connectivity pattern to LSTMs [1]. Let us first take a look at how dense connections compare to residual connections in an LSTM setting for speech recognition. Fig. 3 plots word error rates (WERs) along with the number of LSTM layers in each acoustic model. In the figure, the red curve suggests that normal LSTMs would not obtain any benefit in speech recognition accuracy after the 6th layer. The performance of residual LSTMs, depicted as the orange curve, improves through to the 10th layer and then degrades as more layers are added. This validates that residual learning has LSTMs being trained properly, even with more layers, however we also see that there is a clear limitation depicted by the U-shaped curve. In contrast, dense LSTMs continue to improve as more layers are added, even after the 10th layer, further lowering the WER down to around 16% at 20 LSTM layers (light blue curve). Some of our notes on the dense LSTM experiments can be found here.

Due to the connectivity pattern of concatenating vectors coming from the previous layers, the dimension of an input vector for the 𝓁-th LSTM layer with the cell dimension of 𝒹 would be (𝓁−1)×𝒹, which keeps increasing as 𝓁 increasesd. Hence, we made LSTM layers into blocks where dense connections are applied only within the same block, and linked blocks by a transitional layer. This block concept has also been exploited in the original paper for densely connected CNNs [3], but for a different purpose. The green curve in Fig. 3 is based on dense LSTMs where every group of 5 LSTM layers belongs to one block, while the light blue curve comes from dense LSTMs with 10 LSTM layers per block.

Based on this superiority of dense LSTMs, we proposed a few dense LSTM acoustic models [1].

Fig. 4. (a) Dense TDNN-LSTM. (b) Dense CNN-bLSTM.

The first kind is dense TDNN-LSTM. As shown above (left), it consists of 7-layer time delay neural networks (TDNNs) combined with 3-layer LSTMs, where 3 TDNNs are followed by a couple of dense blocks and 1 LSTM in the final layer before the softmax layer. Each green-highlighted dense block contains 1 LSTM and 2 TDNNs with the dense connectivity pattern. The final layer in each block concatenates all the outputs of the neural layers within the block. The second kind is dense CNN-bLSTM, as shown above (right). As we explored several dense CNN-bLSTMs in [1], the figure is presented as general as possible. This structure has 3 CNN layers followed by 𝑁 dense blocks (blue-highlighted), each of which contains 𝑀 LSTM layers, connected densely to one another. The final layer in each block concatenates the output vectors from all the layers inside to deliver to the next block.

Table 1 shows a WER comparison between TDNN-LSTMs with and without dense connections. For this experiment, we used Fisher English Training Part 1 & 2 and Switchboard-1 Release 2. The total amount of data used was approximately 2,000 hours. It is noticeable in the table that there is a statistically meaningful improvement (by around 5%, relative) from the dense TDNN-LSTM on the CallHome testset (CH). Table 2 presents the performance of a few dense CNN-bLSTMs with different configurations. In the configurations (a), (b) and ©, the dense CNN-bLSTMs have a total of 15 LSTM layers. The dense CNN-bLSTM-(a) and (b) have one transitional layer between two blocks where 7 LSTM layers are allocated each (𝑁=2, 𝑀=7), while three blocks with 5 LSTM layers each (𝑁=3, 𝑀=5) are tightly connected without a transitional layer in the configuration (c). In the dense CNN-bLSTM (c), the cell dimension in each block gets smaller from 512 to 128 to make the entire neural network shape narrower as we go deeper. In the configuration (d), the dense CNN-bLSTM has a total of 30 LSTM layers with a smaller cell dimension of 128. The dense CNN-bLSTM-(a),(b) and (c) all exceeded the performance of the dense TDNN-LSTM in Table 1 for both the Switchboard (SWBD) and CallHome (CH) testsets. The performance gap between the dense CNN-bLSTMs seems to be largely contributed by the LSTM cell dimension.

With the contribution from these dense LSTMs, we have achieved a milestone of 5.0% WER for the Switchboard testset surpassing the reported human parity (5.1% WER) [4]. Dense connection can be easily applied to existing LSTM-based neural network architectures for speech recognition, thanks to it’s simple connectivity pattern, unlocking improved performance as we add more layers since it alleviates the vanishing gradient!

  1. Kyu J. Han, Akshay Chandrashekaran, Jungsuk Kim and Ian Lane, “The CAPIO 2017 conversational speech recognition system,” arXiv 1801.00059.
  2. Kaiming He, Xiangyu Zhang, Shaoqing Ren and Jain Sun, “Deep residual learning for image recognition,” arXiv 1512.03385.
  3. Gao Huang, Zhuang Liu, Kilian Q. Weinberger and Laurens van der Maaten, “Densely connected convolutional networks,” arXiv 1608.06993.
  4. George Saon, Gakuto Kurata, Tom Sercu Kartik Audhkhasi, Samuel Thomas, Dimitrios Dimitriadis Xiaodong Cui, Bhuvana Ramabhadran, Michael Picheny, Lynn-Li Lim, Bergul Roomi and Phil Hall, “English conversational telephone speech recognition by humans and machines,” arXiv 1703.02136.
  5. Kyu J. Han, Seongjun Hahm, Byung-Hak Kim, Jungsuk Kim and Ian Lane, “Deep learning-based telephony speech recognition in the wild,” in Proc. of Interspeech 2017.
One clap, two clap, three clap, forty?

By clapping more or less, you can signal to us which stories really stand out.