Progressive Speech Enhancement with Residual Connections
During the last few years, Speech Enhancement (SE) based on Deep Neural Networks (DNN) has emerged and positioned among the most active topics in the speech processing community. Previous work evidences the ability of deep learning approaches for discovering underlying relations between clean and corrupted signal [1–4]. The fact is that beyond what we know about the network main goal of minimizing the error between the network output and the clean reference, we actually are not sure on further “why” and “how” transformations are happening inside the network. This black box effect is probably the major handicap and complaint against deep learning solutions, because it hinders the research process, and gives place to many empirical solutions.
Interpretability of deep neural networks has recently emerged as an area of machine learning research. It aims to provide a better understanding of how models perform feature selection and derive their classification decisions, such that the findings impact DNN solutions design. Recently, the top scientific conferences in the field have dedicated special spaces to this aim (IRASL 2018, Interspeech2018) evidencing the interest of the R&D community.
Today, we present a SE architecture following the feature-mapping strategy. The architecture is based on the recent and powerful topology of Residual Networks (RN) using one-dimensional convolution layers. The enhancement process can be followed step by step by means of a visualization probe at each network block. The visualization of the partial enhancement in each step allows us to supervise the process and collect relevant details on how it is performed. This information is useful to detect which steps are meaningful in the enhancement, and which others can be discarded during the evaluation. This way we have obtained a proper trade-off between accuracy and computational effort.
For visualization purposes, we choose four speech signals. The following figure 2 shows the corresponding clean and distorted spectrum. We used speech samples from REVERB dataset, consisting of read speech with reverberation and additive stationary noise.
Constant Channel Residual Network
In order to progressively enhance the input without losing spectral representation in each block, we have designed an RN that maintains the same number of channels throughout the residual connection. We will call this architecture Constant Channel Residual Network (CCRN). Figure 1 shows the system which uses multiple input sources to provide a variety of signal representations. We provide multiple representations of the input in order to maintain as much reverberant impulse response inside the pre-processing analysis window as possible, without losing the temporal resolution of the acoustic events.
The network processes input features with a first convolutional layer followed by L=14 Residual Blocks (RB). The combination of BN and PReLU provides a smoother representation for regression tasks than the typical ReLU. Our goal is to estimate the logarithmic spectrum of the clean signal (i.e. the enhanced) from the noisy speech. Based on the experience from previous work , we process the full input signal as a sequence, instead of frame by frame. We use a loss function based on Mean Square Error (MSE) among frames:
Based on the previous description, we place a probe-output at each block, such that we can inspect the evolution of the enhancement process. This is possible because we maintain unaltered the number of channels for each RB.
Figure 3 shows an example of some steps output spectrum obtained.
Note that the standard convolution layer is a linear combination of all input channels with a context for each output channel. Therefore the resulting 512-dimensional matrix does not correspond to a proper spectrum. Also see that some frequency channels seem to group great part of the spectral information, while the rest seems to get blurred. This indicates the network focuses on certain frequencies channels consistently with the findings in . This means the network steps over the spectral time-frequency structure, and only considers the weight changes from the MSE minimization. This could be related to the different level of distortion among the frequency channels. Furthermore, see that the first and last step of the network processing is quite remarkable. The first step transforms the signal spectrum to strong or blurred frequency channels from the convolution output. While the last step suddenly reorders and recovers the signal with enhancement included.
Constant Channel Residual Network with State Path
The CCRN architecture was designed to progressively enhance the input signal without losing the log-spectral representation. However, despite we provide a shortcut path to allow the input to pass through with additive modifications, the training of the CCRN makes a disorganized spectral representation. Apparently, most part of the information is grouped in some channels. In order to pass the input over the residual path without changing its representation, we add a state path between RBs. In this way, the representation of the signal created by the network has its own path to going on. Moreover, this state path allows having more channels at each layer, while maintaining the same number of channels in the residual path. We call the new architecture Constant Channel Residual Network with State path (CCRN-State).
This architecture stacks the channels of both paths at the input of the block. Inspired on the Wide Residual Networks , we increase the number of channels in the first convolution of the block. Then, in order to obtain the residual path and a new state path, we use two convolutional layers at the block output. One reduces the number of channels to the dimension of the residual connection, allowing the same behavior as in the previous architecture. While the other provides the state path to be used by the next block.
Qualitative results showed again a spectrum with disorganized frequency channels. A similar pattern than the observed in the previous section. Unfortunately, this way we were not able to obtain a proper spectrum reconstruction in each step.
Finally, to force the networks to provide a proper signal reconstruction at each step we add the MSE cost term at each block output. You can see antecedents of this strategy in classification tasks , although we are using it in a regression task. In the following equation, we add to the training cost the MSE between the clean reference and each block output.
We call this cost Progressive Supervision, because we take care of how much the network enhances the signal in each block. Note that both architectures explored will have the same output representation because throughout the residual path there is a constant number of channels. Anyway, at the end of the processing, enhancement results could be different.
We can see the earlier blocks take care of the more noticeable distorted areas of the spectrum, e.g. look at the trail of the reverberation. The left side of Figure 6 shows the modifications applied to the spectrum in each network step. These correspond to the network interpretation of what is distortion and what is not.
We also note that the network mainly focuses on the spectrum valleys and gradually the granularity in them is removed. Also, see how the spectral trail effect because of reverberation, is gradually removed. After last blocks, the network starts softening the spectrum in order to produce slow spectral magnitude changes. This avoids undesirable auto-generated distortions such as the annoying musical noise. However, it could also have an over-softening effect that causes an unrealistic effect in the final output (see block 14 output).
During the enhancement process, architectures CCRN and CCRN-State provide a messy spectrum that does not supply clues of what the network is doing. This representation might be a codification of the input, or even it could be learning the training examples. With Progressive Supervision, the networks are able to show the evolution of the predicted signal. Moreover, this cost function regularizes the network because it prevents to learn concrete examples. Finally, we can see how the signal is enhanced at each block. We can view the progressive enhancement through residual connections as a spectral power subtraction method. Each RB computes the weighted spectral power, while the residual connection adds it to the corrupted signal.
An additional advantage of the CCRN + Progressive Supervision is that during training we have access to the reconstruction error at each block. This allows us to train a big network with many RB and then only use the number of RB that actually provides significant cost reduction. This is a desirable quality when dealing with a clean signal.
We have presented a deep learning solution for Speech Enhancement based on residual networks. By means of an interpretative study of the progressive transformations performed by the network, we were able to design an improved architecture for the speech enhancement purpose. The mechanism of Progressive Supervision contributed to regularize the network parameters. It demonstrated to be able to stimulate the correct performance of the enhancement process, beyond the blind modification of frequency channels performed by the other alternative evaluated (CCRN and CCRN-State).
The proposal obtained speech enhancement results beyond the state-of-the-art, achieving a favorable trade-off between dereverberation and the amount of spectral distortion. We showed that the use of interpretative analysis of the process inside the network can provide useful insights to develop improved solutions.
 B.-Y. Xia and C.-C. Bao, “Speech enhancement with weighted
denoising auto-encoder,” in Interspeech, 2013.
 X. Feng, Y. Zhang, and J. Glass, “Speech feature denoising and dereverberation via deep autoencoders for noisy reverberant speech recognition,” in IEEE International Conference on Acoustic, Speech and Signal Processing (ICASSP), 2014.
 M. Tu and X. Zhang, “Speech enhancement based on deep neural
networks with skip connections,” in IEEE International Conference on Acoustic, Speech and Signal Processing (ICASSP), 2017, pp. 5565–5569.
 P. Karjol, A. Kumar, and P. K. Ghosh, “Speech enhancement using multiple deep neural networks,” in IEEE International Conference on Acoustic, Speech and Signal Processing (ICASSP), 2018, pp. 5049–5053.
 J. Llombart, A. Miguel, A. Ortega, and E. Lleida, “Wide residual networks 1d for automatic text punctuation,” IberSPEECH 2018,
pp. 296–300, 2018.
 J. F. Santos and T. H. Falk, “Investigating the effect of residual and highway connections in speech enhancement models,” in Conference on Neural Information Processing Systems (NIPS), 2018.
 S. Zagoruyko and N. Komodakis, “Wide Residual Networks,” CoRR, vol. abs/1605.07146, 2016. [Online]. Available: http://arxiv.org/abs/1605.07146
 C.-Y. Lee, S. Xie, P. Gallagher, Z. Zhang, and Z. Tu, “Deeply-supervised nets,” in Artificial Intelligence and Statistics, 2015, pp. 562–570.