Conditional GAN: A Case Study in Speech Enhancement Using Visual Cues
Introduction
Recent research in GANs indicates that audio processing problems can benefit from their use. One specific problem we believe can benefit greatly from the use of GANs is speech enhancement, also known as ‘de-noising’). Speech enhancement is where relevant speech is separated out from irrelevant noises to achieve greater speech clarity. For example, we would like to separate background noise from human speech in a recorded webcam video. Given that the webcam video contains both visual and audio cues that are sequenced together, we propose to experiment with the use of both audio and video cues to improve the performance of GANs for de-noising.
In this article, we will describe our efforts at Crater Labs in developing a multi-modal GAN, which uses both video and audio cues for speech enhancement. In particular, we propose the use the shape of the lips and facial features to guide our model in removing background noise.
Part One: Starting with SEGAN
SEGAN[1] is a type of conditional GAN which uses additional information to assist the generator in creating ‘good’ samples. In speech enhancement, the speech is contained in the original noisy speech. The noisy speech itself is used as the additional information needed to generate the enhanced speech.
The SEGAN generator is shown in the figure below, where it contains an encoder that estimates a feature vector from the noisy speech c, then the white noise z is concatenated to c. Similarly, all feature layers from the encoder are used as additional information to assist the generator in the same way.
The discriminator has a similar architecture to the generator encoder. But takes both the noisy and clean speech as input.
Because the noise-free speech in SEGAN is available, it is used as a target for speech enhancement. In addition to the GAN loss described above, an L1 loss that is the absolute difference between the clean speech and the generated speech is also included. While SEGAN has shown some success in speech enhancement, like most GANs, it is prone to collapse.
Part Two: Improving SEGAN
SEGAN, as implemented by the original author shows significant promise in speech enhancement. We hypothesized that if we can (a) stabilize SEGAN, and (b) incorporate facial features, we can improve the ability of SEGAN to enhance speech. Specifically, we seek to achieve our two objectives through the incorporation of ‘spectral normalization’[2] and a few techniques not used in the original SEGAN implementation.
Stabilizing SEGAN
Convolution layers used in SEGAN creates ‘checkerboard artifacts’[3] in image generation. The corresponding artifact in the audio domain would be a persistent hum at particular pitches through out the generated speech. It turns out that the discriminator can identify the artifact that led to the hum and has a nearly 100% success rate in identifying generated speech from real speech.
By the Power of Phase Shuffle
The trick is to make it harder for the discriminator to learn such artifacts. It is done by shifting the audio input of the discriminator by a few pixels randomly. This process is dubbed ‘phase shuffle’[4]. The plot below shows this [note that we are using the convention in LSGAN, where the generator loss is 0.5×(D(G(z))-1)²]:
By training SEGAN with phase shuffle, the generator converges to D(G(z)) = 0.5, the perfect scenario where 50% of the generated speech looks real to the discriminator. After implementing Phase Shuffle into SEGAN, we observed a drop in the L1 loss as well.
Having trained SEGAN with the phase shuffle technique, we are now ready to equip SEGAN with visual cues.
Incorporating Facial Features
Intuitively we assume there must be a correspondence between lip frames and speech. As the audio data passes through each layer in the neural network, there must be a corresponding feature as the lip frames passes through a similar neural network.
To get the lip frame features, we captured frames from the webcam video and stacked them depth wise. The frames then are passed into a deep convolutional neural network. The output of each layer is a set of lip frame features.
We used the ‘projection method’[5], originally proposed for the second last layer of the discriminator, and observed significantly improved results. The projection involves a dot product between two sets of features. If the two sets of features align (i.e. are similar) the dot product returns a positive number. If they are not similar, a negative number would be returned. The projection method was shown to be superior to other methods of using additional information, such as the concatenate method used in SEGAN. We will use the projection method for our lip frames.
In the discriminator, we take the last lip frame feature set and take the dot product with a feature set from the audio of the same dimensions. The alignment score obtained from the dot product is then added to the output of the discriminator. The figure below shows a two layer version of our discriminator.
During preliminary experiments a naive implementation of the method led to an order of magnitude increase in the loss function. The culprit is the sum in the dot product. To take care of this, we took the mean rather than the sum when the dot product is performed. Thus, the alignment is still performed while bringing the loss function to the usual range.
Generator Conditioning with Continuous Class Labels
In literature such as cGANs with Projection Discriminator, limiting the generated samples to specific classes is typically done with conditional ‘batch normalization’[6], where different scaling and shifting weights are used for each class. In our applications, the class labels consist of the lip frames associated with each audio clip. Using a different set of weights for each lip frame would be impractical.
The real issue is that batch normalization can cause SEGAN to collapse, especially when it is implemented in the generator. Without batch normalization, we would have to concatenate lip frame features onto the audio ones. However, this leads to additional memory cost.
We designed a novel way to condition on the generator with lip frames without additional memory usage. To do this, we extend the projection method to be used in every layer in the generator decoder. To keep things simple, we use the same implementation as the discriminator. All layers of lip frame features are used rather than just the last one as in the discriminator. The output of each decoder layer with the right dimension is added with the alignment score from the lip frame projection. The figure below shows a sizeable improvement in the model’s generalization power.
Using the projection in the generator not only improves the training, but it also stabilizes the GAN. We also observed that the generalization deteriorates around the same time when the GAN collapses. The figure below shows that only the model with projection in both the discriminator and generator is stable. At 20,000 iterations, the baseline SEGAN collapses and it starts to overfit. For the model with only projection in the discriminator, the training somewhat stalls and is overtaken by the model with projection in both the discriminator and generator.
Conclusions
A pre-trained SEGAN can enhance some audio quite well. We tried to improve the results by training SEGAN directly on the data set, using audio clips only. However, we found that SEGAN collapses quickly and starts to overfit. We believe that this is because the data is so complex that SEGAN is unable to learn the data.
As it turns out, including a module to process lip frames in the GAN can stabilize SEGAN. Facial features can assist in speech enhancement tasks. We extended the projection method to be used in every layer of the generator and observed improved stability and validation loss in speech enhancement GANs.
Further, we found that the stability of the GAN is correlated to the generalization power of the model. As the GAN collapses, it becomes harder for the model to generalize.
Notes: Data Set
The data set is derived from video interviews of school applicants. Each video is about 1 to 2 minutes long. The speakers are of various ethnicities, with foreign and regional accents. To make the problem simpler, we manually selected out fluent 540 nearly noise-free videos of fluent North American English speakers by listening to the audio only. We added background noise to the videos to from a (clean, noisy) audio pair for training, in addition to the lip frames captured from the video.
The data set at hand is much more complex than the one used to train SEGAN. For instance,
- It has 540 speakers, 30 times more than the one SEGAN trained on.
- All speakers have unique answers to the same question.
References
[1] Santiago Pascual et al., SEGAN: Speech Enhancement Generative Adversarial Network, https://arxiv.org/abs/1703.09452
[2] Takeru Miyato et al., Spectral Normalization for Generative Adversarial Networks. https://arxiv.org/pdf/1802.05957.pdf
[3] Odena et al., Deconvolution and checkerboard artifacts. https://distill.pub/2016/deconv-checkerboard/
[4] Chris Donahue et al., Synthesizing Audio with GANs. https://openreview.net/pdf?id=r1RwYIJPM
[5] Takeru Miyato et al., cGANs with Projection Discriminator https://arxiv.org/pdf/1802.05637.pdf
[6] Vincent Dumoulin et al., A Learned Representation For Artistic Style https://arxiv.org/abs/1610.07629