Noise Suppression Using Deep Learning

Published in

Analytics Vidhya

6 min readSep 13, 2020

At the time of boom of online video conferencing and virtual communication, the ability of a platform to suppress background noise plays a crucial roll to give it a leading edge. Platforms like Google Meet constantly use Machine Learning to perform noise suppression to provide the best audio quality possible. Today I will show you how you could make your own Deep Learning model to perform Noise Suppression

What’s the Big Deal with Noise Suppression?

The task of Noise Suppression can be approached in a few different ways. These might include Generative Adversarial Networks (GAN’s), Embedding Based Models, Residual Networks, etc. Irrespective of the approach, there are two major problems with Noise Suppression

Handling variable length audio sequences
Slow processing time resulting in lag

I will show you some basic methods in this article on how to deal with these problems

Let’s Start

We will first import our libraries. We will be using tensorflow. You are free to implement a PyTorch version of the same.

The data we will be using is a combination of Clean and Noisy audio samples of different sizes. The dataset is provided by University of Edinburgh and can be downloaded from here

Loading and Visualising the data

We will use tensorflow’s tf.audio module to load our data. Using tf.audio() along with tf.io.read_file() has given me 50% faster loading times as compared to librosa.load() because of tensorflow using the GPU

Here we load our individual audio files using tf.audio.decode_wav() and concatentate them to get two tensors named clean_sounds_list and noisy_sounds_list. This process takes about 3–4 minutes to complete and is visually represented using the tqdm loading bar

After the loading is done, we will need to make uniform splits of the one big audio waveform. Although this is not compulsory to do, our main aim is to convert this model to a tflite model which currently, does not support variable length inputs. I decided an arbitrary value of batching_size as 12000. You are free to change it but keep it as small as possible. It will be useful later.

For the visualising part, we use librosa’s display module which basically uses matplotlib in the backend to plot the data. On plotting the data as seen above, we can see that the noise is quite visible. The noise can be anything ranging from people and cars to dish-washing sounds.

Creating a tf.data.Dataset for pipelining

We will now create a very basic helper function called get_dataset() to generate a tf.data.Dataset. We choose 40000 samples for training and the remaining 5000 for testing. Again, you are free to tweak and add to this but I will not be going into the depth here as pipeline optimization is not the main goal of this article

Creating the Model

The code for the model architecture is as follows:

The model is a purely Convolutional one. The goal is to find a bunch of filters so as to help minimize the background noise. To help with this, we add residual connections to help with context from the original audio sample. The idea behind this model is derived from the implementation of SEGAN network [et.al Santiago Pascual]. The idea derives from the fact that a purely convolutional network can handle multiple shape inputs with ease, leading to more flexibility. The convolutional nature forces the model to focus on the temporally close correlations throughout the model. The model can be split into two parts, the convolutions and the de-convolutions(or upsampling layers). The strided convolutional layers behave as an auto-encoder where after N layers, a reduced representation of the input is obtained. The deconvolutions perform exactly the opposite strided procedures to obtain a cleaned representation of the noisy input. The skip connections provide the required context to the deconvolution layers at every step which results in better overall results.

The model is compiled with a Mean Absolute Loss

The choice of optimizer was difficult as, SGD,RMSprop and Adam stood pretty close. I finally went ahead with Adam just because of it’s slightly more robust nature. Some hyperparameter tweaking gave me a pretty good learning rate of 0.002

The final results:

Training loss: 0.0117
Testing loss: 0.0117

The results are quite pleasing but we are not done yet. We still have to handle the inference procedure for variable sized inputs. We will do that next

Handling the Variable Input Shape

Our model is trained with a very specific input shape which is dependent on our batching_size. To allow multiple shape inputs, a simple strategy will work really well.

We run our model through all the splits till the (n-1)th split. Consider the example where batching_size is 12000 and your audio array is of shape (37500,). In this case we split the audio waveform into min(37500/12000) = 3 splits. The remaining part of the array will be of shape (1500,). To fix this problem we sample another frame but this time from the rear end of the array. Something like this

Now, we run all the 4 splits through the model to get individual predictions.
From the output predictions, we extract the first three frames as they are and clip the last frame to get only the remaining part

At this point, some code will help with clarity

Okay but how fast is it?

%%timeit
 tf.squeeze(predict(noisy_sounds[3]))OUTPUT: 10 loops, best of 3: 31.3 ms per loop

If we specify the input shape of the model as (None,1), we can pass a variable length tensor to the model which gives even faster results. For now, we want to quantize the model for cross device compatibility.

Optimization and Creation of TFLite Model

Using the TFLiteConverter() is pretty straightforward. You pass the keras model along with an optimization strategy (TF documentation recommends using DEFAULT only) and write the converted model to a binary file for future use.

TFLite Model Inference

The preprocessing is similar to that of the Keras model but since I could not find anything on batching for TFLite models (please let me know if the support is present), I had to use a pythonic for loop for iterating over all splits. The code below describes instantiation of the Interpreter and allocation of tensors, followed by invoking it to get us our results

Now the question arises, how better is this than the Keras model? The answer to that isn’t simple. Since I was unable to to process all batches together, the overall inference time was affected but the TFLite Model on it’s own is faster than the Keras model.

%%timeit
 predict_tflite(noisy_sounds[3])OUTPUT: 10 loops, best of 3: 41.7 ms per loop

Out of all the advantages of the TFLite format, cross platform deployment is the biggest one. The model can now be ported much easily than the Keras model, not to mention the super small size of the model- just 346 kB

That’s it for now!

The model can be improved further by addition of filters, creating a deeper model and optimizing the pipeline but that is for next time. As we come to the end of the article I would like to mention some references and links.

Colab Notebook for the code and audio samples: Here
Dataset: Here
SEGAN paper: Here

Any comments or suggestions would be much appreciated. Thanks for reading!