Addressing the Cocktail Party Problem using PyTorch

Prem Oza
PyTorch
Published in
4 min readMay 20, 2020
Photo by Ricardo Gomez Angel on Unsplash

So, What is the Cocktail Party Problem?

The cocktail party effect is the phenomenon of the brain’s ability to focus one’s auditory attention (an effect of selective attention in the brain) on a particular stimulus while filtering out a range of other stimuli.

Wiki

We’ve all experienced this in our life, mostly when a lot of noise surrounds us. Either during lunchtime in the canteen or while walking in a busy street surrounded by a lot of traffic.

The Cocktail Party problem is related to how we separate out the individual signal. Just like how researchers approach the deep learning problem, the next question we need to ask is how does the human brain do that.

Well, one of the reasons is because of how we focus on visual feedback such as their face, their movement or their body language. That, of course, is visual feedback on top of the auditory attention. In this article, we are trying to solve the cocktail party problem using deep learning. We are going to refer the paper: Looking to Listen at the Cocktail Party.

The architecture of Looking to Listen at the Cocktail Party

Here is a short summary for the neural network from the paper. I recommend reading the entire paper.

Input consists of:

  • Audio
  • Faces of speaker
Neural Network Architecture — https://arxiv.org/abs/1804.03619

Parameters for the video input are shared.

Total number of video inputs is a parameter that has to be chosen before constructing the network. Hence, if you train a 2-input network then you have to retrain for, say 5-input network. But it can be generalized if you train a 1-input network, just run the network for each input.

As you can see from the architecture above, the faces are converted to face embedding using an appropriate pre-trained network. We will be using InceptionResnetV1 using facenet_pytorch.

Similarly, audio is transformed into a spectrogram. CNN used here are dilated.

At the end masks are generated and then these masks are applied to the original input to separate out individual audio. Finally, Inverse Short-time Fourier transform is used to retrieve the time-domain signal.

During inference, we have to chop the input into chunks of 3 second and feed it to the network.

The main focus of this paper is to jointly use Audio and Visual features for better separation of input signal.

Introduction to Catalyst

We are going to use Catalyst for implementing the network.

Catalyst is a high-level framework for PyTorch. It is focused on reproducibility, fast experimentation and code re-use. Here is a minimal example.

Minimal Example using Catalyst

As you can see, most of the code is to setup the model, criterion and loaders. Main driver code is handled by Catalyst. No need of any loops and it looks really clean!

The dl.SupervisedRunner class drives the whole process and does them in the following order:

  • Experiments — It stores info about the task: Model, Criterion, Optimizer, Scheduler
  • Stages — Stages include: train, pretrain, infer
  • Epochs — Epochs repeats step every epoch. e.g. Scheduler
  • Loaders — Loaders run through each DataLoader passed through as a Dict
  • Batches — Finally, model is trained/infer-ed at the batch level

For training something like MNIST it seems straightforward. But what about our task?

As mentioned earlier, we have two inputs contrary to more common one input network.

  • Audio
  • Video

So, how shall we pass that info to Catalyst such that it properly retrieves both the inputs from the loader and passes them into the model?

Fortunately, utils module provides a way! Let’s see how.

Handle inputs from DataLoader

Here, x inside the lambda is the output of the DataLoader __getitem__(). We are assigning the key to each input of the network. There’s one more thing to do and that is informing the framework about our custom inputs.

Add input key

And that is it! It will pass the inputs to the network in the same order.

Implementation

Now, let’s use what we learned. The whole data-loading process can be summarized as follows:

  1. Download the video from the dataset and cache it.
  2. Pre-process the audio and video to 16kHz and 25fps.
  3. Synthetically mix the clean input from the dataset.
  4. Retrieve individual faces using a Face Detector (MTCNN).
  5. Generate Face Embedding (using InceptionResnetV1).
  6. Finally, load them using the DataLoader.

Finally, train the whole network:

Trainer code

Hence, the training code can be greatly reduced and more focus can be given to other parts of the project like data-loading, model et cetera.

Results

Finally, we can plot the resultant spectrograms

Prediction from training set
Prediction from validation set

Future Work

We are still working with Catalyst team to deliver fast and reproducible implementation of this paper.

References

Source

For more information about the project.

Catalyst

Check out the framework.

Reference Paper

Read the original paper.

Thank You!

--

--