Audio Source Separation Using Non-Negative Matrix Factorization (NMF)

Zahra Benslimane
10 min readFeb 1, 2023

01. Introduction :

The main focus of this article is the task of sound source separation, which consists in isolating individual sound sources from a mixture of sounds. This can be used in several applications, such as music separation, speech separation and speech enhancement.

In music separation, for example, we can separate the different instruments in a song, allowing the individual tracks to be edited or remixed. In speech separation, we separate different speakers in a recording, making it possible to transcribe the speech of a single speaker. Lastly, speech enhancement can separate the vocals and the background noise or song.

music separation (source)

In this article, we will introduce a simple solving approach, and how it is used for audio source separation and we will implement a python program that allows us to extract the played notes from a short piano sequence.

02. Solving approach :

This problem, also known as the Cocktail Party Problem, is mostly tackled nowadays by deep learning approaches.

But, here, a more simpler and classical approach will be considered.

One of the most popular algorithms in signal processing used for solving this task is based on Non-Negative Matrix Multiplication (NMF), a method that was popularized by Daniel D. Lee &H. Sebastian Seung papers.

The NFM algorithm seeks to reconstruct/ factorize a given data matrix V into two non-negative unknown matrices W and H, such that the product of the two matrices approximates the original matrix.

What makes the NFM unique from other types of rank reduction algorithms, such as the PCA, is the additional constraint that W and H must be strictly non-negative.

In the context of sound source separation:

  • The original data matrix V is typically the time-frequency representation of the audio signal, also known as a spectrogram.
  • W represents a dictionary of elementary recurring features of the data.
  • H contains the activations of the different features.

W is referred to as the dictionary matrix and H as the activation matrix.

Given V of dimension K x N. W and H, are of dimensions K x S and S x N, respectively, where S is the common dimension, that can be thought of as the number of sources we wish to extract and can be referred to as the rank of the factorization.

To obtain a good reconstruction of the signal’s spectrogram, the estimation of W and H are obtained by optimizing a cost function using the iterative updates of a gradient descent.

This means we will try to find the W and H that minimizes the distance between the given data matrix V and the estimated one.

03. Cost functions

We can use a variety of cost functions, but in this article, we will be seeing a family of statistical measures, called β divergences and defined as:

β-divergence(source): The limit cases for β = 0 and β = 1 correspond to the Itakura-Saito (IS) and generalized Kullback-Leibler (KL) divergences, respectively.

The quadratic cost can also be found for β = 2

Quardratic cost : β-divergence for β = 2 (source)

04. Gradient Descent Algorithm :

In mathematics, gradient descent is a first-order iterative optimization algorithm for finding a local minimum of a differentiable function. [source]

For this task,

We denote θ a parameter corresponding to either W or H.

The update rule of the gradient descent algorithm looks like this:

general gradient descent equation

for which θ(i+1) ≥ 0 in all iterations.

Each gradient of the introduced cost functions can be written as a difference between two non-negative functions.

Lee and Seung proposed in this paper to use adaptive learning rates to avoid subtraction that can produce negative elements.

The learning rate was set to :

learning rate

When we replace the last two equations in the general gradient descent equation, we arrive at the given update rule :

Multiplicative update rule

This approach is taken such that the updates are multiplicative, where the correction value is equal to the ratio of the negative and positive parts of the derivative of the criterion.

The non-negativity of the multiplying update values ensures the non-negativity of the parameters W and H along the optimization iterations.

In the next section, we will compute the gradient of the cost function with respect to W and H, it can be very math intensive, so if not interested, feel free to skip it and move on to the next section (06. Deriving Multiplicative Update Rules) where we get the final update rules for the optimization algorithm.

05. Computing the gradients

Let’s take the generalized beta divergence definition seen above:

The gradient of this cost function can be computed in terms of each parameter W or H at a time, denoted as θ, as such :

We know by definition that each data point in the estimated product matrix of index [i,j] is the sum of the term-by-term multiplication of the ith row of W and the jth column of H.

The partial derivative of [WH] over H for a given W :

As W is considered constant, we have :

This partial derivative is only non-zero for k = p and j = q, which can be translated to 2 Dirac delta function in those points. The sum over k can be removed since all terms will be equal to zero except when k = p.

The gradient of the cost function with respect to H for a given W :

if we substitute this result in the previous gradient equation :

We can then go back to a more simplified matrix form :

finally, we get a gradient that can be written as a difference between two non-negative functions.

The gradient of the cost function with respect to W for a given H:

The same process can be used to compute the solution over W to get :

06. Deriving Multiplicative Update Rules:

We know that :

Using the above-mentioned formulas and the derivatives produced in the last section, we finally get the update rules of both matrices W and H.

Hands-on example :

NMF-based audio spectral analysis (source)

Now that we got the expression of the update rules, we can jump on the different steps for extracting the audio sources by using NMF applied to the spectrogram of a short piano sequence.

Unfortunately, I couldn’t embed audio sounds into this medium article, I would be providing the GitHub that goes with this article, so you could run the complete code.

We will discuss the important steps to complete the task.

  1. Load and display the audio sound signal

The audio consists of 3 consecutive different piano notes followed by the 3 notes played together.

output of the code above (Image by the author)

2. Compute Short Term Fourier Transform of the signal

An audio sound is a signal in which its frequency components vary over time. To have a good representation of such signals, we use the Short-Term Fourier Transform (STFT), where its complex-valued coefficients provide the frequency and phase content of local sections of a signal as it evolves over time.

A spectrogram is a visual representation of the changing spectra as a function of time. Each column of a spectrogram represents a time frame of the audio, and each row represents a certain frequency.

The spectrogram corresponds to the squared magnitude of the STFT.

Definition of STFT (Image by the author)

We will be using this spectrogram as our data matrix V in the NMF, as it satisfies the non-negativity criterion.

A figure representing the spectrogram of the audio signal
Spectrogram of the audio signal (Image by the author)

3. Apply the Non-Negative Matrix Factorisation :

The NMF algorithm is quite simple to implement once we derived the update rules for 𝛽 divergences type of cost functions. The algorithm consists of 3 steps :

  1. Randomly initialize W and H matrices with non negative values.
  2. Update W and H, by considering the other fixed.
  3. Repeat 2 until the convergence threshold is hit.

Define a function that computes the 𝛽 Divergence

Help function to compute the cost function for a given 𝛽

Define the main NMF function

Running the NMF algorithm on the data matrix.

Visualizations

In the next figure, we can see a visual plot of the W, H, and their product WH which represent an approximation of the data matrix V.

Left : Random initialization of the W and H matrices.

Right : The last iteration, updating stopped after 5000 iterations. The convergence threshold wasn’t hit, because it was set too low, but we can see that the cost function has drastically dropped.

(Image by the author)
Loss function in terms of the number of iterations (Image by the author)

Now, if we look closely at the activation matrix H, we can see that the algorithm has successfully been able to distinguish the three consecutive notes as different audio sources (it has separated it into 3 sources because we have set S, the common dimension of W and H, to 3).

(Image by the author)

4. Filtering the different audio sources

Each extracted sound source k, can be calculated by multipling the k-th column of W and k-th line of H.

spectrograms representing the extracted sources (Image by the author)

5. Reconstructing source’s audio signals

The phase of the sources is recovered by simply adding the phase of the original mixture audio, which was computed by the STFT, to the filtered sources.

The three audio source signals are then reconstructed with the inverse STFT.

Extracted audio sources (Image by the author)

One of the challenges in applying NMF to sound source separation is determining the appropriate number of sound sources. NMF can be applied in several different ways to address this issue. One popular method is to use an iterative algorithm that starts with a small number of sound sources, and then increases the number of sound sources until the separation is satisfactory.

In our case, if S, the number of sources to extract is set to S = 4. We are able to extract the the sound of the actionning the piano’s hammer. (Source 2 , in green below)

(source)
Extracted piano sources (source 0, source 2 and source 3) and the hammer’s actionning sound (Image by the author)

In conclusion, Non-negative matrix factorization (NMF) is a powerful sound source separation technique that can extract individual sound sources from a mixture of sounds. NMF is well suited for audio signals as it enforces non-negativity, which is a desirable property for audio signals.

To be able to listen to the reconstructed signal, check out my GitHub repository, where you can find the complete code.

__

Thanks for reading my first post until the end.

--

--