An attempt at Speaker Diarisation

Pipe Runner
Project Heuristics
Published in
9 min readApr 9, 2020

I recently went on to blabber about feature extraction and speaker diarisation in a little meetup we had here at pyDelhi (a python users meetup based in Delhi, India ). Despite the talk being an epic failure (because everyone was tiered and no one really wanted to know how sounds are produced and dissected) I thought of writing a little summary article so as to hammer the final nail to the coffin.

The article comes packed with my presentation slides, 2 Kaggle kernel and video link of the presentation ( the recording quality is horrible, so I am not sure if it is going to be of any helpful ), anyway, enjoy.

Note: As this is a summarized version of the whole workshop, please don’t expect all the nit gritty details that you would expect from a well documented blog. But rest assured, everything you need to understand the topic is right here in this article.

TL;DR

If you are less of a words guy and more of a code guy then just head straight to the kernel and follow along. The code is reasonably commented.

So what is speaker diarisation?

Well, it is just a fancy way of saying — “Segmenting an audio clip into labeled segments based on who is speaking when”. This is different from source separation because in speaker diarisation if two people speak together at the same time, then the method guarantees no well defined output; while in source separation, the two voices can be extracted out. So goes without saying these two are two different problem statements.

The tools…

The tools used for the task are fairly simple and can be segregated based on two major sub-tasks that we will be performing.

1. Data extraction and pre-processing

Here we straight away deal with audio file and the best tool that I have found for this is Librosa. Earlier I used pyAudio but I did not have a great time using it.

You will also need to install Ffmpeg that helps Librosa deal with audio files better. As this happens to be a system level package and not a pip package, you will need to install it like this:

!apt install -y ffmpeg #needed by librosa to function correctly

Running this command in the kernel itself will do the job.

2. Model creation and mathematics

For dealing with most of the math and data manipulation we have numpy and pandas at our disposal.

For creating complex machine learning models we have tensorflow to help us out.

Steps to follow

We start with this:

The steps we will follow in order to achieve our goal are as follows:

  • Break the target audio clip into equal sized (spanning equal amount of time) chunks. After this we will have an array of equal sized chunks of audio data.
  • For each chunk, we apply a extensive set of feature extraction techniques and create a feature vector for each chunk. Now what we have is a matrix where each row is a feature vector.
  • Now we apply a clustering algorithm that will cluster each of these chunks into well defined groups.
  • The chunks that belong to the same group can now be color coded and thus we will end up with a segmented audio clip based on whoever is speaking in that particular chunk.

Following the steps above we will (hopefully) end up with this:

Understanding the Data

As easy as it may appear, feature extraction is not as simple as people think it to be. It is anything but calling some random library APIs on data and getting a set of ready to use feature vectors. Always remember:

“The power of an algorithm comes from the data it works on”

In our case, we are working with audio data and there are tons of features embedded in this simple looking wave plot.

https://docs.google.com/presentation/d/1BSJzd6W5niJKA99Rf8uHhsN4MCjB_ux1djPRPP9XTKY/edit?usp=sharing

To understand what a wave plot actually is I would suggest you to go through these slides: https://docs.google.com/presentation/d/1BSJzd6W5niJKA99Rf8uHhsN4MCjB_ux1djPRPP9XTKY/edit?usp=sharing

Now, let’s begin with something simple and move up the ladder to some advanced ones that we will actually make use of. I will attach relevant links for further reading.

1. Periodogram (for visualization only)

The wavy diagram that you usually see on your Spotify app is the wave plot of the song you are listening to. In a wave plot you see the amplitude changing over a period of time. Now if you apply a DFT ( Discrete Fourier Transform ) on the audio clip you’ll end up with the same audio clip but in the frequency domain. What this means is, you’ll lose all information related to time, but now you’ll see the individual frequencies of simple component sine waves that make up the whole complex wave.

The reason why we call it the frequency domain is because we have a plot that has frequency on the x-axis and amplitude on the y-axis. What you originally had was time on x-axis and amplitude on the y-axis thus making it the time domain.

The view from the frequency domain is what we call the periodogram

2. Spectrogram (for visualization only)

This visualization is generated by taking small regular sized windows of the whole audio timeline and applying DFT on each of those windows. Stacking them horizontally with a bit of post processing would lead to a Spectrogram.

This gives you a nice looking visualization of the whole audio clip with the frequency and time information intact. This infact is the key difference between Spectrogram and Periodogram.

These two may give you a lot of clues about the audio data visually but for an algorithm to work you need numbers. So now we will talk about the advanced features that I made use of in my code.

3. Spectral Roll-Off

This feature is basically the frequency bin (computers don’t see things as continuous values, instead they have discrete versions of them. In our case, the frequency axis is split into discrete bins) below which 85% of the spectral energy resides.

In the terms of priodogram — This is a measure of the amount of the right-skewedness of the power spectrum.

4. Spectral Centroid

Basically the weighted mean of frequency bins per frame in the spectrogram. The more dominant frequency in a frame would technically move the centroid towards itself.

5. Zero Crossing Rate

Zero cross rate tells us the rate with which the audio wave changes its sign from positive to negative and vise-a-versa. This are particularly useful for distinguishing between percussion instruments.

6. Mel Frequency Cepstral Coefficient

blah

This following video is an absolute gem…

For more text oriented description read the following two links…

Assuming that you have seen the video provided above let’s talk about a crucial decision you’d want to make here. By now you should know that MFCC is a set of 40 and more coefficients that basically provide a nice way of defining audio data as numbers. But for speech to text applications the first 20 coefficients are chosen which basically forms the spectral envelope. This is a crucial piece of information that should give you a hint as to why you should not use spectral envelope for speaker detection. The reason why spectral envelope is used for text to speech is because no matter who is speaking, the phonemes that two distinct individuals are uttering will form similar spectral envelops. Thus the first 20 coefficients are actually speaker independent. The rest 20 and above are called spectral details and they contain the information of pitch and overtones. Thus to actually make use of MFCC for speaker diarisation, we will chose MFCC — 11 to 40 (Even though using all 40 will do you no harm, but using 11–40 is a good feature selection that you can do from your end)

So far…

With all the feature extracted let’s take a peek at our feature vectors.

Each row is a feature vector of an audio frame ( Showing first 5 samples )

So we are all ready with the data, now we shall take a look at the algorithm that we shall apply on the data.

The Clustering Algorithm

In the last article, we worked with K-Means which does a decent job at clustering but whether or not it will be successful totally depends on the initialization of centroids.

Thus the algorithms becomes really unpredictable and sometimes may end up with unacceptable outputs. To step up our game, let’s look into an improved version of K-Means — K-Means++

The improvement is extremely simple and as you might have guessed, improves the step of centroid initialization. But this drastically improves the stability of the algorithm and improves the odds of convergence.

To understand the implementation of K-Means and K-Means++ and to grasp the difference between the two more clearly, I suggest you to go through this Kaggle kernel.

Understanding the output

The expected outcome of the program will be an array of labels, each of which corresponds to a cluster index. So ideally audio frames that have same speaker will end up with same labels.

Each number here is a label generated via K-Means++

I wrote a little color coding function to put colored boxes on the wave plot that we started with. Thus we end up with our final output.

But wait… This isn’t perfect

There is a good reason why we were not able to perform a good diarisation. The complexity lies in the type of data that we are dealing with and the explanation to that is beyond the scope of this article. If you are interested to know more, go ahead and watch this video by Google.

--

--

Pipe Runner
Project Heuristics

Software Engineer at Postman | “Coder by profession, Artist by passion” | Stopped writing on Medium and moved to https://piperunner.in/blog