Understanding AI: Who Said What, When?

All about speaker diarization and how to use it practically.

Published in

Ekohe

16 min readJan 10, 2023

However, when it comes to the primary task of transcribing audio, none of these three processes are individually capable of solving an important issue. If a conversation between several people is being transcribed, these 3 steps are not going to inform us about is saying . This is where speaker diarization comes into play.

What is Speaker Diarization?

To “diarize” means to take notes, in essence keeping track of an event in a diary. Speaker diarization is, thus, nothing more than keeping records of the spoken event, to answer the key question “who said what, when?”.

Speaker diarization means to log speaker-specific events on multi-participant (or multi-speaker) audio data. Throughout the diarization process, audio data is divided and clustered into groups of speech segments with the same speaker identity_label. As a result, other salient events such as non-speech/speech transitions or change in speakers can also be automatically detected.

In general, this process does not require any prior knowledge of the speakers, such as their real identity or the number of participating speakers in the audio data. Thanks to this characteristic of separating audio by these speaker specific events, diarization is of great use when applied to indexing or analyzing many types of audio data, such as:

recordings of meetings, conferences, or lectures
broadcast and pre-recorded content from media outlets
court proceedings
call-center recordings
and many more

Traditionally, a speaker diarization system is a complex series of multiple and independent modules:

Front-end processing: audio data is pre-processed to mitigate artifacts in the audio signal, such as noise, reverberation, and background music. In addition, other techniques such as speech enhancement or target speaker extraction can be employed.
Speech Activity Detection (SAD): the audio data is segmented into speech and non-speech segments.
Segmentation: the speech segments are further divided into speech event segments.
Speaker Embedding: the raw signals of the speech segments are converted into embedding vectors, or acoustic features.
Clustering: the embedding vectors are clustered into groups of speech segments with the same speaker identity label.

The whole system outline can be seen below:

Diarization System Overview

Practical Speaker Diarization with Pyannote

Since the process of speaker diarization is so complex, it is not surprising there have been several developments in both Machine Learning and Deep Learning to make this process more efficient. One of the most recent and promising developments is the Pyannote library, which is a Python library for speaker diarization and speech activity detection.

Pyannote is a collection of state-of-the-art speaker diarization and speech activity detection models, as well as a set of tools to train and evaluate these models. Pyannote is built on top of PyTorch, a Deep Learning framework that provides several high-level features.

In order to demonstrate these features in a situation where several speakers are present, we will be using this news report where speakers take turns to talk about recent developments in geopolitics. However, one can follow along with whichever audio file they want, as long as it is correctly formatted.

This use case provides us with a facilitative real-world example of how speaker diarization can be used to analyze audio data, as it involves different speakers with different accents, and some overlapping speech.

Pyannote Features

To start, we will go over the capabilities of Pyannote, and afterwards we will learn how to use it in combination with ASR, in order to generate a transcript of an audio file that correctly assigns each speaker to each sentence.

The dependencies for this article are as follows:

pyannote.audio
torch
pandas
transformers
huggungface_hub
librosa

We can start by simply visualizing the file we’re working with:

import matplotlib.pyplot as plt
import librosa as lr
audio_filepath = "news_article.wav"
waveform, sample_rate = lr.load(audio_filepath, sr=16000)
sample_rate

def plot_waveform(speech, sample_rate):
 X = [i / sample_rate for i in range(len(speech))]
 plt.plot(X, speech)
 plt.xlabel("Time (s)")
 plt.ylabel("Amplitude")
 plt.title("Waveform")
 plt.show()

plot_waveform(waveform, sample_rate)

We have a few minutes of audio, but without any additional information it is difficult to tell apart moments of silence, and impossible to differentiate between speakers with only the waveform. However, Pyannote can help us with that.

For starters, we should be able to distinguish when someone is saying anything at all with Voice Activity Detection (VAD).

Voice Activity Detection

VAD consists of detecting speech regions in a given segment of audio. This is a very useful tool when working with audio data, as it allows us to filter out the non-speech segments, and focus on the speech segments if we so desire.

import json
from itertools import groupby
import pandas as pd
import torch
from huggingface_hub import HfApi
from pyannote.audio import Pipeline
from pyannote.core import Annotation, Segment
from pydub import AudioSegment
from transformers import HubertForCTC, Wav2Vec2Processor

pipeline = Pipeline.from_pretrained("pyannote/voice-activity-detection")
output = pipeline(audio_filepath)
output

Pyannote’s pre-trained VAD model can be used to detect and visualize the points of the audio file where speech (from any speaker) is present. This can be used to filter out the non-speech segments, and focus on the speech segments if we so desire.

Overlapped Speech Detection

Another useful tool is Overlapped Speech Detection (OSD). OSD is a method to detect speech segments that overlap with other speech segments. This is useful when we want to know when two or more speakers are talking at the same time. How to handle these overlapping segments is ultimately up to the user. For example, one can choose to split the audio into segments where only one speaker is talking at a time, or one can choose to keep the overlapping segments as they are or skip ASR on these segments.

The OSD pipeline can be used in the following way:

pipeline = Pipeline.from_pretrained("pyannote/overlapped-speech-detection")
output = pipeline(audio_filepath)
output

Since this is a news report, there is very little overlapping speech. But with situations such as podcasts, call center recordings, or even meetings, there will usually be much more overlap.

Getting to the Point: Speaker Diarization with ASR

Now that we have seen some of the capabilities of Pyannote, we can make use of its Speaker Diarization feature, and combine it with ASR to generate a transcript of the audio file — one that correctly assigns each speaker to each sentence. To do so, we will also use a pre-trained ASR acoustic model based off of the wav2vec2 architecture called HuBERT.

The first thing we will do is create embeddings from the audio file. These embeddings will be used for transcription. To do so we will use the Wav2Vec2Processor from the transformers library. This processor will allow us to convert the audio file into a format that can be used by the ASR model.

# load model & audio and run audio through model
processor = Wav2Vec2Processor.from_pretrained("facebook/hubert-large-ls960-ft")
model = HubertForCTC.from_pretrained("facebook/hubert-large-ls960-ft").cuda()
input_values = processor(waveform, return_tensors="pt", sampling_rate=sample_rate).input_values.cuda(

One thing to keep in mind, however, is that these models use a lot of VRAM, and since it takes quite a lot of raw compute power, we also want to be using CUDA for this. If you don’t have a GPU, you can use the Google Colab service, which provides free GPU access. Whichever the case, we will be splitting the audio file into 10 chunks. This is done to avoid running out of memory, and to speed up the process. The number of chunks is chosen since we know the duration of the content, and for reference, to transcribe around 90 seconds at a time with this architecture, around 12gb of VRAM would be needed.

def return_n_chunks_of_tensor(tensor, n):
    return torch.chunk(tensor, n, dim=1)

input_chunks = return_n_chunks_of_tensor(input_values, 10)

We can then move on to process and concatenate the logits for each chunk. This will give us the final logits for the whole audio file.

logits = None

for chunk in input_chunks:
    with torch.no_grad():
        if logits is None:
            logits = model(chunk.cuda()).logits.cpu()
        else:
            logits = torch.cat((logits, model(chunk.cuda()).logits.cpu()), dim=1)

To get the predicted_ids(that can then be decoded into text) we can take the maximum value of the logits for each time step. This will give us the predicted_ids for each time step. The logits can then be decoded into text using the decode method of the Wav2Vec2Processor.

predicted_ids = torch.argmax(logits, dim=-1)
transcription = processor.decode(predicted_ids[0]).lower()
transcription

"lebanon and israel have ended a long standing dispute over their shared maritime border the two countries are still formerly at war so leaders signe the agreement separately both nations hope to benefit from mineral resources within the formerly disputed area  wstanie kramer report explains now what the dispute was about the mediterranean sea of the coast between israel and lebanon these were contested waters both countries have long been locked in dispute over where the maritime border lies behind the scenes negotiations have gone on for several years now israel and libanon have agreed on a maritime border deal mediated by the united states this gamsho disagreement strengthens israel's security and our freedom of action against hesbla and the threats to our north there is rare consensus in the security establishment regarding the necessity of this agreement the dispute is about a relatively small triungle shaped area with each side claiming their part as exclusive economic zone the areas expected to rich in offshore gas israel and libanon have a long history of conflict the two countries fought a war in two thousand and six and there have been many security incidents between israel and the libanese sheite militant group hasbola since parts of the country's land border the blue line a demarcation line by the u n is also disputed whether the maritime border deal could be a step towards a wider peace agreement is unclear but the deal paves the way to morgaexploration a potential economic benefit for both countries while our correspondents in bayroot and jerusalem have been following developments rebecca wittas in jerusalem told me what this deal means for israel well fill this deal has been in the making for more than a decade there've been numerous rounds of negotiations all of which have failed until a couple of weeks ago when a deal was finally reached between the two sides and we have now seen it signed in by michel arun the lebonese president nd the caretaker prime minister herein israelalapede now all science and in fact it was a u s broker deal all parties involved are calling this deal historic and let's not forget that these two countries as you rightly mentioned are still technically at war er they have no diplomatic relations in fact lebanon doesn't even recognize isr israel as a sovereign state so the fact that two countries in this situation could sign a deal like this maritime boarder deal is being heralded as a historic step er the benefits will be of course for both sides and very far reaching for israel er for example of course thes political and diplomatic benefits and there's also the economic benefits and security guarantees now israel has long wanted to explore these gas wells that it is considered for a long time on its side of the maritime border now er formally agreed inside its maritime border it knows that those gas fields er have a lot of gas in them and its long wanted to extract them but it's been under threat by the eranbak hesbla on the lebanon side that if it were to explore those fields  or fact extract gas out of them without such a deal as we're seeing to day that it would come under threats and would be attacked now of course israel is free to go and explore those fields and extract the gas and in fact it's already doing so yesterday a energy on the company that is drilling on the israelie side  they announce that the gas was already flowing from one of those wells and that it would soon even be able to start delivering to its partners in the next couple of days in fact now that is going to be met  with smiles by e ou leaders who are desperately trying to bridge the gap left by turning off the tap frorussia since the invasion of eukran so you know the benefits from this deal agan o be very far reaching for israel for one though it really says that this is a tacit agreement by a sworn enemy in fact yalaped said words to that effect as he was signing the document as you rightly pointed out that's not exactly how it's seen in lebanon but none theless this is a very significant dealfil oh i thank you for that er rebecca mohommed traiter in bayroot so how does lebanon see this deal well i it's a difference it's a different ea perspectiv definitely this morning the lebanese president a michel laon said that the agreement is purely technical and does not have any political implications or effects that contradicts lebanon's foreign policy and a relations  with other states the two states are still technically at war however the agreement is expected to bring some stability to the area and opens the way a  for offshore energy exploration as it removes a main source of potential conflict between a israeel and a lebanon mainly  the iranian backed heavily group heavily armed group e lebanese hesbola lebanese officials are hoping that disagreement helps e elevate lebanon's economic crisis  the country's economy has been in freefall for three years now  the exploration of hydrocarbons is a huge deal for lebanon as a a significant discovery could help easily a country's stilfling financial a crisis o"With every word of the transcript timestamped, we can now use the Pyannote Pipeline to generate the speaker diarization. This will give us the speaker labels for each timestep of the audio, which can be aligned with the timestamped transcript to generate a transcript with the speaker labels and answer the question “Who said what, when?”.

We can now start processing our data to match it with the timestamps of the audio file. For this we will use the shape of the input and divide it by the sample rate (16kHz) to get the duration of the audio file in seconds.

# this is where the logic starts to get the start and end timestamp for each word
words = [w for w in transcription.split(' ') if len(w) > 0]
predicted_ids = predicted_ids[0].tolist()
duration_sec = input_values.shape[1] / sample_rate

ids_w_time = [(i / len(predicted_ids) * duration_sec, _id) for i, _id in enumerate(predicted_ids)]
# remove entries which are just "padding" (i.e. no characters are recognized)
ids_w_time = [i for i in ids_w_time if i[1] != processor.tokenizer.pad_token_id]
# now split the ids into groups of ids where each group represents a word
split_ids_w_time = [list(group) for k, group
                    in groupby(ids_w_time, lambda x: x[1] == processor.tokenizer.word_delimiter_token_id)
                    if not k]
assert len(split_ids_w_time) == len(words)  # make sure that there are the same number of id-groups as words

Having created the lists of words as well as the ids with the corresponding time stamps, we can start determining when each word starts and ends.

word_start_times = []
word_end_times = []

for cur_ids_w_time, cur_word in zip(split_ids_w_time, words):
    _times = [_time for _time, _id in cur_ids_w_time]
    word_start_times.append(min(_times))
    word_end_times.append(max(_times))

words[:9], word_start_times[:9], word_end_times[:9]

(['lebanon',
  'and',
  'israel',
  'have',
  'ended',
  'a',
  'long',
  'standing',
  'dispute'],
 [0.5604111660671462,
  1.0807929631294964,
  1.2809398081534773,
  1.6011747601918465,
  1.7612922362110313,
  2.001468450239808,
  2.1015418727517985,
  2.3417180867805754,
  2.7820411458333334],
 [0.9206754871103118,
  1.1408370166366906,
  1.5411307066846522,
  1.6812334982014387,
  1.921409712230216,
  2.001468450239808,
  2.261659348770983,
  2.701982407823741,
  3.182334835881295])

With every word of the transcript timestamped, we can now use the Pyannote Pipeline to generate the speaker diarization. This will give us the speaker labels for each timestep of the audio, which can be aligned with the timestamped transcript to generate a transcript with the speaker labels and answer the question “Who said what, when?”.

available_models = [m.modelId for m in HfApi().list_models(filter="pyannote")]
available_models

['julien-c/voice-activity-detection',
 'pyannote/TestModelForContinuousIntegration',
 'pyannote/embedding',
 'pyannote/overlapped-speech-detection',
 'pyannote/segmentation',
 'pyannote/speaker-diarization',
 'pyannote/speaker-segmentation',
 'pyannote/voice-activity-detection',
 'AMITKESARI2000/pyannote_SD1',
 'philschmid/pyannote-speaker-diarization-endpoint',
 'pyannote/brouhaha',
 'anilbs/pipeline',
 'anilbs/segmentation',
 'philschmid/pyannote-segmentation',
 'tawkit/phil-pyannote-speaker-diarization-endpoint']

pipeline = Pipeline.from_pretrained("pyannote/speaker-diarization")
output = pipeline("news_article.wav")
output

speaker_timelines = [output.label_timeline(speaker) for speaker in output.labels()]
overlaps = output.get_overlap()
overlap_segments = overlaps.segments_list_
overlaps

def get_speaker_zones(annotation):
    speaker_zones = []
    current_speaker = None
    start = None
    end = None
    for time, _, speaker in annotation.itertracks(yield_label=True):
        if current_speaker is None:
            current_speaker = speaker
            start = time.start
            end = time.end
        elif speaker != current_speaker:
            speaker_zones.append({'speaker': current_speaker, 'start': start, 'end': end})
            start = time.start
            end = time.end
            current_speaker = speaker
        else:
            end = time.end
    speaker_zones.append({'speaker': current_speaker, 'start': start, 'end': end})
    return speaker_zones


speaker_zones = get_speaker_zones(output_without_overlaps)


def create_annotation_from_speaker_zones(speaker_zones):
    annotation = Annotation()
    for zone in speaker_zones:
        annotation[Segment(zone['start'], zone['end'])] = zone['speaker']
    return annotation


optimized_output = create_annotation_from_speaker_zones(speaker_zones)


optimized_output

Simplified Diarization result, with overlaps removed

transcript_df = pd.DataFrame({'word': words, 'start': word_start_times, 'end': word_end_times})

transcript_df

for turn, _, speaker in optimized_output.itertracks(yield_label=True):
    print(f'{speaker}: {turn}')

def get_transcript_with_timestamps(timestamped_transcript_df, subtitle_lenght_sec=5):
    transcript_with_timestamps = []
    sentence = ''
    for i, row in timestamped_transcript_df.iterrows():
        if i == 0:
            start = row['start']
        if row['end'] - start < subtitle_lenght_sec:
            sentence += row['word'] + ' '
            end = row['end']
        else:
            transcript_with_timestamps.append({'start': start, 'end': end, 'sentence': sentence.strip()})
            start = row['start']
            end = row['end']
            sentence = row['word'] + ' '
    transcript_with_timestamps.append({'start': start, 'end': end, 'sentence': sentence.strip()})
    return transcript_with_timestamps

First, we generate transcripts that are timestamped in chunks of time, and then we use these paired with speaker diarization to improve on the transcript, by defining chunks by speaker turn rather than by time.

get_transcript_with_timestamps(transcript_df, 30)[:2]

[{'start': 0.5604111660671462,
 'end': 30.382291074640293,
 'sentence': 'lebanon and israel have ended a long standing dispute over their shared maritime border the two countries are still formerly at war so leaders signe the agreement separately both nations hope to benefit from mineral resources within the formerly disputed area wstanie kramer report explains now what the dispute was about the mediterranean sea of the coast between israel and lebanon these were contested waters both countries have long been locked in dispute over where the'},
 {'start': 30.482364497152282,
 'end': 60.16414161420863,
 'sentence': "maritime border lies behind the scenes negotiations have gone on for several years now israel and libanon have agreed on a maritime border deal mediated by the united states this gamsho disagreement strengthens israel's security and our freedom of action against hesbla and the threats to our north there is rare consensus in the security establishment regarding the necessity of this agreement the dispute is about a relatively small triungle shaped"}]

def get_transcript_with_speakers(transcript_with_timestamps: pd.DataFrame, speaker_annotation: Annotation):
 total_text = []
 for turn, _, speaker in speaker_annotation.itertracks(yield_label=True):
     turn_transcript = transcript_with_timestamps[(transcript_with_timestamps['start'] >= turn.start) & (transcript_with_timestamps['end'] <= turn.end)]
     total_text.append({'speaker': speaker, 'start': turn.start, 'end': turn.end, 'sentence': ' '.join(turn_transcript['word'].tolist())})
 return total_text

print(json.dumps(get_transcript_with_speakers(transcript_df, optimized_output)[:4], indent=2))

[
  {
    "speaker": "SPEAKER_03",
    "start": 0.4978125,
    "end": 18.8071875,
    "sentence": "lebanon and israel have ended a long standing dispute over their shared maritime border the two countries are still formerly at war so leaders signe the agreement separately both nations hope to benefit from mineral resources within the formerly disputed area wstanie kramer report explains now what the dispute was about"
  },
  {
    "speaker": "SPEAKER_04",
    "start": 20.8996875,
    "end": 43.0059375,
    "sentence": "the mediterranean sea of the coast between israel and lebanon these were contested waters both countries have long been locked in dispute over where the maritime border lies behind the scenes negotiations have gone on for several years now israel and libanon have agreed on a maritime border deal mediated by the united states this"
  },
  {
    "speaker": "SPEAKER_02",
    "start": 43.0059375,
    "end": 55.172812500000006,
    "sentence": "gamsho disagreement strengthens israel's security and our freedom of action against hesbla and the threats to our north there is rare consensus in the security establishment regarding the necessity of this agreement"
  },
  {
    "speaker": "SPEAKER_04",
    "start": 56.894062500000004,
    "end": 104.9878125,
    "sentence": "the dispute is about a relatively small triungle shaped area with each side claiming their part as exclusive economic zone the areas expected to rich in offshore gas israel and libanon have a long history of conflict the two countries fought a war in two thousand and six and there have been many security incidents between israel and the libanese sheite militant group hasbola since parts of the country's land border the blue line a demarcation line by the u n is also disputed whether the maritime border deal could be a step towards a wider peace agreement is unclear but the deal paves the way to morgaexploration a potential economic benefit for both countries"
  }
]

And that’s it! We have now generated a transcript of the audio file that is correctly assigning the speaker to each sentence. As is plainly visible, however, there are still some issues with the transcript, such as the fact that the model seems to make grammatical errors rather often, and that one of the words was assigned to the wrong speaker.

These errors can be attributed to a few different factors (see below). As a follow-on exercise, you may consider exploring the following:

HuBERT shares the acoustic-centric flaws of Wav2Vec2 models, there are ways to improve these results grammatically by using, for example, n-grams.
The handling of overlapping speech in this model was overly simplified, and more complex methods can be used to improve the results.
The model was trained on a dataset that is not representative of the audio file we are using, and this can be improved by training the model on a dataset that is more representative of the audio file we are using.
A generative approach to transcription could be attempted, instead of using an acoustic model.

Conclusion

In this article we have learned how to use Pyannote’s VAD and OSD features to filter out non-speech segments and overlapping speech segments, and how to use Pyannote’s Speaker Diarization feature to generate a transcript of the audio file that is correctly assigning the speaker to each sentence.

With this knowledge, and said transcripts, other downstream tasks can be performed, such as sentiment analysis, emotion detection, and others — enabling a much finer granularity than would be possible through a simple ASR model. For example, when analyzing an interview, one can determine the sentiment of each speaker, and the sentiment of the interview as a whole, and even determine the sentiment of each sentence.

We hope you’ll find this useful when building your own ASR systems, whether it is for audio data mining, transcriptions, or even live captioning. If there are any questions on contributions, please feel free to leave a comment!

References

Originally published at https://medium.com on January 10, 2023.