Deep Learning in Python

The Masked Speaker — Who are you?

Speaker Identification using WavLM

Lidor Bahar

Published in

Eleos Health

8 min readJan 26, 2023

The Masked Singer UK TV show — The dougnuts

If you stop and think on the difficulty of speaker identification task for humans, you will likely find it to be quite challenging. This article will guide you through the process of creating a basic speaker identification system with WavLM and how to easily customize it to your data. You will load a small database and train a classification model for speaker identification. Finally, you will use the model to classify unseen voice samples. Additionally, I will discuss limitations and potential improvements of such a system.

The What and How of Speaker Identification

To begin, speaker recognition refers to the task of determining the speaker from a given speech sample. Speaker recognition systems can operate in two ways, verification and identification. Speaker verification assesses whether a given speech sample belongs to a claimed speaker, i.e. focuses on a one-to-one matching scenario. Speaker identification aims to identify the speaker from a given speech sample among a group of known speakers, i.e. focuses on a one-to-many matching scenario.

There are a variety of techniques employed for speaker identification, including GMM-UBM, i-vector, x-vector, and other deep learning methods. These systems can utilize more classical features such as Pitch, MFCC, LPC, and Spectrograms, as well as deep learning-based features that extract speech representations.

More recent speaker identification systems are typically text independent, which means that the speaker can be classified with any speech sample. This differs from text dependent systems, which require the individual to speak a specific phrase or set of words as a reference for verification.

Although less common in recent deep learning approaches, I believe it is important to have a basic understanding of the GMM-UBM method for speaker identification. GMM-UBM is a classical method for speaker identification. It is based on a statistical model that consists of two parts: a universal background model (UBM) and a set of Gaussian mixture models (GMM) for individual speakers.

The UBM represents the general characteristics of speech in a given population. It is trained on a large dataset of speech from multiple speakers and is intended to capture commonalities among all speakers. The GMMs are trained for each individual speakers. These models capture the unique characteristics of the speech signal for each speaker. To perform speaker recognition, the log-likelihood ratio between the GMM of a test speaker and the UBM is calculated, assuming that the true speaker will get a higher ratio for their matching GMM relative to the UBM probability.

Speech Representations

Speech representations (embeddings) are a popular feature in speaker recognition system. They are a low-dimensional representation of the speech signal. One of the main benefits of speech embeddings is that they are language-independent and text-independent. They can handle varying length audio segments and can be used with a wide variety of input types. Additionally, this method currently yields state-of-the-art results on benchmarks and is widely used in a variety of fields.

In the realm of HIPAA regulations, another critical aspect for certain use cases is that audio may contain sensitive textual information. With the use of embeddings, it is possible to extract a representation that contains only information about the speaker’s voice, effectively removing any textual information.

How Therapists and Data Scientists are Protecting your Privacy

Let’s say you have a document that you wish to keep, however, it contains a client’s personal information. You don’t…

medium.com

Speaker Identification

To create the speaker identification system, you first need to gather a dataset of speakers’ audio files to your use case. If you don’t have it, you can use open source datasets. In this article, we will load a dataset from Hugging Face, but if you do have your own dataset, here is the code snippet for loading it:

from pathlib import Path
import soundfile
import subprocess
import tempfile


def read_audio_file(wav_path: str, dest_fs: int = 16000):
    with tempfile.TemporaryDirectory() as tmpdir:
        temp_file_path_wav = Path(tmpdir) / 'temp.wav'
        # Convert wav file sampling rate and to single channel
        cmd = ['ffmpeg', '-v', 'quiet', '-i', wav_path, 
               '-ac', '1', '-ar', str(dest_fs), str(temp_file_path_wav)]
        subprocess.call(cmd)
        audio, fs = soundfile.read(temp_file_path_wav)

    return audio, fs

wav_files_list = ... # <TODO: add list of paths to wav files>
x_train_data = [read_audio_file(f) for f in wav_files_list]
y_train_data = ... # <TODO: add labels>

# TODO: duplicate for test data

Note that you will need to install ffmpeg first.

There are several datasets available for training and evaluating speaker recognition systems, such as VoxCeleb and SITW. They are quite large and usually used for training and evaluating large models. For the purpose of this article, you will use a sample dataset from VoxCeleb1.

The dataset is divided into train, test, and validation sets, each containing 300 audio files, consisting of 30 speakers with 10 audio files per speaker.

from datasets import load_dataset
import pandas as pd
import numpy as np

# Load dataset
dataset = load_dataset("s3prl/mini_voxceleb1")
SAMPLING_RATE = 16000

# Add labels for each audio file
# Train data
x_train_data = [f['audio']['array'] for f in dataset['train']]
y_train_data = [str(int(i/10)) for i in range(len(dataset['train']))]

# Test data
x_test_data = [f['audio']['array'] for f in dataset['test']]
y_test_data = [str(int(i/10)) for i in range(len(dataset['test']))]

# Let's count the speakers and audio files
speakers_id, c = np.unique(y_train_data, return_counts=True)
df = pd.DataFrame({"Speaker": speakers_id, "wav_files_count": c})
print(df)

#    Speaker  wav_files_count
# 0         0               10
# 1         1               10
# 2        10               10
# 3        11               10
# 4        12               10
# 5        13               10
# 6        14               10
# 7        15               10
# 8        16               10
# 9        17               10
# ...

Next step is to segment the audio samples to segments. Let’s split each sample into segments of N=4 seconds. Each segment will be a single voice sample for training or inference. Note that changing the segments duration parameter will affects the results, as the representation vectors become more robust for audio with longer duration.

N_SECONDS_SEGMENT = 4  # Seconds

def segment_audio(x, y, n_sec, samp_rate):
    """Segment each array in list of audio arrays to N seconds segments"""
    x_segment = list()
    y_segment = list()
    for x, y in zip(x, y):
        segments = np.array_split(x, round(x.shape[0] / (samp_rate * n_sec)))
        x_segment += segments
        y_segment += [y] * len(segments)

    return x_segment, y_segment

# Segment train and test sets
x_train, y_train = segment_audio(
    x_train_data, y_train_data, N_SECONDS_SEGMENT, SAMPLING_RATE)
x_test, y_test = segment_audio(
    x_test_data, y_test_data, N_SECONDS_SEGMENT, SAMPLING_RATE)

# Let's count the speakers and audio files for train and test sets 
# after segmention
speakers, c = np.unique(y_train, return_counts=True)
print("Training data set samples counts by speaker:")
print(pd.DataFrame({"Speakers": speakers, "examples_counts": c}))

# Training data set samples counts by speaker:
#    Speakers  examples_counts
# 0         0               16
# 1         1               26
# 2        10               17
# 3        11               17
# 4        12               17
# 5        13               29
# ...

speakers, c = np.unique(y_test, return_counts=True)
print("Test data set samples counts by speaker:")
print(pd.DataFrame({"Speakers": speakers, "examples_counts": c}))

# Test data set samples counts by speaker:
#    Speakers  examples_counts
# 0         0               20
# 1         1               22
# 2        10               14
# 3        11               23
# 4        12               17
# 5        13               24
# ...

After loading and segmenting the data, let’s extract the features for the classifier. Fortunately, we have many open-source resources available for this task. The pre-trained models that extract audio representation feature vectors are trained for a variety of tasks. In this case, the WavLM model, which is optimized for the speaker verification task, is used. This model was trained on a large amount of data and it gives SOTA results on the SUPERB benchmark. It’s worth noting that this model is fine-tuned with a large dataset for the specific task of speaker verification, thus allowing you to create only a simple classifier on top of it.

import torch
from tqdm import tqdm
from transformers import Wav2Vec2FeatureExtractor
from transformers import WavLMForXVector

# Extract embedding vector for each audio sample using pre-trained 
# model for speaker verification (WavLM)
device = "cuda" if torch.cuda.is_available() else "cpu"
feature_extractor_wav2vec = Wav2Vec2FeatureExtractor.from_pretrained(
    "microsoft/wavlm-base-plus-sv")
model_wav_lm = WavLMForXVector.from_pretrained(
    "microsoft/wavlm-base-plus-sv").to(device)


def extract_embeddings(model, feature_extractor, data, device):
    """Use WavLM model to extract embeddings for audio segments"""
    emb_train = list()
    for i in tqdm(range(len(data))):
        inputs = feature_extractor(
            data[i], 
            sampling_rate=sampling_rate, 
            return_tensors="pt", 
            padding=True
        ).to(device)
        with torch.no_grad():
            embeddings = model(**inputs).embeddings

        emb_train += torch.nn.functional.normalize(
            embeddings.cpu(), dim=-1).cpu()

    return torch.stack(emb_train)


# Extract embeddings for train and test set
x_train_emb = extract_embeddings(
    model=model_wav_lm, 
    feature_extractor=feature_extractor_wav2vec,
    data=x_train, 
    device=device
)

x_test_emb = extract_embeddings(
    model=model_wav_lm, 
    feature_extractor=feature_extractor_wav2vec,
    data=x_test, 
    device=device
)

Using the embedding vectors, it is possible to classify voice samples based on similarity measures (usually cosine distance is used). This results in smaller distances for samples of the same speaker, compared to other speakers. As previously discussed, this is a speaker identification problem where, given a voice sample, it should classify the sample to one of N given speakers. The next step is to train a K-Nearest Neighbor classifier with the cosine distance on your own data:

from sklearn.neighbors import KNeighborsClassifier

# Fit k-NN model
N_NEIGHBORS = 5
model_knn = KNeighborsClassifier(
    n_neighbors=N_NEIGHBORS,
    metric='cosine',
    algorithm='brute'
)
model_knn.fit(x_train_emb, y_train)

In the next step let’s predict the speaker id for each test audio segment in the test set. The macro-F1 score is used to evaluate the results.

from sklearn.metrics import confusion_matrix, f1_score

# Predict
y_pred = model_knn.predict(x_test_emb)

# Print F1 score
f1 = f1_score(y_test, y_pred, average="macro")
print("k-NN test set F1 score={:.3f}".format(f1))

# k-NN test set F1 score = 0.984

The performance of the model is good, reaching high F1 score (0.98). In case that you have a longer segment for the prediction part (i.e. longer than 4 seconds), you can segment it into 4 seconds segments (or any other chosen duration) and then a majority vote can be used to classify the speaker.

Conclusions

In this article, we have shown how to create a speaker identification system in Python. We obtained good results, however, it’s important to keep in mind that there are several limitations to the proposed system. One of the most crucial being the importance of recognizing when the system is unsure of its predictions.

“To know what you know and what you do not know, that is true knowledge.” — Confucius

To address this limitation, one solution is to examine the prediction probabilities and identify when the model has high or low confidence. Another option is to construct a DNN architecture that incorporates the UBM concept. Both options are viable, though the second one may require more resources to develop.

Another important consideration is that the human voice changes over time, both in the short and long term. Various environmental factors can affect our voice, such as illness, mood, and emotions. Our voice also changes as we age and in different medical conditions. Depending on the task, it may be necessary to regularly update the model and test it under these conditions.

There are many other ways to improve the robustness of the model, such as incorporating noise samples and incorporating a Speech Activity Detection (SAD) element to remove non-speech segments. There are also other pre-trained models for speech representations that can be considered (a good place to search for these models is on HuggingFace models).

Thank you for reading. I hope you found this article informative.

If you would like to know more about Eleos Health, our technology, and how we contribute to the behavioral health environment, visit us here!