Speaker Verification: Architecture and Results (Part 2)

Published in

Data Reply IT | DataTech

5 min readJul 19, 2022

The Siamese Neural Network can be applied to Speaker Verification, but, in order to understand what Speaker Verification means we need to present Speaker Identification first.

1. Speaker Identification

Speaker Identification is the identification of a person by the characteristics of his voice. It is the task that needs to be completed to answer the “Who is speaking?” question.
Recognizing the speaker is crucial in several tasks, such as translating speech systems that have been trained on specific voices, in order to simplify them. Speaker recognition uses the acoustic features of speech that have been found to differ between individuals. These acoustic patterns are derived from both anatomy and learned behaviour. Speaker Identification systems can be implemented covertly without the user’s awareness to identify talkers in a discussion, alert automated systems of speaker changes, check if a user is already enrolled in a system, etc.

In the end, Identification is the task of finding the unknown identity of a voice between a restricted group of people

2. Speaker Verification

If a speaker claims to be a certain person and audio containing a voice is used to test this claim, this is called verification or authentication.
Speaker Verification is usually applied as a “sentry” to provide access to a secure system. These systems operate with the users’ knowledge and usually expect their cooperation.

In forensic applications, it is common to first perform a speaker identification process to create a list of “best matches” and then conduct a series of verification to define a convincing match. Working to match the samples from the speaker to the list of best matches figures out if they are the same person based on the number of similarities or differences. The prosecution and defence can use this as proof to settle if the suspect is truly the offender or not.
Other applications of speaker verification may include entry control to a restricted area, access to privileged information, credit card authorizations, funds transfer and similar transactions.

In the next paragraph, I propose an architecture for the speaker verification task. The network aims to distinguish if two audios contain the voice of the same person or if they contain the voices of two distinct persons.

3. Proposed Architecture

The Siamese Neural Network is composed of two branches that work in tandem on 2 different inputs and share weights of each layer. The architecture proposed for each branch of the siamese network is presented in the following figure and can be schematized as follow:

Input: pair of Audio samples of 1 second long recorded with a 16'000 Hz sample rate with a True label if the audios belong to the same person and a False label if not
ConvLayer: Convolutional Layer: create an abstracted feature map, also called activation map, with shape number of inputs x feature map height x feature map width x feature map channels
5 times the combination of Max Pooling, Batch Normalization, LeakyReLU and Convolutional layer
Dense Layer: in fully connected layers, each neuron of the previous layer is connected to each neuron of the fully connected layer. It is the same as a traditional multi-connected layer perceptron neural network
Distance Layer (Euclidean distance)
Dense Layer
Output Layer

4. Dataset: LibriSpeech

The dataset used to train and test this network is the LibriSpeech dataset. This is a corpus of approximately 1000 hours of 16kHz English speech, provided by Vassil Panayotov with the assistance of Daniel Povey. The data is derived by reading audiobooks from the LibriVox project and has been carefully segmented and aligned. The corpus is freely available under the very permissive CC BY 4.0 license. Each sample is saved as a FLAC file that stands for Free Lossless Audio Codec. This is a free audio codec with lossless compression. This means that the audio is compressed without losing quality opposite to lossy compressing such as MP3 or the AAC. This process does not remove information in the audio flow.

In order to create the correct input for this network, each sample is paired one time with a sample from the same speaker and
one time with a sample from a different random speaker of the same dataset
division. To the pair of audio taken from the same speaker, a boolean flag
set to True is used as label and for the others, since the audios are not from the same person, the flag is set to False.

The pairs taken into consideration are 16’000 for train and 4’000 for test
due to the computational limit of Colab Pro. Higher numbers of pairs quickly
saturated the ram leading to the crash of the whole system.

5. Results

The Siamese Neural Network has been trained for 100 epochs even if the
maximum accuracy on the training set was reached around epoch 60 and
the accuracy on the test set stopped growing around the 30th epoch. The
accuracy reached on the test set is 80.1% while the accuracy reached for the
train set is 98.61%. The accuracy on the validation set is 78.20%. As shown in the figure below, the gap between the train accuracy and the test accuracy is around 15%. In order to thin the gap dropouts have been applied on both the convolutional layers and the dense layers. Different distance measures
applied did not bring significant changes in the accuracy score.

6. Conclusion

As said in the previous chapter the Colab Pro could not work on the whole dataset in the case of the verification task so consideration for future work can be to use more computational power and RAM capacity. In this work, only a small part of the LibriSpeech dataset, the dev set, around 100 hours, was used. Using the whole dataset of more than 960 hours of clean and noisy audio can improve the performance. Also testing these networks on different datasets, like Vox-Celeb2, could lead to interesting results.

Nevertheless, the 80+% accuracy obtained is really promising and is worth a try.

7. References

LibriSpeech Dataset

Speaker Identification

Speaker Recognition From Raw Waveform With Sincnet