An Overview of Speaker Recognition With Sincnet

Akash Singh
Saarthi.ai
Published in
8 min readDec 5, 2019

--

Photo by Thomas Le on Unsplash

This article will help you deep-dive into Speaker Recognition, and learn everything about handling audio data.

We will look into SPEAKER RECOGNITION FROM RAW WAVEFORM WITH SINCNET, an approach that converges fast on small training data.

By training with only 15 seconds of audio from a single individual, we will make a Speaker identification model that can differentiate between users that it has seen while training.

Speaker Recognition models capture characteristics of the voice of an individual.

Here, we identify “Who is speaking?”.

Two crucial parts of the process of speaker recognition are Speaker Verification and Speaker Identification.

Speaker Verification: It is a one on one matching process. Here we have prior information that this is speaker “X” (This is verification phase), then the voice is matched to the speaker “X” (Which we get from enrollment phase) voice print only. Based on the amount of similarity we can set the threshold for matching.

Speaker Identification: It is a one on n matching process. Here when a speaker comes, his voice is matched to n speakers and based on that a prediction is made as to “Who is Speaking?”.

Deep Learning has Accelerated Speech Development

Photo by Randall Bruder on Unsplash

Earlier, there were handcrafted features for a speech like MFCC and FBANK, but we cannot use these features in all speech-related tasks.

Recently, however, speech development has been accelerated with the help of Deep Learning.

Convolutional Neural Network in the mix:

CNN (Convolutional Neural Networks) has also come into the mix to capture important speech-features, exhibiting better performance than the aforementioned hand-crafted features.

When we feed in the raw speech inputs in CNN, it learns to capture low-level speech representations, which helps in learning important narrow-band speaker characteristics such as pitch and formants.

Speaker related information is captured at a lower frequency region. Thus, CNN should be designed in a way that it can capture meaningful features.

According to the paper published on Sincnet, the first Convolution layer of the current waveform CNN is the most important as it gets high dimensional inputs. It also gets affected by the Vanishing Gradient.

To overcome this in Sincnet, we have parametrized sinc functions to implement band-pass filters.

The low and high cutoff frequencies are the only parameters of the filter learned from data. This solution still offers considerable flexibility, but forces the network to focus on high-level tunable parameters with a broad impact on the shape and bandwidth of the resulting filter.

In the next section, we will see how Sincnet is designed to work efficiently.

Breaking down the Architecture of Sincnet

We take a standard CNN which performs time-domain convolutions(Convolution in time domain equals multiplication in the frequency domain) between input waveform and some Finite Impulse Response (FIR) filters as given below.

Here,

x[n] = chunk of speech signal,

h[n] is a filter of length L,

y[n] is the output.

We can keep this chunk based on experiments with the dataset and all L elements of each filter are learned from the data.

Sincnet performs its convolutions on function g which depends on Theta.

In digital signal processing, g is defined such that a filter-bank composed of rectangular bandpass filters is employed. In the frequency domain, the magnitude of a generic bandpass filter can be written as the difference between two low-pass filters.

Here f1 and f2 are the learned low and high cutoff frequencies, and rect(·) is the rectangular function in the magnitude frequency domain.

Sinc function is defined as sinc(x) = sin(x)/x.

After returning to the time domain(using the inverse Fourier transform), the reference function g becomes:

The major features which determine a unique speaker are captured in the lower frequency range. In the above equation, fs is equal to the sampling frequency of the input signal and cut-off frequency is initialized randomly in the range [0, fs/2].

The sampling frequency may vary with the type of data you are experimenting with. An IVR system has a sampling frequency of 8Khz, whereas, a stereo system has a sampling frequency of 44khz.

We can initialize the filters based on the cut-off frequencies of the mel-scale filter-bank. The major advantage of assigning filters this way is that it has the advantage of directly allocating more filters at the lower part of the spectrum which has unique information of speaker voices.

To ensure f1 ≥ 0 and f2 ≥ f1, the previous equation is fed by the following parameters:

Here we keep no bounds on f2, i.e., no force is imposed on f2 to be smaller than the Nyquist frequency(the minimum rate at which a signal can be sampled without introducing errors, which is twice the highest frequency present in the signal) as the model learns this while training. Different subsequent layers decide to give more or less importance to each filter output.

An infinite number of elements L are required by an ideal bandpass filter. An ideal bandpass filter is where the passband is perfectly flat, and the attenuation in the stopband is infinite. Any truncation of g thus inevitably leads to an approximation of the ideal filter, characterized by ripples in the passband and limited attenuation in the stopband.

So, windowing is performed to solve this issue. It is performed just by multiplying the truncated function g with a window function w, which aims to smooth out the abrupt discontinuities at the ends of g:

This paper uses the popular Hamming window, defined as follows:

We can get high-frequency selectivity with the use of the Hamming window. We can use other windows too. One important note here is that due to the symmetry, the filters can be computed efficiently by considering one half of the filter and inheriting the results for the other half.

Training a model like this leads to fast Convergence, fewer parameters for training, interpretability.

Fast Convergence: Sincnet is designed in such a way that it forces the network to focus on filter parameters which impacts it performace. This style of filtring technique helps to adapt to data while capturing knowledge just like feature extraction techniques on audio data. This prior knowledge makes learning the filter characteristics much easier, helping SincNet to converge significantly faster to a better solution. We get fast convergence within first 10–15 epochs.

Less Number of parameters for training:SincNet drastically reduces the number of parameters in the first convolutional layer. For instance, if we consider a layer composed of F filters of length L, a standard CNN employs F · L parameters, against the 2F considered by SincNet. If F = 80 and L = 100, we employ 8k parameters for the CNN and only 160 for SincNet. Moreover, if we double the filter length L, a standard CNN doubles its parameter count (e.g., we go from 8k to 16k), while SincNet has an unchanged parameter count (only two parameters are employed for each filter, regardless its length L). This offers the possibility to derive very selective filters with many taps, without actually adding parameters to the optimization problem. Moreover, the compactness of the SincNet architecture makes it suitable in the few sample regime.

Interpretability: The SincNet feature maps obtained in the first convolutional layer are definitely more interpretable and human-readable than other approaches. The filter bank, in fact, only depends on parameters with a clear physical meaning.

Filter Analysis

In the figures above, we can easily see how CNN and Sincnet filters learn differently. This filter was learned using the Librispeech dataset (the frequency response is plotted between 0 and 4 kHz). We can see that a normal CNN learn noisy filters.

In the two figure figures, we can see which frequency bands are covered by CNN and Sincnet.

We can see the first peak corresponds to the pitch region (the average pitch is 133 Hz for a male and 234Hz for a female).

The second peak (approximatively located at 500 Hz) mainly captures first formants, whose average value over various English vowels is indeed 500 Hz.

Finally, the third peak (ranging from 900 to 1400 Hz) captures some important second formants, such as the second formant of the vowel /a/, which is located on average at 1100 Hz.

SincNet learns filters that are, on average, more selective than CNN ones, allow it to better capture narrow-band speaker clues.

Using the same model for Speaker Verification

Once we are done with the speaker identification part we can use the same model for speaker verification.

As the model learns to identify the unique features of a speaker, before the classification layer we get unique features of the audio that is passed through the network we can use that as speaker embeddings or d-vector.

This d-vector is used in verifying whether the new speaker is said speaker or not. Let’s break this into two parts as Speaker Enrollment and speaker verification.

Speaker Enrollment: In this phase when a new user comes into the system there voice samples are stored and d-vector is calculated of all the samples and an average is taken and stored as that users voice print. so that when next time the sam user comes we can match it with this stored voice print. Here longer voice samples help to capture features better and more number the number of samples help to show variation of the users voice. A good voice sample falls in the range 3–5 seconds.

Speaker Verification: In verification phase we already have that users stored voice print and again when it comes for verifcation and says something we take its d-vector and compare it with the previously stored voice print. We can use cosine similarity to match the voice prints.

Conclusion

Dealing with audio requires careful examination of the data, and trials over what the best ways for our use-cases are.

One should understand what impact dividing the data based on gender, age, etc, could have.

I have tried separating data based on gender, and it gave me better results by reducing the search space.

Try to play with different sampling rates, noise and other things that can improve performance.

Remember, the best solution arrives after tedious experimentation

Happy Playing!!

--

--