Building a Voice Recognition System with PyTorch by Taking Advantage of Computer Vision Techniques

Contrastive learning for supervised and self-supervised tasks

Encora

Published in

Encora Technology Practices

12 min readJun 25, 2020

Introduction

Biometric-based authentication methods tend to increase in importance in times of social distancing, remote working, and collaboration, as they can deliver higher security and customer experience at the same time. One of its techniques is Voice Recognition, that is, identifying whether a given voice input is from someone previously registered or not. Voice authentication presents one of the best user experiences among all authentication methods, so advances in that area could help improve applications’ security without impairing experience in many industries.

In this piece we describe how we built a reasonably performing Voice Recognition System with PyTorch, using deep learning Computer Vision techniques. With results as good as 90.2% accuracy using different training and testing samples, with only 25% of the original dataset size, we demonstrate how it is currently possible for different AI domains to leverage knowledge from each other to improve their techniques and outcomes.

Background

Like Computer Vision (CV) and Natural Language Processing (NLP), audio-based applications are one of the most impacted areas by the recent advances in deep learning.

Problems like voice and speech recognition remained extremely challenging for decades. Not that it is not difficult now, but classic solutions for these problems usually required a lot of hand-crafted feature designing and expert domain knowledge.

But perhaps most important, deep learning also sparked a huge increase in collaboration among professionals in different areas. Before the learning by data paradigm really kicked in, NLP, Computer Vision, and signal processing researchers did not have much to collaborate with one another — it was very domain-specific. Nowadays, signal processing researchers can take advantage of advances made in CV and NLP. This exchange of information can be as easy as reading a paper and applying the ideas to a different field.

Today, with a good understanding of machine learning, clean and well-behaved datasets, and deep learning libraries, it is relatively straightforward to build a simple proof-of-concept (POC).

To prove the point, in this article we are going to describe the process of taking a recently developed unsupervised learning method (for CV), to build a POC for voice recognition. In other words, we want a system that can identify voices of specific people.

Why does it matter? Voice-based technology is largely used as a biometric factor for authentication. Many players like Google and Apple have core technology based on voice interaction. Apple Siri, Google Assistant, and Cortana, are some of the most popular. Moreover, voice tech is popularly used as a multi-factor authentication step. Here, biometrics like fingerprint, voice passphrase, and face recognition can be combined to build a customized and secure authentication mechanism.

As with any data-based system, the solution presented here is not perfect. It is meant to be a POC where experimentation is the principal driver. Nevertheless, the system achieves decent performance using just a small portion of the available data.

We are going to start with some background on the unsupervised learning method used for this project and transition to our use case. The code, written in PyTorch, can be accessed on Github.

Preliminaries — Unsupervised Contrastive Learning

The general goal of the contrastive learning loss we will use here, is to make representations from correlated scenes as close as possible (based on a given distance metric) while pushing apart representations from different scenes.

Intuitively, let’s go over the case where we have a supervised dataset with labels flagging each class. It is easy to see that in such a scenario, we could cherry-pick different samples from the same class, get their embedding vectors, and optimize a network so that this pair of embeddings are close together in the space of representations. For reasons that will be clear shortly, let’s represent this pair of embedding vectors as (𝓏ᵢ, 𝓏ⱼ). Here, 𝓏ᵢ is a representation derived from image i and 𝓏ⱼ from image j, remember, these two images are different samples from the same class.

For image based applications, if we have the class labels, we can pair different images of the same class and maximize agreement between them.

For voice recognition, we can maximize agreement between different sentences that were spoken by the same person.

Similarly, we could deliberately choose another image, from a different class from the pair picked above, and make its representations (let’s call it 𝓏ₖ) as far away from 𝓏ᵢ as possible. In other words, the pairs (𝓏ᵢ, 𝓏ₖ) should be optimized to be far apart in the representations space.

Similarly, we can pick an image from a different class and push their representations as far away as we can.

Using the deep learning lingo, the pair of representations (𝓏ᵢ, 𝓏ⱼ) form a positive pair. It is common to refer to 𝓏ᵢ as the anchor and to 𝓏ⱼ as the positive. These two (𝓏ᵢ, 𝓏ⱼ), form a positive pair that is meant to be similar or near to each other.

For an image-based task, say classification, that could be two different images of a cat or a pair of images of horses. In a voice recognition setup, that could be 2 different sample audios of the same person speaking different sentences. The whole idea is to capture different views of the same phenomenon and optimize the network to acknowledge it.

Moreover, the negative representation of 𝓏ₖ is supposed to be uncorrelated with the anchor 𝓏ᵢ. In this way, the pair formed by the (𝓏ᵢ, 𝓏ₖ), represents a negative pair that is meant to be dissimilar and far-away from each other.

However, in an unsupervised setup, we find our selves without class semantic annotations. This forces us to figure out a way to create positive pairs from images that we do not know the labels.

In unsupervised contrastive learning, the most popular and successful approach is called instance discrimination. The idea is simple, since we cannot pair different instances from the same class (because we do not know their labels) we take an instance and create 2 versions of it using random data transformations. These 2 views are passed through an encoder that returns representations 𝓏ᵢ and 𝓏ⱼ respectively, that we can use in the same way as we did before.

For image-based applications, from one image, we can create many pseudo-views of it using random data augmentations such as flip, scaling, and color distortions.

For a voice recognition application, we can take different potions of the same spoken sentence, corrupt them using random transforms, and use as different views.

And for negative 𝓏ₖ, again, since we do not know the labels, we might as well just pick a randomly chosen image and optimize its representation to be as far-away from 𝓏ᵢ as possible.

Contrastive Learning methods have been one of the central players of the recent advances in unsupervised visual representation learning. More specifically, the Info Noise Contrastive Estimation (InfoNCE) loss, as defined below, has been used in many recent works to learn representations from unlabeled data.

The core of the InfoNCE loss lies in the embedding vectors extracted from an encoder function (represented by a Neural Network).

An embedding is just a vector that we extract from the ConvNet.

Just like we explain above, 𝓏ᵢ and 𝓏ⱼ are representations from a ConvNet that refer to the same instance.

Note that in the denominator, the anchor representation 𝓏ᵢ is put against N “negative” representations 𝓏ₖ. Remember, the representations 𝓏ₖ are supposed to be uncorrelated with the anchor. In this way, instead of just one, we have N negative pairs (𝓏ᵢ, 𝓏ₖ), aggregated in the denominator.

In simple terms, we want to maximize the agreement between positive pairs as much as possible in the representation space. But we also want to keep the anchors as far apart as possible to all the negative representations in the denominator.

It has been shown that minimizing the InfoNCE loss is maximizing a lower bound on Mutual Information between the representations 𝓏ᵢ and 𝓏ⱼ. And for this statement to be true, the number of negatives need to be large. That is why you see an aggregation of negative pairs in the denominator.

Another way to see what is going on in the InfoNCE loss is through the lens of classification learning. Note that the InfoNCE loss is a simple Softmax function. It receives a batch of N vectors (N here is the batch size), containing the similarities between positives and negatives. Then, we optimize it with the goal of classifying the positive pairs among all negatives. To put it simply, given a pair of representations that we know are correlated (positive pairs), and a bunch representations from random images (negative pairs), we want to tell the pair of positives from the negatives.

Supervised Contrastive Learning

It is easy to see that the InfoNCE loss, as formalized above, does not handle the case where there is more than one positive sample that correlates with the anchor. That is why the InfoNCE loss was designed for the unsupervised setup we described above. This means that, if we have a supervised or semi-supervised dataset, or if we know how to devise soft positives from the data (maybe through clustering), we would not be able to use them as positives in the loss above.

As a way around, the authors of Supervised Contrastive Learning, proposed an adaptation of the unsupervised InfoNCE loss. This slightly different formulation is designed to handle cases where we know, with certainty, that more than one pair belongs to the same class.

Despite being very similar, note that this loss, more specifically, its numerator, takes into consideration all possible positive pairs in a mini-batch — that is what the condition yi = yj takes care of. In other words, not only the augmentation-based representation is used as positive, but if the mini-batch contains other records, that we know are from the same class as the anchor, they will also contribute as a positive pair.

This loss has been used with success to achieve state-of-the-art results for classification tasks in computer vision. Here, we are going to use it in the domain of audio recognition.

Methodology

We want to create a voice recognition system that can tell whether a given input signal is from someone previously registered or not. To do that, we propose a deep learning system with 2 learning steps. First, we are going to use the supervised version of the contrastive loss, and learn an encoder that can take audio signals as input and output vectors that describe the input. In this way, we want embedding vectors from the same speakers to be similar, and embeddings from different speakers to be as dissimilar as possible.

After learning the encoder, we are going to learn a linear classifier, on top of fixed representations from the encoder, to map embeddings to labels. Let’s first, go over the dataset we used and the preprocessing steps to create the PyTorch dataset and training the system.

Dataset and Preprocessing

The Common Voice dataset, from Mozilla, is probably one of the most diverse and rich audio databases out there. The database contains as many as 2,454 recorded hours spread in short MP3 files. We used the English portion of the data which contains 30GB of 780 validated hours of speech. One very good characteristic of this dataset is the vast variability of speakers. It contains snippets of men and women recordings from a large variety of ages and foreign accents. Most important, we also get the id of each speaker, which we are going to use as our labels.

In order to ensure a sizable working training dataset and well-distributed data, we only keep audio samples from speakers with a least 40 and a maximum of 50 different sample recordings. Note that if a given speaker has more than 50 samples, we randomly choose 50 and discard the rest. This gives us a total of 15004 audio samples from 307 different speakers. For training, we use the “train.tsv” file which contains 63330 total records from 1609 different speakers. Thus, we only use roughly 23% of the total data in our experiments.

After filtering the training data following the guidelines above, we proceed to create our preprocessing pipeline. First, we downsample the audio signals from its original sampling rate of 48000 Hz to 16000 Hz. This step reduces the overall size of the audio sample and still preserves the audio quality. We proceed by removing silent frames using the librosa.effects.split() function. Here we used a threshold of 40 decibels that throws away any portions of the signal with noise smaller than the threshold.

Removing quasi-silent frames from audio signal.

Depending on the length of the audio signal, we split it into smaller samples, each no shorter than 2 seconds. Finally, each of these samples is randomly augmented with random variations in the shift and time stretch. For these 2, we used the corresponding routines from librosa — time_stretch() and pitch_shift().

Before start training though, we need to create the correlated and uncorrelated views — the positive pairs for our loss function. For each of the audio samples previously processed, we randomly crop 2 smaller portions of 0.8 seconds. These 2 small samples from the same signal will act as the positive pair we want to optimize for.

In an unsupervised setup, we can create views by taking random sections of the same audio signal.

Lastly, before feeding the pairs to the network, convert them to Spectograms using the PyTorch audio library. The Spectrogram gives us 2D tensors that can be treated as regular 1-channel images. Note that the positive pair is formed by representations of the same person speaking a different portion of the same sentence.

And for the negatives, we simply use the other spectrograms from the same batch.

We used a ResNet18 as the base encoder and added a non-linear projection head to it. The model is trained using the supervised contrastive loss for 100 epochs with a batch size of 396. After contrastive training, we take the projection head off and append a new classification layer. Here, we train the model with regular cross-entropy loss for more than 100 epochs.

Results

Since our problem is not intended to have a fixed number of classes, we evaluate the system in a slightly different way. Our strategy follows as such.

We take a completely different set of audio files.
We split it into new subsets of training and testing subsets.
We proceed by appending a new linear classification on top of the feature encoder.
Then, we learn this new classifier keeping the encoder weight fixed — that is no gradient updates will flow through the encoder.

Note that here, the new classification layer has a completely different number of classes. That is because the number of classes is equivalent to the number of unique speakers in the dataset. Since this is a different set, the number of unique speakers differs from the training set. Moreover, this second training and testing subsets have never been seen by the encoder. In fact, both the speakers and the sentences they recite, do not have any intersection with the ones seen during training. This property will ensure that we are really assessing the power of generalization from the encoder features.

Following this protocol, we achieve nearly 98.8% and 91.2% in the training and testing sets respectively. Note that this overfitting can be reduced by using more regularization on the classification layer.

Note that even though the feature encoder was trained in a completely different training data, it managed to generalize well for this new unseen records. Taking into consideration that we only used a small portion of the data for training the encoder, we could expect much better results scaling both: the training data and the network architecture.

Conclusions

Advances in unsupervised learning methods can also be used to improve supervised algorithms. That is the idea behind the supervised contrastive learning loss function. In this piece, we showed how to develop a supervised learning pipeline using the contrastive loss. We talked about the ideas of unsupervised learning, and the benefits of training encoders to learn representations from labeled and unlabeled data. Moreover, we applied the same ideas developed for Computer Vision to the signal processing domain. The code is available on Github and you can contact us for any inquiry.

Thanks for reading.

Acknowledgment

This piece was written by Thalles Silva with the Innovation Team at Daitan. Thanks to Fernando Moraes and Kathleen McCabe for reviews and insights.