Speaker Verification: Introduction to Siamese Network (Part 1)

Published in

Data Reply IT | DataTech

4 min readJun 17, 2022

This post will give you a quick introduction to Siamese Neural Network to help you decide if it is the right solution for your problem.

1. Introduction

Nowadays neural networks are almost good in the majority of tasks, but these networks rely on a huge dataset to perform well. For certain problems, like signature verification or face recognition, gathering a discrete number of data can be tricky if not feasible at all. Since we can’t always rely on getting more data, to solve this problem we have a new type of neural network architecture: the Siamese Neural Networks.

A Siamese Neural Network is a class of neural network architectures that contain two identical subnetworks. The two branches have the same layers with the same parameters and weights. It is important to know that parameter updating is mirrored across both sub-networks. It is used to find the similarity of the inputs by comparing its feature vectors, so these networks are used in many applications like image comparison.

Traditionally, a neural network learns to predict multiple classes. This poses a problem when we need to add/remove new classes to the data. In this case, we have to update the neural network and retrain it on the whole dataset. Also, deep neural networks need a large volume of data to train on, but, in the case of the Siamese Neural Network (SNN), on the other hand, the model learns a similarity function and we can train it to see if the two samples are equal or not. This enables us to classify new “classes” of data without training the network again.

2. What is a Siamese Neural Network?

As previously told, a siamese neural network is a class of neural network architecture that contain two or more identical subnetworks. By “identical” we mean that they have the same layer configuration and that they share parameters and weights. In fact, even the parameter updating is mirrored across both sub-networks. Those subnetworks have the task to perform feature extraction of the inputs in order to have comparable feature vectors. The network then can learn a similarity function from these feature vectors. In statistics, a similarity measure is a function that quantifies the similarity between two objects and it is, in some sense, the inverse of a distance metric.

3. Pros & Cons

The main advantages of this kind of network are:

Given its learning mechanism, a simple average of feature vectors works way better than averaging 2 correlated supervised models
Given a few samples per class is sufficient for Siamese Networks to recognize those samples in the future

The drawbacks are:

Siamese networks require more training time than normal networks since it involves quadratic pairs to learn from
The output of the network is the distance from each class and not the probabilities of the prediction since the training involves pairwise learning

4. Loss Functions

Since training this kind of network involves pair of inputs, the classic and widespread losses cannot be used in this case. The main loss functions used are

Triplet Loss is a loss function in which an anchor is used as a baseline and compared to a positive and a negative input. The distance from the anchor and the positive input is minimized and the distance from the baseline and the negative is maximized

In this equation, alpha is the margin term used to improve the distance differences between similar and dissimilar pairs in the triplet. f(A) is the feature embeddings for the anchor, f(P) is the feature embeddings for the positive and f(N) is the feature embeddings for the negative input.

Contrastive Loss is a loss function based on distance as opposed to more conventional losses that are based on error prediction. This loss is used to learn embeddings in which two similar points have a low Euclidean distance and two dissimilar points have a large Euclidean distance.

Contrastive Loss Function used by Le Cunn

The equation above is the equation of the Contrastive Loss used by Le Cunn in which Dw is the Euclidean distance.

Visual representation of the Contrastive Loss (source)

5. Conclusion

In this article, we discussed how Siamese neural networks differ from normal deep learning networks, both in the definition of the loss and in the actual architecture of the network.

In the next article, I will present an architecture for speaker verification and a tutorial on how to implement this Neural Network. Stay tuned!

6. References

SigNet: Convolutional Siamese Network for Writer Independent Offline Signature Verification

Siamese Neural Networks for One-shot Image Recognition

The intuition of Triplet Loss