Losses explained: Contrastive Loss

11 min readApr 19, 2020

This is a series of posts explaining different loss functions used for the task of Face Recognition/Face Verification.

An extremely brief introduction

There are 2 types of tasks for Face Verification/Face Recognition (Fig. 0).

Figure 0–Two types of Face Verification/Face Recognition task. (Image shamelessly stolen from here)

The first one is the so called “Closed-set” task. This is basically a straightforward classification problem, where we need to map the input face image to one of the N classes (people). You can’t add new people and you can’t exclude existing ones, so this formulation of the Face Verification/Face Recognition problem is not very realistic and very limited in terms of practical usage. The model here is trying to learn separable features in this case — i.e. features, that would allow to assign a label from a predefined set to a given image. The model is trying to find a hyperplane, a rule that separates given classes in space.

The second one is called “Open-set” task. This one means that we do indeed have some predefined set of people for training, but the model can be applied to any unseen data and it should generalize. In this case the model is trying to solve a metric-learning problem: to learn some sort of similarity metric, and for that it needs to extract discriminative features — features that can be used to distinguish between different people on any two (or more) images. The model is trying not to separate images with a hyperplane, but rather reorganize the input space, pull the similar images together in some form of a cluster while pushing dissimilar images away.
This is somewhat reminiscent of clustering problem in Unsupervised Learning — and indeed you can use a model trained on a metric-learning task to create a distance matrix for new data, and than run algorithms like DBSCAN on it to, e.g., cluster images of people’s faces, where each cluster would correspond to a new person.

Contrastive Loss: Background

Contrastive loss was first introduced in 2005 by Yann Le Cunn et al. in this paper and its original application was in Dimensionality Reduction. Now, if you recall, the general goal of a Dimensionality reduction algorithm can be formulated like this:

Given a sample (a data point) — a D-dimensional vector, transform this sample into a d-dimensional vector, where d ≪ D, while preserving as much information as possible.

Le Cunn’s case was a little bit more narrow — he needed a way to learn a parametric mapping function (from D to d dimensions) with following constraints:

This mapping should preserve neighborhood relationships between data points.
E.g. if two data points were similar before the transformation, they should be close to each other after the transformation — i.e. the distance between them after the transformation would be small.
If two data points were dissimilar — they would be far away from each other, i.e. the distance between them after the transformation would be large.
This mapping should generalize on new, unseen data.

Now, how do we know whether two data points are in fact similar? This comes from our prior knowledge — for example, for Face Verification task we need to tell whether two given pictures contain the same person— and in this case it seems logical for pictures to be similar, if the person on both of them is the same person. If there are 2 different people — then such photos should be deemed dissimilar. Basically, these are the labels for our data.

So we need to learn some tricky parametric function that operates on high-dimensional data like images or even video. Sounds like a job for a Neural Network!

Indeed, this is how Face Verification can be implemented — a CNN (convolutional neural network) is trained to map input images of different people to vectors of real numbers (also called “feature-vectors” or “embeddings”) — for example, 128-d vectors, in such a way, that these embeddings of photos of the same person are very close to each other (in terms of e.g. Euclidean distance, cosine similarity or some other metric), and embeddings of photos of different people are far from each other.
And to verify that the person in 2 images is indeed the same person, you run your neural network on both of them and calculate the distance between obtained embeddings. If this distance is small, it is likely the same person, if it is large — most probably these are two different people.

Okay, but how do you train such a network? That’s where Contrastive Loss comes into play.

Contrastive loss: The function

The general formula for Contrastive Loss is shown at Fig. 1.

Figure 1 — Generalized Constrastive Loss

Y term here specifies, whether the two given data points (X₁ and X₂) are similar (Y=0) or dissimilar (Y=1).
The Ls term in Fig. 1 stands for the loss function, which should be applied to the output if the given samples are similar, the Ld term — a loss function to apply, when the given data points are dissimilar.
The Dw term in parenthesis is the similarity (or, rather, dissimilarity) between 2 transformed data points, given by Le Cunn like so:

Figure 2 — Distance measure between transformed data points

The G in this formula stands for the mapping function itself — i.e. a Neural Network in our case. This is a regular Euclidean distance function (calculated between outputs of the Neural Network), which was used by Le Cunn in the paper — however, as far as I understand, you can use other similarity metrics like Manhattan distance, Cosine similarity, etc.

The formula in Fig. 1 is highly reminiscent of the Cross-entropy loss — it has the same structure. The difference is that Cross-entropy loss is a classification loss which operates on class probabilities produced by the network independently for each sample, and Contrastive loss is a metric learning loss, which operates on the data points produced by network and their positions relative to each other. This is also part of the reason a cross-entropy loss is not usually used for metric learning tasks like Face Verification — it doesn’t impose any constraints on the distribution on the model’s inner representation of the given data — i.e. the model can learn any features regardless of whether similar data points would be located closely to each other or not after the transformation.

The exact loss function Le Cunn came up with is presented in Fig. 3.

Figure 3 — Actual Contrastive Loss function

So Ls (loss for similar data points) is just Dw, distance between them, if two data points are labeled as similar, we will minimize the euclidean distance between them.

Ld, on the other hand, needs some explanation. One may think that for two dissimilar data points we just need to maximize distance between them — i.e. minimize something like {1/Dw}. But why didn’t Le Cunn just use {1/Dw}?

Time for a little visualization. Let’s say we have some data point (blue dot) and a couple of other data points, which are similar to it (black dots) and dissimilar (white dots) on Fig. 4. We would naturally like to pull black dots closer to the blue dots and push white dots farther away from it. Specifically, we would like to minimize the intra-class distances (blue arrows) and maximize the inter-class distances (red arrows).

Figure 4 — We would like to bring black dots closer to the blue one, and push white dots away.

What we would like to achieve is to make sure that for each class/group of similar points (in case of Face Recognition task it would be all the photos of the same person) the maximum intra-class distance is smaller than the minimum inter-class distance. What this means is that if we define some radius/margin m, all the black dots should fall inside of this margin, and all the white dots — outside (Fig. 5). This way we would be able to use a nearest neighbour algorithm for new data — if a new data point lies within m distance from other, they are similar/belong to same group/class. The same goes for Face Recognition — if a new face image is located within m distance from another, there is likely the same person on both of them.

Figure 5 — What we would like the algorithm to do. Notice how the white dots that were outside weren’t moved farther away from the margin.

So we need to make sure that black dots are inside the margin m, and white dots are outside of it. And that’s exactly what the function proposed by Le Cunn does! In Fig. 6 you see, that the right part of the loss penalizes the model for dissimilar data points having the distance Dw between them < m. If Dw is ≥ m, the {m - Dw} expression is negative and the whole right part of the loss function is thus 0 due to max() operation — and the gradient is also 0, i.e. we don’t force the dissimilar points farther away than necessary.

FIgure 6 — Again, the loss function itself, so that you don’t have to scroll back.

Well, why don’t we use {1/Dw}, why don’t we keep forcing white (dissimilar) points away from the similar ones? And why do we not use this “margin” concept for similar points as well? A term like {max(0, Dw — m)} would work for similar points, to penalize the model for all the similar points being outside of the margin. Well, there is no direct answer in the original paper, but Le Cunn gives some intuition for his choice of terms in the loss function. And it all revolves around the concept of “Equilibrium”.

The idea is somewhat similar to the one used in Generative Adversarial Networks — an update to the left term of the function (similar-points loss) makes the right term (dissimilar-points loss) worse. If you are pulling similar points together for one class, you are inevitably going to attract some dissimilar points as well, which would increase the dissimilar-points loss. Vice versa is also true — if you are pushing all the dissimilar points as far away as possible, you may also push away some similar points, which would increase the similar-points loss. In the iterative training process there comes some time where these two losses are stabilizing after some time, i.e. each new update is not moving the data points around. And this state is called the Equilibrium point.

And the terms in Le Cunn’s proposed function were empirically chosen to make reaching this Equilibrium point easier.

If we were to use the same “margin” concept for similar points, we won’t force the similar points to be as close to each other as possible — thus a lot of them would be located near the margin and could be easily pushed out of it, which would make the training unstable and difficult to converge.
If we were to use the {1 / Dw} term for the dissimilar points, then we would continue pushing away white dots for eternity (in theory) or for just a very long time, even when the results are already separable and usable for a nearest-neighbour classification. This would also make it difficult to reach the Equilibrium point, is simply unnecessary and may push dissimilar points TOO far away, which may worsen the generalization performance of the model.

Again, this is just my intuition and Le Cunn could have had something different in mind while creating this loss function.

Contrastive Loss: Model architecture

Using Contrastive Loss requires having a Siamese network architecture.

It looks something like this (Fig. 6):

You have a convolutional neural network that gets applied to 2 images, then loss is calculated on its outputs and then the backpropagation algorithm is run.

Indeed, this is what Le Cunn used back in 2005! Here’s a screenshot from the original paper (Fig. 7) and the quote:

Figure 7 — LeNet-style architecture LeCunn used in this work for MNIST dataset.

Called a siamese architecture, it consists of two copies of the function GW which share the same set of parameters W, and a cost module. A loss module whose input is the output of this architecture is placed on top of it. The input to the entire system is a pair of images (X1, X2) and a label Y . The images are passed through the functions, yielding two outputs G(X1) and G(X2). The cost module then generates the distance Dw (Gw (X1), Gw (X2)). The loss function combines Dw with label Y to produce the scalar loss Ls or Ld, depending on the label Y . The parameter W is updated using stochastic gradient. The gradients can be computed by back-propagation through the loss, the cost, and the two instances of Gw . The total gradient is the sum of the contributions from the two instances.

Le Cunn took 2 digits out of MNIST dataset and trained a network to distinguish them without explicitly telling which digits are which, just giving the network pairs of same/different digit images. He trained it to produce 2D embeddings, than ran the network on test datasets (new unseen samples) and plotted the results (Fig. 8). These results are a little bit old by now, but they show that the network was able to learn good, discriminative features that separated 2 classes pretty well, and it also generalized well on new data.

Figure 8 — Distribution of MNIST images embeddings. Points with images attached are from test dataset — i.e. unseen data.

There are also a lot of other comparisons and plots in the original paper, but I will leave them to you.

Summary

Contrastive Loss is a metric-learning loss function introduced by Yann Le Cunn et al. in 2005.
It operates on pairs of embeddings received from the model and on the ground-truth similarity flag — a Boolean label, specifying whether these two samples are “similar” or “dissimilar”. So the input must be not one, but 2 images.
It penalizes “similar” samples for being far from each other in terms of Euclidean distance (although other distance metrics could be used).
“Dissimilar” samples are penalized by being to close to each other, but in a somewhat different way — Contrastive Loss introduces the concept of “margin” — a minimal distance that dissimilar points need to keep. So it penalizes dissimilar samples for beings closer than the given margin.
It can be used for Face Verification (and Face Recognition) task(s) in the following way:
- You train the model on pairs of face images of different people
- You calculate the embedding #1 of the given face.
- You calculate the embedding #2 of the new face that you are trying to verify.
- You calculate the distance between these 2 embeddings (using the same metric you used in training, in this particular case it’s Euclidean distance)
- If the distance is smaller than the margin specified during training, these are likely the images of the same person.
- If the distance if larger than the margin, these two images likely contain two different people.
- Or just use a nearest-neighbour algorithm to find a most similar face.