The intuition of Triplet Loss

Susmith Reddy
Analytics Vidhya
Published in
5 min readJan 12, 2019

Many of us feel Machine learning is a black box that takes some input and gives out some fantastic output. In recent years, this same Black box has been creating wonders by acting as a mimic of Humans in the respective fields where it is being used.

But from my experience, it is fascinating, fun, and sometimes frustrating(😜) as we go deep into this Black Box. This black box achieved so many things that none of us expected a decade ago. The most fun part of ML is understanding the way this black box does stuff behind the scenes, which makes it create wonders.

Motivation

Recently, when I came across a Face Recognition Model named FaceNet, I was astonished by how it recognized the faces with such incredible accuracy that with a single shot training. I was really curious to understand what’s happening behind the scenes of this model. After reading a bit about FaceNet, for me, the hero is the loss function that is being used, which is none other than the Triplet loss function. I was amazed by how a small, intuitive thought process that happens inside our brain almost every second could solve extraordinary problems. This is what motivated me to write this article.

Let’s Go

Here I try to explain my understanding of the concept of the Triplet loss function. This loss function became popular after the model Facenet, created by Google, became a state of art model in face recognition that uses Triplet Loss Function under the hood.

I’ll try to give you an intuitive understanding of triplet loss with an analogy. Let’s assume you have two friends (Let’s call them A, B). All of you people study in the same class (and hence the same subjects). The known fact is, A is the topper of the class. On the day of the results, you scored 50 marks, and your other two friends(A, B) scored 95 and 93 marks, respectively. Now when your parents came to know about the results and also assuming that they know A is the topper, they naturally infer B is also a topper because the difference between the marks of B and topper(A) is low, and you as a NOOB😔😔, as the difference between the marks of you and topper (A)is high.

We, humans, do this type of classification/inference almost every time by following the rule; elements belonging to the same category possess similar characteristics(In our Scenario, Marks).

Triple Loss Uses the Same logic, i.e., it tries to reduce the distance/deviation between similar things and increase the same between different things.

The Triplet Loss minimizes the distance between an anchor and a positive, both of which have the same identity, and maximizes the distance between the Anchor and a negative of a different identity.

Let’s Understand the above diagram comparing it with our scenario. A is Anchor as it is a known fact that he is a topper (his identity is fixed). You are considered as Negative (Since the difference between your’s and Anchor marks is high). B is considered as Positive (Since the difference between B’s and Anchor marks is low). So, while training a model to classify, we tweak the weights of parameters to minimize the Triplet loss, i.e., to reduce the difference between similar things and increase the difference between different things.

Now we are ready to understand how triplet loss works in the Facenet model. During the training phase of the Facenet Model, every input consists of 3 images of faces. Among these, two images are of the same person (one image is considered as the Anchor and another image as Positive), and the last image is of a different person (Negative). Facenet model process every image of the human face and encodes the features in 128-dimensional space, i.e., gives out a vector of size 128.

Encoding an image to 128 size vector

Following the thought process of the analogy mentioned above, we can say two faces are different if the distance between those two encoded points is high and are the same if the distance is low (Generally, we keep a threshold to decide whether the distance is high or low) in the 128-dimensional space. So, the model adjusts the weights in such a way, the distance between the encoded points of

  • Anchor image & Positive image is low.
  • Anchor image & Negative image is high.
FaceNet workflow

Now we are in a position to understand the mathematical equation of the Triplet Loss Function.

Mathematical Equation of Triplet Loss Function.
  • f(x) takes x as an input and returns a 128-dimensional vector w.
  • i denotes i’th input.
  • Subscript a indicates Anchor image, p indicates Positive image, n indicates Negative image.

Our objective is to minimize the above equation, which implicitly means:-

Minimizing first term → distance between Anchor and Positive image.

Maximizing(since it has negative sign before it) second term → distance between Anchor and Negative image.

The third term is a bias which acts as the threshold (we can ignore it here).

I hope you understood this article about Triplet Loss😃.

Feel free to suggest improvements & ask questions.

Happy Machine Learning !!

References & Further Reading:
[1] Florian Schroff, Dmitry Kalenichenko, James Philbin, FaceNet: A Unified Embedding for Face Recognition and Clustering

Image References:
1)
Google
2) FaceNet Paper:- https://arxiv.org/pdf/1503.03832.pdf

--

--