Introduction to FaceNet: A Unified Embedding for Face Recognition and Clustering
Illustration of the FaceNet research paper
Facial recognition is one of the most exciting applications of deep learning. The rise in the adoption of facial recognition systems has been phenomenal in recent years; however, it came under a lot of scrutiny lately. There are various AI aficionados who think that the usage of any kind of facial recognition system should be properly regulated in order to prevent nefarious activities.
The threats like data leakage, privacy violation etc. , originating from careless use of facial recognition systems are pretty real, and hence proper measures should be taken to avoid them, but even after all the recent criticism, you have to admit that it is still a pretty useful application which can be widely used to make people’s lives better.
FaceNet provides a unique architecture for performing tasks like face recognition, verification and clustering. It uses deep convolutional networks along with triplet loss to achieve state of the art accuracy.
In this article, I will explain the concepts used in the FaceNet research paper. I have divided this article into the following sections —
- Triplet Loss and Selection
- Deep Learning Basics (SGD, AdaGrad and ReLU)
- CNN Architectures
Prerequisite — Basic understanding of CNNs.
FaceNet provides a unified embedding for face recognition, verification and clustering tasks. It maps each face image into a euclidean space such that the distances in that space correspond to face similarity, i.e. an image of person A will be placed closer to all the other images of person A as compared to images of any other person present in the dataset.
The main difference between FaceNet and other techniques is that it learns the mapping from the images and creates embeddings rather than using any bottleneck layer for recognition or verification tasks. Once the embeddings are created all the other tasks like verification, recognition etc. can be performed using standard techniques of that particular domain, using these newly generated embeddings as the feature vector. For example we can use k-NN for face recognition by using embeddings as the feature vector and similarly we can use any clustering technique for clustering the faces together and for verification we just need to define a threshold value.
So, the most important thing to note here is that FaceNet doesn’t define any new algorithm to carry out the aforementioned tasks, rather it just creates the embeddings, which can be directly used for face recognition, verification and clustering.
FaceNet uses deep convolutional neural network (CNN). The network is trained such that the squared L2 distance between the embeddings correspond to face similarity. The images used for training are scaled, transformed and are tightly cropped around the face area.
Another important aspect of FaceNet is its loss function . It uses triplet loss function (refer to Fig 1). In order to calculate the triplet loss, we need 3 images namely anchor, positive and negative. We will explore triplet loss in great detail in the next section.
Triplet Loss and Selection
The intuition behind triplet loss function is that we want our anchor image (image of a specific person A) to be closer to positive images (all the images of person A) as compared to negative images (all the other images).
In other words, we can say that we want the distances between the embedding of our anchor image and the embeddings of our positive images to be lesser as compared to the distances between embedding of our anchor image and embeddings of our negative images.
Triplet loss function can be formally defined as —
If you don’t understand the formula, don’t worry about it too much I will explain each and every term. Just remember the intuition behind the formula and then it will become very easy to remember it.
Here, the superscript a, p and n correspond to anchor, positive and negative images respectively.
Alpha is defined here as the margin between positive and negative pairs. It is essentially a threshold value which determines the difference between our image pairs. If let’s say alpha is set to 0.5, then we want the difference between our anchor-positive and anchor-negative image pairs to be at least 0.5.
Choosing the correct image pairs is extremely important as there will be a lot of image pairs that will satisfy this condition and hence our model won’t learn much from them and will also converge slowly because of that.
In order to ensure fast convergence, it is crucial to select triplets that violate the triplet constraint.
We essentially want to select the following —
If the above equations are unclear, then let me clarify it.
Eq (1) means that given an anchor image of person A, we want to find a positive image of A such that the distance between those two images is largest.
Eq (2) means that given an anchor image of person A, we want to find a negative image such that the distance between those two images is smallest.
So, we are just selecting the hard positives and hard negatives here. This approach helps us in speeding convergence as our model learns useful representations. But there is a problem associated with this approach, it is computationally infeasible to compute hard positives and hard negatives over the entire dataset.
A clever workaround here is to compute the hard positives and negatives for a mini-batch. Here, we will choose around 1000–2000 samples (In most experiments the batch size was around 1800).
In order to have a meaningful representation of the anchor-positive distances, we have to ensure that there are a minimal number of samples of any one identity in each mini-batch. We will select around 40 faces per identity per mini-batch. Additionally, randomly sampled negative faces are added to each mini-batch.
Deep Learning Basics
FaceNet trains CNNs using Stochastic Gradient Descent (SGD) with standard backprop and AdaGrad. The initial learning rate is 0.05, alpha is set to 0.2 and ReLU is chosen as the activation function.
I know that I have thrown a lot of jargons here and it might put off someone who is new to this field, so, I will try to briefly explain all the above concepts.
Stochastic Gradient Descent —
It is an optimisation technique that is used to optimise our loss function.
The two axes (x and y) represent weights and the third axis (z) represents the loss with respect to those two weights.
Let’s call this red point as point A. We will start our journey from this point A. The intuition behind SGD is that we want to traverse this hill-like structure in such a way that we reach the global minima (lowest point of this hill). Now you might have understood the Descent part of SGD. So now let’s focus on the Gradient part.
Gradient just gives us the direction of the steepest ascent in a n- dimensional plane (similar to a derivative, which determines the slope of a line ).
The key thing to note here is that it gives us the direction of steepest ascent, not descent, so, we take the negative of the value given by this gradient in order to move down the hill.
AdaGrad is used to generate variable learning rates. Fixed learning rates do not work well in deep learning. In case of CNNs where each layer is used to detect a different feature (edges, patterns etc.), a fixed learning will just not work, as different layers in our network require different learning rates to work optimally. To better understand AdaGrad, let’s look at few equations.
If you are still reading this article even after seeing so much maths, then I guess you are pretty inquisitive. So let me help you better understand these equations by breaking them up and explaining them.
Eq (1) — It is just the regular weight update equation of SGD. Here we are using a fixed learning rate (η).
Eq (2) — It is the weight update equation of AdaGrad. In this case we are using a variable learning rate (η’t).
Eq (3) — It determines the formula for calculating the variable learning rate.
Eq (4) — It determines the formula for calculating Ɑt-1.
Ɑt-1 is just the sum of squares of gradients upto t-1. ‘t’ is the iteration number. So we just calculate the gradient at each step and add their squares together to generate Ɑt-1 and since Ɑt-1 will change with every iteration, our learning rate will also change.
ReLU is the non-linear activation function that we are using. Before diving into details about ReLU, let’s understand why do we need non-linear activation functions ? We need them as if we use only a linear activation function, then essentially our output will just be a linear combination of our input regardless of the number of layers in our network.
Another point to consider is that, without non-linear activation functions, we can’t create neural networks which can solve intricate problems i.e. our decision boundary will always be linear if we use linear activation functions.
So, hopefully you are now convinced that we actually need non-linear activation functions. So now let us understand the basics of ReLU.
ReLU is the successor of sigmoid and tanh activation functions. I won’t go into much details here about these functions, but I will briefly explain the issues in sigmoid and tanh that led to the discovery of ReLU. So the big problem with both sigmoid and tanh is that of vanishing gradients, i.e. they both output a value between 0 and 1 and while calculating gradients using back propagation (refer to Eq (1) ) we have to multiply various values which lie between 0 and 1. After few iterations, the value may become so small and insignificant that our weights will stop updating. Another issue with both of them is that they both are expensive to compute i.e. we have to compute functions like exponent and tan, which are computationally expensive.
Here we can see that neither our value lies between 0 and 1, nor do we have to compute any expensive function. So ReLU solves both these problems.
FaceNet uses 2 types of CNNs, namely Zeiler & Fergus architecture and GoogLeNet style Inception model.
I will explain them briefly here.
1. Zeiler & Fergus architecture
Zeiler & Fergus architecture is used for visualising the training process of a CNN. We try to understand the internal workings of a CNN with the help of this architecture. This architecture introduced a novel visualisation technique that gives insight into the function of intermediate layers and the operation of the classifiers.
I won’t go into much detail here. I would recommend that you read the Zeiler & Fergus architecture research paper.
The Zeiler & Fergus architecture used in the FaceNet research paper is shown below.
This model has 140 million parameters and 1.6 billion FLOPS (Floating point operations per second) per image.
2. Inception Model
The main idea behind Inception network architecture is that of using multiple filters of different sizes simultaneously. In any other traditional network architecture we usually choose a filter of let’s say size 3*3 , 5*5 etc, but in Inception architecture we use multiple filters simultaneously and concatenate their results.
In Fig 11 (a), we are using multiple filters of size 1*1, 3*3 and 5*5 along with a max pooling layer, and then we have concatenated the results. This is the main intuition behind Inception network architecture. The problem with this approach is that it is computationally very expensive. So, in order to avoid this problem we use 1*1 convolutions for dimensionality reduction.
In Fig 11 (b), we will use a 1*1 filter with every other convolution in order to reduce dimensionality and make this architecture computationally feasible.
If you want to understand this architecture in more detail, then I would highly suggest that you read the Inception research paper.
This Inception model architecture used in the FaceNet research paper has 6.6M — 7.5M parameters and around 500M — 1.6 B FLOPS. Various variations of the Inception model are used in FaceNet, some of them are optimised to run on mobile phones and hence have comparatively less parameters and filters.
We calculate the true accepts (TA) as follows —
True accepts are the face pairs that were correctly classified as same at threshold ‘d’.
We define the false accepts (FA) as follows —
False accepts are the face pairs that were incorrectly classified as same
P same - It represents the pair of same identities
P diff - It represents the pair of different identities
D(xi,xj) - It is the square L2 distance between the pair of images
d - It is the distance threshold
The validation rate (VAL) and false accept rate (FAR) for a given face distance ‘d’ is defined as