Face Recognition made easy|Nuts and Bolts of Face Recognition
This post is an illustrative guide to understand all aspects of face recognition. We will see how Face recognition solves one-shot learning problem i.e. how network learns to identify faces just by learning from a single image.
Introduction
Face recognition increasingly becoming ubiquitous. It is a method of identifying or verifying the identity using the face of a person. Governments, social media and devices/apps are increasingly adopting face recognition technology. Face recognition has really taken off after the introduction of DeepFace by the research team from facebook and FaceNet from google. FaceNet achieved a very impressive performance with 99.63% accuracy on LFW, 95.12% on Youtube Faces DB. Face recognition algorithms are effective and at the same time simple and elegant to implement. In this post, we shall review FaceNet style architecture to find embedding followed by face recognition and clustering.
Understanding the network and constraints:
All neural networks have 3 basic components — a network that maps inputs to the desired output, objective functions for computing loss/error and optimization of parameters using back prop. While network and loss function has to be explicitly defined, autograd function takes care of backpropagation.
Face recognition algorithm has both convolutional and dense networks to extract features from images and map them to a vector of arbitrary size. These mapped vectors are the embedding. Once fully trained, embedding for images from the same person live in some close vector space i.e. they form a distinct cluster. Equivalently euclidean distances between the embedding of the same person are always much smaller compared to any other person.
This constraint is nicely reflected in triplet loss. The loss function ensures that similarity distances are smaller for images of the same person compared distances to different persons. In addition, some margin is added in the triplet loss to widen this gap.
Each batch of inputs consists of triplets Anchor, Positive and Negative images, where Anchor and Positive are images of the same person and Negative are that of a randomly selected, different person. As stated above the network maps them into n-dimensional vectors such that embedding for each person is close to each other.
Loss/Constraints
Main constraint: Negative distance is greater than positive distance plus enforced margin (alpha)
For all possible triplets (T)
To satisfy the above constraint(rearranged): Left sided terms has to be 0 or less
Finally, loss function takes the following form, where is loss is summed over all triplets in a batch. While implementing, max(left side, 0) is taken so that whenever the left side is less than 0, it is ignored, as a result loss consistently goes to 0 as the loss is minimized.
Illustration of face embedding before and after training.
Visualization of embedding in reduced dimensions. (MNIST in 3d in the example below)
Training a face recognition system.
Implementation
We will build a working prototype from scratch in pytorch. We will use the AT&T database of faces. It consists of ten different images of 40 individuals saved in 40 folders, 400 images in total. We will use simple functions and networks for the sake of clarity.
Visualize an image.
Next, we need pairs and triplets to implement contrastive loss (explained below) and triplet loss respectively. 370 images from 37 individuals are used for training and 30 pictures from 3 individuals were used for testing.
Loading batches of samples to load into the network. Select batch size, find the possible number batches for a list of pairs and retrieve the subset of pairs for the respective batch.
Transform the list of images to list of Tensors to be loaded into ANN.
Defining network
The network consists of both convolutional and fully connected layers. First a simple 3 layered convolutional neural takes the input tensors and generates a final output with 16 channels, each channel is half the size of the original tensor (16 x 56 x 46). Maxpooling is applied once, ReLU is used as activation, Batchnorm is applied to speed up the process.
The fully connected dense network takes the output from the convolutional network after flattening/reshaping. Finally, the dense network outputs a 20-dimensional vector which is the embedding for the given image.
Adam is used for the optimization of weights which helps in faster/better convergence while training.
Finally, train the network.
After training, we can directly go for recognition and clustering of faces but it is very helpful to gauge performance using simple distances, visualizing the patterns and looking for consistency.
Assessing performance using distances
We can find the accuracy and other metrics, particularly for unseen test subjects. One way to asses the performance is to see whether positive distances are shorter than the negative distances (for different persons). Steps are fairly straight forward, first get the embedding for all test images using trained network then compute distances between these vectors.
Once distance matrices are obtained using computed distances it can be displayed as heat-maps as seen below. Patterns clearly reflect that the network is able to learn from data and their relationship. It is very helpful to asses whether the model predictions are fairly accurate.
Accuracy measures the ability of the algorithm to detect true positive (images of the same person positively identified as the same person) and true negatives (images of the different person predicted as a different person). It is visually illustrated below.
Plotting Accuracy
Accuracy varies according to the threshold we set. Apart from accuracy, it would be helpful to see other metrics like precision, recall and f1 score. To find the best cut off value (which could be different for test and train sets) for highest accuracy and find other relevant metrics we can write a simple function which returns all metrics at the varying threshold and select the best cut-off margin accordingly. Face recognition could be further improved by using the mean of distances to each individual (all images of the same person) which we will use later.
Once the network is trained and performing well it is simple few steps to work with face recognition and clustering. Find the link to calculating performance metrics here.
Face recognition and clustering
As network is trained and seems to be performing well,, we can use it for face recognition and clustering. In face recognition given one image the network finds the person whole is most similar or closest. Clustering involves finding all the images individuals of same person from pool of many images. Alternately we can make k cluster from number of images.
Rather computing embedding each time it is more efficient here to get all the embedding and use them for recognition and clustering. It is simple task to make a list with indices, images and corresponding embedding. Here is the code snippet.
Now it becomes a simple task to compute and compare distances which is the basis for identifying the face and get clusters.
Below are code and few runs of face recognition calls. Just select the image of a person using the index in the look_up_list to find his/her identity.
Finding the right person from an image using find_person.
The function also returns the distances to each person. For example -Person in the second folder has an index from 10 to 19 in look_up_list, we can use find_person(11, look_up_list) to find the identity of image 2 of this person. In addition to the image, we also get mean distances to each of 40 individuals as below. Clearly, the mean distance to the second individual 0.246 is the least distance.
array([0.89915437, 0.24623849, 1.40480736, 1.33066083, 1.62849771, 1.41435592, 0.87803077, 0.81626312, 1.61251076, 1.35037916, 0.59056972, 1.63431268, 1.2315311 , 0.85301907, 1.04345673, 0.46334697, 1.00861799, 1.32431773, 0.58127128, 1.13923682, 1.46194084, 1.0527679 , 1.52130461, 0.83202246, 1.54659908, 1.62023422, 0.4725906 , 1.19090982, 1.4773685 , 1.42314241, 1.5581555 , 0.89464784, 1.3679505 , 1.29450277, 1.59522836, 0.72607619, 1.2125357 , 1.30265826, 1.34485011, 1.68174347])
Face Clustering
Same distances are used for clustering. It is simple and very useful application once we have an easy way of representing faces as vectors. Code is given below.
Given a bunch of faces, select an image and find Euclidean distances to all the other faces and finally faces of the same person most likely have the shorted distances which helps to pool all images of the same person. It is quite similar to KNN (K Nearest Neighbor) problem.
Online Mining
Online mining is one of the key concept discussed in FaceNet. It is helpful to develop a robust face recognition system. The way loss function/triplet loss is implemented, it fails to learn anything from those anchors — positive pairs that are already close and similarly it fails to learn from anchor — negative pairs that are already far. By filtering out these easy pairs the model is systematically exposed to hard and semi-hard pairs.
One way to implement online mining is by finding the positive and negative distances on the go and filtering the easy negatives that lie far out in the margin. For each batch, we can find all triplets where: dist(anchor, neg) < dist(anc, pos) + margin: these are hard and semi-hard negative. Use only these triplets to derive loss and step.
The picture below illustrates the distances and definition of easy, semi-hard and hard negatives.
Code for online mining.
Contrastive loss
It is an alternative method to build a face recognition where contrastive loss is used in place of triplet loss. Unlike triplet loss where loss is calculated using triplets here the pairs are separated into two groups(same vs different) and labeled accordingly. Here is a snippet of the loss function.
#Loss function for contrastive loss #Find the distance between anchor and positive or negative #Pairs have labels 1(same) or 0(different) #Finally as loss is minimized dist to same person goes downeuclidean_distance = F.pairwise_distance(anc_fc_out, pn_fc_out)loss_contrastive = torch.mean((labels) * torch.pow(euclidean_distance, 2) +
(1-labels) * torch.pow(torch.clamp(margin — euclidean_distance, min=0.0), 2))
Find the link to full implementation of contrastive loss here.
Dimensionality reduction and visualization of clusters
We can employ TSNE or other methods to get 2d or 3d vectors down from 20d and these can be plotted to see that image embedding belonging to a same class/person makes a small cluster as all the vectors point to some small region in vector space.
Below is 3d transformation of embeding using TSNE. Find the link to 3d plot here.
Conclusion
Face recognition is one of the most used algorithms in AI applications. The basic architecture for one shot learning is quite intuitive and fun. In this article, I have tried to highlight most of the important concepts in face recognition like loss functions, face recognition, clustering, and online mining. Hope you enjoyed the read and do remember to leave feedback.
References:
[1] Yaniv Taigman, Ming Yang, Marc’Aurelio Ranzato, Lior Wolf. DeepFace: Closing the Gap to Human-Level Performance in Face Verification, CVPR 2014. DOI: 10.1109/CVPR.2014.220
[2] Florian Schroff, Dmitry Kalenichenko, James Philbin, FaceNet: A Unified Embedding for Face Recognition and Clustering (2015). arXiv:1503.03832v3 [cs.CV] 17 Jun 2015
Also helpful to watch deep-learning videos on face recognition by Andrew NG. ( https://www.youtube.com/watch?v=-FfMVnwXrZ0)