Face Recognition made easy|Nuts and Bolts of Face Recognition

Published in

towardsdatascience

9 min readJul 10, 2019

This post is an illustrative guide to understand all aspects of face recognition. We will see how Face recognition solves one-shot learning problem i.e. how network learns to identify faces just by learning from a single image.

Introduction

Face recognition increasingly becoming ubiquitous. It is a method of identifying or verifying the identity using the face of a person. Governments, social media and devices/apps are increasingly adopting face recognition technology. Face recognition has really taken off after the introduction of DeepFace by the research team from facebook and FaceNet from google. FaceNet achieved a very impressive performance with 99.63% accuracy on LFW, 95.12% on Youtube Faces DB. Face recognition algorithms are effective and at the same time simple and elegant to implement. In this post, we shall review FaceNet style architecture to find embedding followed by face recognition and clustering.

Understanding the network and constraints:

All neural networks have 3 basic components — a network that maps inputs to the desired output, objective functions for computing loss/error and optimization of parameters using back prop. While network and loss function has to be explicitly defined, autograd function takes care of backpropagation.
Face recognition algorithm has both convolutional and dense networks to extract features from images and map them to a vector of arbitrary size. These mapped vectors are the embedding. Once fully trained, embedding for images from the same person live in some close vector space i.e. they form a distinct cluster. Equivalently euclidean distances between the embedding of the same person are always much smaller compared to any other person.

This constraint is nicely reflected in triplet loss. The loss function ensures that similarity distances are smaller for images of the same person compared distances to different persons. In addition, some margin is added in the triplet loss to widen this gap.

Each batch of inputs consists of triplets Anchor, Positive and Negative images, where Anchor and Positive are images of the same person and Negative are that of a randomly selected, different person. As stated above the network maps them into n-dimensional vectors such that embedding for each person is close to each other.

Loss/Constraints

Main constraint: Negative distance is greater than positive distance plus enforced margin (alpha)

For all possible triplets (T)

To satisfy the above constraint(rearranged): Left sided terms has to be 0 or less

Finally, loss function takes the following form, where is loss is summed over all triplets in a batch. While implementing, max(left side, 0) is taken so that whenever the left side is less than 0, it is ignored, as a result loss consistently goes to 0 as the loss is minimized.

Illustration of face embedding before and after training.

In the beginning, the network is initialized randomly and lots of negative samples are closer than positive images. However after training all embedding for the same person lie closer to each other and negative images lie distant. Source- FaceNet: A Unified Embedding for Face Recognition and Clustering.

Visualization of embedding in reduced dimensions. (MNIST in 3d in the example below)

Clusters of digits in 3d space. Face embedding is similar but in a higher dimension. Embedding vector for each face point to a certain region where the whole cluster for this identity lies. Clearly, all vectors for the same numbers/faces will have shorter distance compared to any other digits/faces. (Image soure: Olivier Moindrot)

Training a face recognition system.

Design and steps: Load grayscale image as tensors (size 112 x 92 ) →forward through network → Get the embedding vectors (size 20, selected arbitrarily) → compute Euclidean (or squared Euclidean) distances (anchor -positive, anchor -negative) → use these distances along with margin(alpha) in the loss function. Followed by iterative minimization of loss and optimization parameters over many epochs.

Implementation

We will build a working prototype from scratch in pytorch. We will use the AT&T database of faces. It consists of ten different images of 40 individuals saved in 40 folders, 400 images in total. We will use simple functions and networks for the sake of clarity.

Visualize an image.

Next, we need pairs and triplets to implement contrastive loss (explained below) and triplet loss respectively. 370 images from 37 individuals are used for training and 30 pictures from 3 individuals were used for testing.

Loading batches of samples to load into the network. Select batch size, find the possible number batches for a list of pairs and retrieve the subset of pairs for the respective batch.

Transform the list of images to list of Tensors to be loaded into ANN.

Defining network

The network consists of both convolutional and fully connected layers. First a simple 3 layered convolutional neural takes the input tensors and generates a final output with 16 channels, each channel is half the size of the original tensor (16 x 56 x 46). Maxpooling is applied once, ReLU is used as activation, Batchnorm is applied to speed up the process.

The fully connected dense network takes the output from the convolutional network after flattening/reshaping. Finally, the dense network outputs a 20-dimensional vector which is the embedding for the given image.

Adam is used for the optimization of weights which helps in faster/better convergence while training.

Finally, train the network.

After training, we can directly go for recognition and clustering of faces but it is very helpful to gauge performance using simple distances, visualizing the patterns and looking for consistency.

Assessing performance using distances

We can find the accuracy and other metrics, particularly for unseen test subjects. One way to asses the performance is to see whether positive distances are shorter than the negative distances (for different persons). Steps are fairly straight forward, first get the embedding for all test images using trained network then compute distances between these vectors.

Once distance matrices are obtained using computed distances it can be displayed as heat-maps as seen below. Patterns clearly reflect that the network is able to learn from data and their relationship. It is very helpful to asses whether the model predictions are fairly accurate.

Heat maps of similarity matrices for test individuals (3 people with 10 images each, similarity distances from each image to all the images). **Left** — a matrix generated using the untrained network. **Middle** -distance matrix using trained network plotted as heat-map. **Right** -the same matrix as in middle but the values below cut off (ie <0.054) are all assigned 0 to visualize all cells having distances less than the threshold. Each smallest square is a Euclidean distance between 2 images. Bigger 10 x 10 square along the diagonal represents distances between 10 images of the same person. Right -smoothed matrix, all values below the cut-off of 0.054 are made assigned 0 to visualize all positive identifications. Note smallest square represent similarity distance from one image to another. Each row or column represents a complete set similarity distances from one image. Bigger diagonal squares represent the same person (10 x 10 cells representing 10 images of each person). Most are true positive and there are few false positive giving an accuracy of 92% for test images using this method for selected cut-off.

Accuracy measures the ability of the algorithm to detect true positive (images of the same person positively identified as the same person) and true negatives (images of the different person predicted as a different person). It is visually illustrated below.

A quick assessment of accuracy. We know that distances between images of the same person should be smaller and distances between different persons should be larger. We can fix a certain threshold (eyeballing or using plotting or other methods) and compute accuracy using Accuracy = (true positive+ true negative)/sample size. Note 3, 10 x 10 diagonal square represent the same person. The cell values below the threshold are all made 0 consequently all cells below the threshold will have the same color. Deep blue within 10x10 diagonal square implies true positive, deep blue outside this region implies false positive and deep blue not appearing in this diagonal region implies false negative.

Plotting Accuracy

Accuracy varies according to the threshold we set. Apart from accuracy, it would be helpful to see other metrics like precision, recall and f1 score. To find the best cut off value (which could be different for test and train sets) for highest accuracy and find other relevant metrics we can write a simple function which returns all metrics at the varying threshold and select the best cut-off margin accordingly. Face recognition could be further improved by using the mean of distances to each individual (all images of the same person) which we will use later.

Cut off for positive test Vs Accuracy. Note accuracy is above 90% from 0.1 to 0.13. Similarly, the highest accuracy is observed at the margin of 0.112, where precision, recall and f1 score are 0.95, 0.83 and 0.889 respectively.

Once the network is trained and performing well it is simple few steps to work with face recognition and clustering. Find the link to calculating performance metrics here.

Face recognition and clustering

As network is trained and seems to be performing well,, we can use it for face recognition and clustering. In face recognition given one image the network finds the person whole is most similar or closest. Clustering involves finding all the images individuals of same person from pool of many images. Alternately we can make k cluster from number of images.

Rather computing embedding each time it is more efficient here to get all the embedding and use them for recognition and clustering. It is simple task to make a list with indices, images and corresponding embedding. Here is the code snippet.

Now it becomes a simple task to compute and compare distances which is the basis for identifying the face and get clusters.

Below are code and few runs of face recognition calls. Just select the image of a person using the index in the look_up_list to find his/her identity.

Finding the right person from an image using find_person.

The algorithm achieves an overall accuracy of 95 to 99% for face recognition for both train and test images after a few epochs of training. Accuracy function also returns a list to find out which persons are misidentified. Link to face_id accuracy here.

The function also returns the distances to each person. For example -Person in the second folder has an index from 10 to 19 in look_up_list, we can use find_person(11, look_up_list) to find the identity of image 2 of this person. In addition to the image, we also get mean distances to each of 40 individuals as below. Clearly, the mean distance to the second individual 0.246 is the least distance.

array([0.89915437, 0.24623849, 1.40480736, 1.33066083, 1.62849771, 1.41435592, 0.87803077, 0.81626312, 1.61251076, 1.35037916, 0.59056972, 1.63431268, 1.2315311 , 0.85301907, 1.04345673, 0.46334697, 1.00861799, 1.32431773, 0.58127128, 1.13923682, 1.46194084, 1.0527679 , 1.52130461, 0.83202246, 1.54659908, 1.62023422, 0.4725906 , 1.19090982, 1.4773685 , 1.42314241, 1.5581555 , 0.89464784, 1.3679505 , 1.29450277, 1.59522836, 0.72607619, 1.2125357 , 1.30265826, 1.34485011, 1.68174347])

Face Clustering

Same distances are used for clustering. It is simple and very useful application once we have an easy way of representing faces as vectors. Code is given below.

Given a bunch of faces, select an image and find Euclidean distances to all the other faces and finally faces of the same person most likely have the shorted distances which helps to pool all images of the same person. It is quite similar to KNN (K Nearest Neighbor) problem.

The reference image is the image on the top left for both groups. Left -the pooling of closest 8 images for an image in train folder. Right -the pooling of closest 8 images for an image in the test folder. Clearly, there are 2 false positive images in this group.

Online Mining

Online mining is one of the key concept discussed in FaceNet. It is helpful to develop a robust face recognition system. The way loss function/triplet loss is implemented, it fails to learn anything from those anchors — positive pairs that are already close and similarly it fails to learn from anchor — negative pairs that are already far. By filtering out these easy pairs the model is systematically exposed to hard and semi-hard pairs.

One way to implement online mining is by finding the positive and negative distances on the go and filtering the easy negatives that lie far out in the margin. For each batch, we can find all triplets where: dist(anchor, neg) < dist(anc, pos) + margin: these are hard and semi-hard negative. Use only these triplets to derive loss and step.

The picture below illustrates the distances and definition of easy, semi-hard and hard negatives.

Left -finding hard and semi-hard negatives. (Source: Olivier Moindrot) Right — Heat map of distances for test images generated using this model. The model achieves similar accuracy.

Code for online mining.

Contrastive loss

It is an alternative method to build a face recognition where contrastive loss is used in place of triplet loss. Unlike triplet loss where loss is calculated using triplets here the pairs are separated into two groups(same vs different) and labeled accordingly. Here is a snippet of the loss function.

#Loss function for contrastive loss                             #Find the distance between anchor and positive or negative     #Pairs have labels 1(same) or 0(different)                    #Finally as loss is minimized dist to same person goes downeuclidean_distance = F.pairwise_distance(anc_fc_out, pn_fc_out)loss_contrastive = torch.mean((labels) * torch.pow(euclidean_distance, 2) +
 (1-labels) * torch.pow(torch.clamp(margin — euclidean_distance, min=0.0), 2))

Find the link to full implementation of contrastive loss here.

Dimensionality reduction and visualization of clusters

We can employ TSNE or other methods to get 2d or 3d vectors down from 20d and these can be plotted to see that image embedding belonging to a same class/person makes a small cluster as all the vectors point to some small region in vector space.

Below is 3d transformation of embeding using TSNE. Find the link to 3d plot here.

TSNE transformation and 3d plotting of 40 individuals. We can see the clusters though separation is not so great.

Conclusion

Face recognition is one of the most used algorithms in AI applications. The basic architecture for one shot learning is quite intuitive and fun. In this article, I have tried to highlight most of the important concepts in face recognition like loss functions, face recognition, clustering, and online mining. Hope you enjoyed the read and do remember to leave feedback.

References:

[1] Yaniv Taigman, Ming Yang, Marc’Aurelio Ranzato, Lior Wolf. DeepFace: Closing the Gap to Human-Level Performance in Face Verification, CVPR 2014. DOI: 10.1109/CVPR.2014.220

[2] Florian Schroff, Dmitry Kalenichenko, James Philbin, FaceNet: A Unified Embedding for Face Recognition and Clustering (2015). arXiv:1503.03832v3 [cs.CV] 17 Jun 2015

Also helpful to watch deep-learning videos on face recognition by Andrew NG. ( https://www.youtube.com/watch?v=-FfMVnwXrZ0)

Face Recognition made easy|Nuts and Bolts of Face Recognition

Written by Munesh Lakhey