A simple face verification system using Keras and OpenCV

Debapriya Tula
Analytics Vidhya
Published in
7 min readDec 22, 2019

You must have definitely come across the term ‘face verification’, or perhaps come across systems that use this new technology. But, ever pondered what goes on inside these wonderful systems? Well, let me address this thought.

You must have seen bio-metric systems which capture pictures of people and try to determine whether they belong to a predefined set of people. What such a system does is it tries to find a match between the person, whose face was captured, and a person already existing in its database as an authorized user.

What is face verification?

It can be thought of as a classification problem where the identity of the person is verified using a matching score. So, if two images are of the same person, they should have a high matching score and if the two images are of two different persons, the match should be low.

The first thing that may come to your mind is, why not match the captured picture with another picture pixel-wise? If the distance(mean-squared or absolute) between pixel values of the captured image and that of the other are small, they must correspond to the same person. But since the pixel values in an image change dramatically even with a slight change of light, position or orientation, this method might not, and in fact does not, work well.

So, what do we do now? This is where convolutional neural networks, better known as CNNs, help us. Such networks can help us better represent images by embedding each image into a d-dimensional vector space. The image embeddings are then evaluated for similarity.

Some of the approaches to the problem

  1. Exemplar SVMs: Here, the idea is to train a linear SVM classifier for each exemplar in the training set, so that in each case we end up with one positive instance and a lot of negative ones. To know more about exemplar SVMs, please refer to this.
  2. DeepID: Here, the task of verification is thought of as a sub-problem of face identification(assigning a label to each person). This is based on the idea that training a neural network for the much harder problem of identification can in principle give very good descriptors for verification. This has been observed to learn feature transforms that take multiple views of a face into account. To know more about DeepID, please refer to this.
  3. Siamese Network: This is based on the idea that intrapersonal distances should be much smaller than interpersonal distances. This approach is what we take up in detail here.

Before delving into Siamese networks, let us first discuss a very important concept Siamese networks are based on. And that is One Shot Learning.

One Shot Learning is an object categorization problem, mostly found in Computer Vision, that tries to learn information about object categories from one or a few training samples. Usually, in deep learning, we require a lot of data, and the more we have access to, the better. But in case of face verification, it is unlikely to have access to thousands of images of a person before the learning actually begins. Secondly, our brain does not need thousands of pictures of a person in order to recognize him/her. So, deep learning and neural networks, which are modelled using the brain, should also not require a lot of examples for the task.

For the face verification task, we expect the system to be able to judge the identity of a person from a single/few images.

As I mentioned earlier, CNNs are helpful for the vectorized representation of the images. But CNNs do require a lot of examples to train. Moreover, it is not convenient to train a model every time a new person’s images are added to the database. So, why not build a model which learns representations that distance two different people while representing two images of the same person similarly. This is exactly what a Siamese network does.

Image source: Convolutional Neural Networks by deeplearning.ai

The image x(1) is fed to a CNN, consisting of convolutional layers and fully connected layers. The convolutional layers provide a meaningful, low-dimensional, and somewhat invariant feature space while the fully connected layers appended to the convolutional layers learn a function(mostly non-linear) in that space. What we have in the end is a feature vector(no softmax activation is added to the feature vector as it will not be used for classification at this stage). The image x(2) is fed to a CNN fully identical to the above one. In our case, we have a third image x(3) fed to the same CNN.

We choose x(1) as my anchor image, x(2) as my positive image and x(3) as a negative image. The anchor and the positive images belong to that of the same person while the negative image is that of some other person. So, our objective will be to minimize the distance between the positive image and the anchor while maximizing the distance between the anchor and the negative.

d is the distance function, a squared L2 norm.

The objective can be written as:

This, however, has the possibility of learning the same(or almost the same) encodings for both the positive and the negative images, which can indeed satisfy the above equation. This is why we need to add a small margin alpha(a hyperparameter) to ensure that there is always a some gap between d(x(1), x(2)) and d(x(1), x(3)).

The modification thus made to the equation. Here A=x(1), P=x(2), N=x(3)

Now, how do we frame it as a loss function.

This is known as the triplet loss function. As we can see, this loss function ensures that the first term in the max function doesn’t exceed 0, which otherwise contributes to the error.

OK, now that’s a lot of theory that we delved into.

Let’s have a look at the long awaited code.

The Code

Creating the data

This is the code to creating the dataset. It uses a pre-trained classifier known as the Haar frontal face classifier that identifies the face in a cascade fashion. The code below stores 20 images of the person’s face captured using a webcam and store it inside a folder, <username>. Such folders are stored inside the dataset folder.

The link to the file can be found here

Creating the model and training it

While creating the dataset, you can see that I have padded my existing images with some values (here 22 along the width and 12 along the height). It is so because I am using a VGG16 model pretrained on the ImageNet dataset which expects input images having dimensions (224, 224, 3), and the dataset which I used has dimensions (200, 180, 3) for each image.

Triplets are formed and stored in the triplets list (in the code attached below). For each person(person identified by a folder inside the dataset folder), we store 5 triplets of (A, P, N). They are then trained over the model which in itself consists of three VGG16 models, joined by a Lambda layer which implements the triplet loss function.

The code is quite self-intuitive and easy to understand.

For training the model consisting of approximately 420,000,000 parameters, Intel’s DevCloud is used which provides you with 200 GBs of storage and a whopping 92 GBs of RAM. Secondly, it is optimized for frameworks such as Tensorflow, Pytorch, Caffe, etc. It takes me around 3–4 hours to train my network which I later store in ‘model.h5’.

The link to the file can be found here

Using the model for verification

The detect_face function which takes in an image, img(the image of the person, captured using webcam) finds the face and crops it out. The user also enters a username while getting his/her face verified. The face now cropped is then verified as follows:

a) We find the folder named <username> in our dataset folder. We choose an image from that folder.

b) We randomly select three other folders and choose an image from each one of them. These will serve as the negative images.

c) We find the encodings of each of the images, the one returned by detect_face, one found from step a) and three from step b).

d) We find the mean squared error of encodings of each of the images obtained from steps a) and b), with that of the image returned by detect_face.

e) If any of the images obtained from step b) have a smaller error than the one obtained from step a), we say that the person is authorised, otherwise not.

The link to the file can be found here

So, this is how we implement a simple face-recognition tool. The code to the above implementation can be found here.

This blog post was made as part of the Intel Student Ambassador competition.

References:

--

--