How Facial Recognition Systems Work?

Published in

Analytics Vidhya

9 min readAug 5, 2020

This title needs no introduction, everybody has come across this term in our everyday lives. We use this term quite liberally without knowing the wide range of applications face recognition algorithms are used in. Facial recognition algorithms are not just limited to unlocking a door by verifying one’s face, the applications are widespread beyond security. These algorithms can be used to predict the gender of a person, age, ethnicity, tracking missing people & pets, helping the law enforcement track criminals across the world & so much more.

Humans can perform the task of recognizing faces exceptionally but making a computer or a machine recognize faces automatically remains a hard problem till date. Let us understand how facial recognition software’s work.

The pipeline of automatic facial recognition

The task of facial recognition is described in 4 steps

Face Detection.
Face Alignment.
Feature Extraction.
Recognition/Verification.

Traditional systems which implemented facial recognition had each of these 4 steps as a separate module, whereas modern facial recognition software combines some or all these steps into a single module depending upon how the system is architected.

Face Detection

The first step is to identify the face/faces in the image or input video feed. This process involves predicting the location of face/faces & the extent of the face that has to be localized (bounding box).

The task of detection of faces is similar to object recognition & detection, in this case, the class of object to identify is only the human face.

Humans also first detect the faces in the line of sight, look closely at the faces to recognize them as somebody or something. Our brains process this information very fast, in the matter of milliseconds & thus can’t interpret it as two different processes.

Face detection algorithm must be robust, as this is the first step in the pipeline of facial recognition. “A face which is not detected, cannot be recognised”.

As human faces are highly dynamic & have a high of variability in its appearance the algorithm must be able to detect faces in all orientations, angles, hairstyles, skin colour, age, makeup or not & so on.

Face detection can be performed in 2 ways

Using Feature descriptor algorithms

A feature detector algorithm takes an input image & outputs locations of significant areas in the image. For example, an edge feature detector which outputs locations of edges in the image, similarly a corner detector outputs locations which comprise of corners. These don’t provide information on any other feature apart from the ones they are meant to detect.

Once features have been detected, a local image patch around the feature can be extracted. This extraction may involve quite considerable amounts of image processing. The result is known as a feature descriptor or feature vector. Feature descriptor algorithms do exactly this.

A feature descriptor is an algorithm which takes an image and outputs feature descriptors/feature vectors. Feature descriptors encode interesting information (features) into a series of numbers and act as a sort of numerical “fingerprint” that can be used to differentiate one feature from another. Ideally, this information would be invariant under image transformation, so we can find the feature again even if the image is transformed in some way or if the same feature appears in another image.

Histogram of Oriented Gradients or just HOG is one such feature descriptor which is widely used to detect faces in the pipeline of facial recognition. Follow this link to learn more about HOG.

In short, HOG divides the image into small squared cells (image is divided into nxn squares) computes the histogram for each of these squares separately. A histogram is created by calculating the gradients & orientations of pixel values in each cell.

To find the faces in a HOG image, all one has to do is find the region of the image which looks similar to a known HOG pattern of a face from the training images.

Feature descriptors are very efficient in detecting objects in realtime but have limitations when the lighting condition varies or if the object is blurred or if the orientation of the face changes. These restrictions can be resolved by using Convolutional Neural Networks based detector.

2. CNN based Detector

A pre-trained CNN based detector can be used to localize the faces in an image. The network is trained on the images of individuals or crowd to a diverse spectrum of scenarios.

These can detect faces in any orientation, lighting conditions & is much more robust than HOG or SIFT.

Object detection algorithms like YOLO or Faster-RCNN can be trained to detect faces.

MTCNN is an algorithm which provides face detection & facial landmark detection cascaded into a single network.

The major disadvantage of using a CNN based Detector is the processing time. To compute & run million-billions of parameters, it would take a considerable amount of time which is not ideal in realtime. A powerful Nvidia GPU can be used to run a CNN but it happens to be very expensive.

Face Alignment

The output of the face detection algorithm is regions where the faces are located. Before we proceed to recognize the faces detected it is essential to centre align the faces.

Why?

As discussed the face detected may not always be the front profile of the person. Sometimes the detected faces may only be the left or right profile of the face or the faces may be tilted or oriented differently. For the algorithm to extract features effectively, it is necessary to centre align the faces.

To align the faces, facial landmark detection is applied to detect key landmarks on the faces. The intuition is to locate landmarks such as eyes, nose, eyebrows, lips which are found on every human face. There are two widely used landmark detection algorithms, the 68 point landmark & 5 point landmark algorithms.

Let us understand the process using the 68 point landmark algorithm.

Given the region of detected faces in the images, the algorithm identifies 68 different key points or landmarks for a given face. These include inner eyes, outer eyes, upper lips, lower lips etc as shown in the fig above.

Now that we have the location of important landmarks such as eyes, nose, mouth by simply rotating, scaling & shearing the image we can centre align the key points/landmarks as best as possible. The goal of face alignment is to transform the image such that

the faces are centred in the image
The faces are rotated such that the eyes lie on a horizontal line ( eyes lie along the same y-coordinate after rotation)
Should be scaled in such a way that faces approximately of the same dimensions.

A more optimized & faster version used in face alignment is the 5 point face landmarking model which uses the 2 points on the left eye, 2 points on the right eye & 1 point on the nose or 2 points for the eyes, 1 point on the nose & 2 points on either side of lips.

The 5 point landmark is 10% faster than 68 point landmark & the model is almost 10 times smaller than 68 point landmark. If the task is just to align faces, 5 point landmark is more than sufficient.

The applications of landmark detection are not limited to face alignment & expand beyond it.

Instagram/Snapchat filters & widgets.
Drowsiness detection.
Digital makeup
Prediction of facial expressions & estimating facial pose.

The task such reading facial expressions involve analysing multiple points on the face, the 68 point landmark is most suitable & also in the case of drowsiness detector, 68 point landmark provides 6 points for each eye & 20 points for mouth region.

Feature Extraction & Face Embeddings

This is the beginning of the recognition part of the task. Up until now, we dealt with the detection of faces & how to align them so that features can be extracted effectively. This is part where the network learns to distinguish two faces.

Human faces are dynamic with a high degree of variability amongst us. Each person’s face is distinct from another person in terms of the facial features. For example, the size of the eyes, nose, eyebrows, the colour of our hair, & a combination of several factors.

This raises a question that which facial feature measurement is the most reliable for recognition? is it the size of the eyes, size of the nose or a metric which a composition of all the facial features?

Turns out the measurements which seem obvious to us humans don’t make sense to a computer looking at the pixel values of the images. The best strategy is to let the computer or the deep learning network figure the best feature representation.

The feature representation created by the deep learning network for a given face is called a face embedding.

A deep convolutional neural network is trained to generate a vector of fixed length (128 or more) for an aligned face. This 128 latent vector remains unique for every face it encounters in the training dataset.

How does the network create unique vectors for every person?

The training of the CNN works by looking at 3 images at every single iteration. Here, 2 out of the 3 images are of the same person (say person A) & the other image is of another person (say person B).

Load training image of A (anchor image).
Load a different training image of A (positive image).
Load a training image of B (negative image).

The CNN outputs 3 feature embedding vectors for the 3 images.

Now, this is where the magic happens, inside the triplet loss function.

The triplet loss estimates the error (distance) between the feature vector of anchor image & positive image, the error between the feature vector of positive image & negative image.

The CNN is tweaked such that the distance between the vector of anchor image & positive image is reduced whereas the distance between the vector of positive image & negative image is increased. By doing this network generates similar face embeddings for the same person.

By training the network on millions of images of thousands of different people, the network learns to output different feature/embedding vector for different people. Now the trained version of the network can reliably output feature vector of a person it has never seen before.

Different networks have been trained to output face embedding & the length of the vector depends on the network one is using. Some notable CNNs used are FaceNet, VGGFace, OpenFace, DeepFace etc.

The applications of face embedding are not limited to verifying the identity of the individual & span much beyond it. One can predict the age of a person, gender, nationality & much more.

Recognizing the face

Finally, the last part of the pipeline is to determine the identity of the person. The previous module which is the feature extraction algorithm or the DNN which outputted a feature vector representing the image of the person. All one has to do is identify the name of the person in the database whose face embedding matches.

Since people’s day to day appearance changes or the orientation of the face won’t be the same every single time as discussed, the face embedding values also vary i.e. the face embedding of two different images of the same person will be different with some degree of variability.

By using an SVM classifier or any other clustering or classification algorithms we can obtain the closest match for the face embedding found in the database.

And voila…..

Conclusion

I hope you found this article to be informative, please do check out my other articles. Thank you for reading! :)