Identifying Faces with MTCNN and VggFace

Abhishek Kumar

Follow

Published in

The Startup

6 min readJan 20, 2021

--

Banner — Source : https://www.kairos.com/blog/face-detection-explained

Face Verification is a task of identifying a person by using a source such as an image,video, or a camera feed of their face. There are various methods of face verification, depending on how we analyse and extract features and facial patterns from an image. This task can further be broken down into three subtasks : Face Detection, Feature Extraction, and Classification.

Face Detection

Face detection involves detecting the bounding box that contains the face in a given image. An ideal bounding box should perfectly encapsulate the face without cropping out important facial shapes and features and without including more surrounding area than is necessary.

I chose Multi-task Cascaded Convolutional Network (MTCNN) as the face detector in this program, because although its not quite as real time as non deep learning methods like Haar cascade classifier, the higher accuracy easily makes it worth the trade off.

MTCNN is a Cascaded Network of three CNNs:

The first stage has a Fully Connected Proposal Network that is used to obtain candidate windows and reduce the overlapping and number of boxes.

P-Net Block Diagram — Fig. P-Net diagram from the MTCNN paper

The first stage takes as input an image pyramid made up of differently scaled copies of the input image. This provides the model with a wide range of window sizes to choose from, and helps the model in being scale invariant.

.

Fig. P-Net Block diagram — Fig. P-Net diagram from the MTCNN paper

The second stage is a CNN Refine Network(R-Net). It further reduces the number of boxes and merges overlapping candidates using non-maximum suppression (NMS).

O-Net block diagram — Fig. O-Net diagram from the MTCNN paper

.

The Output Network in the third stage does more of the same things that R-Net does, and it adds the 5-point landmark of eyes, nose and mouth in the final bounding box containing the detected face.

Feature Extraction

Feature Extraction is the key step in the task of Face Identification. In this step, from a human face image obtained by the Face Detection step above, we extract facial component features such as landmark points (like eyes, nose, mouth, etc) and the relation between them.

We choose a VGG Neural Network, specifically the Resnet-50 based VGGFace2 model developed by researchers at the Visual Geometry Group at Oxford. The Pretrained Open Source model gives much better performance than ‘shallow’ Feature reduction techniques like PCA, LDA ,SIFT etc.

Among Deep learning methods for facial feature extractions, VGGFace has better performance than Facebook’s DeepFace and Carnegie Mellon University’s OpenFace, while being much lighter than Google’s FaceNet (25.6 Million parameters vs 140 Million+ parameters).

Fig. Layerwise Parameters breakdown of different ResNet Models.

ResNet-50 has performance comparable to the other much heavier models.

VGGNet represents the image by a 1 x2048 vector, thus massively reducing the no. of parameters required and providing distilled features.

Classification

In this step, a classifier decides whether the face in the image matches the identifier face based on the information provided to it.

Some popular methods for classifications are

Cosine Similarity
Euclidean Distance
SVM
K Nearest Neighbours

Cosine Similarity suits our use case best, as it’s akin to observing how close our images are in an N-dimensional feature space, by measuring the cosine of the angle between the feature points of the images.

In our case, we feed the feature vectors of the ID image and the test image to the cosine similarity function. Using a threshold, we decide whether the face matches or not.

Et voila! We have our facial Identification System. Now let’s see it in action with one of the toughest facial recognition tests imaginable: telling apart Matt Damon, Mark Wahlberg, and Jesse Plemons.

I have a hard time telling who’s the actor in which movie.

First. let’s extract the identifier face (Matt Damon in our case) using MTCNN:

Screenshot from colab notebook, link provided below.

Now let’s see how our model performs on these faces:

This is Matt Damon, 10 years apart. Our system can be called ‘age invariant’. (or maybe Matt just doesn’t age.)

The model can tell apart Matt Damon and Jesse Plemons, even though I often can’t.

It’s relatively a close call, but the model can just distinguish between Matt Damon and Mark Wahlberg.

Suffice to say, Our model has passed the Matt Plemonberg test.

Use Cases

Face Detection can be used in various scenarios, for example, to detect whether a person is present in a room or not. This can be employed for security purposes, like to detect an intruder, or for safety purposes, like warning a person from entering a very hazardous area, to name a few.

Face Verification is used to check if the faces in two photos are of the same person or not. Video conferencing has become a fact of life now, and Face verification can be used for attendance of students, or for measuring the ‘attention’ of students in class, i.e. for how much percentage of the lecture was the student present and focussing on the screen. It can also be used for secure admission in residential apartments, by detecting the face of a person and matching it to a database of residents.

Areas of future development

Recent feature extraction methods using Attention Mechanisms have been shown to substantially outperform traditional Residual Networks. Attention based Facial Feature Extraction is an area worth exploring.
The performance of the classifier stage can be improved by employing deep learning methods such as a Siamese Network.

“The sky’s the limit, if your imagination flies free.
Rise high, soar on the wings of creativity. ”

Hope you had fun reading through this blog. If you would like to try the code for yourself, here’s a colab notebook with the code and some documentation. Please visit the github repository if you want to try out the model on your laptop’s webcam.

References:

[1] Joint Face Detection and Alignment using Multi-task Cascaded Convolutional Networks : https://arxiv.org/abs/1604.02878

[2] https://www.katacoda.com/ernesto/courses/deep-learning-computer-vision/deep-learning-computer-vision-chapter-29

[3] https://www.youtube.com/watch?v=i_MOwvhbLdI

Author: Abhishek Kumar

About: I am a B.Tech Fresher who is extremely Passionate about Data Science and Machine Learning. Imagination is the key to success. I live to be on the cutting edge, and believe in always growing and moving forward. Technology and the Human Equation are some favorite topics of discussion. Currently seeking an opportunity to explore And develop In the world of Data Science under a winning enterprise.

Get Connected:

Website: https://AbhishekkumarDS.github.io
Mail: inbox.mrkumar@gmail.com
GitHub: https://github.com/AbhishekKumarDS
LinkedIn: https://www.linkedin.com/in/AbhishekKumarDS
Twitter: https://twitter.com/MADScientist_AK