Face Recognition with FaceNet and MTCNN

Jump in as we introduce a simple framework for building and using a custom face recognition system.

10 min readMar 16, 2021

This was originally posted in Ars Futura magazine.

We have since put this blog into practical use and created an office door lock that uses face recognition. Read our Smart Lock DYI blog and learn how you can truly do it yourself.

Deep learning advancements in recent years have enabled widespread use of face recognition technology. This article tries to explain deep learning models used for face recognition and introduces a simple framework for creating and using a custom face recognition system.

Formally, Face Recognition is defined as the problem of identifying or verifying faces in an image. How exactly do we recognise a face in an image?

Face recognition can be divided into multiple steps. The image below shows an example of a face recognition pipeline.

Face detection — Detecting one or more faces in an image.
Feature extraction — Extracting the most important features from an image of the face.
Face classification — Classifying the face based on extracted features.

There are various ways to implement each of the steps in a face recognition pipeline. In this post we’ll focus on popular deep learning approaches where we perform face detection using MTCNN, feature extraction using FaceNet and classification using Softmax.

MTCNN

MTCNN or Multi-Task Cascaded Convolutional Neural Networks is a neural network which detects faces and facial landmarks on images. It was published in 2016 by Zhang et al.

MTCNN is one of the most popular and most accurate face detection tools today. It consists of 3 neural networks connected in a cascade. You can find a more detailed overview of MTCNN here.

FaceNet

FaceNet is a deep neural network used for extracting features from an image of a person’s face. It was published in 2015 by Google researchers Schroff et al.

How does FaceNet work?

FaceNet takes image of face as input and outputs embedding vector.

FaceNet takes an image of the person’s face as input and outputs a vector of 128 numbers which represent the most important features of a face. In machine learning, this vector is called embedding. Why embedding? Because all the important information from an image is embedded into this vector. Basically, FaceNet takes a person’s face and compresses it into a vector of 128 numbers. Ideally, embeddings of similar faces are also similar.

Mapping high-dimensional data (like images) into low-dimensional representations (embeddings) has become a fairly common practice in machine learning these days. You can read more about embeddings in this lecture by Google.

Ok, what do we do with these embeddings? How do we recognise a person using an embedding?

Embeddings are vectors and we can interpret vectors as points in the Cartesian coordinate system. That means we can plot an image of a face in the coordinate system using its embeddings.¹

One possible way of recognising a person on an unseen image would be to calculate its embedding, calculate distances to images of known people and if the face embedding is close enough to embeddings of person A, we say that this image contains the face of person A.

*Recognising a person by calculating embedding distances.*

That looks great, right? Feed the image through FaceNet, get the ✨ magic embedding and see if the face distance is close enough to any of the known faces. But, where does the magic come from? How does FaceNet know what to extract from the image of a face and what do these numbers in an embedding vector even mean?

Let’s try to dig deeper into FaceNet and try to explain how FaceNet learns to generate face embeddings.

In order to train FaceNet we need a lot of images of faces. To keep things simple we’ll assume we only have a couple of images from two people. The same logic can be applied if we have thousands of images of different people. At the beginning of training, FaceNet generates random vectors for every image which means the images are scattered randomly when plotted.

FaceNet learns in the following way:

Randomly selects an anchor image.
Randomly selects an image of the same person as the anchor image (positive example).
Randomly selects an image of a person different than the anchor image (negative example).
Adjusts the FaceNet network parameters so that the positive example is closer to the anchor than the negative example.

We repeat these steps until there are no more changes to be done 👉 all the faces of the same person are close to each other and far from others.

This method of learning with anchor, positive and negative examples is called triplet loss.²

So what do the numbers in the embedding vector mean? Size of the eyes? Distance between the nose and the eyes? Mouth width? Probably… these features seem important for face recognition, but in fact, we don’t really know what these numbers represent and it’s really hard to interpret them.

We don’t directly tell FaceNet what the numbers in the vector should represent during training, we only require that the embedding vectors of similar faces are also similar (i.e. close to each other). It’s up to FaceNet to figure out how to represent faces with vectors so that the vectors of the same people are similar and the vectors of different people are not. For this to be true, FaceNet needs to identify key features of a person’s face which separate it from different faces. FaceNet is trying out many different combinations of these features during training until it finds the ones that work the best. FaceNet (or neural networks in general) don’t represent features in an image the same way as we do (distance, size, etc.). That’s why it’s hard to interpret these vectors, but we are pretty sure that something like distance between eyes is hidden behind the numbers in an embedding vector.

*FaceNet is a function which takes an image of a face as input and outputs a vector of the most important facial features.*

The image above is a good summary of what FaceNet is. A function which takes an image as the input and outputs the face embedding (a summary of the face). If you are a developer, you can think of FaceNet as a hash function. FaceNet maps images of the same person to (approximately) the same place in the coordinate system where embedding is the hashcode.

Softmax

We mentioned earlier that the classification step could be done by calculating the embedding distances between a new face and known faces, but that approach is too computationally and memory expensive (this approach is called k-NN). Instead, we decided to use the Softmax classifier which memorises boundaries between people which is much more efficient.

Softmax classifier is used as a final step to classify a person based on a face embedding. Softmax was a logical choice for us since the entire stack is neural networks based, but you can use any classifier you wish such as SVM, Random Forest, etc. If the face embeddings themselves are good, all classifiers should perform well at this step.

EDIT: When I wrote this part I didn’t fully understand how difficult it is to train the classifier. Once I built a real-world face recognition project I realised it’s not easy at all. You can read more about the project here.

Face Recognition Framework

At Ars Futura, we developed a simple framework for creating and using a Face Recognition system. Our Face Recognition system is based on components described in this post — MTCNN for face detection, FaceNet for generating face embeddings and finally Softmax as a classifier. The framework is free, open-source, and you can find it here.

arsfutura/face-recognition

This repository provides a simple framework for creating and using Face Recognition system. There is also a blog post…

github.com

Create your Face Recognition system

First, you need to collect images of people that you want to be able to recognise down the road. The images should be provided in the following directory structure:

- images 
    - person1
        - person1_1.png
        - person1_2.png
        ...
        - person1_n.png
    - person2
    ...
    - personN
    ...

Every person you want to recognise must have a dedicated directory with their images in it. The images have to contain the face of only one person. If the image contains multiple faces, only the one detected with the highest probability will be considered.

Now that you have your data prepared, you can create your custom Face Recognition system with the following command:

./tasks/train.sh path/to/folder/with/images

That’s it! After this command successfully finishes, you have your own Face Recognition system! You can use your brand-new Face Recognition system in several ways. There are easy-to-use Python scripts that perform face recognition on images or a live video. There is also a Dockerfile for generating a Docker image with Face Recognition system and REST API that you can call. Read more about it in the README!

Which Team Do You Belong To?

As a fun little way to test out the framework, we collected images of our employees and using the Face Recognition Framework created our custom Face Recognition system!

We used this system to implement Which Team Do You Belong To? feature on our careers page.

How does it work?

You take a selfie or upload a photo of your selfie and let us analyze which team you belong to in Ars Futura.

Actually, behind the scenes, we detect your face and get similarity scores between you and all the Ars Futura employees. We sum and normalize scores by team and assign you to the team with the highest similarity score!

*Equation for calculating similarity score with a specific team.*

Bonus: Apple FaceID

FaceID is probably the most popular face recognition system today. It was launched in 2017 alongside the iPhone X. FaceID lets you unlock your iPhone, authenticate with different apps, and authorize payments with your face.

Setting up FaceID on your iPhone is a matter of recording a few pictures of your face from different angles.

How does FaceID work? How can FaceID recognise you based on just a few photographs you take during the setup process?

If you know anything about machine learning, you know that taking just a few photos is not nearly enough for training a robust model. Plus, mobile phones don’t really have enough resources to perform that kind of training. You would run out of battery very fast if there was some kind of training ongoing on your iPhone, so we can safely conclude that FaceID doesn’t do any kind of training on the device.

We don’t really know the inner workings of FaceID because Apple didn’t reveal too many details, but based on the facts we do know, we are pretty sure that the backbone of FaceID is some sort of FaceNet-like neural network which extracts face embeddings.

FaceID uses a FaceNet-like neural network which is trained on millions of faces offline to generate face embeddings. That pre-trained network is shipped and updated together with the iOS operating system. When you set up FaceID, it takes several photos of your face, calculates the face embeddings and stores those embeddings on the device. When you try to unlock your iPhone, FaceID takes your photo, calculates your face embedding and compares it to embeddings it has stored on the device. If those embeddings are similar enough — your phone will be unlocked.

Hold on, isn’t this the same as the Android Trusted Face feature which could be fooled by using a photo of person’s face, and was removed recently because of it?

Sorry! I lied about FaceID taking photos. This is where FaceID differs from face recognition technologies on other mobile devices. FaceID doesn’t use actual photos of faces. Instead, it uses a 3D model of your face. This gives FaceID a lot more details about the face and makes it much more secure. FaceID cannot be fooled by pictures and it’s very hard to fool it with masks (but not impossible).

Sensor-wise, Apple uses the TrueDepth camera to capture a 3D model of your face. The TrueDepth camera is based on IR technology. That’s why iPhones older than X don’t support FaceID, because they don’t have the TrueDepth camera.

Want to read more about this topic? Check out part two.

I want to thank Ivan Božić for helping me write this post, also I want to thank Domjan Barić and Natko Bišćan for helpful comments, I want to thank Antonija Golić and Luka Drezga for creating visualisations and animations for this post, and last but not least, I want to thank Elizabeta Peronja and Lea Metličić for editing the post :) Thank You!

[1] FaceNet embedding vectors have 128 numbers which means they are 128-dimensional. We live in a 3-dimensional world, we cannot plot a 128-dimensional vector. We are pretending that faces can be plotted in 2D for simplicity, the same logic applies for 2D and 128D, but unlike 128D, we can visualise 2D 😊.

[2] This is an oversimplified explanation which intends to give the reader a high-level intuition of the FaceNet learning process. If you are interested in digging deeper into this, read the original FaceNet paper.