Multi-task Cascaded Convolutional Networks (MTCNN) for Face Detection and Facial Landmark Alignment

4 min readJul 27, 2020

This is the first of a series of posts in which I will walk you through how I developed a face detection application with PyTorch and OpenCV. The outline for the following stories is the following:

In this first post I will go over how MTCNN works based on the paper “Joint Face Detection and Alignment using Multi-task Cascaded Convolutional Networks” by Zhang, Zhang and Zhifeng. In following posts I will go over how I developed the face detection app and finally how I trained a classifier to recognize my face and blur it out in a video feed.

Multi-task Cascaded Convolutional Networks (MTCNN) is a framework developed as a solution for both face detection and face alignment. The process consists of three stages of convolutional networks that are able to recognize faces and landmark location such as eyes, nose, and mouth.

The paper proposes MTCNN as a way to integrate both tasks (recognition and alignment) using multi-task learning. In the first stage it uses a shallow CNN to quickly produce candidate windows. In the second stage it refines the proposed candidate windows through a more complex CNN. And lastly, in the third stage it uses a third CNN, more complex than the others, to further refine the result and output facial landmark positions.

The Three Stages of MTCNN:

The first step is to take the image and resize it to different scales in order to build an image pyramid, which is the input of the following three-staged cascaded network.

Input image is resized to different scales to build an image pyramid

Stage 1: The Proposal Network (P-Net)

This first stage is a fully convolutional network (FCN). The difference between a CNN and a FCN is that a fully convolutional network does not use a dense layer as part of the architechture. This Proposal Network is used to obtain candidate windows and their bounding box regression vectors.

Bounding box regression is a popular technique to predict the localization of boxes when the goal is detecting an object of some pre-defined class, in this case faces. After obtaining the bounding box vectors, some refinement is done to combine overlapping regions. The final output of this stage is all candidate windows after refinement to downsize the volume of candidates.

Stage 2: The Refine Network (R-Net)

All candidates from the P-Net are fed into the Refine Network. Notice that this network is a CNN, not a FCN like the one before since there is a dense layer at the last stage of the network architecture. The R-Net further reduces the number of candidates, performs calibration with bounding box regression and employs non-maximum suppression (NMS) to merge overlapping candidates.

The R-Net outputs wether the input is a face or not, a 4 element vector which is the bounding box for the face, and a 10 element vector for facial landmark localization.

Stage 3: The Output Network (O-Net)

This stage is similar to the R-Net, but this Output Network aims to describe the face in more detail and output the five facial landmarks’ positions for eyes, nose and mouth.

The Three Tasks of MTCNN

The Network’s task is to output three things: face/non-face classification, bounding box regression, and facial landmark localization.

Face classification: this is a binary classification problem that uses cross-entropy loss:

2. Bounding box regression: the learning objective is a regression problem. For each candidate window, the offset between the candidate and the nearest ground truth is calculated. Euclidean loss is employed for this task:

Facial Landmark localization: the localization of facial landmarks is formulated as a regression problem, in which the loss function is Euclidean distance:

There are five landmarks: left eye, right eye, nose, left mouth corner and right mouth corner.

To build a FaceDetector using MTCNN and OpenCV, check out my following story.