Mask Detection Using Deep Learning

Harsh Sharma
Analytics Vidhya
Published in
8 min readOct 16, 2020

Please Wear a Mask!

Hello readers, Just like my previous article, this one is also related to our current dire situation of COVID 19. As the title indicates, I will be going to explain about how you can build a mask detection system on a video feed by using Deep Learning. Basically, you will be able to detect if someone is wearing a mask or not and can use it further to generate a trigger.

This system can be used in a workplace to monitor if the employee is wearing a mask or not. It can also be used in shopping malls, stations etc to make announcements from time to time to point out people not following the mask rule.

Our final product of detecting mask in a frame will include two major steps. One, Detecting faces in a frame and Two, Classifying the detected face, if it has mask on or not.

To do the face detection we will use an architecture called RetinaFace which is the state of the art model for detecting face in a picture and to further classify each face into mask or no mask we will be using a ResNet architecture. As I believe that if you know about what you are using you will be more comfortable in using that. So, I will explain about these two architectures first and then will discuss about it’s implementation and provide you the code.

Here I’ll be explaining about RetinaFace architecture and in my next article, I will explain about ResNet architecture and will discuss in detail about how to combine and implement these two models using Python.

RetinaFace

Earlier, Face detection was done using two stage detectors which had one region proposal network and the proposed regions were then sent to another network to find the boxes around the faces. If you don’t know about single stage and two stage detectors, you can go through my previous article where I have explained a bit about them.

Architecture

RetinaFace was one of the first single stage detector which performed really well in detecting small faces and highly occluded faces. In one of my article, I have explained about a state of the art CNN architecture in Object Detection , RetinaNet. This RetinaFace architecture is similar to that architecture but with some changes which are specific for face detection. In RetinaFace also, we use FPN (Feature Pyramid Network) from a backbone of Resnet 152. Again, If you don’t know about FPN and Receptive field , you can go through my previous article here.

Taken from RetinaNet Paper

From here on I’ll assume that you know about FPN. So, here they are using the output of different layers of Resnet which have different receptive fields and which make it possible for detecting different size of faces. Instead of just using the output of layers to locate and shift the box for faces, they have included one more layer of computation on top of each output, which they are calling as Context Module.

Context Module

The concept of context module is not introduced by them, this was already used by earlier researchers who claim to have better accuracy with this type of addition in face detection architecture. In Context module they are just increasing the receptive field by using something called Deformable Convolution Netwok (DCN). These DCNs are similar to CNN but in DCN there are few offset parameters which does not put constraint on the kernel to look at only a fixed shape of window which is square in almost all the cases (3 x 3 sized kernel can only look at 3 x 3 size of square window at a time). To learn more about DCNs you can get an overview from this beautifully explained article. Basically, in Context Module as you can see in the image above, it makes it easier for model to learn different orientation of faces also along with increasing the receptive fields for each output and as we are increasing the computation along with residual connections, it increases the contextual information that it can incorporate.

One more important thing that they are using in the architecture is mesh decoder. This part of the architecture is pretty complex to dive deep into, but I will give an overview of what it ultimately does.

Mesh Decoder

This part of the architecture is a bit unique. This works on 3D structure of the face. Mathematically a 3D face can be represented as V ∈ R( n×6), where V is the set of vertices and each vertex can be represented with 6 numbers which are (x,y,z)spatial coordinates and (R,G,B) color coordinates. This representation of face can be converted into 2D image by using a 3D renderer which is a kind of complex function hence we do not need to dive into that( If you are enthusiastic, you can go through this paper to section 3.2 to get the understanding of the function). So, our architecture includes a part which generates a 128 dimensional vector from the predicted face box. This 128 dimensional vector is considered as shape and textual information for that face and is further passed to a network (mesh decoder) which decodes this vector into 6 dimensional vector which is the 3D representation of the face. This predicted output is then sent to a renderer function which converts it into a 2D image and the obtained image is compared with the original image using pixel-wise difference.

This type of architecture and loss function incorporates the information about 3D structure of a face in an architecture which is very important as we want the model to locate the face in a given image.

Now, we have an understanding of the architecture. Next, we will look into the loss function which is arguably the second most important part of any neural network.

Loss Function

The loss function that they are using is a combination of 4 types of losses as shown in the image below.

Fig : 1
  1. Classification loss : It is the Face classification loss Lcls(pi , p* i ), where pi is the predicted probability of anchor i being a face and p*i is 1 for the positive anchor and 0 for the negative anchor. The classification loss Lcls is the softmax loss for binary classes (face/not face).
  2. Box Regression Loss : It is given by Lbox(ti , t* i ), where ti = {tx, ty, tw, th}i and t*i = {t*x , t*y , t*w, t*h }i represent the coordinates of the predicted box and ground-truth box associated with the positive anchor respectively. It is a smooth-L1 loss.
  3. Facial Landmark Regression Loss : Along with the shape and location information of the boxes around faces, our model is also generating 5 landmarks of the face (left eye, right eye,left lip,right lip and nose). These predicted landmarks are then compared with the actual annotated landmarks for each face using smooth-L1 loss (Lpts).
  4. Dense Regression Loss : This is the loss from mesh decoder that we discussed above which incorporates the 3D information of the face in the model by taking pixel-wise difference of the rendered output. The actual function to calculate this loss is :
Taken from paper

In Fig : 1 which showed the combined loss function we can see that there some lambda(λ) parameters. These are called loss balancing parameters which ensures how much of what loss we want to include.

So, Now we have all the pieces to combine together. We have an architecture which spits out some predictions which are then compared against the true labels for an image using the loss function explained above. This loss function is then optimized using a variant of SGD to train the whole network to finally predict the box around faces along with 5 landmarks on the faces. You can see the result of RetinaFace on the image below.

Summary

To sum up all the steps: Given an image , we pass the image through a ResNet architecture to get FPN output from various layers of the network which acts as a feature extractor. On top of these outputs we add a context module which increases the receptive field and contextual information by using DCNs. These context modules give 5 landmarks points, class label and regression parameters by having some subnets on top of them. We have a lot of predefined set of anchor boxes which are related to the outputs of context modules for all the output of FPN. These sets of anchor boxes are adjusted using the values given out by the context module’s regression output. These outputs are then compared with the annotated values (including class,boundingbox coordinates and the location of 5 landmarks) of all the faces in an images and the loss is calculated using the formula mentioned in above section.

One more loss which is calculated and added to the final loss is Dense Regression loss which helps to include the shape and texture information of the face in the model. All these losses are then combined using loss balancing parameters. Then the error is backpropogated and the weights are trained.

Combining all the losses ensures that different kind of information is fed into the model to learn, after it adjusts its parameters.

We now understand about how RetinaFace works. So to do the mask detection we will use RetinaFace to extract all the faces and then use a ResNet architecture for classification of the detected face in two classes i.e mask/no_mask. In my next article I will explain a bit about ResNet architecture that we use almost everywhere in computer vision tasks. I will then explain how to implement all this using code, for an end to end system.

Till then, STAY SAFE! WEAR MASKS!

References

RetinaFace: Single-stage Dense Face Localisation in the Wild

Additional Links

  1. Implementation of Mask Detection using Deep Learning
  2. How to Track People Using Deep Learning

--

--

Analytics Vidhya
Analytics Vidhya

Published in Analytics Vidhya

Analytics Vidhya is a community of Generative AI and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com