How to Build A Real-Time Face Mask Detector 😷

Learn how to develop your own custom object detection program utilizing Tensorflow + Python and MobileNet SSD

Zaki Rangwala
Analytics Vidhya
14 min readDec 1, 2020

--

I have been really fascinated by the field of artificial intelligence lately as many advancements are being made in this field to teach machines various things. Therefore, I decided to build my own mask detector to determine when someone is wearing a mask. I did this by leveraging the TensorFlow object detection API and OpenCV to train the SSD (Single Shot MultiBox Detector) neural network consisting of 300+ images to come up with these detections.

The image on the left is of a person with a mask, and the image on the right is the result of my model detecting the mask once I input the image, with a confidence score of 94%.

In this article, I will explain everything from what object detection is and how the SSD algorithm works to how you can implement these fundamental principles to build your own custom object detection program, whether it be to determine a cat from a dog or, in my case a face with a mask as opposed to as face without one. Keep in mind that you can make a model capable of detecting more than two classes or using something like the COCO-SSD, which can differentiate objects from 80 different classes.

So without further ado, let’s get right into it!

What is Object Detection? 👀

Object Detection, as you can tell from the name, is one of the most prominent research fields in computer vision today. This technology predominantly identifies or even locates objects belonging to certain classes(dogs, cats, humans, buildings, cars) within an image or even a video stream. This is not to be confused with image classification, which is another branch of computer vision that can easily predict what an image comprises. You could technically use this technology to make a mask detector containing two classes: Mask and No Mask.

Object Detection Model labelling dog [left], cat[centre] and dog[right] using a combination of image classification and image localization (Source : Raneko from Flickr)

But what would you do when you have both a person wearing a mask and a person not wearing a mask in the same image? How can you classify them from one another? You could use a multi-label classifier that can distinguish both the objects from one another, but you still don’t know which ones. This is where Image Localization comes in as it identifies the location of the object in the image and returns a bounding box identifying the object(s) in the image.

Object Detection = Multi-Label Image Classifier + Image Localization

You can make your own classes for your model to detect, like a watch or a bracelet or a mask, but you will need hundreds of labelled images to train your model. But don’t worry too much about it now as I will go over it in this article.

This technology has many real-world applications like video surveillance or image retrieval, with facial detection being one of the most used domains. It can also be used at traffic lights for pedestrian detection to better direct traffic or even help the blind walk. As you can see, the possibilities are endless, which makes it all the more exciting!

What is SSD-Mobilenet? And How does it work? 💭

The Single Shot Multibox Detector (SSD) is a single convolution neural network mainly designed for object detection in real-time. It learns to predict bounding box locations surrounding the object. It classifies the object in just a single shot, as opposed to R-CNNs that use a region proposal network to create the bounding boxes that are later used to classify objects. Therefore, this model can be trained to end and consists of MobileNet architecture, followed by many other convolution layers.

SSD makes 2 predictions of separate classes from a single image by picking the class with the highest score for the bounded object, retaining a class of ‘0’ for the non-bounded objects (Source: Jonathan Hui)

This speeds up the process drastically, but the model takes a hit when it comes to accuracy. And the SSD model includes a few improvements like multi-scale features and default boxes, which allows it to match the R-CNN’s accuracy as it uses lower resolution images. As shown in the image below, it achieves the real-time processing speed with more than 50 FPS in the best case scenario than the R-CNN and, in some cases, even beats the accuracy of R-CNN. Accuracy is measured as the mAP, which represents the precision of the predictions.

Performance Comparison between SSD and other object detection models (Source: Cornell University)

Although this model is high-speed and accurate, this framework's biggest drawback would be that the performance is directly proportional to object sizes, meaning it is more difficult detecting smaller objects. The reasoning for this is because small objects may sometimes lack information for the deeper layers of the neural network to make a detection. But there are many techniques like data augmentation to crop, resize and play around with the images. This is why utilizing a higher resolution for the input images provides better results are there is more data in the images, more pixels to play around with.

SSD Mobilenet Architecture (Source: Cornell University)

The majority of the SSD network is controlled by the backbone network, which would be MobileNet, another network architecture comprising of special class convolutional neural models that are more lightweight in terms of the # of parameters and computational complexity. Additionally, the width and resolution can be explicitly inputted, thereby controlling the # of input and output channels of the convolutional layers while also manipulating the image resolution (height and width). This is directly correlated to the latency vs accuracy of the network, which depends on what the user requires from the model.

A convolutional layer is a matrix applied to images that perform a mathematical operation on the individual pixels to produce new pixels that are then passed as the input for the next layer and so on until the network's end is reached. The last layer is a single integer that turns the image output into a numerical class prediction that corresponds to an object we are trying to predict. For example, if ‘1’ is associated with a cat, the prediction of class ‘1’ would be a cat, whereas ‘0’ would be unknown.

An example of a data input going through a convolutional layer that returns an output after applying complex mathematical operations (Source: Analytics India)

The SSD comprises 6 layers making 8732 predictions and uses most of those predictions to predict what the object is finally.

Integrating MobileNet into the SSD framework opens the doors to limitless possibility since it is not that resource-heavy can run on low-end devices. It is smartphones or laptops that, too, in real-time R-CNN's struggle with.

Time to Start Building Our Mask Detector 🔨

Now that you know how everything works behind the scenes let’s get to actually implementing this. All of the code can be found in my GitHub repository, which you can find here.

To get this working, install Anaconda Python 3.7.4 for Windows, Mac or Linux. Then for windows, install Visual Studio C++ 2015, which you will need to compile Tensorflow. If you have a dedicated GPU in your system, install Cuda and then Cudnn.

Required Dependencies, [Left] Python, [Centre] Visual Studio C++, [Right] Tensorflow

We will be using the Tensorflow Object Detection API, a framework that utilizes a deep learning network to solve object detection problems. Use this guide as a reference if you have any issues, as it helped me a lot.

Next, create a folder in your directory and open it up in the code editor of your choice and create your virtual python environment using the command: conda create -n name_of_your_choice pip python 3.7

Then activate the virtual environment using the command: conda activate tensorflow

Create your virtual environment [left] and activate the environment [right]

Then clone my mask detector repository using the command: git clone https://github.com/ZakiRangwala/Mask-Detector.git

After that, install all the required dependencies and modules by first entering the mask detector repository using: cd mask-detector/tensorflow and then pip install tensorflow==2.3.1 followed by pip install opencv-python.

Verify your installation using the command : python -c “import tensorflow as tf;print(tf.reduce_sum(tf.random.normal([1000, 1000])))”

Your output should be something like this.

Installing Tensorflow Object Detection API 💻

We will install the Tensorflow Object Detection API from the Tensorflow model garden by cd into the Tensorflow-Models directory and using the command git clone https://github.com/tensorflow/models to install all the models. You will now see a new folder inside tensorflow-models called models containing tons of different models.

We will now need to install Protobuf, Google’s language-neutral compiler, which you get from their releases pages for Linux, Windows or Mac. Then extract the contents of the file and add it to the directory of your choice, e.g. C:\Program Files\Google Protobuf, then add this path to your environment variables and cd into models/research and use the command: protoc object_detection/protos/*.proto — python_out=.

Before we install the needed dependencies for the object detection, however, we need to install the COCO API using pycocotools, run the commands: pip install cython and pip install git+https://github.com/philferriere/cocoapi.git#subdirectory=PythonAPI

Finally, cd inside object_detection/packages/tf2 and copy the script, setup.py into Tensorflow\models\research. Then run the command python -m pip install . To install the required dependencies.

Test your installation from within Tensorflow\models\research using the command: python object_detection/builders/model_builder_tf2_test.py

Your output should be something like this.

Install Label Image 🌆

Now install the tool, LabelImg to help annotate our images by cd into label-image and using the command git clone https://github.com/tzutalin/labelImg.git

There should be a new folder created inside the directory. Cd into labelImg and use the command conda install pyqt=5 and conda install -c anaconda lxml

After that, all that is left is setting up the binaries, which can be done using the command, pyrcc5 -o resources.py resources.qrc

Lastly, navigate to the LabelImg directory and move the resources.py and resources.qrc files into the lib folder like so

Move the resource files into the libs folder to finish setting up the program.

Get Your Dataset of Images 📷

With all the hard work out of the way, it’s time to get to the fun stuff. It’s time actually to collect some pictures. You can either try to click them yourself if you are very enthusiastic or get a dataset from anywhere online. If you choose to get a dataset from the web, I suggest using this dataset from Kaggle.

Whereas, if you choose to take your own pictures, make sure to do so in bright lighting, take them from different angles, and it is recommended to capture them with different people talking about the same # of pictures with a mask to the # of pictures without one. To have a stronger confidence score from your model, make sure to have a dataset of at least 300+ images.

My dataset of images consists of a mixture of images from the web and some captured by myself.

Once you have your dataset of images, navigate to the labelImg directory using the command cd and labelImg and run the Graphical User Interface (GUI) using the command python labelImg.py

Your LabelImg GUI Interface should look something like this.

Then click the Open Dir button, navigate the folder where your pictures are stored, and then click the Save Dir button to save the images in the same folder.

Make sure to keep your annotation format as PascalVOC, a data format for image classification and object detection that, unlike COCO, creates a new annotation for every image.

Then begin, label all the images by pressing ‘W’ and then creating a bounding box around the mask and labelling it accordingly. The two classes would be Mask and NoMask. Then click ‘D’ to move forward or ‘S’ to move backward. This can be some tedious work and take a long time, but the model must be trained based on how the images are labelled.

Label Images with Classes ‘Mask’ or ‘NoMask’ depending on the image; make sure to be precise.

Once all the images are labelled, your directory should have double the files as it did before as each image now has an annotation (.xml) file. Now move the images accordingly into the train and test folder, which can be found inside Tensorflow/workspace/images. Make sure to put some pictures with masks and without masks in each folder and put a little more for training. Try something like a 60/40 ratio, as the model will be evaluated based on the training data's accuracy. This is known as supervised learning, as the algorithm is given a labelled dataset to learn from.

My labelled dataset in my testing and training folders

Making our Python Script 📄

Now create a new file in Mask-Detector called detect.py, and let’s begin importing the libraries that we will be using.

After that, let’s set up our environment paths, so they are easier to reference throughout the program.

Note : You can already find files in the Tensorflow/workspace/annotations folder as well as the Tensorflow/workspace/pre-trained-models folder which contains my training data in case you are too lazy to label your images and train your own model.

The next thing you want to do is create a function that will construct a label map containing the class name each image annotation is associated with.

Run this function by calling the method: construct_label_map() to find a new file in Tensorflow/workspace/annotations called label_map.pbtxt containing your classes, which in our case would be “Mask” and “NoMask” assuming that what you annotated your images using the LabelImg tool.

Now you want to add a function that can merge all the PascalVOC .XML files into one CSV file for the training and testing data.

You can call this function by using the command convert() and should find 2 new files under Tensorflow/workspace/annotations named testlabels.csv and trainlabels.csv

Now it’s time to create the TF record (Tensorflow Record) used to train the model.

To do so, use the commands :

# Train Record
python Tensorflow/scripts/generate_tfrecord.py -x Tensorflow/workspace/images/train -l Tensorflow/workspace/annotations/label_map.pbtxt -o Tensorflow/workspace/annotations/train.record
# Test Record
python Tensorflow/scripts/generate_tfrecord.py -x Tensorflow/workspace/images/test -l Tensorflow/workspace/annotations/label_map.pbtxt -o Tensorflow/workspace/annotations/test.record

We are almost finished; all we need to do now is download a pre-trained model from the Tensorflow Model, which you can find here. The repository should look something like this.

There are a variety of models that you can choose from. Make sure to keep speed and accuracy in mind.

You don’t need to download a pre-trained model as I already included one in my repository. But if you download a pre-trained model, make sure to copy the pipeline.config file into the Tensorflow/workspace/models directory and inside a new folder called my_ssd_mobnet as that’s the repository where our training will take place.

Now before we train, let’s modify our pipeline. config file like so

Make sure to keep “num_classes” as 2 since we are only training our dataset with as mask vs, without a mask; you can alter the “batch_size” variable depending on how much computing power you have; 32 is recommended.

Now we are ready to train our model; all we have to do is run the command.

# Train Model
python Tensorflow/tensorflow-models/models/research/object_detection/model_main_tf2.py --model_dir=Tensorflow/workspace/models/my_ssd_mobnet --pipeline_config_path=Tensorflow/workspace/models/my_ssd_mobnet/pipeline.config --num_train_steps=5000

Note : The number of training steps (num_train_steps) is currently set to 5000 but can be increased to increase accuracy but will take longer to train

Your model should typically take 3 to 5 hours to train, so I tend to run it longer overnight to increase efficiency.

Your output should look something like that, and you should have new checkpoints inside the Tensorflow/workspace/models folder.

Make Real-Time Predictions 🟥

Once your model is done training, you are ready to start making detections. What we have to first is load up our model like so :

Make sure where it says “ckpt.restore(os.path.join(CHECKPOINT_PATH, ‘ckpt-9’)).expect_partial() that you input your latest checkpoint where it says ‘ckpt-9’, you can find them in Tensorflow/workspace/models

Detection Function 🔍

Now is that time we have all been waiting for, putting all our hard work to the test. We need to create a function that takes in an image, loads the model, predicts what’s in the image by parsing into our neural network and returning a class 1 or 2, where 1 is “Mask,” and 2 is “NoMask.”

Now let’s check if our codes work my making another function that converts our image into a tensor, which is a matrix consisting of our image’s pixels that, when inputted in the model, will be manipulated by one of the convolutions layers through complex mathematical operations and returned as the input for the next layer until it reaches the last layer that would return a class. If the class is 0, then the object is unknown.

SSD model architecture comprised of many convolutional layers (Source: Lilian Weng)

Once we run the detection function, we should get the results giving us the number of detections, the object's location, the confidence score and the detection class itself. We can then draw labels and bounding boxes based on the coordinates returned from the function and use OpenCV to read the image and display the output with the detections made.

You can always edit the “min_score_thresh” parameter to show detections depending on what your confidence score already is, which depends on how accurate your model is.

You try checking if the models work by running the model using check(‘Tensorflow/workspace/images/check/test_case_one.jpg’) and seeing the results pop up in the results folder. I have already added a few examples. Feel free to add more!

Mask Detection Demo -> Inputting an image into the model.

As you can see, the model works flawlessly when inputting a picture, but how does it perform when asking to make detections in real-time? Let’s find out!

Real-Time Prediction 🧔

For making a prediction, the function inputs the image into the model that returns various elements like the class, confidence score and even bounding boxes.

The process is relatively the same when detecting objects in real-time. Still, we set up a video stream and input every single frame into the image, getting a detection and then labelling the video in real-time.

To do this, we are using OpenCV, which is an open-source library; SSD shines when it comes to real-time detection as it is really lightweight and fast, making this possible as it provides speeds up to 50 FPS, which is pretty amazing taking in everything that is happening under the hood.

And now, with real-time prediction working successfully, we are all done! You have successfully built your own custom object detection model and have learned how everything works behind the scenes. If you would like to get the source code for all the methods and functions that were outlined in this article, you can find it in my GitHub repository, for which you can find the link down below :

Limitations 😢

As you can tell, this application has some limitations, one of them being how the model performance is directly proportionate to object size and image quality, so make sure to have good high-definition images for your dataset and label them precisely and accurately.

Furthermore, although it boasts running 50 FPS, the real-time prediction video stream seems to be a bit slower, which may be because of the specs of the device you are running the model on. To improve performance, try inputting a smaller video frame and stay in a bright environment!

Next Steps 👣

Although this project exceeded my expectations of what it would turn out like, I find it useful and helpful in real-world scenarios. But what would make it even more unique and useful would be to add new classes determining if a person is protected or not, which, judging by the facial landmarks, the mask covers. This would help ensure that people follow the laws and regulations set by authorities all over the world to keep everyone safe.

I could also add more images to my dataset to make my model even more accurate and modify a few parameters to make real-time detection more smooth and efficient.

If you enjoyed this article and it helped you out, please leave a clap, and if you have any questions, you could always leave a comment down below or, even better, email me at zakirangwala@gmail.com.

If you would like to learn more about me and the work I do, visit my website at zakirangwala.com

--

--