A deep learning solution to detect and mask human faces in videos

Guru Prasad Natarajan
Published in
5 min readMay 2, 2018



The objective of the project is to develop a smart deep-learning based solution, which possesses the capability to detect human faces in images / videos and mask them. The solution discussed in this article comprises of a set of open source tools and combining them together to form a pipeline, which processes a given video and outputs another video with faces masked in them.

Background on Face Detection

Face Detection has been one of the hottest topics of computer vision for the past few years.

This technology has been available for some years now, and is being used all over the place.

From cameras that make sure faces are focused before you take a picture, to Facebook when it tags people automatically once you upload a picture.

A computer program that decides whether an image is a positive image (face image) or negative image (non-face image) is called a classifier. A classifier is trained on hundreds of thousands of face and non-face images to learn how to classify a new image correctly.

OpenCV algorithms to detect faces

Initially, we took the novice approach of using openCV classifiers. OpenCV provides two types of pre-trained classifiers for face detection:

1) Haar based classifier and

2) LBP based classifier

Since the color information is not required to classify whether the image has a face or not, both the classifiers process images in gray scale. The knowledge behind the classifiers are embedded in distinct files and each classifier has its own knowledge file. For e.g., a Haar cascade classifier starts off as haarcascade_frontalface_alt.xml.

We will not go through the internal mechanics of each and every algorithm as I intend to discuss that in a later post.

So, which is the best between the two? There are certain advantages and disadvantages when I ran my set of images to detect faces which I tabulated below:

1. Differences between HAAR and LBP

Well, though I had two algorithms in hand, testing on my own set of images yielded very poor results. The reason for the poor result is, the classifiers are trained for frontal images and when it encounters a side-pose it fails badly.

Now it’s high time to turn attention towards a much robust solution in deep-learning. Why deep learning solution? And why it always works?

Over the last years deep learning methods have been shown to outperform previous state-of-the-art machine learning techniques in several fields, with computer vision being one of the most prominent cases and deep learning is a buzzword of the 21st century.

Deep learning allows computational models of multiple processing layers to learn and represent data with multiple levels of abstraction mimicking how the brain perceives and understands multimodal information, thus implicitly capturing intricate structures of large‐scale data. Deep learning is a rich family of methods, encompassing neural networks, hierarchical probabilistic models, and a variety of unsupervised and supervised feature learning algorithms. The recent surge of interest in deep learning methods is due to the fact that they have been shown to outperform previous state-of-the-art techniques in several tasks, as well as the abundance of complex data from different sources (e.g., visual, audio, medical, social, and sensor).


To solve this problem, I experimented with various pre-trained models like Multi-task Cascaded Convolutional Networks (MTCNN), Caffe face, Faster R-CNN etc.,. The advantage of using a pre-trained model is someone who has a similar problem in hand do not have to go through the pain of training the network again as it is time consuming and computation intensive.

So, what is a pre-trained model anyway?

Simply put, a pre-trained model is a model created by someone else to solve a similar problem. Instead of building a model from scratch to solve a similar problem, you use the model trained on other problem as a starting point.

For example, if you want to build a self-driving car, you can spend years to build a decent image recognition algorithm from scratch or you can take inception model (a pre-trained model) from Google which was built on ImageNet data to identify images in those pictures.

A pre-trained model may not be 100% accurate in your application, but it saves huge efforts required to re-invent the wheel.

With minor tweaks, I was able to get a better performing model within weeks and I was able to get a better performing model in Faster R-CNN. It was able to detect faces with high accuracy as already it was already trained with millions of images.

The solution that we built is not a real-time solution. The model takes the videos as and when it is available and outputs the equivalent video. Below is the step by step process of the solution:

1) Using ffmpeg, an open source tool, the given video will be converted to an image sequence and the audio will be extracted from the video.

2) Input all the images to the Faster R-CNN model and obtain the results. The results consists of the class probabilities, class scores and the coordinates of the bounding boxes. Using this information, we identity the face in the image by drawing rectangles. The identified faces will then be masked using openCV’s blurring technique.

3) Once all the images are processed, the images are stitched together to form the video. The new video will have all the faces masked.

4) Finally, merge the audio with the video and the process is done!

It is important to identify the attributes of the source video and prepare the destination video accordingly. Otherwise, the resulting video will be longer than the source video which will result in complications when merging the audio.

Next Steps…

The next step is to provide a transcription service which takes the audio and converts to text and, added to that we will be adding speaker diarization as well. This will enable differentiating speakers in an audio file.

The Mindboard Data Science Team explores cutting-edge technologies in innovative ways to provide original solutions, including the Masala.AI product line. Masala provides media content rating services such as vRate, a browser extension that detects and blocks mature content with custom sensitivity settings. The vRate browser extension is available for download via the Chrome Web Store. Check out www.masala.ai for more info.