Performance Showdown of Publicly Available Face Detection Model
In order to rapidly deploy a prototype of a system which utilizes a machine learning-based algorithm, we need to know which method or algorithm suitable for our needs. It is common knowledge that we need to balance the machine learning or deep learning model’s accuracy with the processing power capability which is related to processing time. Sometimes, the more accurate a provided method, the more complex it is, resulting in more computational power and time. So, it is mandatory to find the best optimized and efficient method to deploy it in real production standard.This case is also applicable in face detection algorithm if you want to make a product which utilizes it.
In this writing, we propose to you the performance comparison of several public models of face detection algorithm which we have already discussed in our previous writings in here and here. These models will have their own performance metrics. Therefore, we will explain to you how we measure those metrics.
Our focus in here is how to measure “performance” of a face detector model. The performance that measured will cover two significant aspects mention at the beginning of the writing — accuracy and complexity.
In term of accuracy, in the field of object detection, there are two famous metrics which are widely used to measure the accuracy of object detection (which obviously we can also adapt to measure the performance of face detection too — face is an object of course). They are average IOU and mean average precision.
Intersection over Union (IOU) is a metric which is used to measure how perfect the prediction bounding boxes match the ground-truth bounding boxes in the dataset. It is used to measure how good the provided object detector model is. The visualization of IOU is shown in the image below.
If we have the information about the prediction bounding box and ground-truth bounding box, we can simply calculate IOU by computing the area of overlap divided by the area of the union. The visualization is shown below.
The resulting score should be the float value between 0 and 1 presenting the percentage or ratio of the overlapping area. The closer the value to 1 the better it is. We calculate the average of IOU by simply calculate all of the IOU found from the prediction boxes and ground-truth boxes and calculate the average value.
Mean averaged precision (mAP) is the average value of how precise the object detector can correctly detect and classify the object regarding their classes. In the case, of face detection, of course, we measure how correct is the model to predict the area in the provided image, and label them as faces, not as background. The common calculation of mAP is shown below:
TP or true positive is the count of the predicted areas which correctly predicted that they have faces in them. While FP or false positive is the number of the predicted areas which falsely detect background or other things and labelled them as faces.
In the field of object detection, we usually determine first whether the object detector correctly detect an object in the image if the predicted bounding box prediction overlap with the ground truth bounding box mean average precision exceeding the pre-determined threshold. After that, the precision regarding the predicted class (for example face) is calculated. The usual value used as the threshold is 0.5, 0.75 and 0.95 which usually annotated as mAP@0.5 and so on.
In the previous section, we discuss how to measure the best accuracy of the available model. The problem is sometimes the best accuracy is obtained by increasing the complexity of the model. Usually, the complexity of the model will be followed by the increment in resource usage and processing time, and thus prohibit us to use them in real time scenario.
In this writing, to measure the complexity of the model, we try to measure the resource usage in term of CPU, GPU, and RAM. In term of complexity performance, we try to measure the inference time, which is the total time needed by the model to make a prediction on a single image. In order to standardize the measurement, we measure the time by making the face detection model to predict a single 1080p image.
The dataset we are using to measure the performance of face detector model is a famous dataset called WIDER Face Dataset. This dataset contains 32.203 images with 393.703 labelled faces. However, this dataset doesn’t only contain normal face pose, but it is provided with high variability of in scale, pose and occlusion. Thus, this dataset is good for providing benchmark result of the performance of the publicly available face detection model.
However, we did not use all provided faces in the database. We only use the train and validation split of the dataset which contains ground-truth bounding boxes annotation. Furthermore, we perform several filtering based on data annotation on the dataset to avoid ‘invalid’ images (see readme.txt on the GitHub for further details). Also based on our assumption that face smaller than 15x15 pixels yield no informative data, we also excluded them as well, resulting in the total of 16.106 images and 98.871 labelled faces.
Experiment and Result
We conduct an experiment on 5 publicly available face detection models which you can try on your own in This GitHub. Additionally, we also provided an example of how to use the face detection module on your own project. They are :
- OpenCV Haar Cascades Classifier
- DLib Histogram of Oriented Gradients (HOG)
- DLib Convolutional Neural Network (CNN)
- Multi-task Cascaded CNN (MTCNN) — Tensorflow
- Mobilenet-SSD Face Detector — Tensorflow
The device used to benchmark these models is Dell Inspiron 15 7577 with hardware specification :
- CPU = Intel Core i7–7700HQ Quad Core Processor
- GPU = Nvidia GeForce GTX 1060
- RAM = 16GB
The aggregate performance result visualization in term of mAP and inference time/prediction processing time at their best hardware setup ( using CPU for model no 1 and 2, and using GPU for model no 3,4,5 ) is shown below.
For specific implementation using only CPU, the benchmark result graph is shown below.
As you can see from the graphs above, both on the setup when using the best hardware configuration or using only CPU, mobilenet-ssd give the outstanding performance not only in term of accuracy (the higher the better) but also in term of inference time speed (the lower the better). This shows that the from these publicly available model, using mobilenet-ssd implementation may provide you with the best performance when you want to deploy it for the real-time system.
In term of resource usage, the information of RAM usage at their best hardware setup ( using CPU for model no 1 and 2, and using GPU for model no 3,4,5 ) is shown below.
For specific implementation only using CPU, the memory graph is shown below.
In the end, it’s up to the you which face detector you want to use in your own project. But, personally, I like mobilenet_ssd implementation better compared to the others based on the evaluation metrics shows above.
You can always try those face detector implementation in This GitHub and try the best face detector that fit your problems. Feel free to ask any questions in Issue section of the GitHub.
Thank you for reading~