What’s the Best Face Detector?

Comparing Dlib, OpenCV DNN, Yunet, Pytorch-MTCNN, and RetinaFace

Published in

Python’s Gurus

9 min readJun 9, 2024

For a facial recognition problem I’m working on, I needed to figure out which facial detection model to select. Face detection is the first part of the facial recognition pipeline, and it’s critical that the detector accurately identifies faces in the image. Garbage in, garbage out, after all.

However, the myriad options available left me feeling overwhelmed, and the scattered writings on the subject weren’t detailed enough to help me decide on a model. Comparing the various models took a lot of work, so I figured relaying my research may help folk in similar situations.

What to Look for in a Face Detector?

The primary trade-off when selecting a facial detection model is that between accuracy and performance. But there are other factors to consider.

Most of the articles on face detection models are written either by the creators of the model — typically in journals — or by those implementing the model in code. In both cases, the writers, naturally, have a bias toward the model they are writing about. In some extreme cases, they are essentially promotional advertisements for the model in question.

There aren’t many articles that compare how the different models perform against each other. Adding further confusion, whenever someone is writing about a model such as RetinaFace, they are talking about a particular implementation of that model. The “model” itself is really the neural network architecture, and different implementations of the same network architecture can lead to different results. To make matters more complicated, the performance of these models also differs according to post-processing parameters, such as confidence thresholds, non-maximum suppression, etc.

Every writer casts their model as the “best”, but I quickly realized that “best” depends on context. There is no objective best model. There are two main criteria when deciding which face detection model is most appropriate for the given context: accuracy and speed.

No model combines high accuracy with high speed; it’s a trade-off. We also have to look at metrics beyond raw accuracy, on which most benchmarks are based (correct guesses / total sample size), but raw accuracy is not the only metric to pay attention to. The ratio of false positives to true positives, and false negatives to true negatives, is also an important consideration. In technical terms, the trade-off is between precision (minimizing false positives) and recall (minimizing false negatives). This article discusses the problem in more depth.

Testing the Models

There are a few existing face detection datasets used for benchmarking, such as WIDER FACE, but I always like to see how the models will perform on my own data. So I randomly grabbed 1064 frames from my sample of TV shows to test the models ( ±3% margin of error). When manually annotating each image, I tried to select as many faces as possible, including faces that were partially or almost fully occluded to give the models a real challenge. Because I’m eventually going to perform facial recognition on the detected faces, I wanted to test the limits of each model.

The images are available to download with their annotations. I’ve also shared a Google Colab notebook to interact with the data here.

It helps to group the various models into two camps; those that run on the GPU and those that run on the CPU. In general, if you have a CUDA-compatible GPU, you should use a GPU-based model. I have an NVIDIA 1080 TI with 11GB of memory, which allows me to use some of the larger-scale models. Nevertheless, the scale of my project is huge (I’m talking thousands of video files), so the lightning-fast CPU-based models intrigued me. There aren’t many CPU-based face detection models, so I decided to test only the most popular one: YuNet. Because of its speed, YuNet forms my baseline comparison. A GPU model must be significantly more accurate than its CPU counterpart to justify its slower processing speed.

The CPU Model

YuNet

YuNet was developed with performance in mind with a model size that is only a fraction of the larger models. For instance, YuNet has only 75,856 parameters compared to the 27,293,600 that RetinaFace boasts, which allows YuNet to run on “edge” computing devices that aren’t powerful enough for the larger models.

Code to implement the YuNet model can be found in this repository. The easiest way to get YuNet up and running is through OpenCV.

cv2.FaceDetectorYN_create('./face_detection_yunet_2023mar.onnx',
                          "", 
                          (300, 300),
                          score_threshold=0.5)

The pre-trained model is available at the OpenCV Zoo repository here. Just be sure when cloning the repo to use Git LFS (I made that mistake at first). There’s a Google Colab file I wrote to demonstrate available here.

YuNet performed a lot better than I expected for a CPU model. It’s able to detect large faces without a problem but does struggle a bit with smaller ones.

Able to detect large faces even when at oblique angles. The bounding box is a bit off, possibly because the image needs to be resized to 300x300 to feed into the model.

YuNet manages to find almost all of the faces but includes some false positives as well.

The accuracy improves greatly when limiting to the largest face in the image.

If performance is a primary concern, YuNet is a great option. It’s even fast enough for real-time applications, unlike the GPU options available (at least without some serious hardware).

YuNet uses a fixed input size of 300x300, so the time difference results from resizing the images to these dimensions.

The GPU Models

Dlib

Dlib is a C++-implementation with a Python wrapper that maintains a balance between accuracy, performance, and convenience. Dlib can be installed directly through Python or accessed through the Face Recognition Python library. However, there is a very strong trade-off between Dlib accuracy and performance based on the upsampling parameter. When the number of times to upsample is set to 0, the model is faster but less accurate.

No Upsampling

Upsampling = 1

The accuracy of the Dlib model increases with further upsampling, but anything higher than upsampling=1 would cause my script to crash because it exceeded my GPU memory (which is 11GB by the way).

Dlib’s accuracy was somewhat disappointing relative to its (lack of) speed. However, it was very good at minimizing false positives, which is a priority of mine. Face detection is the first part of my facial recognition pipeline, so minimizing the number of false positives will help reduce errors downstream. To reduce the number of false positives even further, we can use Dlib’s confidence output to filter lower-confident samples.

There is a large discrepancy in confidence between false and true positives, which we can use to filter out the former. Rather than choose an arbitrary threshold, we can look at the distribution of confidence scores to select a more precise one.

95% of the confidence values fall above 0.78, so excluding everything below that value reduces the number of false positives by half.

While filtering by confidence reduces the number of false positives, it does not increase the overall accuracy. I would consider using Dlib when minimizing the number of false positives is a primary concern. But otherwise, Dlib doesn’t offer a large enough increase in accuracy over YuNet to justify the much higher processing times; at least for my purposes.

OpenCV DNN

The primary draw of OpenCV’s face detection model is its speed. However, its accuracy left something to be desired. While it is incredibly fast when compared to the other GPU models, even its Top 1 accuracy was hardly better than YuNet’s overall accuracy. It’s unclear to me in which situation I would ever choose the OpenCV model for face detection, especially since it can be tricky to get working (you have to build OpenCV from source, which I’ve written about here).

Pytorch-MCNN

The MTCNN model also performed quite poorly. Although it was slightly more accurate than the OpenCV model, it was quite a bit slower. Since its accuracy was lower than YuNet, there was no compelling reason to select MTCNN.

RetinaFace

RetinaFace has a reputation for being the most accurate of open-source face detection models. The test results back up that reputation.

Not only was it the most accurate model, but many of the “inaccuracies” were not, in fact, actual errors. RetinaFace really tested the category of “false positive” since it picked up faces I hadn’t seen, hadn’t bothered to annotate because I thought them too difficult, or hadn’t even considered a “face.”

It picked up a partial face in a mirror in this Seinfeld frame.

It managed to locate faces in picture frames in the background of this Modern Family.

And it’s so good at identifying “faces,” that it finds non-human ones.

It was a nice surprise learning that RetinaFace wasn’t all that slow either. While it wasn’t as fast as YuNet or OpenCV, it was comparable to MTCNN. While it is slower at lower resolutions than MTCNN, it scales relatively well and can process higher resolutions just as quickly. And RetinaFace beat Dlib (at least when having to upscale). It is much slower than YuNet but is significantly more accurate.

Many of the “false positives” RetinaFace identified can be excluded by filtering out smaller faces. If we drop the lowest quartile of faces, the false positive rate drops drastically.

The boundary for the lowest quartile is 0.0035

While RetinaFace is incredibly accurate, the errors do have a particular bias. Although RetinaFace identifies small faces with ease, it struggles with larger, partially occluded ones, which is evident if we look at face size relative to accuracy.

This could be problematic for my purposes since the size of a face in an image is strongly correlated to its importance. Therefore, RetinaFace may miss the most important cases, such as the example below.

RetinaFace failed to detect a face in this image, but YuNet did.

Conclusion

Based on my tests (which I’d like to emphasize are not the most rigorous in the world; so take them with a grain of salt), I would only consider using either YuNet or RetinaFace, depending on whether speed or accuracy was my primary concern. It’s possible I’d think about using Dlib if I absolutely wanted to minimize false positives, but for my project, it’s down to YuNet or RetinaFace.

The GitHub repo used for this project is available here.