Training alternative Dlib Shape Predictor models using Python

In particular, we’re going to see how to train alternative models (to the one proposed by Dlib) used to detect the facial landmarks.

Example of the 68 facial landmarks detected by the Dlib pre-trained shape predictor

Dlib is a pretty famous and awesome machine learning library written in C++. It implements a wide range of algorithms that can be used either on the desktop and mobile platforms.


Face Landmark Localization

The process that is able to extrapolate a set of key points from a given face image, is called Face Landmark Localization (or Face Alignment).

The landmarks (key points) that we are interested in, are the one that describes the shape of the face attributes like: eyes, eyebrows, nose, mouth, and chin. These points gave a great insight about the analyzed face structure, that can be very useful for a wide range of applications, including: face recognition, face animation, emotion recognition, blink detection, and photography.

There are a lot of methods that are able to detect these points: some of them achieve superior accuracy and robustness by analysing a 3D face model extracted from a 2D image, others rely on the power of CNNs (Convolutional Neural Networks) or RNNs (Recurrent Neural Networks), and the other one utilize simple (but fast) features to estimate the location of the points.

The Face Landmark Detection algorithm offered by Dlib is an implementation of the Ensemble of Regression Trees (ERT) presented in 2014 by Kazemi and Sullivan. This technique utilize simple and fast feature (pixel intensities differences) to directly estimate the landmark positions. These estimated positions are subsequently refined with an iterative process done by a cascade of regressors. The regressors produces a new estimate from the previous one, trying to reduce the alignment error of the estimated points at each iteration. The algorithm is blazing fast, in fact it takes about 1–3ms (on desktop platform) to detect (align) a set of 68 landmarks on a given face.

Dlib pre-trained Models

The author of the Dlib library (Davis King) has trained two shape predictor models (available here) on the iBug 300-W dataset, that respectively localize 68 and 5 landmark points within a face image.

The set of 68-points detected by the pre-trained Dlib shape_predictor_68

In this article we will consider only the shape_predictor_68 model (that we will call SP68 for simplicity).

Basically, a shape predictor can be generated from a set of images, annotations and training options. A single annotation consists of the face region, and the labelled points that we want to localize. The face region can be easily obtained by any face detection algorithm (like OpenCV HaarCascade, Dlib HOG Detector, CNN detectors, …), instead the points have to be manually labelled or detected by already-available landmark detectors and models (e.g. ERT with SP68). Lastly, the training options are a set of parameters that defines the characteristics of the trained model. These parameters can be properly fine-tuned in order to get the desired behaviour of the generated model, more or less :)

The official example of the Dlib training process can be found here (C++) and here (Python).


1. Understanding the Training Options

The training process of a model is governed by a set of parameters. These parameters affect the size, accuracy and speed of the generated model.

Default shape predictor training options

The most important parameters are:

  • Tree Depth — Specifies the depth of the trees used in each cascade. This parameter represent the “capacity” of the model. An optimal value (in terms of accuracy) is 4, instead a value of 3 is a good tradeoff between accuracy and model-size.
  • Nu — Is the regularization parameter. It determines the ability of the model to generalize and learn patterns instead of fixed-data. Value close to 1 will emphasize the learning of fixed-data instead of patterns, thus raising the chances for over-fitting to occur. Instead, an optimal nu value of 0.1 will make the model to recognize patterns instead of fixed-situations, totally eliminating the over-fitting problem. The amount of training samples can be a problem here, in fact with lower nu values the model needs a lot (thousands) of training samples in order to perform well.
  • Cascade Depth — Is the number of cascades used to train the model. This parameter affect either the size and accuracy of a model. A good value is about 10-12, instead a value of 15 is a perfect balance of maximum accuracy and a reasonable model-size.
  • Feature Pool Size — Denotes the number of pixels used to generate the features for the random trees at each cascade. Larger amount of pixels will lead the algorithm to be more robust and accurate but to execute slower. A value of 400 achieves a great accuracy with a good runtime speed. Instead, if speed is not a problem, setting the parameter value to 800 (or even 1000) will lead to superior precision. Interestingly, with a value between 100 and 150 is still possible to obtain a quite good accuracy but with an impressing runtime speed. This last value is particularly suitable for mobile and embedded devices applications.
  • Num Test Splits — Is the number of split features sampled at each node. This parameter is responsible for selecting the best features at each cascade during the training process. The parameter affects the training speed and the model accuracy. The default value of the parameter is 20. This parameter can be very useful, for example, when we want to train a model with a good accuracy and keep its size small. This can be done by increasing the amount of num split test to 100 or even 300, in order to increase the model accuracy and not its size.
  • Oversampling Amount — Specifies the number of randomly selected deformations applied to the training samples. Applying random deformations to the training images is a simple technique that effectively increase the size of the training dataset. Increasing the value of the parameter to 20 or even 40 is only required in the case of small datasets, also it will increase the training time considerably (so be careful). In the latest releases of the Dlib library, there is a new training parameter: the oversampling jittering amount that apply some translation deformation to the given bounding boxes in order to make the model more robust against eventually misplaced face regions.

2. Getting the Data

In order to replicate the Dlib results, we have to utilize the images and annotations inside the iBug 300W dataset (available here, the filesize is about 1.7 Gb). The dataset consists of the combination of four major datasets: afw, helen, ibug, and lfpw.

Files and folders inside the dataset

The files (annotations) that we need to train and test the models are the: labels_ibug_300W_train.xml and labels_ibug_300W_test.xml.

Before getting into coding remember to put the scripts in the same directory of the dataset!

3. Training the Models

Imagine we are interested into training a model that is able to localize only the landmarks of the left and right eye. To do this, we have to edit the iBug training annotations by selecting only the relevant points:

Editing an xml file by selecting only the desired parts (landmarks)

This can be done by calling the slice_xml() function, that creates a new xml-file with only the selected landmarks.

After setting the training parameters, we can finally train our eye-model.

When the training is done, we can easily evaluate the model accuracy by invoking the measure_model_error() function.


Conclusions

By properly fine-tuning the training options is possible to customize the training process in such a way that satisfy the constraints of the system we are developing.

Such constrains can be about executive speed, memory and storage consumption, and overall accuracy and robustness. That characterize the platform we are developing with (desktop, mobile, and embedded — In general, different platforms have different sets of quality requirements).

Moreover, by selecting only the relevant landmarks is possible to create specific models that localize a particular subset of landmarks, thus eliminating unnecessary points.

Example of alternative shape predictors. The model in the first quarter, detects the eyes and eyebrows. The model on its left, localize the entire set of landmarks with a faster execution speed and a reduced size (compared to SP68). The last two models detect respectively: the face contour, nose and mouth.

In conclusion, I have trained some models (showed in the above picture) that localize a specific set of points. The models can be retrieved at this repo. In addiction, I’ve build an Android app that shows the capabilities of these models (available here) during a continuous detection.

If you’re wonder about how to embed Dlib in your Android application, you can read this medium post.

I hope you find the article useful and interesting. Consider clapping 👏 if you like it!