Smile detection for image moderation

Assessing the attractiveness of images by capitalizing on Smiles

Nadir Trapsida
Decathlon Digital
8 min readDec 15, 2020

--

Decathlon’s main goal is to make sport accessible to everybody. The Decathlon Community platform helps with that goal by enabling coaches to offer many activities near you. However getting out of bed for practicing sport is really hard, especially in this period. Having the coaches introduce their activities with inspiring images by assessing their quality automatically could certainly help to motivate you to get up and burn some calories.

To tackle the problem, we started by doing a quick brainstorm on what the common things are in images that can stimulate participation in an activity. And guess what, it's the smiles 😄. In fact, we concluded that images that contain smiles are much more pleasant and attractive than images without smiles. We, therefore, decided to create a smile detection model for moderation and image quality purposes, to help coaches select an image to advertise their activities. This model is available for free as part of the Sport Vision API.

What is under the hood?

Smile detection is a really useful topic for sentiment analysis, emotion recognition, etc… Many solutions can be used to solve this kind of problem. Among those, a binary classifier model or a custom object detection model are the most commonly used solutions. Our smile detector is much simpler than that since it is based on some popular pre-trained models included in the libraries OpenCV and DLib, along with some simple geometric relations.

To find a smile in a picture, the first step that we do is to detect faces. We, then locate some points (landmarks) on the face. We finally compute a smile score using some geometric relations between points detected on the face. The rest of this article is spent on explaining how we achieve each of these steps

Face Detection

As mentioned earlier, face detection was the first challenge we encountered. For this task, we had to find a face detection pre-trained model. The most common ones are:

  • OpenCV Deep neural networks commonly called Dnn,
  • OpenCV Haar Cascade Classifier
  • Multi-task Cascaded Convolutional Neural network also called MTCNN
  • Dlib HOG frontal face detector

Based on several comparisons on blog posts such as this one, we decided to go with OpenCV’s Dnn pre-trained model, which seems to be very good and easy to use. After some tests the results were very promising:

Single face detection using Dnn model

Then we also tested it for multiple faces and it was even better than we expected:

Multiple faces detection using Dnn model

We decided to do one last test on pictures taken close to the face and we finally found the downside of this model: nothing was detected.

Large faces detection using Dnn model

After finding the pitfall of the Dnn model, we decided to explore another model to see if it would perform better. We started with Dlib’s HOG model. This model is really not as good as OpenCV’s Dnn for single or multiple faces detections, as can be seen from the following examples:

Single and multiple faces detections using Dlib HOG model

However, it was surprisingly better than the dnn model for pictures taken close to the face:

Large faces detection using Dlib HOG model

Overall, on one side, we had one model that was very good in general (Dnn) and on the other side, we had a model that performed well for the “large faces” (HOG). Why not combine both of them? We ultimately decided to combine the results of both models to have a better detection output.
Unfortunately merging two models implied that some faces would be detected twice but with different bounding boxes, like the images below.

Face detection after combining Dnn (red) and HOG (green)

Since we did not want to analyze the same face twice, we computed the Jaccard similarity coefficient by dividing the overlapping region of the bounding boxes by the combined region. This coefficient is also called the Intersection Over Union (IoU) or the Jaccard index.

Jaccard index computation visual from Source

Then we added a threshold for the Jaccard index. If the index is below this threshold, it means that the two bounding boxes are probably not for the same face, so we consider both bounding boxes separately. Otherwise, the two bounding boxes are probably for the same face since they overlap too much. In this case, we consider only the Dnn ones since this model is in general better than the Dlib HOG model.

These are examples of results we obtained after combining both models:

Face detection after the Jaccard index filter

Landmarks

After the detection of the faces, the next step was to detect the landmarks on the face.

To find the landmarks on the faces not many pre-trained models exist. The TensorFlow Hub offers some pretty good pre-trained models like Blazeface or Facemesh with hundreds of face points and even 3D landmarks on the face. Unfortunately, all these models are only available for TensorFlow.JS or TensorFlow Lite and since we wanted to create an API endpoint in python with the model we had to find another solution.

To work around this problem, we found that the Dlib library also offers a face landmarks detection model and decided to go with it. The model is really not bad since it gives us 68 face points, including the eyes, the eyebrows, the jaw, the mouth, and the nose.

Dlib face landmarks output points Source

Geometric criteria

Since we were able to detect landmarks on the face we just had to interpret these landmarks to detect the smiles. To address this challenge, three criteria were used to build a smile score in order to determine if a person is smiling:

  • First, the opening of the mouth (distance between 62 and 66) must be twice the distance between the nose and the mouth (between 51 and 33). This counts as 20% of the score and represents the mouth opening.
Mouth opening computation using distances
  • Second, we verify if a majority of the mouth (points 48 to 67) is below a straight line between the corners of the mouth (points 48 and 54). This counts as 40% of the score and it represents the mouth direction.
Mouth direction computation using a linear function
  • Third, the distance from the mouth to the cheeks (sum of 48–3 and 54–13) must be smaller than the distance from the mouth to the jaw (sum of 48–5 and 54-11). This counts as 40% of the score and represents the mouth position.
Mouth position computation using geometric distances

Finally, with the combination of these three metrics, we obtained the smiling score of the face, between 0 and 1.

Smile detection final result with bounding boxes

Test it yourself

You can test the smile detector using the sports vision API, you will need first to create an access token through the console, then making a simple POST request for example :

curl --location --request POST 'http://sportvision.api.decathlon.com/v2/smile-detector/predict/' \
--form 'file=@/<IMAGE_ABSOLUTE_PATH>'

And the answer should look like this:

{
"coordinates": [198, 51, 333, 217],
"landmarks": [
[205, 89],[198, 108],[193, 130],...
],
"smile_score": 0.6
}

The coordinates array is the coordinates of the two points on the diagonal of the bounding box, the landmarks array is the coordinates of all 68 facial points returned by the Dlib model and the smile_score represents the score of the smile.

Next steps

So far the algorithm is working well and the accuracy is not bad at all, but we can still improve it by adding more features to make it even more complete and accurate. We can capitalize on two major improvements to increase the quality of the score.

First improving face landmarks detection by using a more recent model such as Facemesh/opset9, with 3D face landmarking and more facial points, will give us more precisely the facial landmarks. Then since the landmarking is more representative of the real face, we will adapt the geometric relations to make the score better.

Next, the Facial action coding system suggests that we can detect a smile more precisely, by using a combination of multiple facial action units, not just the mouth. This means that considering more facial characteristics like the nose angle, the cheeks, the eyebrows, and the eyes should improve the quality of the smiling score.

Example of facial actions units from Source

Further reading

Did you enjoy this article? If so, feel free to continue your reading and take a look at:

Thank you for reading!

🙏🏼 If you enjoyed the article, please share it and consider giving it a few 👏.

❤️ If you have any question, comment, or just want to contribute, do not hesitate and contact us at sportvisionapi@decathlon.com

💌 Learn more about tech products opened to the community and subscribe to our newsletter on http://developers.decathlon.com

👩‍💻? 👨‍💻? Interested in joining Decathlon Tech Team? Check out https://developers.decathlon.com/careers

Very warm thank you to the members of the AI team at Décathlon Canada for letting me write this blog post and also for the comments and review.

A special thanks to

--

--

Nadir Trapsida
Decathlon Digital

A programmer by the day and a gamer by the night! I’m a recent grade in computer engineering and AI enthusiasts wanting to learn more about this field …