FaceMesh : Detecting Key Points on Faces in Real Time

David Cochard
axinc-ai
Published in
4 min readApr 14, 2021

This is an introduction to「FaceMesh」, a machine learning model that can be used with ailia SDK. You can easily use this model to create AI applications using ailia SDK as well as many other ready-to-use ailia MODELS.

Overview

FaceMesh is a machine learning model for detecting key facial features from images, published by Google in March 2019. The paper was published in July 2019.

While a typical face keypoint detection computes 68 points (x,y), FaceMesh gives us 468 points (x,y,z). Since manual annotation of 468 points is difficult, the model was trained on 3D morphable model (3DMM) rendered images to annotate real images.

Source:https://google.github.io/mediapipe/solutions/face_mesh.html

FaceMesh can be applied to virtual try-on clothes using AR.

Source:https://google.github.io/mediapipe/solutions/face_mesh.html

FaceMesh architecture

FaceMesh takes a 192x192 input image of a face and outputs 468 3D keypoints. The values (x,y) are the pixel coordinates of the input image and z is the depth value relative to the center of gravity of the mesh.

FaceMesh output(Source:https://pixabay.com/ja/videos/%E5%A5%B3%E6%80%A7-%E3%83%A4%E3%83%B3%E3%82%B0-%E8%B1%AA%E8%8F%AF%E3%81%A7%E3%81%99-%E8%A1%A8%E7%8F%BE-32387/

The architecture of FaceMesh is based on MobileNet, and consists of a combination of DepthwiseConvolution and PointwiseConvolution. This model is capable of real-time inference on the GPU of a mobile device.

Source:https://arxiv.org/pdf/1907.06724
https://netron.app/?url=https://storage.googleapis.com/ailia-models/facemesh/facemesh.onnx.prototxt

Machine learning models for detecting key points often use 2D heatmaps, but they have the problem of high computational load and the problem of not being able to estimate depth. Therefore, FaceMesh directly calculates the 3D coordinates.

30K images taken with a mobile camera have been used for training. Images taken with a mobile camera have a wide variety of sensors and lighting.

Since annotating 468 key points on a 30K image requires an enormous amount of work, a different approach has been used. Based on 3DMM rendered images, a subset of vertices is designated as Ground Truth. A mechanism infer 2D landmarks separately from 3D landmarks has been added, along with real images annotated with fewer than 468 keypoints as Ground Truth to optimize 3D and 2D simultaneously.

By using the model created by this method, 30% of the dataset was then refined and later used as Ground Truth for training. A brush tool was used for refinement step to easily adjust multiple vertices simultaneously.

By iterating the process of training and updating the dataset, a highly accurate model could be generated.

FaceMesh usage

The following commands runs the model using the web camera input.

$ python3 facemesh.py -v 0

Since FaceMesh is a computed frame-by-frame, some jitter might be visible. It is recommended to implement a 1D temporal filter when using it in AR applications.

ax Inc. has developed ailia SDK, which enables cross-platform, GPU-based rapid inference.

ax Inc. provides a wide range of services from consulting and model creation, to the development of AI-based applications and SDKs. Feel free to contact us for any inquiry.

--

--