Shape and Appearance Based Sequenced Convnets to Detect Real-Time Face Attributes on Mobile Devices


In computer vision, classifying facial attributes has attracted deep interest from researchers and corporations. Deep Neural Network based approaches are now widely spread for such tasks and have reached higher detection accuracies than previously manually-designed approaches. Our paper reports how preprocessing and face image alignment influence accuracy scores when detecting face attributes. More importantly it demonstrates how the combination of a representation of the shape of a face and its appearance, organized as a sequence of convolutional neural networks, improves classification scores of facial attributes when compared with previous work on the FER+ dataset.

While most studies in the field have tried to improve detection accuracy by averaging multiple very deep networks, exposed work concentrates on building efficient models while maintaining keeping high accuracy scores. By taking advantage of the face shape component and relying on an efficient shallow CNN architecture, we unveil the first available, highly accurate real-time implementation on mobile browsers.

Link to the paper AMDO 2018: 73–84, Nicolas Livet, George Berkowski

Introduction / Content

Exploring human face attributes based on images and sequence of images has been a topic of interest over the years. For decades, a number of approaches have been carefully engineered in order to try to solve this problem with the highest possible accuracy. However, most manually-crafted approaches appear to become inefficient when dealing with real life “face-in-the-wild” problems. Manually-crafted approaches are often combined with some Machine Learning principles as for example LBP+SVM [1] or SIFT/HOG+SVM [2]

In contrast with most recent approaches that consists in evaluating always deeper architectures or averaging multiple deep models, our approach consists in classifying emotions using shallow CNN architectures and prior information. Our objective is to achieve robust and real-time face attributes (emotions in shown examples) detection on Mobile Browser.


Presented architecture is trained and results are evaluated on the FER+ dataset [FER] which has been first released in 2013 and later improved with better annotations. The FER dataset contains about 30,000 samples making it one of the most complete in-the-wild face emotion dataset available.

Note that for our training and testing phased, we got rid of the traditional CK+ dataset [CK+] as accuracy scores are saturating. CK+ and other traditional datasets only provide very constraint images, with:

  • Limited # of samples (SFEW),
  • Limited # of poses (CK+, JAFFE),
  • Limited # of lighting variations (All),
  • Limited face diversity (eg. JAFFE).
The CK+ (Extended Cohn-Kanade expression dataset) has been the dataset of choice for decades when comparing accuracies. We can see how it saturates with recent techniques.

However, the FER+ dataset is far from being perfect and include several biases/issues:

  • Examples are constructed using a face detector which crops faces in a particular manner
  • Low quality & low resolution cropped images contain compression artifacts, aliasing, …
  • Original images were not made available,
  • A majority of frontal faces are available (as the face detector tends to fail to detect profile faces).

To reduce impact of such biases, the dataset is preprocessed as follow:

  1. Images are upsampled (2x) and enhanced (noise, jpeg artifacts removal) using a SuperResolution CNN architecture [SRCNN]
  2. Borders are then expanded by progressively mirroring and blurring pixels to obtain 128x128 face images
  3. Cleaner augmented examples are finally constructed (at training phase) based on a random projective transformations.
Three simple preprocessing steps to reduce influence of the FER+ dataset biases (image upscaling, border expansion and final random crops.


Instead of relying on ever deeper and averaged CNN architectures, we decided to rely on modern mobilenet architectures [MOBNET]. Such architecture includes several pointwise/ depthwise separable filters thus optimizing computations at inference time (see illustration).

Our fastest architecture based on pointwise/depthwise operators. It has been constructed based on a truncated Mobilenet-0.5 architecture.

However, shallower architecture cannot compete with veery deep VGG or Resnet-50 architectures. That’s the reason why a supplemental information is fed to the network to give a prior knowledge to the system with the objective to approach as much as possible the results obtained with a deep architecture.

Shape Prior Architecture

A supplemental channel is constructed and added to the RGB or gray input image. This channel contains the shape of the face as depicted by its internal landmarks. It is an image we named the shape prior heatmap as it contains a (rescaled) Gaussian peak response at face landmark vicinity (see illustration).

Our architecture for facial emotion detection is decomposed in two step: first face landmarks are localised (not described in this work, see for some results), then emotions are detected based on face landmark localizations. Following illustration describes our complete architecture.


Not surprisingly, the shape prior helps to improve accuracy scores on the FER+ dataset. It’s especially the case when inferring on shallow CNN architectures.

We have improved over state of the art accuracy on the FER+ dataset using a VGG architecture.
On our experiments, using a shape prior is providing great improvements on shallow architectures (here on Mobilenet-0.5)

We have benchmarked our approach on different architectures. To make our application real-time on Mobile browsers, an optimized implementation of the pointwise 1×1 and depthwise 3×3 convolutions has been developed, relying on the Emscripten tool to build JavaScript bitcode. Even though our implementation could be further optimized (eg. by taking advantage of SIMD instructions), our native Android application reached 300fps on a on a Google Pixel 2 and nearly 100fps on the same device using Chrome Web browser (refer to the last columns of Table 1. for more results


Our work discusses existing datasets, their respective drawbacks,and how to prepare the data to improve the quality of the information passed to the learning process. It is shown how to transform the results of a facial feature detector system to a face shape heatmap image and how to combine the face of a shape with its appearance to learn modified CNN models. Using this approach, accuracy scores on the FER+ dataset are substantially improved. The choice of a smooth loss (a Huber Loss) evaluated on non-discrete label distributions has brought to our system the capability to interpolate between different emotions attributes. Our architecture and in-house implementation take advantage of efficient separable pointwise 1×1 and depthwise 3×3 convolutional filters. We were able to deploy a face tracker combined with a face emotion detector that works in real-time on mobile browser architecture


[1] Dynamic texture recognition using local binary patterns with an application to facial expressions — G. Zhao & All

[2] Facial expression recognition and histograms of oriented gradients: a comprehensive study — P. Carcagnì & All

[FER+] Training Deep Networks for Facial Expression Recognition with Crowd-Sourced Label Distribution — E. Barsoum

[CK+] The Extended Cohn-Kanade Dataset (CK+): A complete dataset for action unit and emotion-specified expression — P. Lucey

[SRCNN] Image Super-Resolution Using Deep Convolutional Networks — Chao Dong, Chen Change Loy, Member, IEEE, Kaiming He, Member, IEEE, and Xiaoou Tang, Fellow, IEEE

[MOBNET] MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications — A. G. Howard & All