Shape and Appearance Based Sequenced Convnets to Detect Real-Time Face Attributes on Mobile Devices
In computer vision, classifying facial attributes has attracted deep interest from researchers and corporations. Deep Neural Network based approaches are now widely spread for such tasks and have reached higher detection accuracies than previously manually-designed approaches. Our paper reports how preprocessing and face image alignment inﬂuence accuracy scores when detecting face attributes. More importantly it demonstrates how the combination of a representation of the shape of a face and its appearance, organized as a sequence of convolutional neural networks, improves classiﬁcation scores of facial attributes when compared with previous work on the FER+ dataset.
While most studies in the ﬁeld have tried to improve detection accuracy by averaging multiple very deep networks, exposed work concentrates on building eﬃcient models while maintaining keeping high accuracy scores. By taking advantage of the face shape component and relying on an eﬃcient shallow CNN architecture, we unveil the ﬁrst available, highly accurate real-time implementation on mobile browsers.
Introduction / Content
Exploring human face attributes based on images and sequence of images has been a topic of interest over the years. For decades, a number of approaches have been carefully engineered in order to try to solve this problem with the highest possible accuracy. However, most manually-crafted approaches appear to become ineﬃcient when dealing with real life “face-in-the-wild” problems. Manually-crafted approaches are often combined with some Machine Learning principles as for example LBP+SVM  or SIFT/HOG+SVM 
In contrast with most recent approaches that consists in evaluating always deeper architectures or averaging multiple deep models, our approach consists in classifying emotions using shallow CNN architectures and prior information. Our objective is to achieve robust and real-time face attributes (emotions in shown examples) detection on Mobile Browser.
Presented architecture is trained and results are evaluated on the FER+ dataset [FER] which has been first released in 2013 and later improved with better annotations. The FER dataset contains about 30,000 samples making it one of the most complete in-the-wild face emotion dataset available.
Note that for our training and testing phased, we got rid of the traditional CK+ dataset [CK+] as accuracy scores are saturating. CK+ and other traditional datasets only provide very constraint images, with:
- Limited # of samples (SFEW),
- Limited # of poses (CK+, JAFFE),
- Limited # of lighting variations (All),
- Limited face diversity (eg. JAFFE).
However, the FER+ dataset is far from being perfect and include several biases/issues:
- Examples are constructed using a face detector which crops faces in a particular manner
- Low quality & low resolution cropped images contain compression artifacts, aliasing, …
- Original images were not made available,
- A majority of frontal faces are available (as the face detector tends to fail to detect profile faces).
To reduce impact of such biases, the dataset is preprocessed as follow:
- Images are upsampled (2x) and enhanced (noise, jpeg artifacts removal) using a SuperResolution CNN architecture [SRCNN]
- Borders are then expanded by progressively mirroring and blurring pixels to obtain 128x128 face images
- Cleaner augmented examples are finally constructed (at training phase) based on a random projective transformations.
Instead of relying on ever deeper and averaged CNN architectures, we decided to rely on modern mobilenet architectures [MOBNET]. Such architecture includes several pointwise/ depthwise separable filters thus optimizing computations at inference time (see illustration).
However, shallower architecture cannot compete with veery deep VGG or Resnet-50 architectures. That’s the reason why a supplemental information is fed to the network to give a prior knowledge to the system with the objective to approach as much as possible the results obtained with a deep architecture.
Shape Prior Architecture
A supplemental channel is constructed and added to the RGB or gray input image. This channel contains the shape of the face as depicted by its internal landmarks. It is an image we named the shape prior heatmap as it contains a (rescaled) Gaussian peak response at face landmark vicinity (see illustration).
Our architecture for facial emotion detection is decomposed in two step: first face landmarks are localised (not described in this work, see www.deepar.ai for some results), then emotions are detected based on face landmark localizations. Following illustration describes our complete architecture.
Not surprisingly, the shape prior helps to improve accuracy scores on the FER+ dataset. It’s especially the case when inferring on shallow CNN architectures.
Our work discusses existing datasets, their respective drawbacks,and how to prepare the data to improve the quality of the information passed to the learning process. It is shown how to transform the results of a facial feature detector system to a face shape heatmap image and how to combine the face of a shape with its appearance to learn modiﬁed CNN models. Using this approach, accuracy scores on the FER+ dataset are substantially improved. The choice of a smooth loss (a Huber Loss) evaluated on non-discrete label distributions has brought to our system the capability to interpolate between diﬀerent emotions attributes. Our architecture and in-house implementation take advantage of eﬃcient separable pointwise 1×1 and depthwise 3×3 convolutional ﬁlters. We were able to deploy a face tracker combined with a face emotion detector that works in real-time on mobile browser architecture
 Dynamic texture recognition using local binary patterns with an application to facial expressions — G. Zhao & All
 Facial expression recognition and histograms of oriented gradients: a comprehensive study — P. Carcagnì & All
[FER+] Training Deep Networks for Facial Expression Recognition with Crowd-Sourced Label Distribution — E. Barsoum
[CK+] The Extended Cohn-Kanade Dataset (CK+): A complete dataset for action unit and emotion-specified expression — P. Lucey
[SRCNN] Image Super-Resolution Using Deep Convolutional Networks — Chao Dong, Chen Change Loy, Member, IEEE, Kaiming He, Member, IEEE, and Xiaoou Tang, Fellow, IEEE
[MOBNET] MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications — A. G. Howard & All