[ CV 2017 / Paper Summary ] Do Deep Neural Networks Learn Facial Action Units When Doing Expression Recognition?

This is a paper summary of Emotion Recognition using Facial Action Units from CNN approach. There are many approaches to recognise the emotion of a person using traditional methods which are dependent on the domain expertise , but with huge progress in CNN the author of this paper has tried a CNN based approach to recognise the emotion of a person. There are different types of emotions which humans exhibit like angry, sad, happy, surprise, disgust, fear and neutral as in the figure.

Types of emotions


In this paper the author is performing emotion recognition using CNN appearance based classifier for CK+ and TFD dataset. He uses zero bias CNN for training the network and achieve better performance compared to the state of the art techniques.He has qualitatively analysed the network by visualising the spatial patterns excited by the activated neuron in the convolutional layers.


One of the biggest challenge in deep learning method is to train the network without annotations, as a considerable amount of time and energy is spent in annotating the dataset and training the module which becomes a supervised learning but in real time scenario unsupervised learning is always acceptable.The author tries to answer the question , can CNN be used to improve the performance of emotion recognition on the basis of what they are learning in the network. The standard facial expression database like CK+ and TFD are used for training the network and visually analysing the discriminative spatial patterns of the filters in the convolutional layers and using deconvolutional network in pixel space. The excited neurons in the filter correspond to the Facial Action Unit (FAU’s).

Related Work

The author refers the similar work done using different method where both appearance and geometric features are used for doing emotion recognition.


Architecture Workflow

The author uses the classic feed forward CNN architecture which comprises of convolution layer 64x5x5 filter size followed by the second convolutional layer 128x5x5 with max pooling and the last layer of 256x5x5 with qudrant pooling which is finally followed by a fully connected layer of 300x1 with dropout of probability of 0.5 and then the softmax of 6–8 outputs based on the number of expressions in the training set.

Though classic feed forward network is used the biases in his convoulution layers are ignored which gives better results as it reduces the number of params and trains quicker

Network Training

The network is trained using stochastic gradient descent optimizer with a batch size of 64 and momentum of 0.9 ,decay of 1e-5 and a constant learning rate of 0.001, the author intializes the params in each layer with a gradient distrubtion of mean zero and variance in the range of [0.2,1.2]
 To combat the problem ot overfitting dropout and data augmentation like rotation,translations, horizontal flips and pixel intensity varriation for the training set.

Experiments and Analysis

To evaluate the performance of the model the author has selected the CK+ and TFD dataset with 7 expression labels , the dataset is divided into 10 folds each for training,validation and test set 
 The grayscale image is resized to 96x96 for training of the model with the preprocessing method of patch wise subtraction and scaling to unit variance.

The author has performed analysis on TFD and CK+ dataset with experiments like zero bias CNN with random intialization , zero bias CNN with dropout ,zerso bias CNN with augmentation ,zero bias CNN with dropout and augmentation . The above table shows the results of the experiments for the given two datasets.

From the experimentation there are two major observation points:
1.Using regularization in the network boostts the performance of the network
2.By using both dropout and augmentation the performance of the network increases even more compared to just using either only dropout or augmentation.

Visualisation of higher-level neurons

The author employs the visualization technqiue to analyse the facial regions and perform classification. From the third convolution layer he choses the filter which gives maximum magnitude response in the training set and set this activation neuron high and the rest zero.
 “Guided Back Propagation” method is used to refine the reconstructions of spatial patterns by not only considering the masked activation's but also the suppressed activation's during feedforward pass.

Visualization of facial regions that activate five selected
filters in the 3rd convolutional layer of a network trained on the
Extended Cohn-Kanade (CK+) dataset. Each row corresponds to
one filter in the conv3 layer and we display the spatial patterns
from the top 5 images.

Finding Correspondences Between Filter Activations and the Ground Truth Facial Action Units (FAUs)

The author validates the working of his model with the CK+ dataset which has information of FAUs as ground truth. He performs the experiment where he computes the KL divergence for each filter used in the third convolutional layer for the given FAU’s, which shows strong influence of activation for the filters having the AU seen using visualisation , thus showing that the certain neurons in the neural network implicitly learn to detect specific FAUs in face images when given a relatively ”loose” supervisory signal which gives a confirmation that CNN’s can be used as appearance based classifier.


The author has qualitatively proved that by visualisation of the spatial patterns the maximally learned network can be discrimitatively classify and correlate to the FAU’s and by comparing the FAU’s of the groundtruth and the number of spatial activation's quantitatively proved that this method can be used which even beats the stat of the art methods for emotion recognition.


  1. M. S. Bartlett, G. Littlewort, M. Frank, C. Lainscsek,
    I. Fasel, and J. Movellan. Fully automatic facial action recognition in spontaneous behavior. In FGR, pages 223–230
  2. M. S. Bartlett, G. Littlewort, M. Frank, C. Lainscsek,
    I. Fasel, and J. Movellan. Recognizing facial expression:
    machine learning and application to spontaneous behavior.
    In CVPR, pages 568–573, 2005. 1, 2
  3. M. D. Zeiler and R. Fergus. Visualizing and understanding
    convolutional networks. In Computer Vision–ECCV 2014,
    pages 818–833. Springer, 2014. 2, 4
  4. A. Dhall, R. Goecke, J. Joshi, K. Sikka, and T. Gedeon.
    Emotion recognition in the wild challenge 2014: Baseline,
    data and protocol. In 16th ACM International Conference on
    Multimodal Interaction. ACM, 2014. 2

Final words

If found any errors feel free to revert back or mail me at abhigoku10@gmail.com