Facial Emotion Recognition using Convolutional Bidirectional LSTM

Published in

Analytics Vidhya

7 min readSep 19, 2020

This is the second part in the Facial Emotion Recognition series, it is recommended to read the first part before jumping here.

This story will walk you through how you can create a FER model using time-distributed Convolutional followed by Bidirectional LSTM in Tensorflow-Keras. I already discussed the introduction and application to this topic in the first part.

This story is divided into following sections,

Why we need this approach
Inspecting and manipulating the data
Creating our own custom FER model from scratch

Why we need this approach?

So, if you have gone through the first part, then you may ask what’s the need for this new approach? The answer is very simple and intuitive, as I told you in the end of part 1 that to apply the simple CNN based model to video I have to feed the video frame-by-frame i.e., feed one frame at a time and predict the emotion on that frame, and keep doing this for every frame(possibly skipping 4–5 frames in between because that will increase the fps but keeps the performance almost same).

So, is there any way in which we can feed a small 2–3 seconds (60–90 frames for a 30fps video)? and yes we can, by time-distributing the CNNs. But, we can do even more, because just like words of a sentence share context with each other, frames of a video also shares context with adjacent frames. So, our model can also learn from these contexts and for that we can use bidirectional LSTMs.

Note: To get the full out of this story, you should have some basic understanding of python and some basics of neural networks specifically CNN and LSTM.

Inspecting and Manipulating the data

As we all know adding intelligence to machine is more about letting them learn from data via some algorithm, and of-course for that we need DATA. Data is the most important part of any machine learning/ deep learning project, because after all our trained model is a product of the data on which it is trained. I mean the better our data represent the real world the more our model behave like real and perform well in real world. Remember one thing “GARBAGE IN, GARBAGE OUT”, if we train on data containing lots of garbage then in production our model also throw garbage. So, DATA is the most important building block for any ML/DL task.

So, we need data for this FER task as well. We will train our model on that and then test it’s performance on data kept aside and also in real-time video stream. Note that this is a supervised learning problem i.e., the learned model y is a function of data x.

For this task I am going to use a very popular data available at kaggle called the CK+48 data-set, each image is grey-scaled and of resolution 48x48. You can also use other datasets as well, there are few more data-sets which are publicly available, or you can create your own.

We will get more insights in this data as we proceed, so stay tuned…

Now we will wet our hands in some real python codes. Firstly import all needed libraries,

Let’s inspect the data, we will check the number of emotion categories we have and the number of images in each of those categories.

sadness and fear has very low number of images as compared to other classes.

TOP_EMOTIONS = ["happy", "surprise", "anger", "sadness", "fear"]

Unlike the data-set used in first part, this CK+48 data-set contains directories of images and the directories are named after the emotion they contain. Within each directory we have bunch of images for that particular emotion class.

One thing which is interesting about this data is that the original data contains video clips. But the data-set we are using contains the last 3 frames of each such video clip. Because in the upcoming model we won’t be feeding images(like in part 1) but rather tiny video clips i.e., collection of frames. So, we need to bundle these 3 frames associated with each video clip into a single short 3 frame video clip (although this clip is very short, in milliseconds, because a single second contains 30 frames in a 30fps video). Then we will feed these 3 frame videos into our model.

Below are the code snippets which will make the data compatible with our upcoming model. I will explain each of them.

Here we created a default-dict called data whose keys are the emotion names and values are dict itself. The keys of this nested dict is the video id from which we have extracted last three frames, and the values of this nested dict is a list which contains the name of those last three frames. Have a look at the below image, it will be more clear to you.

We will also need some helper functions to get the work done.

Here we are first iterating over the keys of dict data , then within each emotion class we are stacking those 3 frames associated with each video clip. This will give us something like below,

Now we are almost done, all we need some stacking and reshaping which is done below.

Now, we split the data into training and validation set. We will train on training data and validate our model on validation data.

label_emotion_mapper , is the mapping from original class label to emotion name.

Let’s visualize the images of each emotion category.

Creating our own custom FER Model

We first need to normalize the image arrays, this is done because neural networks are highly sensitive to non-normalized data. We will use min-max normalization.

For these gray-scaled images min=0, max=255 therefore we will divide the array by 255 because,

Below is a Convolutional Neural Network (CNN), I used following settings :

For generalization purpose dropouts are used at regular intervals.
ELU is used as the activation function because firstly it avoids dying relu problem but it also performed well as compared to LeakyRelu, at-least in this case.
he_normal is used as kernel initializer as it suits ELU.
Batch Normalization is also used for better results.

Now, we will time-distribute the above CNN model, then stack few Bidirectional LSTMs and then stack few dense layers at the end. Below is the function for that,

Unlike the first part, here I used just one callback called ReduceLROnPlateau for reducing learning rate whenever the validation accuracy plateaus. Batch size of 32 is used and trained for 100 epochs.

Now you will see the difference in the input shape, unlike 4 dimensional array(in part 1) we now have 5 dimensional array(5th dimension is for batches of image). We then compile the model.

Let’s now train the model and log the training performance.

Let’s plot the training and validation metrics,

The epoch’s history shows that accuracy gradually increases and achieved +86% accuracy on both training and validation set. Also, ReduceLROnPlateau is called whenever the accuracy plateau.

Note: The fluctuations in the epoch metrics is due to the fact that we have very low data for such a complex task.

We will now visualize what we called a confusion-matrix, it is one of the most widely used evaluator for multi-class classification. It gives us a good glance of model’s performance on all classes.

If we have more data to train then we will get better and more generalized model.

Now, let’s visualize some predictions.

Note: Here t is true label and p is predicted label

Now what next? Should we stop here? No not at all. The purpose of any model is not just to train it and validate it but rather test/use it in the real world. I moved ahead way long with this project by trying many different models and more emotion classes. And at the end I integrated my model with OpenCV and tested it on videos.

Here, is a 2 minute demo video of the power of our model, in this I used many emotions and did cool annotations as well.

Here is the full project hosted on Github.

You can get the entire jupyter notebook for this story from here, you just need to fork it. Also if you like the notebook then up-vote, it motivates me for creating further quality content.

If you like this story then do clap and also share with others.

Also, have a read of my other stories which includes variety of topics including,

and many more.

Thank-you once again for reading my stories my friends :)

Facial Emotion Recognition using Convolutional Bidirectional LSTM

Why we need this approach?

Inspecting and Manipulating the data

Creating our own custom FER Model

Written by Gaurav Sharma