Facial Expression Recognition on FIFA videos using Deep Learning: World Cup Edition

Saurabh Charde
AI Enigma
Published in
7 min readMay 30, 2019

Few Hearts were broken few still live. No matter who wins, the game will still make me thrill.

Introduction

Fifa world cup 2018 has become one of the highest goal scoring world cups in history. No matter which country is playing, the moment those 11 players step on the field, people get connected to them emotionally. While watching them we share their joy, fear, and excitement through the expression conveyed by them. So in this article, we are going to make use of Deep Learning & Image Processing to detect some of those cool expressions we have witnessed during this FIFA world cup 2018.

Don’t worry if you aren’t familiar with any of the terms mentioned above, I have tried my best to keep this article simple, yet informative. So just keep reading!!

What’s in this article?

Let me walk you through the structure of this article.

1. In the beginning, we are going to apply face detection to detect the faces of players in FIFA videos.

2. Use Deep Learning (specifically Convolution Neural Networks) to train our model on a facial expression dataset.

3. Finally, we are going to test it on FIFA videos by first detecting the faces and then giving it to the Deep Learning model to detect the expression portrayed by the player.

This is going to be pretty exciting cause its none other than Christiano Ronaldo who is going to be the primary test model for the rest of this article. Hope you all are ready. So let’s get started!!

Face Detection

In order to recognize the facial expressions we first need to detect the faces in an image. Face detection involves locating the face of a person in a given image. It is used in applications like Facebook to automatically tag your photos, in face filter apps to improve your photos, and in augmented reality, applications to put any cool shades on your face.

There are many ways available to detect faces in an image but we are going to use the popular Haar Cascade Classifiers which are machine learning models trained to detect a specific object in an image (like face, eye, lips, etc).

Haar Classifier (overview)

If we look at the structure of a human face then there are some features which are common in each of them. For e.g. The region of the eye is darker than the cheeks, the nose top is brighter than the region on either side of it, the region of lips is brighter than the region just below the lips, etc.

A Haar Classifier works on similar principals. It has a set of features which it tries to detect in an image. If the image contains those features, then it means the image contains a face in it, and if the features are not found then there is no face present in the image.

Haar Classifier(working)

Credits: http://www.sra.vjti.info

A haar classifier is trained on a large number of images containing faces (positive examples) and non-faces (negative examples). There are several types of features it tries to look in an image (like line, edge, center-surround features). A sliding window passes through the entire image and the features are applied to the area under the window. The classifier subtracts the sum of pixels under the white region of the window from the black region window. If this value obtained is above a threshold then the feature is said to be detected.

In order to select the most relevant features among the set of features an optimization algorithm called AdaBoost is used while training. Adaboost selects only those features which lead to increased accuracy in the detection of faces in the image. These features are further divided into levels (cascaded) in order to reduce the time required for detection. For e.g., if the feature at level 0 is not detected then the image is discarded without looking for features at upper levels (hence reducing the time).

Using Haar Classifier for face detection

We are going to use OpenCV which is a popular open source library of well know functions required for image processing (like image resizing, background subtraction, edge detection, object detection, etc). OpenCV contains a pre-trained Haar cascade classifier for face detection which is stored as an XML file (haarcascade_frontalface_default.xml). Basically, this file contains information about the different features to be tested on an image with their threshold values.

The code for detecting faces in an image using haar cascade classifier is as follows:

Here faces is a list containing the location of the face inside the image.

So finally we have reached to end of the first section. In this section, we successfully learned the basics of face detection & a very popular way of detecting faces using Haar classifiers. Now we are done with recognizing faces and let’s move on to the next section which is going to be quite interesting because we are going to build our own Deep learning model for detecting expressions in our images.

So ready to dive in Deep, cause I am!!!

Deep Learning

Deep Learning involves using a large number of layers of a neural network in order to learn complex features from the given data. There are different types of neural network models used for various purposes like the prediction of weather, image classification, speech recognition, natural language understanding, etc.

In our case, we will be using Convolution Neural Networks (CNN) which are found to perform well on tasks involving image data.

Data Set

In order to make our deep learning model to detect expressions, we need to first train it using a facial expression dataset. The dataset used for this purpose is the fer2013 dataset which was hosted on Kaggle as a part of Facial Expression Recognition Challenge. The data consists of 48x48 pixel grayscale images of faces. The faces have been automatically registered so that the face is more or less centered and occupies about the same amount of space in each image. The dataset consists of facial expression belonging to these seven categories (0=Angry, 1=Disgust, 2=Fear, 3=Happy, 4=Sad, 5=Surprise, 6=Neutral). The training set consists of a total of 28,709 images and the test set consisted of 3,589 images.

Fig: Example images from the dataset

Convolutional Neural Networks

Have you ever imagined how you identify your Dad from a whole bunch of people around you? It’s because your brain has stored his certain body features like his hair, height, body shape, nose, eyes, etc. which helps you to distinguish him from other people. The part of the brain which helps in doing this visual processing is called the visual cortex of the brain.

The visual cortex receives visual information coming from your eyes i.e the things you see (here your dad) and further processes and tries to understand the thing you are looking at.

Now a convolution neural network works somewhat similar to the visual cortex of the brain. It receives an input image [an array] and it tries to detect features from the image in order to identify what the image is all about.

The convolution layers are a set of kernels (or feature detectors) which are used to extract features from the image. Example of features in our case can be whether the teeth of the person is visible or not, whether his lips are raised or not, etc. Now, these are only a few examples of features but in reality, CNN can detect a large number of features that may help to detect the emotion of a person.

This is just an intuitive explanation of how CNN works. But there is a lot more that goes in the background. I have not covered the maths part of CNN as it will make things complicated and is beyond the scope of this article.

The code for building the CNN model is shown below:

Results

Once the model is trained on the dataset provided, now its time to test it on some real stuff. I have chosen a video featuring Ronaldo which will be used for testing the model.

Results obtained from the trained model

Conclusion

Hope you have understood how a modern emotion recognition engine works and now you can build one for your own. The techniques explained in this article for face recognition and emotion detection are used in many places today. So understanding how these stuff work will enable you to explore & understand more interesting things build using them like the automatic tagging system(one that Facebook uses), object detection, style transfer, etc.

If you like the article please spare time to give a clap and if possible do share it with your friends and colleagues. I have attached the link to my GitHub repo containing code to reproduce the model if you want. Do have a look at it and hope you find it useful.

--

--