Facial Emotion Expressions

Image Classifier for Facial Emotion Expressions

11 min readMar 27, 2023

Github : https://github.com/G-G-Thorat/DM_Assn_1/blob/main/facial-expression.ipynb

Homepage : https://g-g-thorat.github.io/

Goal

The primary goal of this blog post is to guide you through the process of constructing an image classifier for a facial emotion dataset [7]. The dataset, which can be found on Kaggle under the name Facial Emotion Expressions, consists of over 35000 images each of faces under categories of angry, sad, happy, neutral, disgust, fear and surprise. Our image classifier model will be built using Convolution Neural Network (CNN).
In this regard, I have already developed the same and shared it on my Kaggle notebook called facial_emotion_notebook.
So, let’s proceed with the tutorial on how to build this image classifier.

Preface:

As mentioned in reference [5], computer vision is a rapidly growing field within the information technology sector that focuses on teaching machines how to interpret images and videos. This technology is essential for the development of various applications such as robotics, self-driving vehicles, and facial recognition systems. At the core of computer vision lies image recognition, which involves identifying the category to which an image belongs.

With that in mind, let’s explore some related terminologies in greater detail.

Image Classification:

Reference [4] defines image classification as the process of categorizing or assigning labels to entire images, with the assumption that each image belongs to just one class. To achieve this, images are inputted into classification models, which predict the specific category to which an image belongs.

Well, here’s an example to illustrate the process of image classification: Suppose we have a dataset consisting of various images of animals, including dogs, cats, and birds. The goal of image classification is to create a model that can accurately predict the class of each image. To train the model, we would input the images along with their corresponding labels (i.e., the animal species) into the algorithm.

During the training process, the model learns to recognize the distinguishing features of each animal class by analyzing the input images. Once the training is complete, we can use the model to predict the class of new, unseen images by feeding them into the algorithm.

For instance, if we input a picture of a cat into the model, it should accurately predict that the image belongs to the cat class. Similarly, if we input a picture of a bird, the model should predict that the image belongs to the bird class.

In order to classify our images, we will construct a model using Convolutional Neural Network (CNN) architecture. Before we proceed, let’s explore the definition of CNN.

What does Convolutional Neural Network (CNN) means ?

Convolutional neural networks (CNN) are a type of artificial neural network (ANN) used most frequently in deep learning to interpret visual data as stated in [8].

The CNN has 3 main layers — convolution layer, pooling layer and fully connected layer as stated in [10].

Convolution Layer:

The neurons within a convolutional layer execute the convolution operation on the inputs they are given. The usual hyper parameters associated with a convolutional layer are the Filter and Stride.

2. Pooling Layer:

By implementing pooling layers, it is possible to decrease the input size, which results in faster processing and analysis of the data. Typically, convolutional layers are succeeded by pooling layers, which help reduce the spatial dimensions (width and height) of the input, leading to a reduction in computational requirements. The hyper parameters associated with a pooling layer are the Stride, Max or average pooling, and Filter size.

3. Fully Connected Layer:

Fully connected layers are named as such because they connect each neuron in one layer to every neuron in the next layer. In these layers, every input dimension and output dimension work in tandem, resulting in complete inter connectivity between the two layers.

Also, What is Over fitting ?

Over fitting occurs when a model is trained on a dataset so well that it starts to capture the noise and details specific to the training data, which can lead to a decline in the model’s performance when presented with new data. In other words, the model starts to learn not only the underlying patterns in the data but also the noise and fluctuations specific to the training dataset. As a result, the model may become too complex and fail to generalize well to new data

One such scenario can be: where a machine learning model is trained to identify the difference between cats and dogs using a dataset of images. If the model is trained on a small dataset and is too complex, it may memorize the images in the training dataset, including noise and irrelevant features specific to that dataset. As a result, the model may perform very well on the training data but may fail to generalize to new images of cats and dogs. This is because the model has not learned the underlying features that distinguish cats from dogs, but instead has learned the specific features present only in the training dataset.

For developing image classification models, we are also going to use TensorFlow.

TensorFlow :

Using TensorFlow, it is possible to extract image data from various files, resize images, and convert multiple pictures at once [2].

Importing all libraries: [1]

Challenge/Problem :

Before seeking a solution to a problem, the initial inquiry should be directed towards identifying the precise problem at hand. Consequently, what issue arises from utilizing standard fully connected neural networks for image-related tasks such as image classification ?

To incorporate images into our neural network, we must first understand their representation on computers. Each pixel of an image is comprised of an array of numbers that denote the red, green, and blue channel values. For example, a 256x256 color image would produce a 196608-dimensional feature vector when converted to RGB values. Due to the large number of weights required for high dimensionality, it becomes challenging to process large high-quality images. Moreover, individual pixel values often contain excessive noise, making it difficult to discern any discernible pattern even with significant computational power.

So, in order to solve this…

Solution :

Instead of treating filters as individual raw filter values, we aim to obtain features through image filters and convolution operations. However, we must first address the issue of manual filters and their generalizability for different tasks. To achieve this, we can randomly assign values to filters, which will function as parameters for our model. In doing so, the model can learn the filter values, allowing for greater adaptability and flexibility.

Using :

> In a convolutional neural network, parameters are shared across various locations of an image, making the architecture more efficient compared to a regular neural network.

> The same filter used to detect vertical edges can be applied to any location within an image, indicating that our filters possess shift in variance. Therefore, our filters are capable of detecting features irrespective of their position within the image.

Now

Dataset :

Our dataset is Facial Emotion Expression [1]. Our dataset consists of seven classes: happy, sad, angry, fear, surprise, neutral and disgust.

Image Pre-Processing :

We will pre-process the dataset in organized manner to feed it to the CNN model.

Step 1 :

Initialize the Batch_size, number of epochs, img_height and img_width

Step 2 :

Create a data frame of the list dataset downloaded.

Step 3 :

Visualize the data

Step 4 :

Model Building

In order to build the model, the data frame data was initially stored and features and labels arrays.

Data Augmentation

Normalizing the data

Predict the model

Step 5 :

Basic CNN : Model Analysis

Basic Keras sequential model using the EfficientNet CNN architecture with 2 layers and Adam Optimizer.

After training the model

We got a best training accuracy of 88 % and loss of 0.98

And we got the best validation accuracy of 25 % and loss of 10.23

My Contribution :

Here is my contribution for image classifier for facial emotion dataset.

>> To experiment with various hyper parameters, including the number of convolutional pooling pairs, percentage dropout, and the number of neurons on the fully connected layer of a CNN, I utilized the guidance provided in tutorial [3] and applied these hyper parameters to three distinct models.

>> I delved deeper into understanding the causes of over fitting in my CNN and identified methods to prevent it from occurring.

>> While I conducted experiments with various hyper parameters, this blog lacks information on other models, which would have aided my comprehension.

>> Encountered many challenges w.r.t. CNN model building and also found solutions to fix them.

Keyword terminologies used in my program :

The trainable parameters in a neural network are referred to as filters, kernels, or cores.
During training, the network learns the values of these filters, which are also known as weights.
Strides determine the step size at which the filter window moves across the input.
To process the edges of the input, a frame of 0-valued pixels, known as padding, is often added.
Dropout is a regularization technique that can be used to prevent over fitting in neural networks.
A filter size of 5x5 can be set by specifying kernel_size=5.
The default stride value is 1.
When padding=’valid’, the output will be reduced by kernel_size — 1.
On the other hand, when padding=’same’, the output will have the same spatial size as the input.
The activation function ReLU (Rectified Linear Unit) can be set using activation=’relu’.

Experiments

Model 3 — Fully Connected Layers

Training the model

Visualizing the model accuracy

The best training accuracy was 85 % and validation accuracy was 25 %.

2. Model 5 — 5 layers Fully Connected CNN Model

Training the model

Accuracy

The best training accuracy was 88 % and validation accuracy was 25 %.

3. Model 6 — Number of Neurons 100,200,100 Layer 3 Epoch 15

Training the model

Accuracy

The best training accuracy was 84 % and validation accuracy was 28 %.

4. Model 8 — Number of neurons 200,100,300 Epoch 15 layers 3

Training the model

Accuracy

The best training accuracy was 88 % and validation accuracy was 35 %.

5. Model 9 — Neurons 200,300,100,200,300 Shape of conv2d Layers 5 Epochs 10 and Regularized with Dropout at 10

Train the model

Accuracy

The best training accuracy was 95 % and validation accuracy was 75 %.

The final model 5 got me to think that in order to avoid over fitting I had to reduce the training dataset as well as add regularization to the model.

The only I didn’t add the same technique to all previous models was to show my contribution in an appropriate manner without bias.

Comparing CNN Models

I tried to implement various hyper parameters and methods to overcome over fitting, few worked, but not all models are good. I tried to summarize all the models in one image as shown below :

Performance :

Our performance suffered as we added an extra pooling layer to the convolution, resulting in a loss of valuable information. Although adding a fully connected layer increases the number of parameters, it can also negatively impact performance when working with limited data. Additionally, reducing the batch size may provide regularization but may introduce more noise in the gradients during training.

Challenges and Solutions :

I come from a background where these things should a like a piece of cake kinda thing, but still faced a lot of issues in covering the models.

Adjusting the hyper parameters, particularly the number of neurons and dropout parameter, was a difficult task during experimentation. However, I was able to gain valuable insights from an Udemy course on CNN [3], where the author provided a clear explanation on how to set these parameters. This guidance was instrumental in building my models (2, 3 & 4).
I faced several challenges when evaluating models 3 and 4, specifically with regards to the amount of time it took to complete the evaluation. Initially, I used a batch size of 32 and image dimensions of 180x180, but this resulted in unacceptably long execution times. As a solution, I reduced the image height and width to 48, which significantly improved the evaluation time.
One secondary challenge I faced was identifying the most suitable CNN Python library for building my model. There are various libraries available, such as PyTorch, TensorFlow, and mxnet, as noted in [7]. After conducting research, I ultimately chose TensorFlow because it is considered an accessible library for image classification, as indicated in the TensorFlow tutorial [1].

References :

Kaggle dataset code link : https://www.kaggle.com/code/anand1994sp/facial-expression
Tutorial, Image Classification, TensorFlow: https://www.tensorflow.org/tutorials/images/classification
Tutorial, Image Recognition with Machine Learning, Educative — Level Up Your Coding Skills, Link: https://www.educative.io/courses/image-recognition-ml.
Image Classification, Link: https://huggingface.co/tasks/image-classification
Medium Blog on CNN: https://towardsdatascience.com/a-comprehensive-guide-to-convolutional-neural-networks-the-eli5-way-3bd2b1164a53
StackOverflow, Changing tick frequency on axes- https://stackoverflow.com/questions/12608788/changing-the-tick-frequency-on-the-x-or-y-axis
Python Deep Learning Libraries, https://pyimagesearch.com/2016/06/27/my-top-9-favorite-python-deep-learning-libraries/
Wikipedia, CNN — https://en.wikipedia.org/wiki/Convolutional_neural_network
Machine Learning, https://machinelearningmastery.com/overfitting-and-underfitting-with-machine-learning-algorithms/
Blog on CNN — https://www.clarifai.com/blog/what-is-convolutional-networking?hs_amp=true&utm_term=&utm_campaign=DSA-Community&utm_source=adwords&utm_medium=ppc&hsa_acc=4305946045&hsa_cam=18142553015&hsa_grp=141361868638&hsa_ad=618056207992&hsa_src=g&hsa_tgt=dsa-19959388920&hsa_kw=&hsa_mt=&hsa_net=adwords&hsa_ver=3&gclid=CjwKCAjwzNOaBhAcEiwAD7Tb6Bd_39yY3s-xpiUj0Nx3y3GfmfnJGGP03tZWNNaBpQ-7AAZkujrVDBoCCAIQAvD_BwE

Facial Emotion Expressions

Image Classifier for Facial Emotion Expressions

Goal

Preface:

Image Classification:

What does Convolutional Neural Network (CNN) means ?

Also, What is Over fitting ?

TensorFlow :

Challenge/Problem :

Solution :

Now

Dataset :

Image Pre-Processing :

My Contribution :

Keyword terminologies used in my program :

Experiments

Comparing CNN Models

Performance :

Challenges and Solutions :

References :

Written by Gauravthorat