Facial Expression Recognition & Comparative Study on Densenet161 and Resnet152

Published in

Analytics Vidhya

10 min readAug 29, 2021

using Deep Learning, PyTorch, and Transfer Learning

Facial Expression Recognition can be featured as one of the classification jobs people might like to include in the set of computer vision. The job of our project will be to look through a camera that will be used as eyes for the machine and classify the face of the person (if any) based on his current expression/mood.

Face recognition is a method of identifying or verifying the identity of an individual using their face. It is one of the most important computer vision applications with great commercial interest. Recently, face recognition technologies greatly advanced with deep learning-based methods.

Face recognition in static images and video sequences, captured in unconstrained recording conditions, is one of the most widely studied topics in computer vision due to its extensive range of applications in surveillance, law enforcement, bio-metrics, marketing, and many more.

History of Deep Face Recognition:

In the early 1990s, face recognition gained popularity following the introduction of the historical Eigenface approach. In the 1990s and 2000s, holistic approaches dominated the face recognition community. Holistic approaches derive the low-dimensional representation through certain distribution assumptions, such as linear subspace, manifold, and sparse representation. The problem of holistic methods is their failure to address uncontrolled facial changes that deviate from their prior assumptions. This led to the development of local feature-based face recognition in the early 2000s.
In the early 2000s and 2010s, local-feature-based face recognition and learning-based local descriptors were introduced. Face Recognition using Gabor filters and Local Binary Pattern (LBP), as well as their multilevel and high-dimensional extensions, achieved robust performance through some invariant properties of local filtering. Unfortunately, handcrafted features suffered from a lack of distinctiveness and compactness. In the early 2010s, learning-based local descriptors were introduced for face recognition, in which local filters are learned for better distinctiveness, and the encoding codebook is learned for better compactness.
In 2014, Facebook’s DeepFace and DeepID achieved state-of-the-art accuracy on the famous Labeled Faces in the Wild (LFW) benchmark, surpassing human performance in the unconstrained scenario for the first time. Since then, the research focus has shifted to deep-learning-based approaches. Deep learning methods use a cascade of multiple layers of processing units for feature extraction and transformation. Hence, larger-scale face databases and advanced face processing techniques have been developed to facilitate deep face recognition. As a result, with the representation pipelines becoming deeper, the LFW (Labeled Face in-the-Wild) performance steadily improved from around 60% to above 97%.

Face Recognition and Deep Learning:

Deep learning, in particular the deep convolutional neural networks (CNN), has received increasing interest in face recognition, and several deep learning methods have been proposed.

Deep learning technology has reshaped the research landscape of face recognition since 2014, launched by the breakthroughs of DeepFace and DeepID methods. Since then, deep face recognition techniques, which leverage the hierarchical architecture to learn discriminative face representation, have dramatically improved state-of-the-art performance and fostered numerous successful real-world applications. Deep learning applies multiple processing layers to learn representations of data with multiple levels of feature extraction.

Dataset:

I found this dataset on Kaggle, and it represents an image classification problem; since the data consists of six classes with each class containing between 3000–7000 images with six classes as ‘Happy’, ‘Suprise’, ‘Sad’, ‘Fear’, ‘Angry’, ‘Neutral’.

Dataset Link:
https://www.kaggle.com/apollo2506/facial-recognition-dataset

Description:

There are 3995 images in the Angry category. There are 4830 images in the Sad category. There are 3171 images in the Suprise category. There are 4097 images in the Fear category. There are 7215 images in the Happy category. There are 4965 images in the Neutral category.

Procedures:

Download Dataset
Import libraries
GPU Utilities
Creating a Custom PyTorch Dataset
Creating Training and Validation Sets
PyTorch data loaders
Some Examples
Modifying a Pretrained Model(DenseNet161,ResNet152)
Transfer Learning
Training Loop
Finetuning the Pretrained Model
Results
Testing the model against the test dataset to check the accuracy
Save the trained model

Modifying a Pretrained Model (DenseNet161):

Recent work has shown that convolutional networks can be substantially deeper, more accurate, and efficient to train if they contain shorter connections between layers close to the input and those close to the output. In this paper, we embrace this observation and introduce the Dense Convolutional Network (DenseNet), which connects each layer to every other layer in a feed-forward fashion. Whereas traditional convolutional networks with L layers have L connections — one between each layer and its subsequent layer — our network has L(L+1)/2 direct connections. For each layer, the feature maps of all preceding layers are used as inputs, and their own feature maps are used as inputs into all subsequent layers. DenseNets have several compelling advantages: they alleviate the vanishing gradient problem, strengthen feature propagation, encourage feature reuse, and substantially reduce the number of parameters. We evaluate our proposed architecture on four highly competitive object recognition benchmark tasks (CIFAR-10, CIFAR-100, SVHN, and ImageNet). DenseNets obtain significant improvements over the state-of-the-art on most of them, whilst requiring less memory and computation to achieve high performance.

Authors: Gao Huang, Zhuang Liu, Kilian Q. Weinberger, Laurens van der Maaten

Modifying a Pretrained Model (ResNet152):

Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions concerning the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. On the ImageNet dataset, we evaluate residual nets with a depth of up to 152 layers — -8x deeper than VGG nets but still having lower complexity.

An ensemble of these residual nets achieves a 3.57% error on the ImageNet test set. This result won 1st place on the ILSVRC 2015 classification task. We also present an analysis on CIFAR-10 with 100 and 1000 layers.

The depth of representations is of central importance for many visual recognition tasks. Solely due to our extremely deep representations, we obtain a 28% relative improvement on the COCO object detection dataset. Deep residual nets are the foundations of our submissions to ILSVRC & COCO 2015 competitions, where we also won 1st place on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.

Authors: Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun

Comparative Study on Densenet161 and Resnet152:

ResNet or DenseNet? Nowadays, most deep learning-based approaches are implemented with seminal backbone networks, among them the two arguably most famous ones are ResNet and DenseNet. Despite their competitive performance and overwhelming popularity, inherent drawbacks exist for both of them. For ResNet, the identity shortcut that stabilizes training also limits its representation capacity, while DenseNet has a higher capacity with multi-layer feature concatenation. However, the dense concatenation causes a new problem of requiring high GPU memory and more training time. Partially due to this, it is not a trivial choice between ResNet and DenseNet.

Their implementation is a deep residual network (ResNet) with shortcut connections that skip a certain number of layers. After performing identity mappings, the outputs of these connections and the outputs of the stacked layers are added. According to their conclusions, residual networks are easy to optimize and benefit greatly from increased depths. Achieving first place in the ILSVRC 2015 classification competition, their 152-layer single model (the deepest on ImageNet at the time) has significantly lower complexity than VGG networks. Further, ResNet152 outperforms previous ensembles with a top-5 accuracy of 0.94. To avoid overfitting, more drastic regularization is recommended when training much deeper networks on small datasets

In contrast with ResNets, the inputs are concatenated using a composite function, leading to a simpler and more efficient solution. Divided into multiple dense blocks connected by transitional layers performing batch normalization, convolution, and pooling, DenseNets are more compact and motivate feature reuse; bottleneck and compression layers are present to reduce the number of feature maps and improve computational efficiency. Additionally, with a relatively low number of filters per layer, DenseNets are easy to train and scale to hundreds of layers while raising no optimization concerns. The outcome of their research shows that DenseNets achieve similar performance to ResNets while requiring significantly fewer parameters. DenseNet201 achieves a top-5 accuracy of 0.93 on the ImageNet dataset and is believed to obtain further gains through hyperparameter tuning; the overfitting problem is addressed by the regularizing effect of connections.

Competing in the ILSVRC implies classifying images into one of the 1,000 classes of the ImageNet database. Approximately 150,000 images collected from search engines are used for validation and testing, each algorithm producing a list of labels sorted by decreasing confidence. PyTorch is a popular scientific computer library in the deep learning community with easy debugging and support for hardware accelerators. It offers common image transformations and model architectures addressing image classification. Table 1 provides the top-1 and top-5 error rates for the chosen pre-trained models.

the proposed dense weighted normalized shortcut is also beneficial to speed up the convergence. It is also important to note that the training error of DS2Net is much smaller.

We provide a unified perspective of dense summation to facilitate the understanding of the core difference between ResNet and DenseNet. We demonstrate that the core difference lies in whether the convolution parameters are shared for the preceding feature maps. We proposed a dense weighted normalized shortcut as an alternative dense connection method, which outperforms the two existing dense connection techniques: identity shortcut in ResNet and dense concatenation in DenseNet. We found that Dense summation from the aggregation output provides superior performance to that from the convolutional block output. In short, the dense shortcut addresses the problem of representational capacity decrease in ResNet while avoiding the drawback of requiring more GPU resources in DenseNet.

Transfer Learning:

The basic premise of transfer learning is simple: take a model trained on a large dataset and transfer its knowledge to a smaller dataset. For object recognition with a CNN, we freeze the early convolutional layers of the network and only train the last few layers which make a prediction. The idea is the convolutional layers extract general, low-level features that are applicable across images — such as edges, patterns, gradients — and the later layers identify specific features within an image such as eyes or wheels.

Results:

Here are the results for

1. Resnet152:

From the epoch vs loss graph, we can see that firstly both the training data and testing data are decreasing simultaneously up to epoch 15. After that, the training plotline started increasing for some time while the validation plotline is going parallel.

Initially, both the training and validation losses seem to decrease over time. However, if you train the model for long enough, you will notice that the training loss continues to decrease, while the validation loss stops decreasing, and even starts to increase after a certain point!

This phenomenon is called overfitting, and it is the no. 1 why many machine learning models give rather terrible results on real-world data. It happens because the model, in an attempt to minimize the loss, starts to learn patterns that are unique to the training data, sometimes even memorizing specific training examples. Because of this, the model does not generalize well to previously unseen data.

2. Densenet161:

From the epoch vs loss graph, we can see that firstly both the training data and testing data are decreasing simultaneously up to epoch 5. After that, the training plotline started increasing for some time while the validation plotline is going parallelly.

Conclusion:

1. Training for Facial expression recognition was much more difficult than I thought it would be, some expressions are fairly similar and it seems to create more error when trying to recognize certain expressions.
2. Maybe a bigger size picture Resize and Randomcrop can give a better result, but for that top quality, GPU is highly required otherwise it will give an “out of memory” error.
3. I also tried different learning rates and other approaches. Also tried DenseNet, but ResNet was better.

Project Link:

https://github.com/soham2707/Facial-Expression-Recognition-Using-Deep-Learning.git

Future work:

As the results are not very satisfying I will try to make another image classification using the transfer learning approach.