An Offline Social Recognition System for Recognizing Contempt

5 min readAug 16, 2020

We introduce a social recognition system which uses two machine learning models to classify emotion across cultures. The current models are used to classify the universal emotions of contempt, disgust and anger.

Observing Filipino facial expressions using OpenFace

Observing Persian facial expressions using OpenFace

Motivation

Social recognition systems are capable of identifying emotions in humans based on their perceived facial expressions. This entails observing data, like a set of frames from a collection of videos containing faces of humans, then training a machine learning model on this data.

The current state of the literature indicates that there is a lack of research regarding social recognition systems and contempt across different cultures. Cultures display emotions different and therefore, can impact how a social recognition system will perceive that emotion.

Our main contribution, which the current project aims to address is filling in the gap of recognizing, cross-culturally, contempt. This will provide further support to computers which utilize recognition systems to detect emotions in humans.

Two machine learning models are used to classify the emotions of contempt, disgust and anger based on action units (AU). The models are classifying contempt across three different cultures: Filipino, Persian and North American. The two models used are Support Vector Machine (SVM) and Multi-Layer Perceptron (MLP).

An SVM attempts to separate data into categories. SVM’s separate the data using many hyperplanes. Once the data is separated by hyperplanes, the model attempts to find the largest hyperplane which gives the best separation. The AU attributes of each video is converted into a vector, called a feature vector, which is inputted into the model. Using the feature vectors and a Gaussian radial basis function as a kernel, the model can map the AU attributes to a high dimensional space in order to find the optimal hyperplanes for separation. The separation will involve classifying the data into emotions amongst a variety of different cultures.

An MLP is a deep, artificial neural network that is composed of multiple perceptrons. MLPs consist of a input layer which takes in data and an output layer which makes a prediction about the data. Between these two layers are a number of hidden layers which serve as the computational engine of MLPs. In other words, MLPs learn to model the correlation between data (input) and predictions about that data (output).

The current MLP consisted of 3 hidden layers with 24, 12, 6 units respectively. The 16 AU attributes from the data set were fed into the input layer. The output layer used a Softmax layer to classify them into the 3 emotions. After finding a model which worked best on validation sets, the model was further examined on a separate test data set.

Dataset Collection and Preprocessing

We collected 236 videos that contained anger, disgust or contempt from YouTube videos covering Filipino, North American and Persian cultures. Due to the copy right, we are not going to publish the dataset overtime.

Each video was analyzed using OpenFace. OpenFace software breaks down each video into individual frames and identifies facial movements and landmarks. We extracted AUs along with frame number, face id, confidence, and success for each frame in a CSV file.

In next step, we cleaned the data. It involved removing images from the dataset that had a confidence of less than 0.8 and a success value that did not equal 1.

Results

The results for each model were compared with one another. Before compiling the model a k-fold cross validation was conducted to assess the true predictive power of each model. The training dataset was created by removing all rows belonging to 10% of the videos and using the removed videos as the test dataset.

The MLP was ran over 100 epochs and produced an accuracy score of 66% on the validation data and an accuracy score of 49% on the test data. While the SVM produced an accuracy score of 96% on the validation data and an accuracy score of 40% on the test data. This increased accuracy score on the validation data for SVM might indicate a level of overfitting. In addition to accuracy, the f1-score the the test dataset for the SVM was 0.29 while the f1-score for the MLP was 0.45. The MLP, therefore, proved to be the best model for this current dataset.

Result of our model on validation and test set

Training and validation loss over 100 epochs

Although the accuracy was relatively low for the test data, this might indicate that the cultural differences do affect the outcome of the model.

In addition to the models, a gaussian mixture model was used to determine which images were highly related to each action unit associated with contempt.

Amongst the 5 AU’s (e.g., AU4, AU7, AU10, AU25 and AU26) Filipino culture represented the best depiction of AU4, the Persian culture represented the best depictions of AU7, AU10 and AU26 whereas the North American culture represented the best depiction of AU25.

Best depiction of AU4 was from the Filipino category

Discussion

Contempt, anger and disgust are closely related with each other. Making it difficult for humans to differentiate between them. Given the low accuracy, there were several times the model misclassified images that didn’t even contain an emotion. For example, the image below shows an example of an image labelled contempt but clearly did not display contempt.

Although our models produced relatively low accuracy scores on the test dataset, it did show that either the dataset requires more data or that the variations across cultures of contempt do affect the results of the model.

Further research could look into using audio or even sentiment analysis of the dialogue to add further context to the dataset. As well as compiling more data and possibly across more cultures instead of just three.

Conclusion

In this blog post, we reported on our efforts in automatic recognition of three negative emotions: anger, contempt and disgust. To this end, we collected videos containing these emotions from YouTube, as well as labeled and analyzed them. We developed and evaluated a deep neural network and a SVM model. The best result was achieved by our DNN model with F1 = 0.649 and F1 = 0.451 on validation set and test set respectively. Analysis on the experiments showed that these three emotions have indistinct boundaries which make them difficult to classify in some cases, especially when we want to generalize it to other cultures. Future works can improve the result by incorporating other modalities like text or audio pitch, or extending dataset.

Feel free to share your thoughts and comment on this project.