Comparing Emotion Recognition Tech: Microsoft, Neurodata Lab, Amazon, Affectiva

Neurodata Lab
11 min readApr 19, 2019

--

Automated emotion recognition has been with us for some time already. Ever since it entered the market, it has never stopped getting more accurate. Even tech giants joined the race and released their software for emotion recognition after smaller startups had successfully done the same. We set out to compare the most known algorithms.

Emotions are subjective and variable, so when it comes to accuracy in emotion recognition, the matters are not that self-evident. In machine learning, which includes the emotion recognition task, the algorithm’s accuracy is determined via specifically created datasets or external contests to test their performance. In the world of emotions, such datasets subdivide into those where emotions are acted, natural, or caused, and those where emotions are represented by discrete categories (labels) or continuous categories (with different valence/arousal). Earlier we talked about how to classify emotional states properly. Among the most known affective datasets are IEMOCAP, SAVEE, AFEW, RECOLA, MOSEI, GEMEP, the newest RAVDESS.

The comparison task in any branch of machine learning is important for a variety of reasons. It sets the objective flagship for the industry and stimulates the development of better algorithms that are able to work faster and provide us with more accurate results. It highlights the blind spots that we cannot tackle for now and that require new approaches. It also highlights possible drawbacks of existing technologies and outlines leaders and outsiders in the world of innovations.

Of course, companies and laboratories take the challenge seriously and gather enormous amounts of data to train their systems. Affective Computing, the field related to emotional technologies and the study of systems that can recognize, interpret, process and simulate human emotions, is no exception. For instance, Affectiva, a pioneer in the field, has analyzed 4 billion frames containing 7.5 million faces up to date. We at Neurodata Lab created Emotion Miner — a global platform to gather and mark-up multicultural content available in open sources and collected one of the biggest affective video datasets. All in all, the more relevant data an algorithm has ‘seen’, the better it will perform on the new data, in new conditions.

However easy it is to teach machines to recognize intense emotional expressions, for instance, happiness with its indicative smile, it can be difficult to deal with natural data where people move, talk, actively gesture. At the same time, accurate emotion recognition ‘in the wild’ is crucial for real-life applications of the technology in the advanced human-machine interactions. If we want machines to understand us as well as we do one another, we need to teach them how to recognize the slightest nuances in our facial expressions, vocal intonations, body movements, all while making sense of what we actually say. You can read more about this in a recent piece for Digital Trends, where we shared our opinion on the challenges and the future Affective Computing faces.

Data to compare

We compared the algorithms on the basis of the available test datasets. For this purpose we took the emotion recognition technologies developed Microsoft and Affectiva, then added Amazon, and, of course, our own Neurodata Lab’s emotion recognition unit. All of them recognize emotions by analyzing facial expressions. We used affective video data marked discreetly since all the compared algorithms were taught to deal with emotion labels, that is — different types of emotions. We chose SAVEE, AFEW, and RAVDESS, that was released only a year ago.

We usually take for granted the fact, unless we have to deal with the matter professionally, that some technology is more accurate than the others. Though it is interesting to understand how an emotion recognition algorithm is trained to perform this way or another.

In machine learning tasks the usual practice is to create a dataset containing the examples of the objects an algorithm has to classify, for instance cars or chairs. For emotion recognition purposes, it is common to record the actors playing certain conversational situations and imitating emotions — emotional expressions that are acted and thus can be unnatural or exaggerated. Algorithms trained on the acted data work with low accuracy on ‘in-the-wild’ data. When an emotion is played by an actor, we cannot be sure if she does it right, as people would really do. Particularly for this reason, natural data are so valuable for developers, but at the same time difficult to work with due to the background noise, or any other possible limitations, such as large variability in the emotional expression, missing or obstructed channels (the face unseen, or the voice not clearly heard).

In any case, the data in the sets are represented by short audiovisual fragments in each of which one particular emotion is expressed. How do we know which one it is? Quite a lot of people, so-called ‘annotators’, watch each fragment and manually indicate what emotion is in it. The results of such procedure might differ depending on the annotator’s cultural background, as the patterns of emotional expression in these cultures might differ as well.

Thus, the accuracy of an emotion recognition algorithm will be determined as the difference between the emotion predicted by the machine, and the emotion indicated by annotators. Today most algorithms have learned how to distinguish among 6 emotions — happiness, sadness, fear, disgust, anger, surprise, — and a neutral state. These are sometimes called ‘basic’ emotions (which is actually a myth).

Each of the datasets, SAVEE, AFEW, and RAVDESS, includes from 480 to 1440 fragments with these emotions. With SAVEE and RAVDESS being the acted datasets, for this article we will specifically concentrate on the results and samples from AFEW, containing the fragments from the most famous and well-acted movie scenes of contemporary cinema. Even though acted, these scenes are not refined and are as close to real life as possible.

The comparison results

We examined each of the 7 affective states: 6 emotions and a neutral state. It turned out that some of the emotions were relatively easy to detect for all the 4 algorithms, while others were quite difficult for most of them. All in all, some performed better than the others.

Again, we took:

These algorithms work on the principle of the single-frame analysis. They separate the video stream into frames and detect emotions in each of them as if they had to deal with single images.

The results for the three datasets are in the table below, for SAVEE, AFEW, and RAVDESS respectively.

Table 1. The plus indicates that in general in most videos the algorithm detected the right emotion, in more than half the video fragment length (F-score > 0.2). The highlighted plus indicates the best result among the four algorithms. The results are the average F-score for all video fragments for each emotion category in the datasets.

In the strictest sense of the word, we didn’t measure the detection accuracy. In the emotion recognition task, the algorithms had to classify emotional states according to the 7 categories. They could do so with some precision and recall. Accuracy is a weighted arithmetic mean of precision. Instead, we measured F-score that combines both precision and recall. You can read a very nice Wikipedia entry on that.

Also, if the algorithms randomly guessed the emotions, the F-score would equal 0.14. For our purpose, we set the F-score benchmark as 0.2, which is higher than the random guess.

The top-3: Happiness, Sadness, Surprise

Happiness, Sadness, Surprise were among the best detectable emotions. This can be because their expressive manifestations are quite intense and clearly distinguishable — a smile or an open mouth. On the other hand, these emotional categories are usually better represented in the datasets — there are simply more data with happy, sad, or surprised people.

Microsoft definitely carries the palm in the three categories. Neurodata Lab and Amazon keep up with the leader, while Affectiva didn’t perform very well (it actually could detect any emotions at all only in 2/5 of the AFEW files).

Below are a few examples from the AFEW dataset with the emotions predicted by each algorithm.

A very happy fragment (AFEW).
Astonished (surprised) fragment (AFEW).
Sad fragment (AFEW).
What the results actually looked like for every video fragment from each dataset.

Tough nuts: Anger, Neutral, Disgust, Fear

These four were among the emotions with the most erroneous performances. Affectiva has done well with the acted expressions of Disgust, with Neurodata Lab, Amazon and Microsoft performing well in the RAVDESS dataset. At the same time, no algorithms were able to cope with the naturalistic affective data of the AFEW dataset. Almost all algorithms have coped with Anger, except for Affectiva. Some of the algorithms were occasionally good with Neutral, especially Microsoft, but only Neurodata Lab managed to correctly recognize Fear.* Microsoft has performed the best with the emotionless neutral faces, while Affectiva does not have this affective category at all.

*We should note that Amazon does not detect Fear but Confusion, while Neurodata Lab recognizes Anxiety instead.

Angry fragment (AFEW).
One more angry fragement (AFEW).
‘Don’t touch me’ (disgusted) fragment (AFEW)
Neutral fragment (AFEW).

The emotional rating

Let’s now have a look at what was confused with what. Since we mostly gave examples from the AFEW dataset, we would now illustrate these results via the Confusion Matrices (a short instruction to understand these better). On the left side of each matrix, there are the actual emotions expressed by the people, at the bottom are predicted emotions — the results of the algorithm’s work. The darker the square is, the more predictions were made in that particular category.

Confusion Matrices. First row, left to right: Neurodata Lab, Microsoft. Second row, left to right: Affectiva, Amazon.

For instance, Affectiva didn’t do well in the AFEW dataset. It is obvious that instead of being disgusted or surprised the algorithm thought the people were angry, feared, sad or literally almost anything except for being calm. Microsoft, Amazon, and Neurodata Lab have coped well with Happiness, Surprise, Sadness, and occasionally Anger. Neutral and Fear were tough for most algorithms.

Judging by the results of this short study, however limited it might be, below is the rating of the most difficult emotions to predict.

Diagram 1. The rating of emotions by the points scored. Left to right — from the emotions that scored the most points, and thus were recognized by themost algorithms in most cases, to the emotions that scored the least points.

Consequently, we can rate the algorithms by the scored points.

Diagram 2. The rating of algorithms by the points scored. Left to right — from the algorithms that scored the most points, to the algorithms that scored the least points. Best results represent the best F-score for each dataset and each emotion in it.

It seems more reasonable though to range the algorithms by their average F-score. Below are the results.

Table 2. Average precision, recall, F-score (F1), accuracy for each dataset.

Conclusions & Limitations

We aimed to understand what is the current state of the emotion recognition technological landscape in 2019 and found out that:

  • Microsoft and Neurodata Lab are clearly among the leaders. Amazon’s technology works pretty decently, while Affectiva surprisingly didn’t perform very well.
  • Most algorithms struggle with natural emotional expressions and perform better with acted emotions. Especially Affectiva, even though being a pioneer in the industry.
  • Happiness, Surprise, Sadness are the easiest to detect and show the most accurate results. On the contrary, Disgust, Anger, and Fear are the most struggling for AI.
  • All the algorithms overestimate the neutral face and tend to attribute a wide palette of emotions to it, except for Microsoft and Affectiva (doesn’t have the category). That is — overdramatize.

This comparison and the results, of course, are subject to limitations. It is important to keep them in mind.

  • The emotional classes that can be recognized by the algorithms slightly differ. Affective has no Neutral category, Neurodata Lab detects Anxiety instead of Fear, Amazon detects Confusion instead.
  • The mark-up in the three datasets — the emotions indicated by people in each video fragment — are not ultimately true. Emotion perception is subjective. For emotions, there is probably no such thing as the ‘ground truth’. As Liza Barrett famously said, emotions are concepts that are made up on the go.
  • Emotional expression depends on the context. The results of this short study should be carefully transferred to other types of applications. We aimed to test the algorithms’ overall ability to recognize emotions, thus we took three general datasets. To measure the quality of the algorithms’ work on a practical task, we would need specific data — records from a supermarket or a call-center.
  • Emotional expression also depends on the person. Two of the three datasets contained few unique people: 10 in SAVEE, 24 in RAVDESS. Only AFEW had 220, meaning it is more representative in this respect.

What’s next for Affective Computing?

All in all, Affective Computing is at the dawn of its development. The companies and laboratories still have to learn how to deal with complex cognitive and affective states such as shame or self-confidence. Not to mention recognition of mixed, fake of hidden emotions.

While the single-frame approach may work with simple emotions that change relatively fast, more complex states should be taken in dynamic to observe how they are gradually changing over time. A lifted corner of the lips does not mean a happy smile, but an open mouth during the speech, so are widened eyes — not necessarily a signal of fear.

In general, different emotional analytics is relevant for different tasks, because not all of it is informative for some industries. And it is OK. One of the most widespread applications for emotion recognition is found in call-centers. What you need there is the ability to filter out annoyed customers and redirect them to the staff trained to deal with difficult cases. In other words, to recognize Angry vs. Not-Angry emotions in the voice.

With the inclusion of the analysis of multiple channels, a multimodal approach, where facial expressions, voice, and body movements, are analyzed simultaneously, and the data coming from one channel is confirmed with the data from the others — detection accuracy and the list of emotions will grow. This will, in turn, allow broadening real-life applications of the technology, for instance, make truly conversational virtual assistants, where a sophisticated analysis of an emotional state of a user is needed, as well as the ability to react accordingly, correctly from an emotional point of view.

We are looking forward to the future and make our best efforts to contribute to the creation of fast, accurate, and competitive technologies that automatically recognize emotional and social behavior.

September 3rd, 2019 we will be organizing an International Workshop on Social & Emotion AI for Industry — SEAIxI. It will be held in conjunction with one of the main events for Affective Computing and Emotional Technologies — the Bi-annual International Conference on Affective Computing and Intelligent Interaction ACII 2019, this year taking place in Cambridge, UK.

Authors: Olga Perepelkina, Chief Research Officer at Neurodata Lab, Kristina Astakhova, Evangelist at Neurodata Lab.

--

--

Neurodata Lab

We create multi-modal systems for emotion recognition and develop non-contact methods of physiological signal processing. Reach us at contact@neurodatalab.com