The state of artificial emotional intelligence today

Artificial emotional intelligence is discussed everywhere in the media and industry at the moment. But how good is it really?

Carl Robinson

Published in

Voice Tech Podcast

8 min readNov 18, 2018

By Felix Burkhardt, Director of Research at audEERING

Listen to this article

In large parts of the world, the line that separates humans and machines is becoming increasingly blurred. This is fuelled by two trends: firstly, ubiquitous computing via smart phone, wearables, glasses and implants and secondly, home and vehicle automation via smart speakers, interlinked home components and entertainment systems.

The explosion in man-machine communication has refocused attention on the most natural form of human communication: speech. But speech is much more than just words; speech is expression of the soul! (if such a thing exists). Most of what we express is not defined by the words we use, but by the way we say them. As a long-forgotten Greek actor once boasted: “I can make the audience cry just by reciting the alphabet!”

The huge well of information contained in the way we say things is largely neglected by the current, so-called ‘AI’, bots like Apple Siri, Amazon Alexa, Microsoft Cortana or Google Assistant. Without it, neither urgency, disinterest, disgust or irony can be detected or acted upon, all of which are vital for an interaction that could be deemed “natural”.

A new and burgeoning field

This ’emotional channel’ has long been ignored. I remember a salesman from a large speech technology provider answering my question about emotional speech synthesis just 15 years ago.“We concentrate on stability, and leave research to academia” was the response. However, this is changing now, of course also being fuelled by the current wave of AI hype.

Emotional artificial intelligence is a comparatively new field, but one in which there is tremendous activity. Supported by a plethora of newly developed open-source components, modules, libraries and languages to extract acoustic features from audio and feed them into machine learning frameworks, any reasonably able programmer can now throw together a first prototype of an emotion-aware dialog machine in about two working days.

In addition to many SMEs, all the big companies like Amazon, Microsoft, IBM and Apple already have solutions for face emotion recognition on the market. They surely have internal Voice AI Research and Development projects underway for emotion recognition from speech too. Many smaller Voice emotion analytics companies offer services for sentiment detection from text, bio-signal analysis, and audio analysis. But does the technology delivery on the promises of the marketers?

Applications of emotionally intelligent AI

The applications are manifold.

Emotion recognition and speech analytics can help with automated market research, a field where already many companies are offering services to monitor target groups interacting with a new product while objectively measuring the emotional reaction.

Stress or tiredness detection can help to make automobile traffic safer. Interest and boredom detection are obvious candidates for e-learning software. Speaker classification algorithms can help make automated dialogs more humanlike.

Additional applications of emotion AI include automated security observation, believable characters in gaming, and acting training for professionals such as salespeople and politicians.

Another vast field of possible applications is the health-care and wellbeing industries; monitoring emotional expression can help us to understand our own minds and those of others, and aid in therapeutic treatments. We have already seen robots with emotions start appearing in hospitals.

There are much less obvious applications of computer emotions that are already on the market; for example, implementing a pay-per-laugh when watching comedies in a cinema.

Drawbacks of emotional artificial intelligence systems

I remember, about 12 years ago when I programmed my first emotional dialogs, the weird moment when my very simple if-then-else dialog suddenly appeared to be intelligent — all because I had added an unpredictability layer due to the erroneous detection of my own emotional state.

Increased expectations

First of all, what should an AI-driven dialog system do with the information about the user’s emotional state? A system reacting to my emotions seems more intelligent than a dumb toaster that ignores the urgency in my voice. But can it stand up to the increased expectations placed upon it?

Restricted domains

Symbolic AI, that is to model the world by a rule-based expert system, is to this day only successful on very limited domains. The same is true for systems that are based on machine learning: the world is just too complex for it to be modelled by an artificial neural network or a support vector machine, even a very small part of it.

Remember: everything that can happen will happen eventually, and while some events might be rare, there’s a really large number of them. The world is chaotic by nature and eludes models! A promising research avenue that could make the best of both worlds are ontology based machine learning techniques.

Build better voice apps. Get more articles & interviews from voice technology experts at voicetechpodcast.com

Ethical considerations

Another issue to be conscious about are the ethical consequences of emotion detection technology: there are thousands of definitions of emotions, but most include that emotional expression is something that humans are not conscious of, cannot control directly, and in many cases, don’t want to have advertised. So we have to be very careful how we use these systems, if we don’t want to take yet another step towards the world envisaged by George Orwell…

Trade-offs when designing emotion AI

Emotion-aware technology is based on machine learning, which means it is fuelled by data from human interactions. This means there are several trade-offs to be aware of.

Noisy real world data VS clean laboratory data

Acted or elicited laboratory data is of excellent quality, but are of very limited significance for real world emotions, which are difficult to harvest due to privacy issues, and are by definition full of noise and unexpected events. There is tremendous variability in the acoustic conditions of existing audio emotion databases, which makes it difficult to simply use them all in one big unified model.

Size of data sets VS cost of labelling

There’s a famous quote which illustrates the emotion definition dilemma quite well: “Everyone except a psychologist knows what an emotion is” (Kleinginna & Kleinginna 1970). Usually what we do is ask humans to manually label the data according to some given categorical system. This is a very costly procedure, as huge amounts of data are needed to train machine learning systems that generalize well enough to handle conditions in “the wild world outside my lab”.

Complexity of context VS generalisability of model

But these are just the first questions a prospective emotion-detecting engineer would encounter; the intermingling of emotion, mood and personality further confuse the matter. How many emotional states are there at any given time? One? Two? More? How do I sound, being an extrovert, having just learned I failed my exam, but being freshly in love? Can I learn to detect the emotions of a Haitian dentist if I’m a German carpenter? If there is a difference between the genders, how does this reflect in their affective expressions? Then there are the questions of how long an emotion endures, how to split the data, how to model emotion transitions…

Advice for training emotion AI models

On the bright side, most of these issues are not exclusive to emotion detection but concern machine learning in general. There are many ideas to tackle them using unsupervised or semisupervised learning, innovative architectures inspired from evolutionary models, and subcategorizing parameters for better generalization, to name but a few.

When faced with all these challenges, it is best to start small, keep your expectations realistic and stick to a limited domain as defined by your application. Learn from the data that is produced by your system, and define your emotional models according to the requirements of the use case scenario.

But wait: which system? It hasn’t been built, yet! A way out of the classic chicken-and-egg problem is the so called Wizard-of-Oz scenario, in which a concealed human mimics the system behavior in order to provoke user input to the system.

Another one is to start to train the system with data gathered from another application that is similar with respect to acoustic conditions and targeted emotional expression. Or start by running a rule-based system for “friendly users” — in any case, each application should incorporate a feedback loop in order to get better with use.

How good can artificial emotional intelligence be?

There are a number of scientific benchmarks running in the research world during the last decade that might give an estimate on system accuracy; starting with the 2009 Interspeech Emotion Challenge and continuing with the first Audio-Visual Emotion Challenge (AVEC 2011). Since then, seven annual AVEC challenges have taken place, and the Interspeech series revisited emotional speech recognition in 2013.

Meanwhile, challenges that consider media material such as clips of films have appeared since 2013, namely the annual Emotion in the Wild Challenge (EmotiW14) and the new Multimodal Emotion Challenge (MEC 2016 and 2017). While progress is not directly comparable, as mostly different databases were used in the challenges, it can be noted that the underlying databases evolved from ones created in the laboratory to more realistic data harvested “in the wild”. Also , new techniques like sophisticated artificial neural nets architectures and data augmentation have lead to more stable results, not to mention the increase in computing power from the newly found application of GPUs.

Furthermore, some rules of thumb can be applied:

The accuracy of a classification task depends on the number of classes of course, and can be expected to be around twice the chance level (I speak of “real world” test data, not laboratory data hand-collected by the system designer).
Aspects of emotional expression that influence the speech production apparatus directly, such as the level of arousal, are much more easily detected than for example the valence, which more easily detected from facial expressions.
Fusion of results from different modalities can help, and some of these additional modalities can even be directly derived from the acoustic signal, such as analysis of the linguistic content, or estimating the pulse from fluctuations in the voice.
Compared to a group of human labellers, machine classification results can be expected to be at least as good as a well-performing human.
Strong and clear expressions of emotion will be always be much better recognized than weak and obscure signals.

Conclusions

So should you use emotional awareness in your system? By all means, yes! There is still a lot to learn, and what is currently called AI does not really deserves the attribution “intelligent”, but ignoring the vast richness of emotional expression in human-machine communication does not help in any way.

Be aware of the pitfalls, avoid unrealistic expectations, and be sure to make your system transparent to the user. We have just started on a long journey and won’t get anywhere if we don’t make the first step. Clearly, there are already many applications in specific domains that greatly benefit from affective analysis, with many more to come.

Originally published at voicetechpodcast.com on November 18, 2018.