By Madhumitha Loganathan
Conversational agents and chatbots are becoming smarter by the day. AI advancements have made it so easy for them to mimic our language, to the point that chatbots are more convenient, provide instant support, easy to communicate with, and are available 24/7. But they still all lack a basic yet important quality in them — empathy.
When people talk to a virtual assistant or chatbot, they end up feeling misunderstood because the bot does not understand emotional context. Words and actions are driven by human emotion, and our virtual friends need the ability to understand emotions and react appropriately.
And with bots becoming more and more intelligent, customers expect a corresponding increase in emotional intelligence and personalization. Imagine, for example, if your chatbot could sense your confused expression, convey good news in a happy tone, or play music to help brighten up your mood.
So how can we make our bots “emotionally intelligent?”
By applying emotion sensing and emotion synthesis techniques to enable them to be more friendly and human-like in conversations.
In my experiment, I used emotion analysis from text, natural language understanding of conversations, and emotional speech synthesis to create an empathetic customer assistant that resulted in a greater emotional connection with our customers. The goal here is not only to fix the bots, but make people feel better too through optimizing our virtual bots’ ability to deeply understand text, voice, and facial cues.
To give a bit of context — text-based emotion analysis is very helpful in chatbot conversations and in various other places where text is the primary mode of communication. Words are expressions of emotion, and recent advances in text analysis have made it possible to detect not just sentiment, but also the underlying emotion, communication style, and social propensity.
IBM Watson’s Tone Analyzer services, for example, use linguistic analysis to detect joy, sadness, fear, and anger as well as analytical, confident, or tentative tones from the text. Several other companies, like Perceive and biText, also provide services for emotion recognition from text.
It’s not only what you say, but how you say it that matters.
Research says that verbal, vocal, and facial cues contribute to 7%, 38%, and 55% of the effect of the message as a whole, respectively. The non-verbal components of an assertive message are really the key to its effectiveness.
An individual’s vocal pitch, energy, amplitude, frequency, pauses, and other key elements are used to assess the emotional state. This is different from linguistic and semantic information of the text; vocal analysis helps in phone conversations where body language cannot be assessed. Emotions like anger and sadness could more easily be detected using voice than using text or facial cues.
Companies like Beyond Verbal and Vokaturi are gathering the emotional states of the user from decoding voice patterns in audio recordings or real-time.
Although every face has its own way of presenting emotions, there are seven universal expressions (anger, contempt, disgust, fear, happiness, sadness, and surprise) that are common across people regardless of their age, ethnicity, language, or religion. Facial emotion analysis tools analyze every identified face for a range of emotions and with a confidence score across a set of emotions.
Affectiva, Kairos, and Microsoft have developed services to perform facial emotion analysis, with some even enabling real-time analysis from a live-camera feed. These face readers also track eye gazes and head orientation.
Like facial cues, body gestures and movements also convey affective information. Crossed arms, body leaning forward, and hands on head are examples of body gestures that can be related to an individual’s emotional feeling, and several companies are researching this space.
In chatbots, where multiple input channels are available, you may think that combining the analysis of different modalities will give us better results. This is true, as long as the emotion is consistent among all channels. In the case of sarcasm, a user’s facial expression may infer the person to be happy but the verbal analysis as sad or angry.
Consider the word “fine.” The meaning of this one word can change drastically based on how it’s said. Hence, multimodal detection, if not combined accurately, could lead to inaccurate results.
Similar to using voice, text, and facial cues for identifying affect, the same modalities can be used to make meaningful, contextual responses.
Speech synthesizers generally output text in a neutral, declarative style and monotonic fashion. By changing the patterns of stress and vocal intonations in speech, different kinds of emotions can be expressed. Vocal intensity increases for anger and decreases for sadness; speech rate is faster for anger, happiness and fear than for sadness.
Speech Synthesis Markup Language (SSML) is an XML-based markup language for speech synthesis. Using SSML tags, one can customize the pitch, speaking rate, volume of text, pausing, specify pronunciations, and add or remove emphasis for generating natural-sounding speech. Alexa and Google Assistant support the use of SSML for constructing output speech, and IBM has introduced expressive SSML tags in their text-to-speech services.
As in the example below, the tags can either set a positive, regretful, or neutral tone for the generated audio by setting the “express-as” tag with the expected expression.
<express-as type=”GoodNews”>Congratulations on your wedding! </express-as>
By using the right choice of words and adding facial expressions with adequate intensity, we can make bot avatars empathetic, expressive, and socially acceptable.
The empathetic chatbot assistant uses text-based affect analysis tools for recognizing emotion, IBM’s Expressive SSML for expressive speech synthesis, and Motion Portait’s avatar and animations for facial expressions. Check out my explanation below for more details on the experiment!