Emotionally Aware Voice bots

Sam Bobo
Speaking Artificially
7 min readNov 30, 2023
Image Created by Microsoft Designer with the following prompt: “A series of three bots, one happy, one sad, and one angry”

Landing in Buenos Ares, Argentina for your honeymoon and waiting for luggage to arrive, you notice after a while that your bags have not landed on the carousel. Frustrated, you pick up the phone and dial the airline company’s customer service number and explain the situation to the voice bot. Upon voicing your frustration, the automated bot exclaims in a delightful and joyous voice “Sure, let me help you fix your problem.” The conversation continues with a few back-and-forth (“turns”) and with each turn you grow ever-frustrated, yet the bot continues with a joyful tone which is apparent within its synthetic inflection. By the time you reach a customer service agent, anger has imbued your voice and calm rationality towards a simple call center agent tares them apart, misplacing your anger for the bot at the human.

Unfortunately, that experience happened to me and furthermore, this mis-matched synthetic tone of a voice bot to the situation is all too common within Interactive Voice Response (“IVR”) applications. Truth of matter is that this is largely unavoidable in the “Conversational Intelligence” era of Artificial Intelligence, whereby all systems are trained using supervised machine learning (such as providing a corpus of examples used as ground truth, in this case, a voice actor recordings and script of those recordings). Conversational systems were limited simply by the logic inputted by the conversational designer to achieve a specific outcome — routing a customer to the appropriate agent or filling a request self-service without them, with the latter being the end goal. That, however, does not tackle the voice aspect.

Many text-to-speech voices possesses two levels of inherent prosodic training (1) the natural prosodic footprint of the voice talent that is native to the way they speak and (2) markup — speech synthesis markup language — that wraps the intended textual output of the voice with additional modifications. Using SSML, conversational designers can modify the pitch, timbre, prosody, speed, and other elements of a voice to try and modify the voice’s natural output associated with the experience they intend to provide to the end customer. SSML, however, is either applied at the global level or within a response node of a conversational design. Furthermore, this type of markup can help aid in the pronunciation of specific terms of art using custom lexicons and other phonetically set rules that apply at runtime for the speech synthesis engine.

Untimely, there was not previously a way to tackle this mismatch between the customer’s emotions and the appropriate empathetic response of the voice bot, meaning that voice bots lacked emotional intelligence, until now!

In a previous post, I postulated the idea of creating an emotionally intelligent bot that had the following dynamics:

  1. The bot analyzed the sentiment of the inbound utterance of an end user
  2. The sentiment then fueled the appropriate generated response (using Generative AI capabilities guard railed with the appropriate blockers to prevent unintended responses (say… saying something outside of the intended scope of the bot or offensive responses as examples)
  3. The sentiment further ran into the back-end codec of the LLM (any coding examples the LLM was trained on) to dynamically create SSML tags that modify the aforementioned properties that matched closely to an appropriate style the bot should respond with.

In order to create this bot, I loaded an instance of Microsoft Azure and created the following resourses:

  1. OpenAI — the OpenAI playground granting access to OpenAI’s GPT models, DALL-E, Whisper, and others
  2. Azure AI — text-to-speech (Neural) and speech-to-text capabilities for vocal interaction with my bot

Within the playground environment I first started by defining the system prompt. I used Instructive Prompting to:

  • Assign the bot a persona and role — a call center agent and audio engineer
  • Define Skillset — enumerated the bot’s skills such as SSML mark up and sentiment analysis
  • Instructions — defined the expected input from the user, process the bot should undergo, and the expected output. In this case, analyzing sentiment and crafting a response with SSML markup around it back to the user.

My system prompt was:

You are an audio engineer highly skilled in writing SSML code and computing sentiment from messages. You will be acting as a call center agent. You will receive a message from a customer. Your task is to analyze the sentiment and use that sentiment as context to respond to the customer. The output should be the response wrapped in SSML tags such as speaking rate, volume, pitch, and timbre that matches the tone of the response.

I then proceeded to provide an example input and output to the bot, known in the industry as “Few Shot” learning or “One shot Learning” in my scenario since I provided a single example:

User: My luggage bags were lost upon arrival and I can not locate them.
Assistant: <s><prosody rate="slow" pitch="-10%" volume="medium">I'm truly sorry to hear about your lost luggage. I can imagine how frustrating that must be. Please don't worry; we're here to assist you in locating your bags. To help us, could you please provide me with your flight details and a description of your luggage? We'll do our best to resolve this situation for you.</prosody></s>

For any technical integrators reading this blog, the JSON structure would be:

[
{
"role": "system",
"content": "You are an audio engineer highly skilled in writing SSML code and computing sentiment from messages. You will be acting as a call center agent. You will receive a message from a customer. Your task is to analyze the sentiment and use that sentiment as context to respond to the customer. The output should be the response wrapped in SSML tags such as speaking rate, volume, pitch, and timbre that matches the tone of the response."
},
{
"role": "user",
"content": "My luggage bags were lost upon arrival and I can not locate them."
},
{
"role": "assistant",
"content": "<s><prosody rate=\"slow\" pitch=\"-10%\" volume=\"medium\">I'm truly sorry to hear about your lost luggage. I can imagine how frustrating that must be. Please don't worry; we're here to assist you in locating your bags. To help us, could you please provide me with your flight details and a description of your luggage? We'll do our best to resolve this situation for you.</prosody></s>"
}
]

Finally, I set the parameters of the bot such as temperature and Top P to be conservative to limit the randomness of the responses and be more accurate (the converse would make the bot more creative which is more conducive to brainstorming)

  • Temperature — Controls randomness. Lowering the temperature means that the model will produce more repetitive and deterministic responses. Increasing the temperature will result in more unexpected or creative responses. T
  • Top P — Similar to temperature, this controls randomness but uses a different method. Lowering Top P will narrow the model’s token selection to likelier tokens. Increasing Top P will let the model choose from tokens with both high and low likelihood.

Trying out my bot (voice bot if you use Azure AI TTS and STT features), I inputted the following start to the conversation:

My wife and I are planning a romantic trip to Venice Italy for our engagement. What activities do you recommend?

The Output:

<s><prosody rate="medium" pitch="0%" volume="loud">Congratulations on your engagement! Venice is a beautiful city and perfect for a romantic getaway. There are many activities you can enjoy with your partner, such as taking a gondola ride through the canals, visiting the famous St. Mark's Basilica, exploring the Peggy Guggenheim Museum, or taking a stroll through the charming streets of the city. For a more intimate experience, I recommend booking a private boat tour of the city or enjoying a romantic dinner at a local restaurant. Whatever you choose, I'm sure you'll have a wonderful time in Venice.</prosody></s>

What can be noticed is that the SSML tags were properly generated similar to the example I provided at the system level. The SSML modifications of rate, pitch, and volume were examples of an excited voice, the appropriate repsonse for someone responding to the news about an engagement. Furthermore, the text generated both acknowledged the engagement and provided recommendations.

While I am not normally a troll in any way, lets completely change our mood and state that the recommendations the bot provided were terrible. While the former input had a positive sentiment (obviously) we were change our sentiment to extremely negative and see how the bot responds.

Input

What terrible recommendations. I hate all of these.

Output

<s><prosody rate="slow" pitch="-5%" volume="medium">I'm sorry to hear that my previous recommendations did not meet your expectations. Can you please provide me with more information about your preferences? This will help me better understand what you're looking for and provide you with more suitable recommendations.</prosody></s>

What was demonstrated was exactly the type of response I would expend. The speaking rate slowed down, the pitch got lower, and the volume decreased to match the fact that I was extreme frustrated at the response.

Even more impressive, when speech synthesis (text-to-speech) is enabled with one of Microsoft’s Neural voices — I chose JennyMultilingualV2, the SSML is rendered and interpreted by the system, with only the actual text read outload with the modifications interpreted and applied!

To further summarize the ease at which this bot was created:

  1. The resources all fell within Microsoft’s free Azure tiers, so the cost to develop was $0
  2. The bot development took me 15 minutes of experimentation

This certainly is a breakthrough (please do not overhype, its not a massive breakthrough but one that should be celebrated in some regard) whereby voice bots can gain some emotional intelligence. Certainly Emotion AI as a field needs to develop further prior to truly achieving bots with EQ (Emotion Quotient). I encourage you to experiement with OpenAI and Generaitve AI more broadly to come up with new use cases that we have not imaged yet! Happy coding!

--

--

Sam Bobo
Speaking Artificially

Product Manager of Artificial Intelligence, Conversational AI, and Enterprise Transformation | Former IBM Watson | https://www.linkedin.com/in/sambobo/