Vocalizing Emotional Intelligence and the Creation of Dynamic Voice Bots

Published in

Speaking Artificially

5 min readSep 12, 2023

Imagined by Bing Image Creator — Powered by DALLE

It’s time to face the “faceless” reality — the contact center automation is rapidly accelerating and systems are working tirelessly to contain you within either the voice bot or chat bot interface instead of transferring you to a live representative. Personally, I find it extremely frustrating that a simple phone number can not be found online to at least connect through an IVR, but here we are, not to mention that when I do reach a voice system and issue a complaint, the voice playback is pleasant, which sometimes puts me further into a bad mood. Can anyone relate?

Let’s catalog the “as-is” journey for callers to a voice bot:

Whenever one enters through a voice channel, even with advanced neural text-to-speech capabilities, the “faceless” voice persona speaks in a single style.
Traditionally today, within a state manager / dialog-based application which employs pre-defined prompts, responses, and logic, Speech Synthesys Markup Language (“SSML”) can be applied to change the style of the synthesized speech played back on the voice application. The style contains pre-defined configurations for the voice that changes its volume, speed, prosody, timbre, and more.
Today, thanks to solutions such as Nuance Mix Answers, we know that Generative AI can dynamically create responses to a user confined to a knowledge source such as a website or broader corpora (recently known as Retrieval Augmentative Generation), however, the way the voice bot speaks a response within the application remains static, at least in the dynamic content generation setting.
Codec models behind Generative AI and transfer learning thereof enabled the generation of code.

As a Product Manager of Conversational AI Engines such as speech recognition and text-to-speech, I postulated about the ability to utilize Generative AI to “magically” equip voice personas with Emotional Intelligence to dynamically change speaking style based on what was said by the caller. It turns out, in a simple Proof of Concept, that ChatGPT can do just that.

In order to facilitate a sample interaction, I started with constructing a prompt template in order to initialize the agent with context pertinent to playing a specific roll and the type of knowledge required to complete the request. In actuality, I requested the agent to play two roles, each with varying skillsets.

In summary:

Persona 1 (Audio engineer): “You are an audio engineer named Eric. You have knowledge of SSML markup. You also understand Sentiment Analysis and can deduce the sentiment from a statement you see.”
Persona 2 (Call Center Agent): You are also an agent as part of a call center named Stacy who is responding to customer inquiries.
Instruction: I will provide a statement the customer would say. As Stacy, craft the response, taking the sentiment into consideration. As Eric, use the sentiment to change the SSML prosody and style tags, plus any you deem appropriate to mark up the response. You should have a professional, yet conversational writing style. Show your work as both Eric and Stacy and your reasoning.
Utterance: I just bought a flat screen TV for my new house and it came cracked upon arrival. I am hosting a housewarming party in two days. Can you help me remedy the situation?”

I used two types of Prompting:

Instructive Prompting — This method involves providing the model with a clear and detailed instruction that specifies the task, the input, the output, and any other relevant information. The idea is to reduce ambiguity and confusion for the model by telling it exactly what to do and how to do it. For example, if the task is to write a blog post about productivity, the instructive prompting might include the topic, the audience, the tone, the length, and some keyword
Chain-of-Thought Prompting — This method involves breaking down a complex task into a series of intermediate steps, and guiding the model through each step with a prompt. The idea is to elicit reasoning and problem-solving skills from the model by providing it with a chain of thought that leads to the final answer. In this scenario, I requested that the sentiment categorization, associated score from -1 to 1 (implied from knowledge of Sentiment Analysis but not explicitly instructed), and type of language and/or speaking style applied.

Below was the response I received from ChatGPT and associated code (copied in for formatting)

<speak>
  <s style="mood: empathetic;">
    I'm truly sorry to hear that your new flat-screen TV arrived cracked. 🙁
  </s>
  <break time="300ms"/>
  <s style="mood: calm;">
    We understand your frustration, and we're here to help. To resolve this issue, I'll initiate a replacement process for you right away.
  </s>
  <break time="300ms"/>
  <s style="mood: reassuring;">
    We'll have a new TV shipped to you, and we'll also arrange for the return of the damaged one at no extra cost.
  </s>
  <break time="300ms"/>
  <s style="mood: attentive;">
    Please expect an email with all the details shortly. If you have any other concerns or questions, feel free to let me know.
  </s>
</speak>

While I did not get the exact Sentiment score, it was categorized as negative, so partial points! Overall, I was impressed with the level of SSML markup known by ChatGPT, likely augmented by the codec model behind the product.

Note: Given that this was generated on a chat interface as a POC, there was no audio synthesized to include within this post.

There are a number of limitations to the highlighted example and implementation the reader should be aware of.

Voice Model Baseline — highly skilled audio engineers have an acutely attuned ear and can detect optimal sounding output. Typically, building a synthetic persona requires a voice talent who is known for their voice quality. All voices are different — they very in inflection, prosody, pronunciation, and more. While SSML can modify the characteristics of a voice, that does not necessarally mean they will sound good. Same with the output from ChatGPT. What consists of a “reassuring” style? How would that sound against one of the voice personas selected? That still remains a risk to using this solution out-of-the-box and would require a skilled audio engineer to tune or provide explicit parameters within the prompt template.
Latency — Generating text and modifying the voice styling as a middleware component could yield higher latency than acceptable on a voice application / phone call. Caller-perceived latency should be preserved at all costs to reduce the amount of dissatisfaction incurred after the transaction.
SSML Support — Some voice platforms do not support SSML, have proprietary SSML tags and/or lack of support for some within the spec. An engineer must be mindful of the voice platform using and its limitations.

As mentioned earlier, this is a Proof of Concept. In future iterations, I shall incorporate this concept more dynamically into a voice bot using Nuance Mix, Azure Functions, Azure AI Sentiment and Azure AI Neural Text-to-Speech. Stay tuned!

Vocalizing Emotional Intelligence and the Creation of Dynamic Voice Bots

Written by Sam Bobo