Speech synthesis and paralanguage: Experiments in affective computing

Published in

Voice Tech Global

11 min readApr 26, 2021

Three years ago, our small team at Voice Tech Global was developing a voice application for ESL Students(English as a Second Language) as a part of our co-creation and educational workshop series. In our community-led user research, we learned that ESL students lacked practice time when attending in-person classes and that this lack of practice led them to struggle with literacy skills. This lack of practice was damaging to their confidence, especially regarding their pronunciation of new words that they encountered.

After some ideation, evaluation, and early prototyping, we landed on the product concept of what we called ‘Read Along’–a patient, kind, and always available reading assistant. Read Along leverages text-to-speech synthesis to create an a-la-carte audiobook experience, enabling the user to customize and control aspects of the speech playback. We were excited about the potential of this project, but we ran into a problem with speech synthesis early on. in our experimentation. In trying to solve this problem, we came up with a novel solution that began to look like the contours of what a future, more expressive, empathetic, and relational conversational AI might look like. To help illustrate these contours, I’ll take you through the challenge we faced, our approach to solving them, and the potential future applications of this technology

For Read Along, we sourced our stories and texts from Project Gutenberg. Everything from Project Gutenberg is free of U.S. copyrights. This library includes a lot of classic literature that is old enough to have entered into the public domain, giving us a significant source of material to create on-demand ‘audiobooks’ for our users. We decided to try speech synthesis instead of prerecorded speech because of the production effort in prerecorded audiobooks. We also wanted to provide features like translation, accent/voice selection, and variable speech rates which wouldn’t really be possible with prerecorded speech. However, when we tested our various prototypes with ESL students, larger bodies of text suffer in ‘listenability’ when using unstyled, raw speech synthesis.

Without the dynamic and emotive aspects of good storytelling, our reading assistant sounded and felt monotonous. This flat and sterile reading made every sentence an effort to get through, and the result was that it became difficult for our participants to pay attention to and read/follow along with the assistant. However, when we tested with dynamic expression added to the narration, the measure of what we termed "listenability" increased dramatically. To validate this, we tested several different prototypes with our ESL students. Three versions of the same VUI (Voice User Interface); one used prerecorded human-read stories, another used plain text-to-speech synthesis, and the third utilized styled TTS (text-to-speech) synthesis. After our testing session, all of our participants agreed that both the prerecorded speech and styled speech synthesis were easier to follow and more enjoyable to listen to than the flat and sterile version.

Some of the SSML tags used by Amazon Alexa

The styling we applied to the text was done with SSML, which stands for Speech Synthesis Mark-up Language and is a type of mark-up code that lets a TTS engine synthesize paralanguage–a kind of meta-communication that allows humans to modify and communicate more nuanced meaning and convey emotion. As humans, we do this naturally by adjusting our prosody, pitch, volume, intonation, and other qualities of speech. While computers don’t have this innate ability, with SSML, we can manually craft the paralanguage expression of our digital assistants’ synthetic voices. We knew from our research that our TTS would be easier to comprehend and listen to if we added expressive paralanguage.

The problem is that SSML styling is time-consuming and painstaking work. Even for an experienced conversation designer, a single turn of dialogue, or a paragraph of text, may take several iterations to get right. Here is a list of the SSML parameters that can be adjusted. It’s a quintessential but often missed step in voice product development because it’s time-consuming, laborious, and requires significant experience and experimentation to implement effectively. If these barriers prevent it from being implemented on everyday voice interactions, it will undoubtedly make applying SSML to large libraries of classic literature an insurmountable task. The first story we began testing with was the children’s classic Peter Pan by J.M. Barrie. At 40,745 words and 118 pages, this novel clocks in at almost 3 hours of audio time–A length of text that makes the speech synthesis very difficult to listen to without expressive styling, but also an incredibly daunting task to apply this expressive styling.

The problem is that this styling is time-consuming and painstaking work. A single turn of dialogue may take several iterations, and a process of trial and error to get right. Here is a list of the SSML parameters that can be adjusted. It’s an important but often missed step in voice product development because it’s time-consuming, laborious, and requires significant experience and experimentation to implement effectively. If these barriers prevent it from being implemented on standard interactions it will certainly make applying SSML to large libraries of classic literature an insurmountable task. The first story we began testing with was the children’s classic Peter Pan by J.M. Barrie. At 40,745 words and 118 pages, this novel clocks in at almost 3 hours of audio time. This would be arduous for any listener to get through with the plain text, as even listening to one paragraph was a struggle for our testing participants.

Our Solution

To accomplish this seemingly impossible task, we found a way to automate the SSML styling of an entire text, regardless of its size.
Our method required two steps to take place:

1. Processing of the entire text using sentiment analysis to understand the affective meaning behind the writing.

2. Using the results of this analysis to programmatically apply rule-based SSML styling that would emulate the expressive style that a professional narrator or voice actor would use.

With these goals in mind, we began by benchmarking various tools to determine a sentiment analysis solution that would fulfill our needs.

Sentiment Analysis

With these goals in mind, we began by benchmarking available tools to bring this concept to life. Sentiment analysis refers to a combination of data analysis techniques used to systematically identify, extract, quantify, and study emotional sentiments and intentions within data. Sentiment analysis can be applied to text, audio, biometric, image, or video data. For our Read-Along application, we wanted to analyze our long-form text to infer the implicit affective sentiment in each sentence and use that understanding to then apply auto-SSML styling, sentence by sentence.

For this to work, we needed a capable sentiment analysis tool, and so we started by doing a technical audit on the following sentiment analysis tools. The most accurate and capable tool we found was IBM’s Watson Tone Analyzer. It was able to parse and analyze our large text at both a document level and sentence level.

IBM Watson — Tone Analyzer
Tone Analyzer can identify a variety of tones at both the sentence and document levels. It detects three types of sentiment, including emotions (anger, disgust, fear, joy, and sadness), social propensities (openness, conscientiousness, extroversion, agreeableness, and emotional range), and language styles (analytical, confident, and tentative) from whatever text you need to analyze.

Once we had selected the right platform and affiliated API, the next step was to build an end-to-end application to drive the experience on smart speakers.

Compute Sentiment and Apply Emotional Styling:
We set up a database to pull stories from the Guttenberg Library and reformat the classic stories to extract chapters, paragraphs, and sentences. With the text extracted, parsed and structured, we needed to compute sentiment. We decided to do this on two levels, first on each sentence and then, on average, the dominant sentiment at the paragraph level. We decided on this simple system to help balance the outliers in a text where a few sentences would emote strongly as happy in the middle of a very sad paragraph, which would risk creating an odd reading experience. When we look at the snippet below:

Tone Analyzer

The Tone Analyzer Service analyzes text at the document level and the sentence level. The document-level analysis helps to get a sense of the overall tone of the document, and the sentence level analysis helps identify specific areas of your text where detected sentiments are the strongest.

Here is the JSON code returned from tone analysis for one of the more emotionally intense parts of our passage. You can see that sometimes 2 or 3 sentiments are detected on a sentence level, each with a coefficient value associated with them. We realized that we’d have to figure out how to handle multiple detected sentiments.

Sentence-level analysis:


    {
      "sentence_id": 3,
      "text": "He does this at once because he thinks it is what real boys would do, and you must have noticed the little stones, and that there are always two together.",
      "tones": [
        {
          "score": 0.687768,
          "tone_id": "analytical",
          "tone_name": "Analytical"
        },
        {
          "score": 0.727798,
          "tone_id": "confident",
          "tone_name": "Confident"
        }
      ]
    },
    {
      "sentence_id": 4,
      "text": "He puts them in twos because they seem less lonely.",
      "tones": [
        {
          "score": 0.931734,
          "tone_id": "sadness",
          "tone_name": "Sadness"
        },
        {
          "score": 0.687768,
          "tone_id": "analytical",
          "tone_name": "Analytical"
        },
        {
          "score": 0.822231,
          "tone_id": "tentative",
          "tone_name": "Tentative"
        }
      ]
    },

The document analysis returns the following JSON code. Instead of running the document analysis on the entire text, we used it on our paragraph structures. This allowed us to infer sentiment at both sentence and paragraph levels and provided us with additional contextual understanding, which became very useful later in our project.

Document-level analysis:

{
  "document_tone": {
    "tones": [
      {
        "score": 0.63074,
        "tone_id": "sadness",
        "tone_name": "Sadness"
      },
      {
        "score": 0.552272,
        "tone_id": "joy",
        "tone_name": "Joy"
      },
      {
        "score": 0.525246,
        "tone_id": "analytical",
        "tone_name": "Analytical"
      }
    ]
  }
}

Now that we had found a tool that we could use to detect sentiment, we needed to begin working on the second part of our proof of concept solution, the automatic SSML styling. Read Along would use this styling to synthesize speech that would communicate suitable emotional paralanguage. Through iterative experimentation, we developed a set of styles for four emotions. (e.g., harder, more laboured breathing, lower pitch and slower speech rate for sadness, higher speech rate, short breaths and higher pitch for joy) We applied this styling programmatically at two different strengths. Here are some samples of what the synthesis sounds like after it’s been styled to express emotional paralanguage.

Sad SSML styling

Angry SSML styling

Happy SSML styling

Scared SSML styling

The styling is applied programmatically using a mapping that includes both the detected emotion and the corresponding coefficient value (Strength of the detected sentiment). In our early testing, we found that many of our sentences contained more than one sentiment detected and rather than combining and applying two styles, we created logic that would choose the sentiment that had a higher confidence level.

The rules we used were as follow:

Any sentence with a sentiment detected with a confidence level between 0- 0.49, the neutral tone of Alexa’s voice would be used.
Any sentence with a sentiment detected with a confidence level between 0.5 and 0.8 would get moderate emotional strength styling (for sadness, anger, fear, and happiness)
Any sentence with a sentiment detected between .81 and 1.0 will have a maximum emotional expression style applied.
If two or more sentiments are detected within the same sentence, the sentiment with the highest confidence level is applied.

As we tested, we also found that sometimes rapid switching of the expression of our synthesis sounded emotionally too fluid and a bit frantic and unsettling. To mitigate this, we came up with new rules that brought in more of the sentence context by running analysis at a paragraph level and leveraging this additional context. Our new rules also negate the need for rules 1 and 4. The new rules are as follows:

5. Sentence-level emotional styling is applied within a given paragraph only to the most prevalent emotion within that paragraph.

6. Sentences rated between 0–0.49 of their paragraph’s dominant sentiment have a new, low-level emotional styling applied. We found in our testing that this eases the transition between steps and incorporates the emotional context of the entire paragraph.

These last two rules have made our automated styling more conservative but also more contextually aware. We discovered that when paralanguage goes badly it can have unintended negative consequences. Instead of helping us understand the intent better, it can confuse what is being said, and also can cause dissonance in the nuances behind the meaning and the expression of what we hear, which can be distracting and even disturbing.

We discovered that when paralanguage goes badly, it can have unintended negative consequences, and while these new rules have made our automated styling more conservative, they are also more contextually relevant. Mistakes in paralanguage can confuse the audience about what’s being communicated. A dissonance causes this in the nuances behind the meaning and the expression of what we hear, which can be distracting and sometimes disturbing.

In our passage from Peter Pan, when rule 5 is applied, our sentiment analysis finds sadness to be the most prevalent emotion, and as such, sadness styling is applied programmatically at three different levels to every sentence in the paragraph. You can hear how this sounds in the following text:

With no styling whatsoever

With SSML styling mapped to sentiment

When we did this work almost three years ago, the range of SSML tags available to us did not include the programmatic emotional styling we developed. However, since then, Amazon has released two Alexa-specific emotional styling SSML tags for expressing both excitement and disappointment. The implementation sounds excellent, and for us, it’s been validating to see development by Amazon in our research area.

However, our work takes the idea of parametric emotional styling one step further by combining it with sentiment analysis to automate the styling process. The applications of this potent combination are fascinating. There are many areas in which the text is just too large to have designers craft SSML, or in some cases, a conversational agents dialog doesn’t even yet exist when open domain language generation models like GPT-3 are generating it on the fly.

I think it’s important to note that sentiment analysis is not a solved problem, and in our testing, sometimes the analysis got it wrong. This meant sometimes finding joy in our text where the meaning was more melancholic. A mixup that could lead to incorrect emotional styling being applied, which would likely cause a confusing experience for the user due to the dissonance between the text’s intended meaning and the paralanguage the SSML styling was communicating. And while it’s not a perfected technology, it does show us the contours of the future where we’ll interact in a more affective and empathetic computing paradigm. A future where our virtual agents interpret the meaning behind our words through the tone of our voice or the expression on our face–allowing them to understand and respond to us at a deeper and more humanlike level, engaging us with the same paralanguage that we use to communicate with each other.

Thank you to my colleague Guy Tonye who worked on this research with me and provided valuable input into this article!

If you’re interested in learning more about how to design successful conversational experiences — check out Voice Tech Global’s conversational design training programs today.

Speech synthesis and paralanguage: Experiments in affective computing

Our Solution

Sentiment Analysis

Tone Analyzer

Written by Tim Bettridge