How To Properly Plan and Configure Watson Text to Speech in an IVR

Marco Noel
IBM Watson Speech Services
10 min readDec 15, 2021
Photo by Adi Goldstein on Unsplash

By Kelly To and Marco Noel — Product Managers, IBM Watson Speech Services.

In a previous Medium article, we shared guidelines and best practices focused on the overall voicebot solution and user experience. One of the biggest factors impacting your user experience is the way your questions and information are played back to your users with Text to Speech (TTS). A poorly designed IVR can be frustrating and lead to unnecessary transfers to human agents, while a well polished one can add value to the user’s experience.

This article will share some guidelines on how to properly analyze, prepare, plan, test and deploy Text to Speech prompts for your IVR, including some helpful features and open-source tools you can use.

Identify pain points in your existing IVR outputs

The first step is to identify sources of frustrations and factors that cause your users to be transferred to a human agent.
Photo by Julien L on Unsplash

The first step is to identify sources of frustrations that cause your users to be transferred to a human agent. Here are 3 major ones:

1. Long greetings

The longer the greeting, the more impatient the user becomes. Here’s an example:

  • “Welcome to ABC Bank’s Virtual Assistant. We are here to deliver the best service you deserve. This is an automated voicebot. Calls will be recorded for training purposes. If you know the extension of the person you wish to contact, dial it right now. How may I help you?”

You already lost some precious tolerance from your user. If you allow your users to “interrupt” this long greeting (aka barge-in), it’s a clear indication that it needs to be changed.

Keep the greeting short — no more than 5 seconds.

“Thank you for calling ABC Bank’s Virtual Assistant. Calls are recorded. How may I help you today?”

If you require a legal disclaimer, ask yourself the following questions:

  • What is the purpose of the disclaimer?
  • Can you break it down into multiple smaller disclaimers, targeting specific turns of the conversation?

2. Too much or too little information

Users want to get information as quickly as possible. Having to listen to long voice prompts with unnecessary information will frustrate users. At the same time, they want clear instructions of what the system needs from them. It’s a delicate balance.

If you’re asking for information, try keep the prompts short:

  • “What’s your date of birth?”
  • “What’s your member ID?”

If the information was not properly captured, provide some short guidance at the reprompt:

  • “I did not get your date of birth. Starting with the month, day and year, please provide your date of birth”
  • “I did not get your member ID. Please provide your 6-digit member ID”

Also, avoid saying things like “I’m sorry. I did not get…” or “Thank you. Please provide …” repetitively at every prompt. It may sound polite but it can be annoying coming from an IVR. Use these where it matters the most like when you detect a “negative” comment or when a complaint is reported. Eliminate prompts like “please listen carefully to the following option as our menu has changed.” Customers don’t care that the options have changed, and these lines only add unnecessary length. Refrain from adding any marketing messages — these customers are already aware of your company since they called you.

3. Too fast or unintelligible information

If users do not understand or do not have time to process the information played back, they will likely get frustrated.

Clearly identify the important information returned to the users:

  • Credit card balance
  • Insurance coverage
  • Mailing address
  • Phone number

Make sure the text is properly formatted with punctuations so the playback is consistent:

  • Phone number: 1234567890 (bad) → 123–456–7890 (good)
  • Mailing address: 123 Maple Street Some City State 12345 (bad) → 123, Maple Street, Some City, State, 12345 (good)

Choose the right tone and personality

Choose the right tone and personality for your voicebot
Photo by Rendy Novantino on Unsplash

Designing the tone and personality of your voicebot will help craft language and an attitude that lives up to their expectation of your brand promise. When building your prompts in your flows, choose a personality that matches your company image and appeals to your customer demographic.

  • Formal Reserved Personality

Hello and welcome to ABC Bank’s Virtual Assistant. I can answer your questions about ABC Bank’s savings accounts. How may I help you?

  • Semi-Formal Reserved Personality

Hello, I’m ABC Bank’s Virtual Assistant. I can help with your questions about savings accounts. How can I help you?

  • Semi-Formal Lively Personality

Hello there! I’m ABC Bank’s Virtual Assistant. I can help with your questions about our savings accounts and choose the one that’s right for you! How can I help you?

  • Informal Personality

Hi! I’m ABC Bank’s Virtual Assistant. I know heaps about ABC Bank’s savings accounts, so if you’ve got a question, just fire away! How can I help?

How to evaluate with user ratings

How to measure with user ratings
Photo by UX Indonesia on Unsplash

Now that you’ve crafted meaningful prompts and selected an appropriate persona for your IVR, it’s time to test it out. Does the IVR sound helpful? Or does it sound annoying? You’ll want to ensure users actually like what they’re hearing before moving to production.

Methodology for evaluating IVR prompts:

  • Build a team of 3 reviewers maximum — too many reviewers will significantly slow down the process. These reviewers should be representative of your customer base (e.g. if evaluating a banking IVR, source people who have a bank account).
  • Generate short audio files for each Watson Assistant output node using an out of the box TTS voice — use the TTS Python tool mentioned below. The files should be representative of all the prompts in your IVR (e.g. greetings, asking for more information, routing to an agent, etc.)
  • Have the reviewers listen to each audio file and document detailed feedback. You can use the Excel spreadsheet “Evaluation_Template.xlsx” as a guide.
- Does the voice sound natural?- Do the voice and prompts sound pleasing/helpful?- Are there any mispronounced words?- If the audio file uses emphasis, does the reviewer hear that coming through? 
  • You want their feedback to be detailed. A response like “it sounds weird” will not help you improve your existing IVR. Probe the reviewer. What sounds weird? Is it a particular word? How should it sound?
  • Ask the reviewers to rate each prompt overall on a scale of 1–5 where 1 is terrible and 5 is excellent.
  • Any aspect ranked 3 or lower requires more details. Have the reviewer identify which part of the prompt is problematic, then update the evaluation spreadsheet accordingly.
  • To fix specific data inputs like IDs, dates, phone numbers, amounts, etc, you can use the designated SSML tags here
  • To fix recurring words and expressions, use TTS custom dictionary
  • For speed adjustment, use SSML prosody rate at the node level
  • For intonation and inflexion adjustment, use Phonetic symbols or Tune-By-Example
  • Re-generate audio files for each Watson Assistant output node with the new TTS custom dictionary, SSML tags and other improvements

Iterate as many times as required.

For easy SSML experiments, you can use the TTS Demo UI application referenced below.

TTS features and open-source tools

TTS features and tools available to improve the user experience
Photo by Ernesto Velázquez on Unsplash

If you would like to adjust the way the text sounds, take advantage of Text to Speech’s customization features and open-source tools.

Create a TTS customization model with a dictionary:

Create a dictionary of words and their translations to specify how the service pronounces unusual words in your text, such as domain-specific terms, words with foreign origins, personal names, and abbreviations or acronyms.

SSML tags:

Use Speech Synthesis Markup Language (SSML) to provide annotations that adjust pronunciation, volume, pitch, speed, etc. With our upcoming feature releases in 2022, you’ll be able to use SSML tags for word emphasis, expressiveness, emotions and more.

Phonetic symbols (IPA / IBM SPR):

Watson TTS supports both the standard International Phonetic Alphabet (IPA) and IBM Symbolic Phonetic Representation (SPR) notations to represent the sounds of words. The list of languages supporting each phonetic symbol type is listed here.

With phonemes, you can play with the location of the stress in a syllable as well as how a syllable is pronounced with different speech sounds. For more information, check out the following links below.

Tune-By-Example (US English only):

As an alternative to SSML tags, the Tune By Example feature allows you to adjust for misplaced pauses, awkward inflections or general unnatural feel using your own voice. Simply provide an audio file that speaks the text as you want to hear. Tune by Example will apply the intonations and inflections of your voice to the original audio.

TTS Demo UI (NodeJS application):

With this tool, you can test text, SSML tags and phonetic symbols with any voice available. It can be run locally on your machine or on IBM Cloud. Before running this tool, you will need to setup a TTS instance on the IBM Cloud, then get the API Key and the instance URL. To create it on IBM Cloud, you can go through the TTS Getting Started video HERE.

To install it:

  • Download the code from the public Github repository below:
  • Follow the simple instructions in the README.MD file
  • Launch your browser and connect to “http://localhost:3000”

TTS Python tool:

This very useful tool allows you to test out any text, down to your Watson Assistant outputs configured in your skills. You can use the sample files located under the “template-samples” sub-folder to test the tool.

Here’s how to use it:

  • Download the code from this Github repository below
  • You will need a Watson TTS instance to generate your audio files. To create it on IBM Cloud, you can go through the TTS Getting Started video HERE. When completed, you will need the API key and URL of your TTS instance for the configuration file of the tool.

From the folder where you downloaded the code:

  • Copy the config.ini.sample into config.ini
  • Using a text editor, open the config.ini
  • Update the [TextToSpeech] section with the API key and URL of your TTS instance, then add your TTS voice for your output. Leave the customization_id commented for now
  • Update the [Synthesis] section with the output directory of your synthesized audio files, the audio file type (wav, mp3), the input CSV file with the text to synthesize
  • If you wish to get phonemes from text, update the [Pronunciation] section with the input text file (each line with a word or utterance), select the phonetic format you wish to get (IPA or IBM) and the output CSV file
  • If you wish to pull output texts from a Watson Assistant skill, you can update the [Assistant] section with your output CSV result file and the JSON file (if you have a physical Watson Assistant JSON file on your local machine). If you wish to pull them directly from your Watson Assistant instance, get the API key, URL and workspace ID from it.
  • Run the following command to launch the synthesis of the multiple prompts
> python synthesize.pyThis takes an input CSV file with the columns IDs and text, synthesizes each text into an audio file with the name as specified by the ID column.All audios files will be stored under the sub-folder name as configured in the setting "output_dir=" in the config.ini file.
  • To launch the IPA or SPR pronunciation, run the following command
> python pronounce.pyThis takes a plain text input file with one word/word phrase per line and generates pronunciations. The output file is a CSV file that is suitable for passing as an input file to "synthesize.py".
  • To pull output text from a Watson Assistant skill or JSON file, run the following command
> python extract_skill_text.pyThis takes a Watson Assistant skill (JSON file) and extracts all of the text "spoken" by the assistant. The output file is a CSV file that is suitable for passing as an input file to "synthesize.py".You can configure this mode with a pre-downloaded JSON file (as configured in "skill_json_file") or you can provide the Watson Assistant connection information.  If "skill_json_file" is set, it takes precedence over the Watson Assistant connection.
  • Based on the results of your experiments, you can create a TTS custom model with new pronunciations of words and expressions, update the config.ini file with the new “customization_id”, then re-run your experiments above, then iterate as many times as needed.

--

--

Marco Noel
IBM Watson Speech Services

Sr Product Manager, IBM Watson Speech / Language Translator. Very enthusiastic and passionate about AI technologies and methodologies. All views are only my own