Wave Hello to Watson Assistant Voice Interaction, and Goodbye to Complex Phone Trees

Published in

IBM watsonx Assistant

4 min readApr 20, 2020

In late 2019, IBM announced Watson Assistant for Voice Interaction (WAVI). This AI solution enables enterprises to modernize their traditional Interactive Voice Response systems. It also allows callers to speak naturally in order to get their problems solved.

Say good-bye to complex phone trees (“press 1 for new reservations, press 2 for existing reservations”) — and hello to simple actions (“I need help changing my existing reservation for my trip to Hawaii on August 26”).

Companies struggle to keep up with high call volumes during the COVID-19 pandemic. People are calling about canceled flights, insurance claims, or medical screening questions. Pretty much every industry is experiencing higher than normal call volumes.

WAVI helps enterprises reduce call wait times and provide faster time to resolution. In fact, according to a Forrester Economic Impact Report, by choosing to adopt Watson, the typical large organization saves millions of dollars per year.

Check out my short one-minute video for an example of a potential WAVI solution:

Demonstration of Watson Assistant Voice Interaction (WAVI)

The caller was able to speak in natural language with normal hesitations. Watson was able to understand the intent of the call and respond appropriately. Watson also sent a text message providing a link for more information.

Let’s break down the four components of WAVI:

Watson Assistant

This is a conversational AI service that classifies the intent of a statement and orchestrates the dialog flow. When building a voice solution, you need to make sure it is tailored to an experience that will work for callers. Keep Watson responses short. Also, train it so it will be able to handle common voice utterances such as "um”, “sure", “yep”, “nope”.

Watson Speech to Text (STT)

This service transcribes speech into text before the input goes into Watson Assistant. The pre-trained US English base model for STT is very good and did not need any customization for my demo recording above. However, to increase its accuracy, it is recommended to train STT with a Language Model matching your Watson Assistant training data (intents and entities).
Custom training might be needed in order for Watson to understand you say domain-specific terminology such as Gastroenterology or Otolaryngology. Before you get nervous about what “custom training” means, for these two words it was as simple as uploading a .txt file with the words split up (e.g. Gastro enterology). Watson does the rest!
Acoustic training can help if your WAVI solution is struggling to understand accents or hear through background noises. Check out this Medium article posted by my colleague: “How to Train Your Own Speech Dragon”.

Watson Text to Speech (TTS)

This service synthesizes the text output from Watson Assistant into audio. This audio is then played back to the caller.
Voices — It’s important that you select a Watson voice that resonates with the end-user. There are over fifty voices to select from, male or female, with accents from around the world. In 2019 we announced neural voices that use Deep Neural Networks. This was a HUGE advancement in our Speech capabilities and a major difference-maker. To help you choose a voice, listen to audio samples for voices in each language and dialect. (FYI, in the example above, I useden-US_EmilyV3Voice.)
Custom Words — If Watson is mispronouncing words, you can easily create custom words. You can give these words specific pronunciation rules. The recording above did not require custom words. However, for certain use cases, such as reciting names of pharmaceutical drugs, you may need to add custom words to a model.

IBM Voice Gateway

This is a Session Initiation Protocol (SIP) orchestrator and a very important part of Voice-over-IP technology. It handles the orchestration between the telephone (caller) and Watson (virtual agent). No steps required for setup since the WAVI solution handles them for you.
SMS — As you saw in the recording, WAVI can send the user a text message with important information. This information (such as URLs, Google Map directions, or long responses) is difficult to communicate over voice.
Barge In- Lets you choose whether a user can interrupt Watson in the middle of a response. There’s a trade-off to consider here but as long as you keep Watson's responses short, turning off Barge In ensures a cough or background noise does not interrupt a dialog.
DTMF- Allows users to press numbers on their phone to provide Watson with inputs. It’s useful for situations where you want a higher accuracy for numbers so Watson does not confuse "eight" vs. "ate" or "four" vs "floor".

Note: Each of these Cloud service components are setup over WAVI’s simple user interface. You do need a SIP Trunk to connect the solution to a phone number. Enterprises typically use major providers such as Avaya, Cisco, or Mitel — but you can sign up for a free Twilio trial account.

Getting Started on IBM Cloud

In summary, Watson Assistant Voice Interaction (WAVI) helps enterprises transform customer service. This leads to shorter wait times, faster time to resolution, and better user experiences. WAVI is easy to set up and easy to customize. Here are some important links to help you get started on building your own voice agent:

Getting started tutorial (link)
Video with step-by-step setup instructions (link)
Original WAVI announcement(link)