Watson Speech-To-Text: How to Train Your Own Speech “Dragon” — Part 1: Data Collection and Preparation

Published in

IBM Watson Speech Services

9 min readAug 12, 2019

Adding voice to your chatbot requires data preparation | Photo by ROOM on Unsplash

Over the past years, we’ve seen a lot of AI chatbots deployed in across many organizations. They typically handle general questions about products and services, even doing some basic transactions without the intervention of a human.

Most of them already have an IVR to automate some customer interactions (press “1”… press “2”…), then route to the right call center queue. This locks Call Center Agents into answering very basic questions, instead of focusing on real complex situations.

The current trend now is how they can bring their current web chatbots with voice and replace their IVR. They have no clue where to start when introducing automated speech recognition services. A typical IVR solution built with Watson services requires the following components:

IBM Voice Gateway (VG)/ Voice Agent (VA)— manage inbound calls using SIP and orchestrate the integration with the other IBM Watson services below
IBM Watson Assistant (WA) — Conversational service containing your business flows (chatbot)
IBM Watson Speech-to-Text (STT) — Automated Speech Recognition (ASR) service that converts an audio stream into text for other text-based APIs services
IBM Watson Text-to-Speech (TTS)— Converts text into a natural-sounding audio voice
Service Orchestration Engine (SOE) — Application layer that integrates many API services and backend systems.

This series of article will focus on how to train the IBM Watson STT service, its requirements, the methodology and some best practices inspired by actual customer engagements.

This article explains the core components of Watson STT as well as the data collection and preparation required before we start training Watson STT.

IBM Watson STT Components

First, you need to know what comes out-of-the-box with Watson STT and its components you can use to improve its accuracy:

Base Model — each provisioned STT service comes with a base acoustic model and a base language model in multiple languages. It’s pre-trained with general words and terms. It can also handle light accents. IBM has introduced a new model called “US English short form” which is trained for IVR and Automated Customer Support solutions (US English Narrowband only)
Language Model Adaptation/Customization — with UTF-8 plain text files, you can enhance the existing base language model with domain-specific terminology, acronyms, jargon and expressions, which will improve speech recognition accuracy.
Grammar Adaptation/Customization — this new feature allows you to adapt the speech recognition based on specific rules limiting the choices of words returned. This is especially useful when dealing with alphanumeric IDs (eg. member ID, policy number, part number, etc).

I will cover the configuration and training in more details in part 2.

Yeah yeah — this is all really great but again, where do I start?

Where do I start to train Watson STT? | Photo by Kyle Simmons on Unsplash

Data Collection

The first thing you need is data. Not any data. Representative Audio Data, meaning data that comes from your actual end users. We recommend to collect 10 to 50 hours of audio, depending on the use case, business flows, data inputs and target user population. You need to know the demographic information and distribution of your target users :

Location: Some US states, entire USA, worldwide, …
Accents: plain English, Indian English, Hispanic English, Japanese English, …
Devices: cellphone, deskphone, softphone, computer,…
Environments: busy open office, waiting room, quiet office,…
Length of Utterances: Long versus short responses

Document the location, accents, devices and environments metadata with your audio files | Photo by Product School on Unsplash | Photo by Lauren Mancke on Unsplash | Photo by Hal Gatewood on Unsplash

You can use your existing call recordings from your existing IVR or call center solution, but in most cases I have seen, they are very “conversational / free form / multi-topics” which, in general, never mimic the target solution. If you have the human transcriptions of these audio files, you could search for calls that match your use case and business flows, then extract the utterances and data inputs you need (text and audio segments). If not, you will most probably end up collecting new, more targeted audio files.

It is recommended to maintain a matrix of accents, devices and environments for each collected audio file within your data set. This matrix will be used to verify if the data set is representative of your actual user population and identify gaps where more training material is required.

Example of a matrix with demographic information

How Do I Collect Data?

You will need to identify testers, ideally actual users of your previous solution, matching you demographic matrix (location, accents, devices, environments, etc). The more distributed testers you have, the better data quality and variety you will get. You can either choose and manage your own group of testers or you can crowdsource to a company with your specific data collection requirements that we will cover next.

To be as representative as possible, you need to use a data collection technical environment as close as possible to your target solution. For our example of an IVR, you will need to setup an environment that your testers will be able to call in and collect their voices. All the customers I worked with had to setup 3 environments: Development, User Acceptance Testing (UAT), and Production. You can use the Development environment to do your data collection. You can simply ramp up an IBM Voice Gateway directly connected to a Watson Assistant skill with a simple welcome message and instructions, a STT service and a TTS service. Your telephony team would need to configure an 800 number that would connect to your Voice Gateway.

Next is your data collection strategy. If you know your use case and business flows, you know what you are expecting your users to provide:

Utterances — to identify your intents and entities
Data inputs — dates, IDs, numbers, amounts, part numbers

For the testers, should I use a “free form” style or a “scripted” style of data collection? — Here are the pros and cons for each style below.

Free form:

PRO: Provides variety of data, more representative — best for utterances (intents)
CON: Unpredictable, hard to manage — especially for data inputs, will need manual transcription

Scripted:

PRO: Expected utterances/data inputs (no manual transcription, only review), easier for testers to read and follow
CON: Limited scope of data set — could result in model overfitting

The ideal solution is a balance of both: build multiple scripts (minimum: 5 — recommended: 10) where you provide a wide variety of pre-scripted data inputs and utterances to read, then include some “free form” fields where you provide instructions to the testers like “how would you ask about filing a claim?”. Include an “identification” section in your scripts. This will greatly help to build your matrix without having to listen to the entire audio. Here’s an example:

I am reading training script #<1, 2, 3,…,10>
I am a <gender> — male, female
I live in <city>, <state> — Newark, New Jersey
I am using a <device> — cellphone, computer, deskphone
I am calling from <environment> — office, street, living room, warehouse
I speak English with a <accent>

Testers can call from anywhere if it matches your use case | Photo by Marc Kleen on Unsplash

Test your scripts. Based on past experiences, we have had good adoption when a script did not exceed 15 minutes in duration.

Make sure you deliver a training session to your testers so they understand what you are expecting out of this exercise, how to read the different scripts, the amount of calls you expect. Establish clear collection targets (eg. 20 hrs of audio in the next 3 weeks) and track progress on a daily basis.

Human Transcriptions

With all the audio files you have collected, you need to get them transcribed by humans (transcribers).

With all the audio files you have collected, you need to get them transcribed by humans (transcribers). Photo by The Climate Reality Project on Unsplash

Wait! Why do I need to get human transcriptions when I already have my scripts above? I’ll just use my scripts and that will be the end of it.

Well… not quite. You’ll find out pretty quickly that your testers will:

hesitate (eg. hummm, mmm, oops, sorry)
stutter / restart (eg. one t-t-two three … one two three five <uhm> one two three four five…)
cheat / skip lines
cut short on the script
improvise / not following script

There are two reasons why we need human transcriptions of the collected audio files:

Measure the Accuracy of our STT Service — establish the “reference” comparison, knowing exactly what the audio is saying, so we can compare to what STT returns.
Train a Language Model Adaptation — even if you use a script, your testers can provide expressions and jargons as part of their free-form utterances. That’s what we need in your language model adaptation.

More to come in part 2.

Customers have two options to transcribe the audio files:

SMEs — Using customer staff who knows the use case very well (expressions, acronyms, products, IDs, etc) may make sense but you have to be sensitive as all of them have their day job. Also, they may not be as efficient as a professional transcriptionist and can be expensive/hard to scale if you have a large volume of audio files.
Outsource — Using companies specialized in audio transcriptions could be the obvious choice but budget is usually a problem. You must also consider data confidentiality and ownership from the vendor. It may still require an SME to review the transcriptions for quality control.

For Proof-of-Concepts, we used both consultants and customer SMEs to expedite this task as the data set is usually very small.

You will also need to use a Human Transcription Protocol — a standard and consistent way of transcribing audio across all human transcribers. Here are some general guidelines I have been using at customers:

Transcribe everything that is said as words, never use numbers. For example, the year 1997 should be spelled out “nineteen ninety seven,” and a street address like 1901 Center St. should be spelled out “nineteen oh one Center street.”
Human sounds that are not speech such as coughing, laughter, loud breath noise, or a sneeze should all be transcribed as <vocal_noise>
Other small disfluencies like “uh” or “um” should all be marked as: <uh> or with the tag %HESITATION (this depends on the final use of the transcription).
Do not tag any other sounds or noises, even loud ones, unless directed.
Only if the task at hand calls for it, mark background noise with a speaker ID of ‘noise’ or a transcription annotation of <noise>.
Do not make up or use any new tags not mentioned in this sheet, unless directed.
If someone stammers and says ‘thir-thirty’, the corresponding transcription would be ‘thir- thirty’. Please be sure to leave a blank space after the “-” in the above example.
Don’t use any punctuation such as periods, commas, question marks, or exclamation marks. But don’t correct existing.
Transcribe abbreviations spoken as letters using capital letters with a period and a space after each letter: => I. B. M. => F. T. P.
Transcribe acronyms that are said as words using capital letters without spaces: => NASA => DAT
If someone spells out a word, capitalize the letters and put a space between them. For example, “my name is Dana D. A. N. A.”
If you are unsure of a word or phrase, do your best to transcribe it. But if you can’t understand it or feel very unsure, then mark it <unintelligible>.

Building Your Training Set and Your Test Set

Once you have collected your audio files and human transcriptions, the next step is to split them into a training set and a test set. Depending on the amount of data you have collected, you can use the 80/20 rule (80% training / 20% test). Make sure both data sets are randomly selected, evenly distributed across your matrix areas with no overlap between them.

Using a simple spreadsheet with columns and a random function will do the trick. I usually start with the Accents category: I sort the column to group all rows with the same accent together, copy each accent group in their own Excel worksheet, randomize within each accent tab, then pick the top 20% of each accent. I create a Test Set worksheet, then copy each top 20% of each accent into it. Finally, I manually validate the balance of devices and environments in the test set to make sure all areas are covered.

In Part 2, I will explain how to use the collected to configure and train Watson STT, do some experimentations and optimize.

To learn more about STT, you can also go through the STT Getting Started video HERE