Watson Speech-To-Text: How to Train Your Speech “Dragon” — Part 3: How to Build a Grammar for Data Inputs

Published in

IBM Watson Speech Services

7 min readAug 21, 2019

Grammar is a powerful feature and very helpful with handling IDs with speech | Photo by Edho Pratama on Unsplash | Photo by Romain Vignes on Unsplash

In Part 1, I walked you through how to identify the data required, collect it and prepare it properly. In Part 2, I explained how to use the collected text and audio data to train the Watson STT service using the multiple speech adaptation features available.

In this last article of the trilogy, I want to take the time to explain the latest and pretty powerful training feature of STT — Grammar. I will also show you how to build one, train STT with it and use it.

Why Do I Need a Grammar in Speech

In Voice solutions, depending on the use cases, you are expecting different pieces of data to be captured and processed:

General Utterance: Dictates the direction of the conversation, defined as the “intent”. This answers the very typical question : “How may I help you today?”. It can also be some word responses — “entities” — that provide a more precise orientation throughout the conversation.
Data Inputs: Pieces of information that can help identify a user or a component important in business processes (eg. user authentication, ordering a part, inquiring a claim)

Some examples of data inputs are:

Member ID
Card Number
Policy Number
Airline Ticket Number
Healthcare Procedure Code
Date of Birth
Phone Number
Zip Code

For each data input, you have either general or specific “rules” to follow. It can go from digits only to alphanumerics to date formats.

Data Inputs can be of any format | Photo by Ryan Born on Unsplash | Photo by Nick Hillier on Unsplash | Photo by Nicole Geri on Unsplash

Some examples of “rules” for some well-known data inputs are:

Social Security Number (US): AAA-GG-SSSS. The first three-digit field is called the “area number”. The central, two-digit field is called the “group number”. The final, four-digit field is called the “serial number”.
Vehicle Identification Number (VIN): Unique code identifying a specific automobile, composed of 17 characters (digits and capital letters).
Credit Card Number: Numbers composed of 8 to 19 digits structured as follows: 6 digits for Issuer Identification Number (IIN), 1 digit for the major industry identifier (MII), then a variable length (up to 12 digits) individual account identifier.

Each company can also identify their products and their members using a unique identifier, based on their own in-house custom rules.

Watson STT has a feature called Grammar which will help recognizing these data inputs.

A grammar uses a formal language specification to define a set of production rules for transcribing strings. The rules specify how to form valid strings from the language’s alphabet. When you apply a grammar to speech recognition, the service can return only one or more of the phrases that are generated by the grammar.

Grammar is so cool! Where do I start? | Photo by Collin Armstrong on Unsplash

How Do I Build a Grammar for my Watson STT?

In Part 2, under the “Using the Grammar Feature for Data Inputs” section, I provide the technical steps how to add a grammar configuration to an existing Language Model customization after it’s completed and ready.

But before we do that, we need to build one using one of the following standard formats:

Augmented Backus-Naur Form (ABNF): Plain-text similar to traditional BNF grammar.
XML Form: XML elements used to represent the grammar.

As a reference, check the Watson STT Grammar documentation and the W3C Speech Recognition Grammar Specification Version 1.0.

For this example, I will use the XML format. We will look at a fictitious “Product number” that has 9 positions with the following 2 possible formats:

2 letters followed by 7 digits
1 letter followed by 8 digits

I encourage you to use a text editor that supports XML display like Atom, to help visualize and troubleshoot the grammar configuration. All the screenshots in this article are taken from Atom.

The grammar XML have the mandatory start and end tags <grammar version…..> and </grammar>. It is broken down into sections called “rules” (with the prefix “rule id”). Each rule contains one or multiple lists of valid items that will be checked when an audio is getting transcribed by Watson STT. Each rule must start and end with these tags: <rule id=”…”> and</rule>. Within each rule, the list of items must be within <one-of> and </one-of> XML tags. Each item must be wrapped around these tags: <item> and </item>. The “root” value determines which rule contains the main pattern. In our example, the main rule is called “patterns”. Let’s expand it and take a look.

For the Product number, we have defined 3different patterns (listed as items), one for each known product number format and one for utterances. You can add a comment to document what each pattern is doing.

Let’s take a look at our first product number pattern: the “9 positions: 2 letters and 7 single digits”.

9-position Grammar Pattern starting with 2 letters followed with 7 digits

Each item contains what we call a “rule reference” (listed as “ruleref uri”), which is simply a link to another rule configured in this same XML file. This makes grammar maintenance much easier but creating reusable rules, then refer to them anywhere in any other rule.

The typical rules you start with are numbers and letters, as shown below:

Choice of letters within the “letters” rule

Choice of numbers within the “numbers” rule

As you can see, it lists each possible number within the rule, including “oh” as a possible option for “zero”. Same for letters, shown below, from A. to Z. Make sure you include the “dot” following each letter since that how Watson STT transcribes letters.

Let’s go back to our first pattern: the “9 positions: 2 letters and 7 single digits”

As you can see, since one of the valid product formats starts with 2 letters followed by 7 numbers, you simply add one “rule” for each position.

If this product format is the more frequent that the one with 1 letter and 8 digits, you can use the “weight” rule to increase the weight of this format. This tells STT to validate anything it gets starting with this rule first, then validate the other one (with no “weight”). You do not need to configure a “weight” to every rule. I recommend that you first put “weight” to one pattern only (the most frequent) and test.

But what happens if the users mention the product number with a prefix or a suffix but not all the time:

“The product number is …..”
“… is my product number”

You can build rules with “optional” items, meaning that if one of the items is mentioned, transcribe it, if not, leave it blank. This is how the “prefix” and “suffix” rules are built — see below:

“Prefix” rule containing optional prefixes

“Suffix” rule containing optional suffixes

The “optional” item is configured with the <item repeat=”0–1"> and </item> tags, which means that this item may appear 0 time or 1 time.

The “utterances” rule is also built the same way. It contains what we call “opt-out” utterances, allowing the users to say anything other than a product number, which would then be sent to Watson Assistant as-is so they can get immediately transferred to a call agent. Unfortunately, if users are opting out using other valid utterances, you have to add them to the grammar configuration or will end up getting an invalid (blank) response from STT and the user will be reprompted to give the product number.

Test it out and have fun.

In my next article, I will show you how to leverage all this STT knowledge and start designing an IVR solution with other Watson components.

To learn more about STT, you can also go through the STT Getting Started video HERE

Watson Speech-To-Text: How to Train Your Speech “Dragon” — Part 3: How to Build a Grammar for Data Inputs

Why Do I Need a Grammar in Speech

How Do I Build a Grammar for my Watson STT?

Written by Marco Noel