Watson Speech-To-Text: A Tale Of Multiple US Cities … with Grammars

Marco Noel
IBM Watson Speech Services
14 min readSep 26, 2020
Los Angeles, California — Photo by Henning Witzel on Unsplash

As customers expand their IVR and other voice solutions, more requirements around specific new data inputs are introduced. One of them I recently had to deal with is US states and cities.

Please note that, although I am covering a US-specific use case in this article, it can be easily applied to other countries (like my very own Canada with its great provinces and cities).

When discussing with clients and colleagues, I often hear the following comments:

“Google is identifying cities and states easily.“

“Google and Siri recognize cities and states on my phone, why don’t you?”

It’s not because Google Map recognizes US states and cities that their Google Speech engine can do it natively for any use case ( I did a little research about Google Speech and there is no “native” states/cities feature for it). In support of their above statement, I always ask for their test methodology and the data sets they used for their evaluation. Then, the discussion becomes fuzzy and I never get anything back. In the end, I encourage them to conduct their own evaluation using my article, then get back to me to expand our discussion further.

Then, a lightbulb turned on in my head: why don’t I do it myself, just like a customer?

Photo by Johannes Plenio on Unsplash

Tough Problem … But Solvable

The pronunciation of names like persons, streets, cities and states is very challenging and subjective: you can often end up with 10+ different human speech variations for the same name, influenced by native speakers, accents, and countries of origin.

Photo by Lea on Unsplash

Here at IBM, we could create Language Model customizations with all the cities and states but when you deal with 50 states and 25000+ cities, combined with the fallback to the Base Model, it makes it challenging to control the speech accuracy.

The Grammar feature in Watson Speech-to-Text is a great fit because you bound Watson STT to a limited list of items with no other fallback — it’s in the list or it’s not.

To improve the accuracy even further, break the state and city in their own prompt. I know that it may be more natural for a human to say “Austin, Texas” in the same prompt, but the more data you expect within the same prompt, the more exposed you are to errors, thus re-prompts and user frustration. Never lose sight of your business objectives: great user experience with minimal frustrations. Check this article for more details.

In this experiment, I used the following grammar configurations, uploaded into the same Language Model customization, then trained it:

  • 1 grammar for US States (50 items)
  • 50 grammars — one for each state, each containing a list of major cities(see table below)
Each state grammar with their respective number of city entries

For more details on how to build grammar customization into Speech-to-Text, you can check out this article.

Test Methodology

Going through my article to evaluate speech engines, I identified my typical use case: as a user, I call into an IVR, and, at one point, I need to provide my US state and my US city. The metric I will use to evaluate accuracy is the City Error Rate.

Then, I hit my first blocker: I have no representative test set.

Photo by Tim Collins on Unsplash

Dang! How do I work around that? — I need audio data to test.

Then, I thought about our great TTS Neural Voices and …. Emily to the rescue! (OK! I’m biased — EmilyV3 is my favorite TTS Neural Voice). I can get Emily to generate individual audio files for each US state and city — that’s over 25185 audio files. Nice!

I can already hear all of you out there, screaming that I am not following my own rules. Here’s why my fallback plan is not as bad as you might think with this test set:

  • Experiment with the Base Model — no grammar. This will be your baseline.
  • Experiment with the Language Model customization and grammar files. This is experiment #1 to compare to your baseline.

What I can get is a relative accuracy improvement between the Base Model and the Grammars. This is a great first step and will confirm if it’s worth continuing this effort.

Does it replace the need for representative data? — ABSOLUTELY NOT! As I mentioned earlier in this article, when you deal with real users, real accents, real variations, you’ll get different but more representative results. If you deal with accents, you can create an Acoustic Model customization to support your grammars. For more details on how to create one, check this article.

TIP: Let me share with you the bash script I used to build my test set and generate each audio file for each city. Before using it, you’ll need to create a TTS Service here — you will need a Standard Plan since you will easily exceed the monthly limit of10k characters.

file=file_with_list_of_cities
url="url of your TTS Service"
voice=en-US_EmilyV3Voice
# Loop reading each individual line contained
while IFS='' read -r line || [[ -n "$line" ]]
do
# Each audio file will use the city name as the base filename - removing spaces from the name
output_std_wav=`echo ${line// /}.wav` &&
echo "Processing city name " $line " into " $output_std_wav " using voice " $voice &&
# Generate TTS audio file
curl -X POST -u $useCred --header "Content-Type: application/json" --header "Accept: audio/wav" --data "{\"text\":\"$line\"}" --output ./$output_std_wav "$url/v1/synthesize?voice=$voice"
done < "$file"

I used 50 text files (one per state), each with the list of cities. Here’s an example below for Alabama (Alabama.txt):

Abbeville
Adamsville
Addison
Akron
Alabaster
Albertville
Alexander City
Alexandria
Aliceville
Allgood
Altoona
Andalusia
Anderson
Anniston
Arab
Ardmore
Argo
Ariton
Arley
Ashford
Ashland
Ashville
Athens
Atmore
Attalla
Auburn
Autaugaville
Avon
Babbie
Baileyton
Banks
Bay Minette
Bayou La Batre
Bear Creek
Beatrice
Beaverton
Belk
Benton
Berry
Bessemer
Billingsley
Birmingham
Black
Blountsville
Blue Mountain
Blue Ridge
Blue Springs
Boaz
Boligee
Bon Air
Branchville
Brantley
Brent
Brewton
Bridgeport
Brighton
Brilliant
Brookside
Brookwood
Brundidge
Butler
Bynum
Cahaba Heights
Calera
Camden
Camp Hill
Carbon Hill
Cardiff
Carolina
Carrollton
Castleberry
Cedar Bluff
Center Point
Centre
Centreville
Chalkville
Chatom
Chelsea
Cherokee
Chickasaw
Childersburg
Citronelle
Clanton
Clay
Clayhatchee
Clayton
Cleveland
Clio
Coaling
Coffee Springs
Coffeeville
Coker
Collinsville
Colony
Columbia
Columbiana
Concord
Coosada
Cordova
Cottonwood
County Line
Courtland
Cowarts
Creola
Crossville
Cuba
Cullman
Dadeville
Daleville
Daphne
Dauphin Island
Daviston
Dayton
Deatsville
Decatur
Demopolis
Detroit
Dodge City
Dora
Dothan
Double Springs
Douglas
Dozier
Dutton
East Brewton
Eclectic
Edgewater
Edwardsville
Elba
Elberta
Eldridge
Elkmont
Elmore
Emelle
Enterprise
Epes
Ethelsville
Eufaula
Eunola
Eutaw
Eva
Evergreen
Excel
Fairfield
Fairhope
Fairview
Falkville
Faunsdale
Fayette
Five Points
Flomaton
Florala
Florence
Foley
Forestdale
Forkland
Fort Deposit
Fort Payne
Fort Rucker
Franklin
Frisco City
Fruithurst
Fulton
Fultondale
Fyffe
Gadsden
Gainesville
Gantt
Gantts Quarry
Garden City
Gardendale
Gaylesville
Geiger
Geneva
Georgiana
Geraldine
Gilbertown
Glen Allen
Glencoe
Glenwood
Goldville
Good Hope
Goodwater
Gordo
Gordon
Gordonville
Goshen
Grand Bay
Grant
Grayson Valley
Graysville
Greensboro
Greenville
Grimes
Grove Hill
Guin
Gulf Shores
Guntersville
Gurley
Gu-Win
Hackleburg
Haleburg
Haleyville
Hamilton
Hammondville
Hanceville
Harpersville
Hartford
Hartselle
Harvest
Hayden
Hayneville
Hazel Green
Headland
Heath
Heflin
Helena
Henagar
Highland Lake
Hillsboro
Hobson City
Hodges
Hokes Bluff
Holly Pond
Hollywood
Holt
Homewood
Hoover
Horn Hill
Hueytown
Huguley
Huntsville
Hurtsboro
Hytop
Ider
Indian Springs Village
Irondale
Jackson
Jacksons Gap
Jacksonville
Jasper
Jemison
Kansas
Kennedy
Killen
Kimberly
Kinsey
Kinston
Ladonia
La Fayette
Lake Purdy
Lakeview
Lake View
Lanett
Langston
Leeds
Leesburg
Leighton
Lester
Level Plains
Lexington
Libertyville
Lincoln
Linden
Lineville
Lipscomb
Lisman
Littleville
Livingston
Loachapoka
Lockhart
Locust Fork
Louisville
Lowndesboro
Loxley
Luverne
Lynn
McDonald Chapel
Macedonia
McIntosh
McKenzie
McMullen
Madison
Madrid
Malvern
Maplesville
Margaret
Marion
Maytown
Meadowbrook
Memphis
Mentone
Meridianville
Midfield
Midland City
Midway
Mignon
Millbrook
Millport
Millry
Minor
Mobile
Monroeville
Montevallo
Montgomery
Moody
Moores Mill
Mooresville
Morris
Mosses
Moulton
Moundville
Mountainboro
Mountain Brook
Mount Olive
Mount Vernon
Mulga
Munford
Muscle Shoals
Myrtlewood
Napier Field
Natural Bridge
Nauvoo
Nectar
Needham
Newbern
New Brockton
New Hope
New Market
New Site
Newton
Newville
North Bibb
North Courtland
North Johns
Northport
Notasulga
Oak Grove
Oak Hill
Oakman
Odenville
Ohatchee
Oneonta
Onycha
Opelika
Opp
Orange Beach
Orrville
Owens Cross Roads
Oxford
Ozark
Paint Rock
Parrish
Pelham
Pell City
Pennington
Petrey
Phenix City
Phil Campbell
Pickensville
Piedmont
Pike Road
Pinckard
Pine Apple
Pine Hill
Pine Ridge
Pinson
Pisgah
Pleasant Grove
Pleasant Groves
Point Clear
Pollard
Powell
Prattville
Priceville
Prichard
Providence
Ragland
Rainbow City
Rainsville
Ranburne
Red Bay
Red Level
Redstone Arsenal
Reece City
Reform
Rehobeth
Repton
Ridgeville
River Falls
Riverside
Riverview
Roanoke
Robertsdale
Rock Creek
Rockford
Rock Mills
Rogersville
Rosa
Russellville
Rutledge
St. Florian
Saks
Samson
Sand Rock
Sanford
Saraland
Sardis City
Satsuma
Scottsboro
Section
Selma
Selmont-West Selmont
Sheffield
Shiloh
Shorter
Silas
Silverhill
Sipsey
Skyline
Slocomb
Smiths
Smoke Rise
Snead
Somerville
Southside
South Vinemont
Spanish Fort
Springville
Steele
Stevenson
Sulligent
Sumiton
Summerdale
Susan Moore
Sweet Water
Sylacauga
Sylvania
Sylvan Springs
Talladega
Talladega Springs
Tallassee
Tarrant
Taylor
Theodore
Thomaston
Thomasville
Thorsby
Tillmans Corner
Town Creek
Toxey
Trafford
Triana
Trinity
Troy
Trussville
Tuscaloosa
Tuscumbia
Tuskegee
Underwood-Petersville
Union
Union Grove
Union Springs
Uniontown
Valley
Valley Head
Vance
Vernon
Vestavia Hills
Vina
Vincent
Vredenburgh
Wadley
Waldo
Walnut Grove
Warrior
Waterloo
Waverly
Weaver
Webb
Wedowee
West Blocton
West End-Cobb Town
West Jefferson
West Point
Wetumpka
White Hall
Wilsonville
Wilton
Winfield
Woodland
Woodville
Yellow Bluff
York

I have created separate folders for each state since I have to test one grammar at a time. This also allows me to measure on a per-state basis.

As an example, here’s my Alabama grammar file:

<grammar version="1.0" xml:lang="en-US" root="patterns"
xmlns="http://www.w3.org/2001/06/grammar">
<rule id="patterns">
<one-of>
<!-- utterances -->
<item>
<ruleref uri="#utterances" />
</item>
<item>
<ruleref uri="#Alabama" />
</item>
</one-of>
</rule>
<rule id="Alabama">
<one-of>
<item>Abbeville</item>
<item>Adamsville</item>
<item>Addison</item>
<item>Akron</item>
<item>Alabaster</item>
<item>Albertville</item>
<item>Alexander City</item>
<item>Alexandria</item>
<item>Aliceville</item>
<item>Allgood</item>
<item>Altoona</item>
<item>Andalusia</item>
<item>Anderson</item>
<item>Anniston</item>
<item>Arab</item>
<item>Ardmore</item>
<item>Argo</item>
<item>Ariton</item>
<item>Arley</item>
<item>Ashford</item>
<item>Ashland</item>
<item>Ashville</item>
<item>Athens</item>
<item>Atmore</item>
<item>Attalla</item>
<item>Auburn</item>
<item>Autaugaville</item>
<item>Avon</item>
<item>Babbie</item>
<item>Baileyton</item>
<item>Banks</item>
<item>Bay Minette</item>
<item>Bayou La Batre</item>
<item>Bear Creek</item>
<item>Beatrice</item>
<item>Beaverton</item>
<item>Belk</item>
<item>Benton</item>
<item>Berry</item>
<item>Bessemer</item>
<item>Billingsley</item>
<item>Birmingham</item>
<item>Black</item>
<item>Blountsville</item>
<item>Blue Mountain</item>
<item>Blue Ridge</item>
<item>Blue Springs</item>
<item>Boaz</item>
<item>Boligee</item>
<item>Bon Air</item>
<item>Branchville</item>
<item>Brantley</item>
<item>Brent</item>
<item>Brewton</item>
<item>Bridgeport</item>
<item>Brighton</item>
<item>Brilliant</item>
<item>Brookside</item>
<item>Brookwood</item>
<item>Brundidge</item>
<item>Butler</item>
<item>Bynum</item>
<item>Cahaba Heights</item>
<item>Calera</item>
<item>Camden</item>
<item>Camp Hill</item>
<item>Carbon Hill</item>
<item>Cardiff</item>
<item>Carolina</item>
<item>Carrollton</item>
<item>Castleberry</item>
<item>Cedar Bluff</item>
<item>Center Point</item>
<item>Centre</item>
<item>Centreville</item>
<item>Chalkville</item>
<item>Chatom</item>
<item>Chelsea</item>
<item>Cherokee</item>
<item>Chickasaw</item>
<item>Childersburg</item>
<item>Citronelle</item>
<item>Clanton</item>
<item>Clay</item>
<item>Clayhatchee</item>
<item>Clayton</item>
<item>Cleveland</item>
<item>Clio</item>
<item>Coaling</item>
<item>Coffee Springs</item>
<item>Coffeeville</item>
<item>Coker</item>
<item>Collinsville</item>
<item>Colony</item>
<item>Columbia</item>
<item>Columbiana</item>
<item>Concord</item>
<item>Coosada</item>
<item>Cordova</item>
<item>Cottonwood</item>
<item>County Line</item>
<item>Courtland</item>
<item>Cowarts</item>
<item>Creola</item>
<item>Crossville</item>
<item>Cuba</item>
<item>Cullman</item>
<item>Dadeville</item>
<item>Daleville</item>
<item>Daphne</item>
<item>Dauphin Island</item>
<item>Daviston</item>
<item>Dayton</item>
<item>Deatsville</item>
<item>Decatur</item>
<item>Demopolis</item>
<item>Detroit</item>
<item>Dodge City</item>
<item>Dora</item>
<item>Dothan</item>
<item>Double Springs</item>
<item>Douglas</item>
<item>Dozier</item>
<item>Dutton</item>
<item>East Brewton</item>
<item>Eclectic</item>
<item>Edgewater</item>
<item>Edwardsville</item>
<item>Elba</item>
<item>Elberta</item>
<item>Eldridge</item>
<item>Elkmont</item>
<item>Elmore</item>
<item>Emelle</item>
<item>Enterprise</item>
<item>Epes</item>
<item>Ethelsville</item>
<item>Eufaula</item>
<item>Eunola</item>
<item>Eutaw</item>
<item>Eva</item>
<item>Evergreen</item>
<item>Excel</item>
<item>Fairfield</item>
<item>Fairhope</item>
<item>Fairview</item>
<item>Falkville</item>
<item>Faunsdale</item>
<item>Fayette</item>
<item>Five Points</item>
<item>Flomaton</item>
<item>Florala</item>
<item>Florence</item>
<item>Foley</item>
<item>Forestdale</item>
<item>Forkland</item>
<item>Fort Deposit</item>
<item>Fort Payne</item>
<item>Fort Rucker</item>
<item>Franklin</item>
<item>Frisco City</item>
<item>Fruithurst</item>
<item>Fulton</item>
<item>Fultondale</item>
<item>Fyffe</item>
<item>Gadsden</item>
<item>Gainesville</item>
<item>Gantt</item>
<item>Gantts Quarry</item>
<item>Garden City</item>
<item>Gardendale</item>
<item>Gaylesville</item>
<item>Geiger</item>
<item>Geneva</item>
<item>Georgiana</item>
<item>Geraldine</item>
<item>Gilbertown</item>
<item>Glen Allen</item>
<item>Glencoe</item>
<item>Glenwood</item>
<item>Goldville</item>
<item>Good Hope</item>
<item>Goodwater</item>
<item>Gordo</item>
<item>Gordon</item>
<item>Gordonville</item>
<item>Goshen</item>
<item>Grand Bay</item>
<item>Grant</item>
<item>Grayson Valley</item>
<item>Graysville</item>
<item>Greensboro</item>
<item>Greenville</item>
<item>Grimes</item>
<item>Grove Hill</item>
<item>Guin</item>
<item>Gulf Shores</item>
<item>Guntersville</item>
<item>Gurley</item>
<item>Gu-Win</item>
<item>Hackleburg</item>
<item>Haleburg</item>
<item>Haleyville</item>
<item>Hamilton</item>
<item>Hammondville</item>
<item>Hanceville</item>
<item>Harpersville</item>
<item>Hartford</item>
<item>Hartselle</item>
<item>Harvest</item>
<item>Hayden</item>
<item>Hayneville</item>
<item>Hazel Green</item>
<item>Headland</item>
<item>Heath</item>
<item>Heflin</item>
<item>Helena</item>
<item>Henagar</item>
<item>Highland Lake</item>
<item>Hillsboro</item>
<item>Hobson City</item>
<item>Hodges</item>
<item>Hokes Bluff</item>
<item>Holly Pond</item>
<item>Hollywood</item>
<item>Holt</item>
<item>Homewood</item>
<item>Hoover</item>
<item>Horn Hill</item>
<item>Hueytown</item>
<item>Huguley</item>
<item>Huntsville</item>
<item>Hurtsboro</item>
<item>Hytop</item>
<item>Ider</item>
<item>Indian Springs Village</item>
<item>Irondale</item>
<item>Jackson</item>
<item>Jacksons Gap</item>
<item>Jacksonville</item>
<item>Jasper</item>
<item>Jemison</item>
<item>Kansas</item>
<item>Kennedy</item>
<item>Killen</item>
<item>Kimberly</item>
<item>Kinsey</item>
<item>Kinston</item>
<item>Ladonia</item>
<item>La Fayette</item>
<item>Lake Purdy</item>
<item>Lakeview</item>
<item>Lake View</item>
<item>Lanett</item>
<item>Langston</item>
<item>Leeds</item>
<item>Leesburg</item>
<item>Leighton</item>
<item>Lester</item>
<item>Level Plains</item>
<item>Lexington</item>
<item>Libertyville</item>
<item>Lincoln</item>
<item>Linden</item>
<item>Lineville</item>
<item>Lipscomb</item>
<item>Lisman</item>
<item>Littleville</item>
<item>Livingston</item>
<item>Loachapoka</item>
<item>Lockhart</item>
<item>Locust Fork</item>
<item>Louisville</item>
<item>Lowndesboro</item>
<item>Loxley</item>
<item>Luverne</item>
<item>Lynn</item>
<item>McDonald Chapel</item>
<item>Macedonia</item>
<item>McIntosh</item>
<item>McKenzie</item>
<item>McMullen</item>
<item>Madison</item>
<item>Madrid</item>
<item>Malvern</item>
<item>Maplesville</item>
<item>Margaret</item>
<item>Marion</item>
<item>Maytown</item>
<item>Meadowbrook</item>
<item>Memphis</item>
<item>Mentone</item>
<item>Meridianville</item>
<item>Midfield</item>
<item>Midland City</item>
<item>Midway</item>
<item>Mignon</item>
<item>Millbrook</item>
<item>Millport</item>
<item>Millry</item>
<item>Minor</item>
<item>Mobile</item>
<item>Monroeville</item>
<item>Montevallo</item>
<item>Montgomery</item>
<item>Moody</item>
<item>Moores Mill</item>
<item>Mooresville</item>
<item>Morris</item>
<item>Mosses</item>
<item>Moulton</item>
<item>Moundville</item>
<item>Mountainboro</item>
<item>Mountain Brook</item>
<item>Mount Olive</item>
<item>Mount Vernon</item>
<item>Mulga</item>
<item>Munford</item>
<item>Muscle Shoals</item>
<item>Myrtlewood</item>
<item>Napier Field</item>
<item>Natural Bridge</item>
<item>Nauvoo</item>
<item>Nectar</item>
<item>Needham</item>
<item>Newbern</item>
<item>New Brockton</item>
<item>New Hope</item>
<item>New Market</item>
<item>New Site</item>
<item>Newton</item>
<item>Newville</item>
<item>North Bibb</item>
<item>North Courtland</item>
<item>North Johns</item>
<item>Northport</item>
<item>Notasulga</item>
<item>Oak Grove</item>
<item>Oak Hill</item>
<item>Oakman</item>
<item>Odenville</item>
<item>Ohatchee</item>
<item>Oneonta</item>
<item>Onycha</item>
<item>Opelika</item>
<item>Opp</item>
<item>Orange Beach</item>
<item>Orrville</item>
<item>Owens Cross Roads</item>
<item>Oxford</item>
<item>Ozark</item>
<item>Paint Rock</item>
<item>Parrish</item>
<item>Pelham</item>
<item>Pell City</item>
<item>Pennington</item>
<item>Petrey</item>
<item>Phenix City</item>
<item>Phil Campbell</item>
<item>Pickensville</item>
<item>Piedmont</item>
<item>Pike Road</item>
<item>Pinckard</item>
<item>Pine Apple</item>
<item>Pine Hill</item>
<item>Pine Ridge</item>
<item>Pinson</item>
<item>Pisgah</item>
<item>Pleasant Grove</item>
<item>Pleasant Groves</item>
<item>Point Clear</item>
<item>Pollard</item>
<item>Powell</item>
<item>Prattville</item>
<item>Priceville</item>
<item>Prichard</item>
<item>Providence</item>
<item>Ragland</item>
<item>Rainbow City</item>
<item>Rainsville</item>
<item>Ranburne</item>
<item>Red Bay</item>
<item>Red Level</item>
<item>Redstone Arsenal</item>
<item>Reece City</item>
<item>Reform</item>
<item>Rehobeth</item>
<item>Repton</item>
<item>Ridgeville</item>
<item>River Falls</item>
<item>Riverside</item>
<item>Riverview</item>
<item>Roanoke</item>
<item>Robertsdale</item>
<item>Rock Creek</item>
<item>Rockford</item>
<item>Rock Mills</item>
<item>Rogersville</item>
<item>Rosa</item>
<item>Russellville</item>
<item>Rutledge</item>
<item>St. Florian</item>
<item>Saks</item>
<item>Samson</item>
<item>Sand Rock</item>
<item>Sanford</item>
<item>Saraland</item>
<item>Sardis City</item>
<item>Satsuma</item>
<item>Scottsboro</item>
<item>Section</item>
<item>Selma</item>
<item>Selmont-West Selmont</item>
<item>Sheffield</item>
<item>Shiloh</item>
<item>Shorter</item>
<item>Silas</item>
<item>Silverhill</item>
<item>Sipsey</item>
<item>Skyline</item>
<item>Slocomb</item>
<item>Smiths</item>
<item>Smoke Rise</item>
<item>Snead</item>
<item>Somerville</item>
<item>Southside</item>
<item>South Vinemont</item>
<item>Spanish Fort</item>
<item>Springville</item>
<item>Steele</item>
<item>Stevenson</item>
<item>Sulligent</item>
<item>Sumiton</item>
<item>Summerdale</item>
<item>Susan Moore</item>
<item>Sweet Water</item>
<item>Sylacauga</item>
<item>Sylvania</item>
<item>Sylvan Springs</item>
<item>Talladega</item>
<item>Talladega Springs</item>
<item>Tallassee</item>
<item>Tarrant</item>
<item>Taylor</item>
<item>Theodore</item>
<item>Thomaston</item>
<item>Thomasville</item>
<item>Thorsby</item>
<item>Tillmans Corner</item>
<item>Town Creek</item>
<item>Toxey</item>
<item>Trafford</item>
<item>Triana</item>
<item>Trinity</item>
<item>Troy</item>
<item>Trussville</item>
<item>Tuscaloosa</item>
<item>Tuscumbia</item>
<item>Tuskegee</item>
<item>Underwood-Petersville</item>
<item>Union</item>
<item>Union Grove</item>
<item>Union Springs</item>
<item>Uniontown</item>
<item>Valley</item>
<item>Valley Head</item>
<item>Vance</item>
<item>Vernon</item>
<item>Vestavia Hills</item>
<item>Vina</item>
<item>Vincent</item>
<item>Vredenburgh</item>
<item>Wadley</item>
<item>Waldo</item>
<item>Walnut Grove</item>
<item>Warrior</item>
<item>Waterloo</item>
<item>Waverly</item>
<item>Weaver</item>
<item>Webb</item>
<item>Wedowee</item>
<item>West Blocton</item>
<item>West End-Cobb Town</item>
<item>West Jefferson</item>
<item>West Point</item>
<item>Wetumpka</item>
<item>White Hall</item>
<item>Wilsonville</item>
<item>Wilton</item>
<item>Winfield</item>
<item>Woodland</item>
<item>Woodville</item>
<item>Yellow Bluff</item>
<item>York</item>
</one-of>
</rule>
<rule id="utterances">
<!-- optional suffix -->
<one-of>
<item repeat="0-1">I don't know</item>
<item repeat="0-1">I don't have it</item>
<item repeat="0-1">I don't have one</item>
<item repeat="0-1">speak to an agent</item>
<item repeat="0-1">please transfer me to an agent</item>
</one-of>
</rule>
</grammar>

TIPS: Below are 2 bash scripts I used to generate STT transcriptions, grabbing all the WAV files in the current folder and transcribing them. Before you use them, you’ll need to create an STT instance with a Language Model customization ID containing the grammar XML file above and train it — see the Grammar article for more details

1 — Baseline script: STT Base Model with no customization or grammar

# Removing spaces from filenames in source directory
for d in *.wav; do mv "$d" "${d// /_}"; done
#
# Cataloging all the WAV files in "audiofiles" folder
ls -1 *.wav > wavfiles.txt
grammar_xml=Alabama.xml
customID="put_LM_Customization_ID_here"
#
# Converting each WAV file listed in the text file "wavfiles.txt"
#
echo "Base Model=" $base_model_name
echo "URL=" $url
echo "Language Customization ID=" $customID
echo "grammar_name=" $base_grammar
while IFS= read -r file
do
base_wav=$(basename ${file%.*}) &&
echo "Converting " $file " ............." &&
curl -u $useCred -X POST --limit-rate 40000 --header "Content-Type: audio/wav" --header "Transfer-Encoding: chunked" --data-binary @$file "$url/v1/recognize?inactivity_timeout=-1&model=$base_model_name&x-watson-learning-opt-out=true&word_confidence=true" > $base_wav-baseline-transcription.json
done < "wavfiles.txt"

2 — Grammar script: STT Language Model customization with the Alabama.xml grammar.

# Removing spaces from filenames in source directory
for d in *.wav; do mv "$d" "${d// /_}"; done
#
# Cataloging all the WAV files in "audiofiles" folder
ls -1 *.wav > wavfiles.txt
grammar_xml=Alabama.xml
customID="put_LM_Customization_ID_here"
#
# Converting each WAV file listed in the text file "wavfiles.txt"
#
echo "Base Model=" $base_model_name
echo "URL=" $url
echo "Language Customization ID=" $customID
echo "grammar_name=" $base_grammar
while IFS= read -r file
do
base_wav=$(basename ${file%.*}) &&
echo "Converting " $file " ............." &&
curl -u $useCred -X POST --limit-rate 40000 --header "Content-Type: audio/wav" --header "Transfer-Encoding: chunked" --data-binary @$file "$url/v1/recognize?inactivity_timeout=-1&model=$base_model_name&language_customization_id=$customID&x-watson-learning-opt-out=true&word_confidence=true&grammar_name=$grammar_xml" > $base_wav-grammar-transcription.json
done < "wavfiles.txt"

Each script should generate one STT JSON transcription file for each city — in our example, you should have 493 JSON files for your baseline and 493 JSON files with the grammar.

Below are 4 JSON examples — 2 from the baseline and 2 from the grammar:

Adamsville (using the STT Base Model with no grammar)

{
"result_index": 0,
"results": [
{
"final": true,
"alternatives": [
{
"transcript": "Adamsville ",
"confidence": 0.6,
"word_confidence": [
[
"Adamsville",
0.6
]
]
},
{
"transcript": "Adams L. "
},
{
"transcript": "Adam still "
},
{
"transcript": "Adamsville I "
},
{
"transcript": "Adams bill "
}
]
}
]
}

Abbeville (using the STT Base Model with no grammar)

{
"result_index": 0,
"results": [
{
"final": true,
"alternatives": [
{
"transcript": "Advil ",
"confidence": 0.18,
"word_confidence": [
[
"Advil",
0.18
]
]
},
{
"transcript": "Abigail "
},
{
"transcript": "hi bill "
},
{
"transcript": "add bill "
},
{
"transcript": "I bill "
}
]
}
]
}

Adamsville (using the grammar, with the correct result shown as 1st-Best):

{
"results": [
{
"alternatives": [
{
"transcript": "Adamsville ",
"confidence": 0.99,
"word_confidence": [
[
"Adamsville",
1.0
]
]
}
],
"final": true
}
],
"result_index": 0
}

Abbeville (using the grammar but with an incorrect result shown as 1st-Best) — “Albertville” shows up as 1st-best result while “Abbeville” is 1st-alternative

{
"results": [
{
"alternatives": [
{
"transcript": "Albertville ",
"confidence": 1.0,
"word_confidence": [
[
"Albertville",
0.48
]
]
},
{
"transcript": "Abbeville "
},
{
"transcript": "Ashville "
},
{
"transcript": "Adamsville "
},
{
"transcript": "Aliceville "
}
],
"final": true
}
],
"result_index": 0
}

Notice that the results and alternatives are always limited to what’s in the grammar configuration, compared to the Base Model.

Results and Observations

After running all my experiments, and doing all the comparisons between the baseline and with the grammar, here are the City Error Rate results:

  • State: Name of the US state
  • Not Found Top 1: STT did not find the target city name as the 1st-best result
  • % City Error Rate Top 1: # Not Found Top 1 / Total
  • Total: Total number of cities in that state

Here are some general observations:

  • Great results — Expected some improvements using grammars compared to the baseline, but not to the level of 96.4% overall accuracy in city recognition for 25135 cities (green columns). This is more than 40% absolute improvement over the baseline.
  • US English Base Model is trained on some major US capitals and cities which explains the 54.44% accuracy with no training.
  • Hawaii reports the worst result with grammar at 23.26% city error rate for 129 cities but still shows a 63% absolute improvement compared to the baseline. This is significant given their spelling and pronunciation.
  • Six (6) states have more than 1000 cities in their grammar and showing a city error rate of 5% or less.

Conclusion

You can use TTS to generate some test audio data for experimentation purposes, even if it’s high-quality audio. You can test features like grammars or general Language customization training, at least to get a sense of how it performs in a best-case scenario and if it’s worth pursuing further.

This experiment confirms that grammar is a really great feature not only for alphanumerics inputs but also when you deal with a limited list of possibilities within a category, like cities within a state. By limiting STT with grammar, you statistically increase your level of accuracy.

It also confirms that it does not fix everything either. Even with high-quality audio, human speech confusion is still a limitation of speech recognition (see example above Abbeville vs Albertville). That’s why you need to rely on other validation mechanisms like using STT alternatives (Abbeville shows as 1st-alternative) or an extra data input like the zip code to limit even further the possibilities (Abbeville is 36310 / Albertville is 35950).

The next phase would be to collect representative audio from actual users of the system, respecting your demographic user distribution, and including accents and other environmental factors. You already have your STT test environment setup. All you need to do is to use this new test set and re-run your experiments. Don’t forget to redo your baseline since you are changing the test set.

To learn more about STT, you can also go through the STT Getting Started video HERE

--

--

Marco Noel
IBM Watson Speech Services

Sr Product Manager, IBM Watson Speech / Language Translator. Very enthusiastic and passionate about AI technologies and methodologies. All views are only my own