Watson Speech-To-Text: How to Train Your Own Speech “Dragon” — Part 2: Training with Data

Marco Noel
IBM Watson Speech Services
8 min readAug 12, 2019
Photo by Jason Rosewell on Unsplash

In Part 1, I walked you through the different components in Watson STT available for adaptation. I also covered the important step of data collection and preparation. In this article, we will see how we use this data to configure and train Watson STT, then conduct experiments to measure its accuracy.

Establish Your Baseline

In order to see how Watson STT performs and how we measure improvements, we go through multiple iterations of teach, test and calibrate (ITTC).

The first thing we must do is to set our baseline by using the Test Set we built earlier (see “Building Your Training Set and Your Test Set” in Part 1,). My friend and colleague Andrew Freed wrote a great article on how to conduct experiments for speech applications, using the sclite tool — read it for more information on experimentation. The first experiment is run against the STT Base Model with no adaptation. This becomes your baseline. Not only you will get a Word Error Rate (WER) and a Sentence Error Rate (SER) but it will give you the areas where you need to improve.

The obvious gaps that we usually observe at this point are:

  • Out-Of-Vocabulary words — domain-specific terms, acronyms
  • Technical terminology and jargons — product names, technical expressions, unknown domain context

Take note of your weak areas. They will indicate where Watson STT training is required and what to validate as you go through your multiple iterations.

Create a Language Model Adaptation/Customization

Out of the 3 components available for model adaptation, the Language Model Adaption is the one who delivers the biggest bang for the buck. Watson STT is a probabilistic and contextual service, so training can include repetitive words and phrases to ‘weight’ the chance of the word being transcribed. The focus of training text data should be on ‘out-of-vocabulary’ words, and known words that the solution struggles with. Additional emphasis can also be put on high frequency in-vocabulary words.

To create a Language Model Adaptation/Customization, the steps are the following:

  • Create a new custom model by running the “curl” command below:

curl -X POST -u “apikey:{apikey}”
— header “Content-Type: application/json”
— data “{\”name\”: \”Example model\”,
\”base_model_name\”: \”en-US_BroadbandModel\”,
\”description\”: \”Example custom language model\”}”
https://stream.watsonplatform.net/speech-to-text/api/v1/customizations"

You will get a customization id similar to:

{
“customization_id”: “74f4807e-b5ff-4866–824e-6bba1a84fe96”
}

This ID is your placeholder that you will use to add training data and “recognize” API calls. There is no limit in the number of custom models you can create within a Watson STT service but you can only use one custom model at a time in API calls.

  • Create a UTF-8 text file with utterances and add it to the new custom model

Here’s an example — “healthcare.txt” — that contains gaps identified during the first experiment.

To add the file to your newly created custom model with the customization ID, run the following “curl” command:

curl -X POST -u “apikey:{apikey}”
— data-binary @healthcare.txt
"https://stream.watsonplatform.net/speech-to-text/api/v1/customizations/{customization_id}/corpora/healthcare"

You can add as many text files as you want within a single custom model, as long as you do not exceed the maximum number of total words of 10 millions.

  • Add custom words to the custom model

You can use custom words to handle specific pronunciations of acronyms within your domain. One example in our healthcare domain example is the Healthcare Common Procedure Coding System (HCPCS). A common pronunciation we see for it is “hick picks”. You can configure a custom word when a caller says “hick picks”, Watson STT transcribes “HCPCS” instead. To add this custom word to your existing custom model, you run the following “curl” command:

curl -X PUT -u “apikey:{apikey}”
— header “Content-Type: application/json”
— data “{\”sounds_like\”: [\”H. C. P. C. S.\”, \”hick picks\”]}”
https://stream.watsonplatform.net/speech-to-text/api/v1/customizations/{customization_id}/words/HCPCS"

For more details, check the documentation on how to add multiple words.

  • Train the custom model

Every time you add, update or delete training data to your custom model, you must train it with the following command:

curl -X POST -u “apikey:{apikey}”
https://stream.watsonplatform.net/speech-to-text/api/v1/customizations/{customization_id}/train"

You can check the status of the custom model by running this command:

curl -X GET -u “apikey:{apikey}”
https://stream.watsonplatform.net/speech-to-text/api/v1/customizations/{customization_id}"

When you create the custom model, the status is “pending”. When you add data to it, after the processing is complete, it moves to “ready”. When you issue the train command, the status changes to “training”. When the training is done, it shows “available” and your custom model is ready to use.

New Experiment with The New Language Model Adaptation

Run experiments, review, analyze, adjust then re-test | Photo by Trust “Tru” Katsande on Unsplash

Now that we have a new custom model, let’s re-run the same previous experiment against it and review the results. Check the gaps you have identified from your baseline and validate your improvements. It does not need to be perfect. As long as you have the correct Watson STT transcription with high confidence scoring (0.8 or more), you are good to go.

Also, make sure you are not experiencing any regression on good results you already had in your baseline.

Keep iterating your experiments, identify gaps and improve your training, using ONLY the Language Model Adaptation for now. Based on past project experiences, I got the best results and improvements with it at first.

New Pattern Experiment with The New Language Model Adaptation / Customization

Experiment with audio matching your “pattern” (accents, environment, etc) | Photo by Antenna on Unsplash

Using the pattern audio files from your test set, run an experiment against you new custom language model you created earlier.

Here’s a “curl” command showing how to use your custom language model:

curl -X POST -u “apikey:{apikey}”
— header “Content-Type: audio/flac”
— data-binary @audio-file1.flac
https://stream.watsonplatform.net/speech-to-text/api/v1/recognize?language_customization_id={customization_id}"

Compare your results and make sure you have corrected the “pattern” issue.

Enhance your original test set by adding the “pattern” test set audio and transcription data. The more data you have in your test set, the more accurate the results will be.

Using the Grammar Feature for Data Inputs

For general utterances to identify intents and entities, training your Watson STT with a custom language model will do the trick. But what about when you handle specific data inputs like a part number, a member ID, a policy number or a healthcare code?

In speech recognition, you encounter certain characters that get misrecognized or confused with others. I personally call these “speech confusion matrix”. Here are some examples below:

  • A. vs H. vs 8
  • F. vs S.
  • D. vs T.
  • B. vs D.
  • M. vs N.
  • 2 vs to vs too
  • 4 vs for

There are multiple factors that can cause this confusion like accent or audio quality. Watson STT Grammar is a feature we can use to improve accuracy for these data inputs, and mitigate this confusion. It supports grammars that are defined in the following standard formats:

  • Augmented Backus-Naur Form (ABNF): Plain-text similar to traditional BNF grammar.
  • XML Form: XML elements used to represent the grammar.

For more information on creating a grammar configuration, check the Watson STT Grammar documentation and the W3C Speech Recognition Grammar Specification Version 1.0.

To train Watson STT with a grammar configuration, you will need a custom language model. The steps are :

  • Create a new custom model or use an existing one

I recommend that you create a separate custom language model dedicated to all your grammar configuration. This is purely for ease of administration and maintenance purposes only. You can use an existing custom language model if you want. See the section “Create a Language Model Adaptation/Customization” for more information.

  • Add the grammar configuration to the custom language model

If you grammar configuration is in ABNF format, run this “curl” command:

curl -X POST -u “apikey:{apikey}”
— header “Content-Type: application/srgs”
— data-binary @confirm.abnf
https://stream.watsonplatform.net/speech-to-text/api/v1/customizations/{customization_id}/grammars/confirm-abnf?allow_overwrite=true”

If you grammar configuration is in XML format, execute the following “curl” command:

curl -X POST -u “apikey:{apikey}”
— header “Content-Type: application/srgs+xml”
— data-binary @confirm.xml
https://stream.watsonplatform.net/speech-to-text/api/v1/customizations/{customization_id}/grammars/confirm-xml?allow_overwrite=true"

Note: I frequently use the “allow_overwrite” query parameter as it allows to overwrite the existing grammar configuration as you update it.

  • Validate your grammar configuration

Once your grammar configuration uploaded in your custom language model, I find this command very useful to validate it and identify issues :

curl -X POST -u “apikey:{apikey}”
— header “Content-Type: application/srgs+xml”
— data-binary @confirm.xml
https://stream.watsonplatform.net/speech-to-text/api/v1/validate_grammar?customization_id={customization_id}"

If no error, you should see the OOV results:

{

“results”: [

{

“OOV_words”: []

}

],

“result_index”: 0

}

Here’s an example of an error you can see during the validation. It will give you an indication where the error is located in your grammar file:

{

“code_description”: “Bad Request”,

“code”: 400,

“error”: “Invalid grammar. LMtools getOOV grammar — syntax error in RAPI configure: compiler msg: Syntax error, line number: 10, position: 21: “

}

  • Check the status of your grammar

This “curl” command will show you the status of all your grammar configurations in your custom language model:

curl -X GET -u “apikey:{apikey}”
https://stream.watsonplatform.net/speech-to-text/api/v1/customizations/{customization_id}/grammars"

You should be getting a response similar to the following:

{“grammars”: [{

“out_of_vocabulary_words”: 0,

“name”: “confirm.xml.xml”,

“status”: “analyzed”

}]}

Note: The “status” may be “being_processed” (still processing the grammar), “undetermined” (see below) or “analyzed” (completed and valid).

  • Train the custom model

As mentioned previoously, every time you update a custom language model, you have to train it:

curl -X POST -u “apikey:{apikey}”
https://stream.watsonplatform.net/speech-to-text/api/v1/customizations/{customization_id}/train"

… then check check the status :

curl -X GET -u “apikey:{apikey}”
https://stream.watsonplatform.net/speech-to-text/api/v1/customizations/{customization_id}"

When the training status is “available”, you are ready to use the grammar.

  • Using a grammar in your “recognize” request

As part of each “recognize” request, you can only use one custom language model and one grammar configuration. The example below shows the use of a custom language model and a grammar configuration:

curl -X POST -u “apikey:{apikey}”
— header “Content-Type: audio/flac”
— data-binary @audio-file.flac
https://stream.watsonplatform.net/speech-to-text/api/v1/recognize?language_customization_id={customization_id}&grammar_name={grammar_name}"

Re-run Experiments with New Updated Test Set and Establish a New Baseline

Re-run the same experiments you first ran against the Base Model but now using the new ustom language model and new grammar configuration where applicable. Review your results and compare. Make sure you are showing improvements and not regressing in any other areas.

Identify new gaps, rinse and repeat.

When your results are optimal, this will become your new baseline.

In Part 3 of this series, I will show you how to configure and train STT with a Grammar to handle specific data input strings.

To learn more about STT, you can also go through the STT Getting Started video HERE

--

--

Marco Noel
IBM Watson Speech Services

Sr Product Manager, IBM Watson Speech / Language Translator. Very enthusiastic and passionate about AI technologies and methodologies. All views are only my own