A Postman Collection for Training IBM Watson Speech to Text

Peter Tuton
5 min readMar 12, 2019

--

The IBM® Watson Speech to Text (STT) service provides APIs that use IBM’s speech-recognition capabilities to produce transcripts of spoken audio. This post provides a description on how to train STT with a custom language model (extending a base model) and a custom acoustic model, using Postman.

Assumptions

  1. You have an existing instance of Watson Speech to Text (STT) service, and are familiar with the IBM Cloud portal. For details on how to create a STT instance, refer to the documentation.
  2. You have an application that makes use of the STT service. You can find in my GitHub repo a (really rough!) sample application that transcribes provided audio files, here.
  3. You have training files available, both text and audio, e.g. audio and associated transcriptions recorded from a call center application. If transcripts are not available, one method I recommend to create them is to run the source audio files through a vanilla (i.e. untrained) STT application, then modify the textual output to be used as subsequent training material. This could take some time to create (but it’s likely an activity your voice analyst team is performing anyway! Go ask them…).
  4. You’re somewhat familiar with using Postman, though it’s not tough to learn.

For best results, there are two components to training Watson STT (described here):

  1. the language model, and
  2. the acoustic model.

The language model training uses sample transcripts to add to the textual part of training. For example, we may want train Watson that the spoken word “Coles” (i.e. the large supermarket chain in Australia) is returned with more confidence than “coals”. There’s also a method to explicity train the language model, e.g. to train Watson that “Woolworths” (also commonly pronounced “Woolies” in Australia) is to be interpreted as “Woolworths” and not “wool worth”. Note that training a custom language (and acoustic) model extends the base model, so the more training material provided the better the results.

The acoustic model training uses sample audio files. It’s our best method of training Watson in the accents, for example, among other benefits such as noise reduction (think of a noisy call center). You must add a minimum of 10 minutes and a maximum of 100 hours of audio that includes speech, not silence, to a custom model before you can train it.

Combining the two models gives us the best outcome, as described here.

The order of training is:

  1. train the custom language model, using the source transcription files only; then
  2. train the custom acoustic model, providing the custom language model as a guide, using the source audio files.

Use Postman — it’s an invaluable tool that will save you much time… Grab the Postman Collection I’ve created [Run in Postman] to use as a template. Import the collection and modify the following collection values:

  1. apikey: In the collection’s Authentication tab, set the value for apikey to the STT instance (found in ‘Service credentials — View credentials”).
  2. url: In the collection’s Variables tab, set the value of url to the appropriate endpoint for your service (based on the region in which the service is provisioned). The default value is https://stream.watsonplatform.net/speech-to-text/api. For a list of appropriate values, refer to the API documentation.
  3. model: In the collection’s Variables tab, set the value of model to the appropriate model of your audio files. The default value is en-US_NarrowbandModel. For a list of available models, refer to the API documentation.

Now, time to create and train the custom language model…

  1. Create the source language model file. The training expects a single source file, so merge all the transcripts into a single file, e.g. cat transcript1.txt transcript2.txt transcript3.txt > transcripts.txt. If there’s too many files to do this on the command line, create a bash script to do it.
  2. Create a new custom language model. In the ‘Body’ tab of the “Language customization — create” Postman command, change the values of the name and description values, as you wish, then send the command. Capture the resulting value of the “customization_id” and update the collection’s value for language_customization_id. Check for a 201 Created response.
  3. Confirm the creation of the new language model. Send the “Language customization — list” command and confirm the details are as supplied. Also check for a “status”: “pending” value.
  4. Add the source model file to the custom language model. In the ‘Body’ tab of the “Language customization — add/update corpus” command, ensure ‘binary’ is selected then choose the transcripts.txt file you created in step 1, then send the command. Check for a 201 Created response.
  5. Wait for the upload to complete, by sending the “Language customization — list corpora” command. The upload is complete when you receive a "status": "analyzed" response.
  6. Train the custom language model. Send the “Language customization — train” command. Check for a 200 OK response.
  7. Wait for training to complete. Send the “Language customization — list” command, checking for a “status”: “available” value.
  8. Check your model for correctness. You can check the model’s customizations using the “Language customization — list words” command. You may be surprised at the number of mistakes in the source files…
  9. Repeat, if required. If something looks incorrect or broken, e.g. a spelling error, you’ll need to find the error in the appropriate source file(s), fix it, and retrain. To retrain, first reset the model by sending the “Language customization — reset” command, then repeat the training steps.

To train the acoustic model, it’s much the same process:

  1. Create the source model file. Merge all the audio files into a single archive file (e.g. mp3-audio.zip). If you have multiple source audio file types, you will need to create a separate file for each type cause audio source files can contain only one type of audio, e.g. a file for mp3 file, a separate file for wav files, etc.
  2. Create a new custom acoustic model. In the ‘Body’ tab of the “Acoustic customization — create” command, change the values of the name and description values, as you wish, then send the command. Capture the resulting value of the “customization_id” and update the collection’s value for acoustic_customization_id. Check for a 201 Created response.
  3. Confirm the creation of the new acoustic model. Send the “Acoustic customization — list” command and confirm the details are as supplied. Also check for a “status”: “pending” value.
  4. Add the source model file(s) to the custom acoustic model. In the ‘Body’ tab of the “Acoustic customization — add audio <type>” command, ensure ‘binary’ is selected then choose the source audio file file you created in step 1. In the ‘Headers’ tab, ensure the value for Contained-Content-Type matches the audio type (e.g. mp3), then send the command. Check for a 201 Created response. Repeat this command for each audio source file type.
  5. Wait for upload to complete. Send the “Acoustic customization — details” command, checking for a “status”: “ok” value for each audio type and the value for “duration” appears correct.
  6. Train the custom acoustic model, by sending the “Acoustic customization — train” command. Check for a 200 OK response.
  7. Wait for training to complete, by sending the “Acoustic customization — list” command and checking the value of “status”. During training the value will be “training”. This will take a while, depending on the length of your audio files…be patient.

Once both the language model and the acoustic models have completed training, you can test the accuracy using one of the source audio files or a completely unesen audio file. Remember, the more content you provide for training purposes the better the STT results!

--

--