Watson Speech to Text: How to Plan Your Migration to the Next-Generation Models

Marco Noel
IBM Watson Speech Services
4 min readAug 30, 2021
a person working on a blueprint draft with a triangular ruler sitting on the drafting plans
Photo by Daniel McCullough on Unsplash

In a previous article, we announced the release of our IBM Watson Speech to Text (STT) next-generation models that can deliver accuracy results as high as 92% depending on the test set, and can deliver results 50% faster than our previous-generation models. This release is the beginning of a major architectural shift for Watson Speech to Text. We have overhauled our model training technique to deliver higher accuracy and faster results for our customers. These models are now generally available and so far, we have received great customer feedback.

This article will guide you in your migration from the previous-generation Speech to Text models to the next-generation ones.

Get started with these new models

a person wearing a suit and glasses holding a spyglass over one eye
Photo by Evgeni Tcherkasski on Unsplash

Identify the base model you currently use

The previous-generation STT model names listed here, contain either the word “Narrowband” or “Broadband” for each language. Narrowband models are used for telephony use cases while the Broadband models are best for multimedia use cases.

We slightly changed the model names to illustrate this usage better. If your model name has “Narrowband” in it, the matching next-generation model name has “Telephony” in it. If it has “Broadband”, the matching model name has “Multimedia”.

So if you are using “en-US_NarrowbandModel”, your matching next generation model would be “en-US_Telephony”.

Identify the features and parameters you use

While the next-generation Speech models have all the commonly used features, they are not at complete feature parity with the previous-generation STT models. So it is important that you review the most up-to-date list of supported features for the next-generation STT models here. We are continuously rolling out new features and will update the link as more features become available.

Here are the key features to consider for the next-generation STT models:

  • Language Model customization with corpora text files — You can reuse your existing STT corpora text files as-is with no change
  • Low Latency — Our next generation models are already very fast but just in case you need something faster with slightly lower accuracy
  • Speaker labels

Some features coming out soon are Grammars.

Let’s take an STT Websocket Python code abstract as an example with the previous-generation STT model, features and parameters.

with open(join(dirname(__file__), './.', wav_file),'rb') as audio_file:
audio_source = AudioSource(audio_file)
speech_to_text.recognize_using_websocket(
audio=audio_source,
content_type='audio/wav',
recognize_callback=myRecognizeCallback,
model="en-US_NarrowbandModel",
language_customization_id="8acf31fa-0aa2-4ecc-a805-1f527f342dba",
grammar_name="grammar-id.abnf",
interim_results=False
word_confidence=True,
timestamps=True,
keywords=True,
keywords_threshold=0.5,
speaker_labels=True
smart_formatting=True,
audio_metrics=False,
end_of_phrase_silence_time=1.5,
inactivity_timeout=-1,
split_transcript_at_phrase_end=True
)

With the next-generation model and its supported features, the following items need to be updated in the Python script.:

  • model: new next-generation STT model name
  • language_customization_id: new language model customization ID (since you changed to a new model name, you have to create a new custom model using the same training corpora text files)
  • remove unsupported features: “grammar_name”, “keywords”, “keywords_threshold”, “split_transcript_at_phrase_end
  • Keep the remaining supported features
with open(join(dirname(__file__), './.', wav_file),'rb') as audio_file:
audio_source = AudioSource(audio_file)
speech_to_text.recognize_using_websocket(
audio=audio_source,
content_type='audio/wav',
recognize_callback=myRecognizeCallback,
model="en-US_Telephony",
language_customization_id="636d8494-7e53-436a-8557-30d6b2a63cd7",
interim_results=False
word_confidence=True,
timestamps=True,
speaker_labels=True
smart_formatting=True,
audio_metrics=False,
end_of_phrase_silence_time=1.5,
inactivity_timeout=-1
)

Conduct your own experiments and see for yourself

a person wearing headphones and looking at visualizations of sound waves
Photo by Kelly Sikkema on Unsplash

If you do not have an existing audio test set, that’s ok. You can collect some representative audio from live calls.

Run these audio files against your previous-generation STT model, collect your results and document your baseline.

Use the same audio files against our STT next-generation model out-of-the-box (no custom model), document your results and compare with the baseline.

You can use simple “curl” commands or Python scripts — see our API documentation for code examples.

If you need a Word Error Rate (WER) tool to measure your accuracy, you can use this one from our public GitHub repository

To learn more about our next-generation models, check out our documentation . Also keep an eye out for more language and feature releases — we’re hard at work to make sure you’re able to provide an exceptional experience to your customers.

We would love to get your feedback and comments!

--

--

Marco Noel
IBM Watson Speech Services

Sr Product Manager, IBM Watson Speech / Language Translator. Very enthusiastic and passionate about AI technologies and methodologies. All views are only my own