Watson Speech to Text: How to Plan Your Migration to the Next-Generation Models
In a previous article, we announced the release of our IBM Watson Speech to Text (STT) next-generation models that can deliver accuracy results as high as 92% depending on the test set, and can deliver results 50% faster than our previous-generation models. This release is the beginning of a major architectural shift for Watson Speech to Text. We have overhauled our model training technique to deliver higher accuracy and faster results for our customers. These models are now generally available and so far, we have received great customer feedback.
This article will guide you in your migration from the previous-generation Speech to Text models to the next-generation ones.
Get started with these new models
Identify the base model you currently use
The previous-generation STT model names listed here, contain either the word “Narrowband” or “Broadband” for each language. Narrowband models are used for telephony use cases while the Broadband models are best for multimedia use cases.
We slightly changed the model names to illustrate this usage better. If your model name has “Narrowband” in it, the matching next-generation model name has “Telephony” in it. If it has “Broadband”, the matching model name has “Multimedia”.
So if you are using “en-US_NarrowbandModel”, your matching next generation model would be “en-US_Telephony”.
Identify the features and parameters you use
While the next-generation Speech models have all the commonly used features, they are not at complete feature parity with the previous-generation STT models. So it is important that you review the most up-to-date list of supported features for the next-generation STT models here. We are continuously rolling out new features and will update the link as more features become available.
Here are the key features to consider for the next-generation STT models:
- Language Model customization with corpora text files — You can reuse your existing STT corpora text files as-is with no change
- Low Latency — Our next generation models are already very fast but just in case you need something faster with slightly lower accuracy
- Speaker labels
Some features coming out soon are Grammars.
Let’s take an STT Websocket Python code abstract as an example with the previous-generation STT model, features and parameters.
with open(join(dirname(__file__), './.', wav_file),'rb') as audio_file:
audio_source = AudioSource(audio_file)
speech_to_text.recognize_using_websocket(
audio=audio_source,
content_type='audio/wav',
recognize_callback=myRecognizeCallback,
model="en-US_NarrowbandModel",
language_customization_id="8acf31fa-0aa2-4ecc-a805-1f527f342dba",
grammar_name="grammar-id.abnf",
interim_results=False
word_confidence=True,
timestamps=True,
keywords=True,
keywords_threshold=0.5,
speaker_labels=True
smart_formatting=True,
audio_metrics=False,
end_of_phrase_silence_time=1.5,
inactivity_timeout=-1,
split_transcript_at_phrase_end=True)
With the next-generation model and its supported features, the following items need to be updated in the Python script.:
- model: new next-generation STT model name
- language_customization_id: new language model customization ID (since you changed to a new model name, you have to create a new custom model using the same training corpora text files)
- remove unsupported features: “grammar_name”, “keywords”, “keywords_threshold”, “split_transcript_at_phrase_end”
- Keep the remaining supported features
with open(join(dirname(__file__), './.', wav_file),'rb') as audio_file:
audio_source = AudioSource(audio_file)
speech_to_text.recognize_using_websocket(
audio=audio_source,
content_type='audio/wav',
recognize_callback=myRecognizeCallback,
model="en-US_Telephony",
language_customization_id="636d8494-7e53-436a-8557-30d6b2a63cd7",
interim_results=False
word_confidence=True,
timestamps=True,
speaker_labels=True
smart_formatting=True,
audio_metrics=False,
end_of_phrase_silence_time=1.5,
inactivity_timeout=-1)
Conduct your own experiments and see for yourself
If you do not have an existing audio test set, that’s ok. You can collect some representative audio from live calls.
Run these audio files against your previous-generation STT model, collect your results and document your baseline.
Use the same audio files against our STT next-generation model out-of-the-box (no custom model), document your results and compare with the baseline.
You can use simple “curl” commands or Python scripts — see our API documentation for code examples.
If you need a Word Error Rate (WER) tool to measure your accuracy, you can use this one from our public GitHub repository
To learn more about our next-generation models, check out our documentation . Also keep an eye out for more language and feature releases — we’re hard at work to make sure you’re able to provide an exceptional experience to your customers.
We would love to get your feedback and comments!