Next-Generation Watson Speech to Text

Published in

IBM Watson Speech Services

6 min readApr 23, 2021

Watson Speech to Text has released ten languages on our next-generation engine. This release is the beginning of a major architectural shift for Watson Speech to Text. We have overhauled our model training technique in order to deliver better accuracy to our customers. The new engine delivers accuracy improvements as high as 57%! Read on to learn about how the new engine works and where you can access the new models.

A person speaking, the audio going to the the model to be transcribed, the transcription, and the recurrent processing of that audio and transcription. — Illustration by Briana Van Koll

How it Works:

Below is an illustration of how next-generation Speech to Text transcribes audio.

The encoder transforms the audio into an acoustic embedding. The decoder predicts the next character using the acoustic embedding and the previously predicted characters to form the final transcription.

The model is made up of an encoder and a decoder. First, the model’s encoder uses deep recurrent neural networks to extract features from the audio at multiple levels of abstraction, called an acoustic embedding. An acoustic embedding is multidimensional representation of sound that can be used for classification tasks. Then the model evaluates those features bidirectionally to predict the correct transcription. Think of it this way: the new engine “listens” to your data twice. With that second listen, the engine can make a smarter hypothesis on the words spoken in the audio. Then the decoder hypothesizes each character based on the acoustic embedding and the previously predicted characters. Said more simply, the model is predicting the spelling of a word, and it’s basing that prediction on the sound of the word and the characters it has already predicted.

Even with all of this analysis, the next-generation engine runs more efficiently than our current engine. Transcriptions are delivered faster and are more accurately.

The encoder module is a deep bidirectional long short-term memory network (LSTM). Audio moves from the input layer through layers of LSTM blocks to the output layer. Between the input and output layer, there are two forward layers and two backward layers. — The encoder module is a deep bidirectional long short-term memory network (LSTM).

User Benefits of Next-Generation Speech to Text:

Higher accuracy out of the box. The next-generation engine is 19% more accurate for US English telephony data and as high as 57% more accurate for other languages.
Spend less time customizing. Not only is the next-generation engine more accurate, but it can transcribe words never seen in training. That means the model is more flexible, so you can spend less time customizing.
Get results faster: The next-generation engine analyzes audio with a higher throughput. That means you’ll receive your transcriptions more quickly.

Use Cases:

You can use the next-generation engine for any use case you currently tackle with Watson Speech to Text. Here are some examples.

Virtual Assistant on the Phone

Imagine eliminating hold times and improving customer satisfaction at the same time. The Watson Assistant phone integration enables you to do just that. You can provide live support to your customers with the pre-built integration of Speech to Text and Text to Speech within Watson Assistant, and hand off to agents as needed.

Analyze Customer Calls

With Watson Speech to Text, you can transcribe customer phone calls to uncover patterns and conduct root cause analysis. After you transcribe you audio, you can use Watson Natural Language Understanding or Watson Discover to analyze those transcriptions.

Support Agents

You have the ability to provide real-time information to improve agent efficiency and focus. You can leverage Watson Speech to Text to transcribe live calls, and then use Watson Discovery to automatically surface relevant information so your agent can focus on the customer rather than on the search.

Language Availability and Accuracy Improvements

The following languages are supported:

U.S. English,* British English,* Australian English,* French,* Canadian French, German,* Italian, and Spanish

The asterisk (*) indicates that low latency mode is supported for the language. You should use low latency mode (if available) when you need the shortest possible interval between submitting your audio and receiving your transcription. Low latency mode might be the best option for Watson Assistant phone integration use cases. Test both options to determine what works best for your solution.

All current languages (see the full list here) will be available on the next-generation engine in 2021. Most will be available by the end of June.

Accuracy improvements for supported languages:

Below are the accuracy improvements for most of the newly released languages. These numbers reflect the increase in accuracy of the next-generation engine compared to the previous generation of Speech to Text. The higher the number, the larger the improvement is. Both columns compare against the existing generation of Speech to Text. Note that the improvements for low latency mode are smaller, since the this mode sacrifices some accuracy in favor of speed.

Table containing the accuracy improvements for recently released languages. US English, Multimedia: 29%; US English Telephony: 19%; British English, Telephony: 19%; Australian English, Telephony: 25%; German, Telephony: 34%; French, Telephony: 41%; Canadian French, Telephony: 44%; Spanish, Telephony: 46%; Italian, Telephony: 57%

Feature Availability

Switching to the next-generation engine is seamless. The new models are available at the same service endpoints and communication protocols (WebSockets/REST) as the previous generation of the service.

We continue to support the majority of our most popular features for the next-generation engine. Language model customization will be supported later in 2021.

Service endpoints: You can keep using Speech to Text the same way you do now.

WebSocket/REST communication protocols: You can transcribe audio with WebSocket or HTTP.
Asynchronous: You can use the asynchronous HTTP interface, which will transcribe the entire file and then deliver it to you. This interface works well for any file-based use case, including offline call analytics.

Transcription features:

We continue to support our most popular features.

Diarization: Label speakers in a conversation for US English, British English, Australian English, Spanish, and German
Interim results: You can request results that are not final using the interim_results parameter for use cases that require ultra low latency and partial transcripts.
Profanity Filter: Get rid of naughty words for U.S. English.
Format numbers and dates: Use smart formatting to control how the engine transcribes numbers and dates for U.S. English and Spanish.
Redact sensitive information: Remove numbers and other sensitive information for U.S. English using redaction.
Understand your audio and improve transcription: Use speech detector sensitivity, background audio suppression, and audio metrics to understand your audio and improve accuracy on noisy audio.
Save money on silence: Use inactivity timeout to avoid transcribing silence if your end user walks away from a live mic.
Identify the timing of words: Use the timestamps parameter to know when a word stops and ends. This is key for users looking to combine multiple channels of audio or pinpoint when particular words occur in a transcript.

Abstract illustration of a conversation — Illustration by Briana Van Koll

How to get started

If you’re a new user, welcome! Set up your IBM Cloud account here and get started using Speech to Text. Refer to this list of model names to access our next-generation engine.

If you’re an existing user, we’re glad to have you! You can access these new models using the model names listed on this page of our documentation. Note that the next-generation engine will transcribe audio differently than the model you’re currently using. The results will be more accurate, but you may need to make slight adjustments to your implementation. Feel free to run a few tests with the new models before moving into production. For methodology on how to do so, see this blog post.

The next-generation engine are also available through the Watson Assistant voice integration. You can find them in the drop down menu of Speech to Text model options. Use a model with “telephony” or “multimedia” in the title to access the next-generation engine.

The next-generation engine is a major architectural shift for Watson Speech to Text. It will deliver improved out-of-the-box accuracy for our clients and a better experience for their users. We look forward to supporting all of our languages and dialects on the next-generation engine in the coming months.

Stay tuned!

This blog was written in collaboration with George Saon and Daniel Bolanos.

Abstraction of data in various storage formats and locations, including a computer, a cell phone, a book, and a folder. — Illustration by Briana Van Koll