Measuring and Improving Speech-to-Text Accuracy | Google Cloud Platform

Published in

Google Cloud - Community

9 min readMar 9, 2022

What is Speech-to-Text?

Speech-to-text(STT), or automatic speech recognition (ASR) is a technology that enables human speech to be converted automatically into text through computational linguistics.

It is one of key areas of the Artificial intelligence that has strived over decades, to come as close as possible to human capabilities of speech recognition using breakthrough advancements in the Deep Learning technologies.

These advancements have led to Speech-to-text’s adoption in our everyday personal use on phones, homes applications to industrial applications like call-analytics and agent assist, Media subtitling, Clinical documentation, Content search etc.

What are we solving for using STT?

Problem statement: Customer is an Asset Management Company and as part of their onboarding journey and policy registration process, they record a video testimony of their customers who agree to the terms and conditions of the policy. This is used as a legal admissible evidence should there be a conflict of interest in future. As part of an compliance audit conducted on the video archives, quite a few non-compliances were found in terms of some videos without any voice, some with incomplete or missing facts and some with too much background noise making it difficult to understand what the customer was speaking.

Solution: The Group IT team embarked on an initiative to introduce a Speech-to-Text transcription service to transcribe the Customer video testimonials and approve and store it only if the transcriptions are accurate in terms of capturing the key facts. The Group IT team started the initiative but were not able to get the required accuracies and reached out to us.

In this blog we will see how we helped the customers measure and improve the accuracy of their Speech-to-Text requirements using GCP’s Speech-to-Text service.

ASR under-the-hood and Google Speech-to-Text!

Before we jump into improving the accuracy, it is important to

Have a high level understanding of how the ASR systems function.
How do we measure the outcome to improve the same?

ASR — System Architecture

This is an extremely simplified diagram to get a give a high level understanding of the system.

The system takes an audio input in from the user as chunks of 10–25 milliseconds speech frames to process and develop what’s called an acoustic model. The Acoustic model looks at the audio wave forms or the spectrogram and converts these waveforms into phonemes.

A phoneme is distinct unit of sound in a specified language that makes up a word. The example below shows 3 distinct phonemes that make up the words “five” and “four” in english. These word to phoneme mappings are developed by experts and can be thought of equivalent of alphabets that make up the word but using phonetics.

These Phonemes word predictions are then fed into what is known as a Language model.

Language Model provides a way interpret how the words should be ordered for a particular language and these models are developed by training on lots and lots of text in that particular language. Language Model looks at the groupings and series of phonemes and tries to transcribe the output as a series of word lattice which are sent as an output.

So, in a very simple way, thats how the speech gets transcribe into text.

Measurement: Word Error Rate (WER)

Speech to Text output is measured in Word Error Rate(WER) which is the combination of three types of transcription errors which can occur:

Insertion Error (I) — Words present in the hypothesis transcript that are not present in the ground truth (human transcription)
Substitution errors (S) — Words that are present in both the hypothesis and ground truth but not transcribed correctly
Deletion errors (D) — Words that are missing from the hypothesis but present in the ground truth
Total number of words (N) — total words in the ground truth transcript

WER is calculated by adding all the errors (S+D+1) and dividing it by total number of words (N) in the ground truth transcript. Therefore the aim is be to have as much lower WER as possible.

Measuring and Improving the accuracy

With this background, let’s try and work through the problem of the AMC customer.I will be using the following ground truth to simulate the customer’s recorded testimonial and benchmark its accuracy:

“I Nishit Kamdar am applying for a guaranteed income milestone policy from cymbal direct insurance. I confirm to pay the premium amount of rupees 50000 for a period of 5 years on an annual basis. I am based in Mumbai at 789 Alpine Regency Main Street Fort Mumbai 400001 and my contact number is 98211 65243 kindly consider this as my approval to process the application.”

Iteration —1 : Reviewing current configuration and accuracy

Following is the python sdk code that the group IT were trying out:

WER: Green strikethrough texts are the errors in transcription and yellow highlights are the ground truth text.

The WER of the current configuration was extremely high at 37.88% bringing the overall STT accuracy down to 60+%.

Iteration — 2 : Language Code

One of the key issue with the above configuration is language_code=en-US. Google speech-to-text provides 125+ languages which include variants of these languages country-wise. So, for English language, it provides 40+ variants of the way the english language is spoken in these countries. Since this is for indian customers, let’s change it to the indian variant.

Config updated to english for India — en-IN

WER: The WER dramatically drops to 10+% giving close to 90%+ accuracy by making this change alone and therefore this is one of the most important configurations to review.

Iteration -3 : Transcription Model

Another key area to review is the context of the speech that is being transcribed. The context differs across various domains as shown below and therefore one should look for specific models for the type of transcription being done to get better accuracy.

Google text-to-speech supports the following models as of today. These are available for specific languages only and are constantly evolving.

Since none of the models apply to our scenario, we will go with the “default” one.

Iteration — 4 : Speech Adaptations — Phrase Hints and Boost

In our Iteration-2 output, we still see some proper nouns or rare words (People and Company name) Address and phone number related errors.

To address this, GCP provides Speech adaptions which are mechanisms to provide hints to the ASR to bias it to get the required output. There are 3 types of adaption features — phrase hints, boosts and Class tokens.

Phrase hints provides an ability to define words or long phrases in the configuration request, that may be present in the speech, so that the system takes it into account and baises towards that particular piece of information .

Boost allows you to add numerical weights to these phrase hints to adjust the strength of speech adaptation effects on your transcription results.

So let’s update the config with a Speech_Contexts parameter, passing the proper nouns and boosting its weight.

Output:

Both the proper nouns Nishit and Cymbal have now been transcribed correctly using the speech adaption phrase hints and boost feature thereby improving the transcription.

Boost is also useful when dealing with words that exactly sound the same. like “fair and fare,” or “weather and whether”.

The phrase hints and boosts could be used in 2 ways:

Static list : You could set up a dictionary of the key words which are unique to your domain or business and pass it as a parameter with each request to transcribe it correctly. Google STT also provides a way to store a collection of thousands of items and pass the collection ID so that you do not have to send all the items for every request. Example of static list could a list of indian medicine names which may not be standard words that can be used for medical transcription.
Dynamic Parameters: It is not always possible to build a static list, like e.g for transcribing names, you can’t have a list of all the possible names built as a collection. In such cases, you can pass this parameters from upstream processes as parameters and create the config object dynamically.

Iteration — 5 : Speech Adaptations — Classes

We are still left with phone number related errors and that’s where the Classes feature of speech adaptation will be helpful.

Classes represent common concepts that occur in natural language, such as defining monetary units, addresses and calendar dates.

For example, if an audio recording contains “My house is at 123 Main Street”, the output expected would be (“123”) as numbers rather than ordinal (“one-hundred twenty-third”) even if its spoken like that.

This is where the Class token can be defined as [“My house is $ADDRESSNUM”] where $ADDRESSNUM is the class token that biases STT to transcribe numbers within the address. Similar to address token, there are various other class tokens for date, phone number, etc — Refer to https://cloud.google.com/speech-to-text/docs/class-tokens to see the list of class tokens available per language. GCP STT also always you to define your own custom class tokens.

In our example, we have some inconsistencies with phone number, so I have added the phone num token below.

The overall WER has dropped to 3% thereby increasing the accuracy 97% from 60+% by using the various configurations and optimisation techniques. There are other small errors which again can be fine tuned through a combination of adaptions. However, I have all the key facts I need in the testimonial, transcribed accurately and therefore I am happy with the 97% accuracy.

Along with these optimisations, there are also some Best Practices sampling rates and loseless codec that will help in providing an optimal input to the STT service which can be found at https://cloud.google.com/speech-to-text/docs/best-practices

Summary:

Speech is hard and extremely complex under-the-hood. The expectations from ASRs should not be to get 100% accuracy but to leverage its potential to accelerate your transcriptions and drive business outcomes.

Google Cloud Platform’s Speech-to-text is built on decades of Google’s research and contribution to the ASR space and is one of leading Speech recognition and transcription service in the industry. With its Speech Adaptions, it also provides mechanism to tune and bias the engine to improve the accuracy of the transcription. For more details, please visit:

https://cloud.google.com/speech-to-text