How to Properly Evaluate Speech-To-Text Engines

Marco Noel
IBM Watson Speech Services
6 min readJan 2, 2020

--

What is the best way to compare speech recognition engines? | Photo by Troy T on Unsplash

Over the past year, I have had a lot of discussions with customers and colleagues about which speech recognition engine is the best in the market. Unfortunately, I still see a lot of very basic mistakes when evaluating them. A bad choice can lead to major impacts on your voice solution for different use cases.

The first and most common mistake I hear is to think that if it works for one thing, it will work for everything else, known as “one-size-fits-all”. It gets worse when you realize that there are some gaps you did not notice and end up coding some functions that already existed in other speech engines. But it’s too late. You already invested months into the solution and cannot turn back.

Here are some general guidelines I recommend for your evaluation.

1. Clearly Identify Your Use Case and Requirements

Common voice use cases involve call centers | Photo by Alex Kotliarskyi on Unsplash

It is extremely important that you know the target use case for your future voice solution.

An IVR pattern usually handles short utterances, requires that you identify intents, entities and data inputs (member ID, credit card number, dates, alphanumeric product ID, etc) and has defined process flows. Common use cases are self-service and call queue transfer management.

A Call Analytics pattern deals with longer audio recordings (approx. 1 to 10 min.), requires that you identify key insights (product names, tone, sentiment, etc) and usually have two or more speakers. Call quality assurance, customer satisfaction gaps and overall sentiment are common use cases.

I encourage you to listen to call recordings, draw on a board the typical interactions, like a process flow and document your requirements. Check also for things like audio quality, noise factors, accents, and crosstalk which are indicators of potential challenges frequently observed.

2. Collect Representative Data and Define a Test Methodology

Representative data and a good test methodology are key to properly evaluate speech engines | Photo by Mika Baumeister on Unsplash | Photo by Daria Nepriakhina on Unsplash

To evaluate speech recognition engines, you need to have representative data related to your use case, with key factors like devices used, environment (eg. noisy warehouse) and accents. Check out this article for more details on how to collect and build audio data sets.

Once you have your data, you need to define a test methodology based on your use case. As you have guessed, testing an IVR pattern is different than a Call Analytics one. Let’s use the IVR pattern as our example moving forward.

Some common data we see in an IVR pattern are:

  • Small utterances to identify the intents and entities (“I need the check the status of my claim”, “How do I add my daughter to my policy?”, “My computer hangs all the time”)
  • Data inputs to authenticate the user like a member ID (“LK12345”) and a date of birth (“January 25, 1973” , “01/25/73”)
  • Product name (“Macbook Air”, “Colgate Mouthwash”), unique policy number (“PQ1234R67”) or claim number (“1234K56”)

As we will see later in this article, we will need to test and measure each data category individually and document the results.

IMPORTANT NOTE: A common mistake is to limit your tests to the “Out-of-the-box Base Model” of each speech engine. Avoid it at all costs. Each vendor has features and functionalities to improve accuracy and fix potential gaps. Test them out and compare.

Word Error Rate is a good metric… but not that good

All vendors claim they can get a Word Error Rate (WER) of 1% (under perfect lab conditions with plain English). This is definitely not realistic and this sets the wrong expectation. A normal human being deals with 4–5% WER under ideal conditions. If you have listened to regular calls, it is very common to hear a call agent requesting a user to repeat, especially if the user is in a busy street, a noisy warehouse or simply outside on a windy day. If the user speaks a thick foreign accent, it can get worst. These are “real life factors” that directly affect WER.

Here’s why you should never rely ONLY on WER to evaluate your speech engines. WER evaluates a speech transcription (hypothesis) against a human transcription (reference) and identify errors like deletions, insertions and substitutions. If your human transcribers are not consistent in their transcriptions when building your references, you might get a word error when it’s not really one.

“It’s” vs “it is” : Word Error (not really!)

“five three” vs “5 3”: Word Error (really!?)

Another aspect is around what we call “glue” words (“the”, “and”, “I”, “they”, “so”, “this”, “it”) which has the highest volume in typical conversations. They are also totally meaningless.

The most meaningful words are usually rare in a conversation but they are the most important ones (“claims”, “need”, “policy”, “buy”, “sell”, “purchase”). They are used to identify your intents and entities.

If you get a 10% WER but your intents and entities are 50% missed because your meaningful words are not properly transcribed, your 90% speech accuracy is not representative of the real performance of your voice solution.

For data inputs (member ID, claim number, date), WER is irrelevant because if one word is wrong, the whole data input is considered wrong. The metric we use in this case is Sentence Error Rate (SER). As you can imagine, even if you have a low WER, you can still have a high SER:

For example, if you have 10 policy numbers of 5 digits each = 50 characters

If one digit is wrong in each policy number = 5 word errors (10% WER)

If one digit is wrong in each policy number = 100% sentence errors (100% SER)

Another metric to consider is “intent/entity recognition rate”. If you send the utterance (voice transcription) as-is to your chatbot, is your intent recognized? — If the answer is yes, then your speech engine has achieved its purpose. If it’s not, check for the presence of domain-specific terms, the quality of the audio (noise, crosstalk), the context of the conversation, then leverage the extra features and functionalities available to improve the accuracy.

3. Experiment and Evaluate All Features Available

An example I frequently use in my discussions is the Olympic Decathlon. The athlete that wins the gold medal does not have to be #1 at the high jump, nor need to break a world record in the 110-meter hurdles. The key is to perform great at enough disciplines to achieve your objectives.

For speech recognition, you can have a great out-of-the-box base model with great results for general-use utterances, but what about recognizing domain-specific terms or alphanumeric inputs? What about users with heavy accents working in a noisy environment? Does it handle dates properly? Does it spot keywords?

Start by building yourself a grid in Excel with your list of technical requirements, one for each row, and put each speech recognition engine in individual columns. In a first pass, if you see speech features and functionalities that could address a requirement, document them in individual intersection cell (requirement - speech engine).

Create different experiments with audio data test sets to measure each requirement individually (eg. intent/entity utterances, dates, claim numbers, policy numbers, member IDs) against each speech engine and rate how easy it is to implement/use/maintain those features (eg. 1–10 rating scale).

Don’t rely only on the vendor’s “brochure” to complete your evaluation! Don’t settle for shortcuts and quick tests! There’s nothing like good structured experimentation, getting your hands dirty and witness the results for yourself.

How do you conduct your ASR evaluation?

--

--

Marco Noel
IBM Watson Speech Services

Sr Product Manager, IBM Watson Speech / Language Translator. Very enthusiastic and passionate about AI technologies and methodologies. All views are only my own