Word Error Rate 101: Your Guide to STT Vendor Evaluation

Felix Laumann
NeuralSpace
Published in
6 min readOct 25, 2023

In the rapidly evolving world of Speech-to-Text (STT) technology, making an informed choice can seem overwhelming. Yet, the success of your project hinges on this crucial decision. With so many claims about performance and accuracy, how can you navigate the maze of marketing hype to select the right vendor? The answer lies in objective benchmarking.

One of the key factors to consider when evaluating an STT model is the Word Error Rate (WER). WER is a metric used to determine the accuracy of transcriptions produced by an STT system. In this blog post, we will explore what WER is, why it is important, the nuances of calculating WER for different languages, and what is considered a good WER score.

Key takeaways:

  • WER is a vital measure of the performance of an STT model.
  • Understand normalization techniques in calculating WER and the associated challenges.
  • It is essential to compare services using a relevant test set due to varying evaluation methods.
  • WER calculation is different for different languages, as they exhibit unique linguistic characteristics and pronunciations.
  • Learn how to measure WER and conduct your own evaluations with our calculator.

What is word error rate (WER)?

Word Error Rate or WER is a metric used primarily in the field of speech recognition to measure the performance of an automatic speech recognition (ASR) system. WER calculates the minimum number of operations (substitutions, deletions, and insertions) required to change the system’s transcription (prediction) into the reference transcription (truth), divided by the number of words in the reference.

Word Error Rate Calculation

​WER can range from 0 to infinite. The closer the WER is to 0, the better. WER is often also represented as a percentage. It is usually calculated by just multiplying 100 to it. For example, a WER of 0.15 might also be represented as 15%.

WER is important because it provides:

  1. Performance Measure: It gives an objective measure of how well an ASR system is transcribing speech into text.
  2. Comparison: It allows for comparison between different ASR systems or versions of a system.

Why evaluate models yourself?

Provider X has advertised WER for their English model as 4.5, and another has published 7.5 as theirs. We know a lower WER indicates a higher accuracy, so does it mean that provider X is the better provider for you? No, the answer is not that simple.

The method of evaluation that providers X and Y could have used may be completely different. They could have done their evaluation on different test sets (affected by recording quality, noise, accents, etc.) or could have normalized differently. WER is a sensitive metric and these factors can dramatically affect the results.

Hence the need to evaluate all providers on a test-set representative of your use-case, and then compare the results and metrics.

How to evaluate STT services

  1. Identify your use case and prepare a representative audio test set with a decent number of files. Somewhere around 5 hours worth of audio (2000 files of roughly 10 seconds each) is a good number!
  2. Run transcriptions on all the models/providers for the whole test set.
  3. After getting the results, normalize (more on normalization below) both the results and the ground truths.
  4. Calculate the WER on all the samples. Do not calculate WER for each sample and then average them.
  5. Compare on the basis of this WER.
  6. Human evaluation.

What is normalization?

Text normalization is the process of transforming text into a consistent and standardized form. Normalization is a crucial step before calculating WER. It helps ensure that different variations or representations of the same content are treated as equivalent, thereby improving the accuracy and efficiency of text analysis. But it’s not an easy process and can be very nuanced for different languages.

Normalization for English usually involves:

  1. Converting all letters to lowercase or uppercase.
  2. Removing punctuation or special characters.
  3. Expanding contractions (e.g., “isn’t” to “is not”).
  4. Converting numbers to words (e.g., “100” to “one hundred”) or vice versa.
  5. Correcting spelling errors.

Normalization for other languages like Arabic can become even more challenging owing to the rich morphology, script, and phonetic variations in it. Some additional steps to normalize the script involve:

  1. Removing diacritics
  2. Letter Standardization and normalization (e.g., standardizing the letter ا (Aleph) which has multiple representations and converting ة (Ta marbuta) to ه (Ha))
  3. Ligature Resolution (e.g., decomposing لا (Lam-Aleph) into its constituent letters ل and ا)

The need for normalization can be best understood using an example:

Without normalization, WER is 50% because 2 out of 4 words in the predicted sentence need to be substituted to arrive at the truth. On the other hand, with normalization, WER is 0% as the ground truth and the prediction are exactly the same after normalization.

Thus, normalization in the above example helped us accurately measure how well the model was able to convert speech to text, without being affected by how well the text was punctuated. This example is also a clear indicator of how sensitive WER can be to minor features of a transcript.

Human evaluation

Human evaluation is also a very important step in the process of choosing the best STT provider. This is because a model could be performing well but the WER figures suggest it’s subpar because of the following reasons:

  • Difference in representation — some words might have different possible representations (extra space or hyphen) or different spellings — for example, in Tagalog, ‘right?’ can be written as both ‘di ba’ and ‘diba’.
  • Representation of dates, currency, and ordinals can vary as there are multiple correct ways of writing them.

Hence, it is a good idea to get predictions verified by a human who can speak and read the language.

Try out our WER calculator yourself

Head to https://neuralspace.ngrok.io/?__theme=light to try out our interactive WER calculator.

NeuralSpace Word Error Rate Calculator

What’s considered a good WER?

The target WER score often hinges on the unique needs of a particular industry. Generally, a lower WER signifies superior performance. 0% WER represents perfect transcription, albeit a rarity. Typically, WER below 10% is seen as excellent, while scores between 10% and 20% are good.

But this generalization should not necessarily be your guiding star. WER, as we have seen, can vary a lot depending on testing methodologies and test sets. Hence, WER should be looked at in a relative way. Using the same testing strategies to compare results among providers helps make a more informed choice than just considering absolute scores. It’s also essential to align WER standards with the specific demands and industry norms of your application.

WER plays a vital role in evaluating the accuracy and reliability of an STT vendor. By understanding WER and its nuances for different languages, along with determining the appropriate range of WER scores based on the specific context, you can make an informed decision on which STT vendor aligns best with your unique requirements and expectations.

Join our expert-led session where we demystify the selection process for STT systems.

--

--