Two minutes NLP — Intro to Word Error Rate (WER) for Speech-to-Text

Speech Recognition, Levenshtein distance, and BLEU

Fabio Chiusano
NLPlanet
3 min readFeb 3, 2022

--

Photo by Sebastian Scholz (Nuki) on Unsplash

Hello fellow NLP enthusiasts! While in the past few weeks I was researching to write articles on the BLEU and ROUGE metrics, I came across WER. Many of the models we use every day are trained and evaluated with metrics like these, so I think it’s very important to know them (even if they may be not the sexiest of the topics) 🤷🏻‍♂️. I then added WER to my editorial plan of articles to write, and today is the day to talk about it! Enjoy! 😄

What is WER?

Word Error Rate (WER) is a common performance metric mainly used for speech recognition.

When recognizing speech and transcribing it into text, some words may be left out or misinterpreted. WER compares the predicted output and the reference transcript word by word to figure out the number of differences between them.

There are three types of errors considered when computing WER:

  • Insertions: when the predicted output contains additional words that are not present in the transcript;
  • Deletions: when the predicted output doesn’t contain words that are present in the transcript;
  • Substitutions: when the predicted output contains some misinterpreted words that replace words in the transcript;

Let’s make an example. Consider the following reference transcript and predicted output:

  • Reference transcript: “The dog is under the table”.
  • Predicted output: “The dog is the fable”.

In this case, the predicted output has one deletion (the word “under” disappears) and one substitution (“table” becomes “fable”).

So, what is the Word Error Rate of this translation? Basically, WER is the number of errors divided by the number of words in the reference transcript.

WER = (num inserted + num deleted + num substituted) / num words in the reference

Thus, in our example:

WER = (0 + 1 + 1) / 6 = 0.33

Lower WER often indicates that the Automated Speech Recognition (ASR) software is more accurate in recognizing speech. A higher WER, then, often indicates lower ASR accuracy.

WER, Levenshtein distance, and edit distance

The WER calculation is based on the Levenshtein distance, which measures the differences between two words. Informally, the Levenshtein distance between two words is the minimum number of single-character edits (insertions, deletions, or substitutions) required to change one word into the other.

The Word Error Rate may also be referred to as the length normalized edit distance.

WER drawbacks

Although WER is the most widely used metric for Speech Recognition, it has some drawbacks:

  • There is no differentiation between the words that are essential to the meaning of the sentence and those that are not as important.
  • It doesn’t take into consideration if two words differ just by one character or if they differ completely.

Moreover, WER does not account for the reason why errors may happen, which may affect WER without necessarily reflecting the capabilities of the ASR technology itself. Some examples of these factors are the recording quality, the speaker's pronunciation, and the presence of unusual names or domain-specific terms.

WER and BLEU

Why is the BLEU score used for machine translation and summarization but not for speech-to-text?

Although Automatic Speech Recognition models output text similarly to machine translation systems, the target sentence is unambiguous and usually not subject to interpretation. In this case, the BLEU score is not the ideal metric.

--

--

Fabio Chiusano
NLPlanet

Freelance data scientist — Top Medium writer in Artificial Intelligence