How to rate an Automatic Speech Recognition (ASR) system?

5 min readOct 22, 2021

Bacon: “What? ? ?” — Inn at Eagle Mountain — Sep 2021

Auto speech recognition system has been deployed everywhere. It can be an upstream system of the AT&T bot agent; It can be part of your Siri or Cortana. Its main functionality is to convert audio speech to computer-understandable text. The problem now is how to rate its job in the conversion? Given that the rating needs to take audio and text into account.

There are three main categories of metrics, each from a different perspective.

1. Human-Perceived Metrics

With the category name, we know that this group of metrics stand from a human’s perspective. Understanding and comprehending an audio file and giving a rating produced by human themselves are perhaps the most accurate metrics, given the fact that the designated user of the system is human.

One direct and easy metric is Mean Opinion Scores (MOS). To calculate this metric, human agent needs to listen to the audio file and give a subjective rating scale from 1 to 5, with 5 signifying an excellent quality of transcription. The obvious problem is that it is labor-extensive and time-consuming. Therefore, two more efficient and realistic metrics can be considered:

Human Perceived Accuracy

Human Perceived Accuracy (HPA) [1] intends to predict how humans perceive and comprehend the transcriptions produced by ASR system. It measures how much of information that a human perceives as useful information is captured by the ASR transcription.

Figure 1: HPA metric equation

In experiments that Mishra et al. have conducted, saliency was estimated by Inverse Document Frequency (IDF). To determine the relative weights of ASR different types of errors, i.e. insertion, deletion and substitution, a regression on the average subject scores obtained from the experiments is proposed. It turns out that, from their experiments, the Pearson correlation between HPA results and human judgement was high.

Word Recognition Ratio

Word Recognition Ratio (R) [2] is an estimation of MOS. The usage of this metric automates evaluation process and eliminates tedious audio listening.

Figure 2: Absolute R and Relative R equations

Relative word recognition ratio is the ratio assigned to each speaker, because each speaker may possess individual identifiable features, such as accent, talk speed, etc., and those factors can potentially influence the recognizer’s performance. Relative R is calculated by dividing the absolute R under packet loss probability p by its own value under zero loss condition.

Jiang et al. have proved that Word Recognition Ratio can be reliably used to prediction human-based perceived quality of ASR. Also, the relative ratio is speaker-independent, therefore, it’s a universal MOR estimator.

2. Keyword-Based Metrics

Keyword related measurements are alternatives to Word Error Rate (WER), where every single word is treated in a uniform manner, and they are able to address some drawbacks of WER as well.

Keyword Error Rate

Keyword Error Rate (KER) [3] stems from WER, but it only takes pre-filtered keywords into consideration.

As for keyword selection, Park et al. propose to identify domain-specific words by calculating the relative frequency of a word’s occurrence in a domain-specific text than a non-domain specific text. Words that have high enough domain-specificity are labeled as keywords.

KER is calculated with the sum of the number of falsely recognized keywords F and the number of missed keywords M divided by the number of keywords in reference data N, then multiply by 100 to make it more readable.

This metric calculation involves a procedure of keyword selection where a threshold of specificity needs to be manually picked. However, generally, this metric is easy enough to be applied.

Weighted Keyword Error Rate

Weighted Keyword Error Rate (WKER) [4] is developed from KER. WKER gives weights on errors from a perspective of information retrieval. This metric is more appropriate for predicting the performance of key sentence indexing of oral presentations.

Firstly, we index key sentences with keywords in the sentence using words term frequency (TF):

Figure 4: WKER equation — indexing key sentences

In the equation of indexing key sentences, df stands for the number of documents in which the word appears; Nd is the number of presentations for normalization.

Then, we move forward to calculate WKER:

Figure 5: WKER equation — calculating WKER

Given that keywords are more important than common words, this predictor, WKER, requires a collection of key sentences, which potentially requires a small amount of work on data preprocessing. The keyword indexing can be automated and the calculation of WKER is relatively simple in application.

3. Information-Based Metrics

Information-based metrics are developed after information theories and it’s a different scope of ASR performance estimates. Instead of being picky on every word or every single keyword, this category considers a knowledge distillation.

Word Information Lost

Word Information Lost (WIL) [5] [6] is an estimate of Relative Information Lost (RIL). RIL is based on Mutual Information and it’s not an adequate metric, due to complexity.

WIL, however, is simple to apply because it’s calculated with Hit-Substitution-Deletion-Insertion (HSDI) counts of Input/Output alignments. WIL can be easily applied and easily interpreted as the probability that any input word is lost in the output.

References

[1] Mishra, Taniya, Andrej Ljolje, and Mazin Gilbert. “Predicting human perceived accuracy of ASR systems.” Twelfth Annual Conference of the International Speech Communication Association. 2011.

[2] Jiang, Wenyu, and Henning Schulzrinne. “Speech recognition performance as an effective perceived quality predictor.” IEEE 2002 Tenth IEEE International Workshop on Quality of Service (Cat. №02EX564). IEEE, 2002.

[3] Park, Youngja, et al. “An empirical analysis of word error rate and keyword error rate.” INTERSPEECH. 2008.

[4] Nanjo, Hiroaki, and Tatsuya Kawahara. “A new ASR evaluation measure and minimum Bayes-risk decoding for open-domain speech understanding.” Proceedings.(ICASSP’05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005.. Vol. 1. IEEE, 2005.

[5] Morris, Andrew Cameron, Viktoria Maier, and Phil Green. “From WER and RIL to MER and WIL: improved evaluation measures for connected speech recognition.” Eighth International Conference on Spoken Language Processing. 2004.

[6] Errattahi, Rahhal, Asmaa El Hannani, and Hassan Ouahmane. “Automatic speech recognition errors detection and correction: A review.” Procedia Computer Science 128 (2018): 32–37.