How Accurate is Amazon Transcribe on South African English?

Measuring transcription accuracy when there is no ground truth

Nick Wilkinson
Tesserae AI
Published in
10 min readSep 30, 2021

--

“Alexa, play Bohemian Rhapsody.” “Okay, calling grandma!” We’ve all been there with our digital assistants, particularly those of us who speak English with one of the many accents that these assistants struggle with (basically any accents other than American or British). Many cloud services, such as those offered by Amazon and Google, now offer their own speech-to-text services. These are often quoted as having a very high degree of accuracy. For example, this blog found error rates as low as 3–5% when testing these services on podcast speech. However, many real-world applications require speech-to-text to be performed in less than ideal noise conditions, on speakers with many different accents. In this blog we explore how Amazon Transcribe performs when faced with a variety of South African English accents.

The Data

The dataset we use for these tests comprises a collection of voicemail messages to a call centre, in which customers are asked to provide feedback concerning the service they have just received. Speakers are either first or second language English speakers from South Africa. The total duration of the dataset is around 290 hours of audio, which is to our knowledge the largest corpus of transcribed South African English audio in existence.

An overview of the dataset statistics

The dataset statistics are shown in the figure above. Of these 47 000 voicemails, we collected approximately 63 500 annotations, transcribed by non-expert human annotators, with 15% of the voicemails being annotated multiple times to allow for quality control. The idea was to use these repeatedly transcribed voicemails to get some idea of the accuracy of our annotators.

Inter-Annotator Agreement

Usually, if we are interested in the accuracy of a transcription, we would calculate the word error rate (WER) to get our answer. This approach is useful for comparing transcriptions produced by a text-to-speech system against a ground truth reference generated by an expert human annotator. The ground truth can be used as the reference transcription and the speech recognizer output is the hypothesis transcription.

However, what happens if we would like to compare non-expert human annotators against one another? In this case, neither transcription is the ground truth reference. Without a ground truth we can no longer measure “accuracy” as such, but rather we can look at the level of agreement as a proxy for accuracy. By analogy, if we were to ask a number of students to write a mathematics test for which no memorandum was available, one way to compile a memorandum would be to assume the most commonly given answers are likely to be the correct answers. This method has its flaws, but fortunately transcription is much easier than calculus, so those flaws are unlikely to have a large influence.

A better metric

Unfortunately, the WER calculation is not commutative, i.e. we get a different answer if we swap the reference and the hypothesis. This is demonstrated by the examples in the figure below, where swapping the transcription order results in a different WER of 40% or 44.44%. In order to measure pairwise agreement between annotators we need a commutative metric to measure the amount by which transcriptions differ.

A comparison between WER and Normalized WER. Note that Normalized WER is commutative whereas WER is not.

This is where Normalized WER comes into play. Normalized WER makes a simple modification to WER to make it commutative. Instead of using the length of the reference transcription as the denominator, it uses the length of the longest of the two transcriptions. This means the order of the transcriptions is no longer important. Once again this is shown in the examples in the figure above, where the Normalized WER remains 40% regardless of the transcription order.

Normalized WER has the added benefit of having a maximum of 100% (no agreement) and a minimum of 0% (total agreement), making it easier to interpret than WER which can have values exceeding 100%. Normalized WER is equivalent to another metric known as match error rate (MER), which is discussed in more detail here, along with other measures of transcription accuracy.

How good are human annotators?

Using Normalized WER we can compute the pairwise agreement between all transcriptions that have been completed by multiple annotators. We can then average the pairwise agreement per annotator, giving us an approximate measure of each annotator’s individual accuracy. This is what is shown in the graph below. We see that the annotators range from a WER of 44.9% down to 17.6%, with the median and mean lying in the region of 23–24%.

The Normalized WER of each annotator. The average Normalized WER between annotators is 23.96%.

We can interpret these results as there being an average disagreement of 2.3 words for every 10 words between annotators. This is very similar to what was found by researchers at Cambridge in this paper, who found an average error rate of 23.5% between three pairs of professional human transcription services

It is worth noting that the source of some of this error is not what we might classify as a true “mistake”. The annotations are cleaned before the WER calculation is performed, however, the cleaning is not perfect, and so instances of multiple, or incorrect spellings of a word do still occur. For example American English vs British English spellings, like color/colour and finalize/finalise.

It is also important to note that the inter-human errors are very different from those between humans and automatic speech-to-text services. Even though the error between humans is found to be around 20%, upon closer examination the transcriptions are usually seen to be semantically very similar. This is not the case for automatic speech-to-text services. Therefore, it is not fair to directly compare inter-annotator WERs with human vs automatic speech-to-text WERs.

Human vs AI — Amazon Transcribe

As mentioned previously, if one has access to a ground truth reference, measuring WER is a fairly trivial task. However, if you have non-expert human transcriptions that are not necessarily a perfect ground truth, the problem becomes more difficult. The Venn diagram below demonstrates the task. There is an overlap between the crowd-sourced human transcription and the ground truth “true” transcription. There is also an overlap between the Amazon Transcribe transcription and the ground truth “true” transcription. We would like to use the human transcriptions, which are approximations of the ground truth “true” transcription, to estimate the WER of Amazon Transcribe.

We would like to measure the error rate of Amazon Transcribe against the ground truth, but a ground truth is not available, so we need to estimate it using our flawed human transcriptions.

How do humans and Amazon Transcribe differ?

We approach the problem of estimating Amazon Transcribe’s error rate from two different angles. The first is to simply compare the WER between all our human produced transcriptions and those of Amazon Transcribe. If we do this, we find that the WER is 25%. If we had a true ground truth reference, we would take this value as our final answer. However, we have already seen from the disagreement between different annotators that we clearly have some errors in our human transcriptions. This means the true error rate of Amazon transcribe is likely to be lower than this.

The source of error between Amazon Transcribe and human annotators. In reality the true error rates are likely between the bounds given by Case 1 and Case 2.

To show why, consider the following two scenarios. In scenario 1, we assume the human annotations are 100% correct. In this case, the error of Amazon Transcribe would be the WER we measure, which is 25%. In scenario 2, we assume that Amazon Transcribe is 100% correct. Now, the source of the 25% WER would be from the human annotations. In reality, we know that neither the human annotations, nor Amazon Transcribe is 100% correct. Therefore, we can think of the measured WER of 25% as an upper bound on the WER of Amazon Transcribe, because in reality the source of the error will not be solely from Amazon Transcribe, but partially from the human error too. This argument is demonstrated in the figure above with examples.

How many human corrections are required to correct Amazon Transcribe?

The second angle from which we approach the question of Amazon Transcribe’s error rate, is to look at the number of edits a human annotator needs to make to correct Amazon Transcribe. The way we measured this was to have the annotators perform a slightly different transcription task to before. For everything discussed up until this point, annotators were giving the voicemail audio and an empty text field where they could type the transcription. Now, instead of an empty text box, we gave the annotators the Amazon Transcribe transcription and asked them to correct it. This allows us to directly measure the number of edits (insertions, deletions, and substitutions) a human annotator makes to correct Amazon Transcribe. We refer to this task as transcription with pre-labelling.

One issue with this approach is that the annotators are now biased towards the transcription they are given. We can imagine that if a word is difficult to hear the annotator is likely to leave the Amazon Transcribe pre-labelling of that word. Furthermore, if we have annotators who are tired or perhaps a bit lazy (we all have those Friday afternoons…) they are likely to miss corrections that need to be made, or perhaps more maliciously, they might leave the Amazon Transcription without change to get through their workload quicker.

Consequently, the best way to measure the number of edits required to fix the transcription, is not to measure the average WER, but rather to look at the annotators with the highest WER because those annotators clearly found more corrections and were therefore likely to be putting in the most effort into fixing the transcription. The same logic is often applied when one has multiple transcriptions of the same audio and one would like to choose the best one. A simple heuristic is to pick the longest transcription, as the person who wrote the most is likely to have put the most effort into their transcription. This is discussed in this paper, where this heuristic is seen to work quite well in practice.

The bar plot below shows the WER of each annotator for the pre-labelled annotation task. We see that the annotators who made the most changes have an error rate of around 17%, which is equivalent to correcting 1.7 words for every 10 words in the Amazon Transcribe output. On the other end of the spectrum we see annotators with very low error rates, indicating that they didn’t make many changes to the Amazon Transcribe output.

The Normalized WER of each annotator for the pre-transcribed task.

In reality, the true WER is probably higher than the 17% suggested by the pre-labelling experiment. The first reason has already been mentioned, the annotators are biased towards using words “heard” by Amazon Transcribe as they have access to Amazon Transcribe’s output. The second reason is that even the very best annotators are likely to sometimes miss corrections that need to be made. As a result, we can think of the 17% WER from the pre-labelling experiment as a lower bound on the true WER.

Putting it all together

We’ve seen that our human transcribers are not perfect, and as a result we don’t have a perfect ground truth transcription against which to measure the performance of Amazon Transcribe. To get around this problem we measured the error rate of Amazon Transcribe in two different ways. First we measured Amazon Transcribe against our flawed human transcriptions, to get an upper bound for the WER. This was found to be around 25%. Next, we gave our annotators the Amazon Transcribe outputs and asked them to correct them, to measure how many edits are required to fix Amazon Transcribe’s transcriptions. This gave us a lower bound for the WER, which was around 17%. Therefore, we can say that Amazon Transcribe has a WER of 17–25% on a large corpus of South African accented English audio.

Acknowledgements

This work would not have been possible without the fantastic work done by everyone at the Tesserae team. Ashley Gritzman, our CTO who supports and inspires us daily. Gilad Even-Tov and Jay Meyerowitz, the fantastic engineers behind our annotation platform. Kartik Mistry and Lutho Matiwane on the BizOps team, for coordinating our projects and annotation teams. And of course many thanks to our fearless leader Stu Iverson. Finally, a note of thanks to all our annotators for their hard work, and to Standard Bank for making this project possible.

About Us

Artificial intelligence will define the financial services sector within the next five years, but companies are struggling to adopt AI due to two core problems: access to quality data, and access to high-calibre talent. At Tesserae we’re building the only full-stack AI platform focused on the financial services industry, including a large-scale crowd-labelling platform, and a state-of-the-art AI model marketplace. Tesserae was started together with Standard Bank Group to help companies leverage their data while generating job opportunities across the African continent.

References

Cloud Compiled, “Transcription API Comparison: Google Speech-to-text, Amazon, Rev.ai”, Cloud Compiled Blog, 2020. [Online]. Available: https://cloudcompiled.com/blog/transcription-api-comparison/.

A. C. Morris, V. Maier and P. Green, “From WER and RIL to MER and WIL: improved evaluation measures for connected speech recognition,” in Proc. INTERSPEECH, Jeju Island, Korea, 2004. [Online]. Available: https://www.isca-speech.org/archive/archive_papers/interspeech_2004/i04_2765.pdf.

Y. Gaur, W. S. Lasecki, J. P. Bigham and F. Metze, “The effects of automatic speech recognition quality on human transcription latency,” in Proc. W4A, Montreal, Canada 2016. [Online]. Available: https://www.cs.cmu.edu/~fmetze/interACT/Publications_files/publications/asr_threshold_w4a.pdf.

R. C. van Dalen, K. M. Knill, P. Tsiakoulis and M. J. F. Gales, “Improving multiple-crowd-sourced transcriptions using a speech recogniser,” in Proc. ICASSP, Queensland, Australia, 2015. [Online]. Available: https://www.repository.cam.ac.uk/bitstream/handle/1810/247607/van%20Dalen%20et%20al%202015%20Proceedings%20of%20the%20IEEE%20International%20Conference%20on%20Acoustics%2c%20Speech%20and%20Signal%20Processing%20%28ICASSP%29.pdf?sequence=1&isAllowed=y.

--

--