Photo by Jacques Nel on Unsplash

Is Local Really Lekker when it comes to Transcription Models?

Testing different Amazon Transcribe models on a large dataset of South African English voicemails

Nick Wilkinson
Published in
7 min readFeb 22, 2022

--

Towards the end of 2021, we published a blog titled “How accurate is Amazon Transcribe on South African English?”. In that blog, we did a deep dive into the accuracy of Amazon Transcribe on our crowd-labelled dataset, and concluded that the Word Error Rate (WER) was in the region of 17–25%. That work was done using Amazon’s US English transcription model, as there was no South African model available.

Since then, Amazon has released support for speech-to-text in 6 new languages, including South African English. This begged the obvious question, how much better is the new model on our South African English dataset? Surely a model trained on South African English would be better…?

In this blog we explore the accuracy of three different transcription models: the US English model and the new South African English model from Amazon Transcribe, as well as the US English model with a custom language model.

The Data

The dataset we use for these tests comprises a collection of voicemail messages to a call centre, in which customers are asked to provide feedback concerning the service they have just received. Speakers are either first or second language English speakers from South Africa. The total duration of the dataset is around 290 hours of audio, which is to our knowledge the largest corpus of transcribed South African English audio in existence.

An overview of the dataset statistics.

The dataset statistics are shown the figure above. Of these 47,000 voicemails, we collected approximately 63,500 annotations, transcribed by non-expert human annotators, with 15% of the voicemails being annotated multiple times to allow for quality control.

The annotation process was split into a number of distinct milestones, each with differing goals. For this blog, milestones 1 to 3, comprising 31,000 voicemails were used. An overview of these milestones is provided in the figure below. Milestone 1 was the first 1,000 voicemails, each annotated 3 times allowing us to measure the accuracy of our annotators. For more info on how we measured annotation accuracy, refer to our previous blog. Milestone 2 and 3 were scaling up the number of annotators and voicemails.

Breakdown of the annotation milestones.

The Experiments

Data Prep

To explore the accuracy of these models we need to settle on a test set. We decided to use the Milestone 1 data for this purpose, which would allow us to use the bulk of the data (Milestone 2 and 3) as a training set for our custom language model.

In order to use Milestone 1 as a test set, we prepare the data by removing any blank transcriptions, removing special tokens (such as “[non eng]” which was used to transcribe any non-English words encountered). Next we choose the best of the three repeated transcriptions. We do this by selecting the longest of the three transcriptions, using the number of words as a proxy for the amount of effort put into the transcription. This is a simple yet effective heuristic, as described in this paper.

Custom Language Model Training

Once our data is prepared, we have a dataset to test how well our models perform, as well as a dataset to train new models. Next, we train a custom language model using the Milestone 2 and 3 training set data.

The model we train is a custom language model. Language models are statistical models that describe the probability distributions of a sequence of words. In speech-to-text systems they are used to find the most likely sequence of words, given the input audio. For example, a model that only uses acoustic information may transcribe a sentence as “the cat sat on the pat”. The addition of a language model would allow the system to infer, given that output, that a more likely sentence would be “the cat sat on the mat”. Custom language models, either trained from scratch or adapted from larger pre-trained models, allow speech-to-text systems to learn domain specific language. This makes them more accurate in those specific domains. For example, a custom language model trained on sentences that relate to farming would allow a text-to-speech system to more accurately transcribe farming related conversations.

Custom language model training is a service provided by Amazon Transcribe. It allows one to bring a custom set of text data, which is used to adapt a base model from Amazon Transcribe into a new language model. Amazon recommends a minimum of 10,000 words to train a custom language model, and they recommend around 100,000 words of domain specific text to see good improvements. Our dataset is around 1,000,000 words of domain specific text, so comfortably larger than both recommendations. We train the model using the narrowband base model, with the language set to US English. Unfortunately, custom language model training for South African English is not yet supported.

Findings

In order to compare the models, we use a metric known as Word Error Rate (WER). WER is the number of errors in the hypothesis transcription (in our case generated by Amazon Transcribe) divided by the number of words in the reference transcription (generated by the human annotator). One can interpret the WER as the amount by which the transcriptions differ. If the transcriptions match completely, the WER will be 0, and if they are vastly different the WER will be very high. The actual calculation of WER is shown in the figure below, with an example.

The formula for WER, with an example.

Using this metric we can compute the WER between each of the three models and the cleaned Milestone 1 test dataset described previously. The results are shown in the figure below.

WER comparison between the three models.

The first noteworthy result is that our Tesserae custom language model shows a relative improvement of 11.3% from the default US English Amazon Transcribe model. This result is somewhat expected, as the custom language model is trained on transcriptions from the same domain as the test data. Still, it is an impressive result that text only data can provide such an improvement, without any change to the underlying acoustic model.

The more surprising result is that the South African English Amazon Transcribe model performs 17.1% worse relative to the US English Amazon Transcribe model. There are a few possible reasons for this, depending on how the model was trained.

The first possibility is that the South African Amazon Transcribe model was trained from scratch, with a corpus of South African English. In this scenario, it is possible that the dataset was not large enough, or not diverse enough. There are a wide variety of South African English accents, and due to our data being from call centres, it contains most, if not all, of those accents.

Being the original and most used English model for Amazon Transcribe, the US English model was likely trained on a much larger dataset than the South African English model. It would, therefore, have a very robust acoustic model, trained on many hours of audio from speakers with different American accents. And so, despite not being specifically trained for South African English, the robust acoustic model combined with a highly accurate language model (which tends to be more region independent than an acoustic model) makes this model generalise better on a diverse set of speakers than the South African English model.

The other possibility is that the South African Amazon Transcribe model was fine-tuned from the US English model. In this scenario, the result could be a result of the fine-tuning overfitting the training data. Consequently, the base (US) model would likely generalise better than an overfitted fine-tuned version of that same model.

It must be noted, however, that the exact details of the training data and underlying models used by Amazon Transcribe are not publicly available, and so these hypotheses should be taken with a pinch of salt. (PS. If anyone has any details of how this model was trained, please do share!)

Also, if your company would like to find out how our custom transcribe model performs on your data, please get in touch with us at nick@tesserae.co.

Finally, in the figure below there are some examples of each model’s output transcription for one voicemail:

Examples of transcriptions from the three models, as well as the human annotation.

Conclusion

We have shown that training a custom language model can provide significant performance improvements for transcription using AWS. Furthermore, we have shown that using the South African English transcription model is not necessarily the best choice for transcribing South African English. In cases such as ours, even if one were not able to train a custom language model, one would still see better results using the default US English model from Amazon Transcribe.

In other countries with English models that have recently been added by Amazon Transcribe, such as Australia and New Zealand, it may be worth running a similar experiment before using the new models to see if they truly are more accurate than the US English model on your specific data.

About Us

At Tesserae we’re building the only full-stack AI platform focused on the financial services industry, including a large-scale crowd-labelling platform, and a state-of-the-art AI model marketplace. Tesserae was started together with the Standard Bank Group to help companies leverage their data while generating job opportunities across the African continent.

--

--