Using Automated Speech Recognition to Support Human Transcriptions

Published in

Aigent

8 min readOct 8, 2020

In early 2020 Aigent built a Speech Model with a Word Error Rate (WER) of 9%. For those new to Speech Recognition, WER is a common metric used to measure the accuracy of speech model transcriptions. Aigent’s Speech Model’s WER is particularly impressive given the model is trained upon data from a notoriously tricky domain - call center conversations between US consumers and customer service agents from the Philippines.

To get the quality of transcriptions required to train such a model we built an extremely resource-heavy Transcription Process. For a transcription to be defined as Gold Standard, we required that at least four human transcribers came into contact with it. Such a resource-heavy process was an option for us because we have a huge Data Acquisition Team in the Philippines (see here for more information on how we set this team up).

Now, we want to streamline this entire process and utilize our state-of-the-art Speech Model to support our Data Acquisition Team. Ultimately, we hope to reduce the 30.5 active human working hours it takes for one hour of audio to pass through the entire process. This will not only allow us to get transcriptions in a more timely manner, but also free up our human resources to do a wider range of data acquisition tasks, from labelling text to annotating audio.

The Plan

Our current time-consuming Transcription Process looks as follows:

Image 1: Aigent’s current Transcription Process Flow

Audio (in our case complete two-channel phone conversations) is passed through this process from left to right.

First the audio is manually labelled in our Audio Labelling Tool, with unusable audio weaned out and other labels such as emotion and speaking rate added.
Next, the audio is split into sentence-sized segments by an automated tool.
The audio is then routed to a human transcriber who transcribes the audio from scratch.
To avoid bias, the very same audio is then transcribed again from scratch by a different transcriber.
A human Quality Assurance (QA) round is then completed. An experienced transcriber is presented with both of the previous round transcriptions and has to make a decision on which is correct or make adjustments where both are wrong.
A final human QA verification round is completed before the audio can be classified as Gold Standard.

For this initial investigation into utilizing Automated Speech Recognition (ASR) to support our Transcription Process, we decided to simply replace one of the most resource-heavy initial transcription rounds (Transcription Round 2) with an automated transcription completed by our Speech Model.

To give our model the ultimate test, we created a fifteen-hour dataset containing audio recordings from a call center program that our Speech Model has not been trained on. Within ten days we had our results.

To evaluate the success of replacing a human transcription round we accounted for a number of quantitative and qualitative metrics. They are as follows:

Transcription Speed

How is the speed of the overall process affected?
How is the speed of the specific stages affected?

Transcription Quality

What is the WER between QA Round 1 and the ASR Transcription Round? How much worse is this WER than that of a human round?
Is the QA Round 1 quality affected? (this is determined by calculating the WER between QA Round 1 and QA Round 2).

User Observations

What do the impacted users (particularly the QA Round 1 Team) think of the change? I.e. do they use ASR transcriptions? Or do they just ignore them and focus on the human round?

Transcription Speed

We were certain going into this task, that by substituting one of the two human transcription rounds with a transcription completed by our ASR model that we would see a significant decrease in the overall time taken for audio to be transcribed.

We were, however, less sure about the impact of this on the QA rounds, particularly the first QA round. Would introducing a non-human transcription round make these subsequent QA rounds more difficult and thus increase the time taken to QA?

The below table outlines the time taken for one hour of audio to pass through each of the 5 stages. In the Normal Transcription Process (1 hour) column you can see how long it takes for 1 hour of audio to be transcribed using just human labor. In the ASR Supported Transcription Process (1 hour) column you can see how long it took for 1 hour of audio to be transcribed after incorporating ASR transcriptions.

The results are conclusive and clearly show that by incorporating ASR transcriptions into the process, Aigent will significantly reduce the time taken for audio to pass from non-transcribed to Gold Standard. For each hour of audio we would save over nine hours of manual human work.

However, it should be noted that the time taken to transcribe in QA Round 1 does increase substantially after introducing ASR transcriptions. This is partly because the first QA stage is now harder and requires more effort from these users, who can rely less upon a second round transcription. This spike can probably also be partly explained by the QA Team making observations about the ASR quality while transcribing this dataset. In all likelihood, as the QA Team adapt to this change, the time spent will converge with the Old Transcription Process.

Transcription Quality

When evaluating the benefits of incorporating ASR transcriptions into the Transcription Process it is vital to take into account the impact on transcription quality- i.e. the WER of transcriptions. For instance, if the overall speed of the Transcription Process increases, but we also witness a significant dip in overall transcription quality, then the costs of implementing such a change may outweigh the benefits.

ASR Transcription Quality

We calculate the quality of the transcriptions by comparing the ASR Transcription Round and the Human Transcription Round to the first human Quality Assurance Round.

In the below table, the WER percentage, where the lower the percentage the more accurate the transcription, is displayed. For this 15 hour dataset of 250 calls, you can see the Average WER for a human transcriber and the ASR round. You can also see the Max WER and Min WER to get a grasp of the quality range within and between the two rounds.

Table 2: Transcription Quality (compared to Quality Assurance Round)

In line with expectations, the ASR transcriptions are on average three times less accurate than human transcriptions. The accuracy range (difference between Max and Min WER) of these ASR transcriptions also varies much more than with human transcribers. This is because when a human makes a mistake it generally is a word here or there. The ASR on the other hand, although often accurate, can get whole speech segments incorrect.

Other interesting observations in terms of the quality difference between the human and ASR transcriptions are that:

on one occasion the ASR Transcription round has a lower WER than the Human Transcription round (12% vs 18%).
the average difference between the WER score of the ASR and Human transcription is 16.5%. Normally the WER difference between the Round 1 & Round 2 transcriptions is closer to 4%.

QA Round 1 Transcription Quality

As mentioned, we expected the ASR transcription quality to be significantly worse than that of a trained human transcriber, particularly as the Speech Model is not trained on this audio. However, we were unsure how strong the impact of introducing this new round would be on the quality of the first QA Round. It is certainly feasible that the added distraction of this new, less accurate and non-human round will negatively impact the quality of the first QA transcription (which is measured by calculating the WER between this round and the second QA Round).

To evaluate this we compared the WER between the QA Round 1 and QA Round 2 for a normal 15- hour dataset and the WER between the QA Round 1 and QA Round 2 for the dataset which included ASR transcriptions.

Table 3: QA Round 1 Quality (compared to Quality Assurance Round 2)

The above table shows that the quality output for the first QA Round does not seem to be particularly impacted by the introduction of the ASR Transcription Round. The Average WER does increase slightly but the spike is negligible. Also, the Max WER is actually lower for this the ASR supported dataset and in both datasets the Min WER is 0% (i.e. the QA Round 2 transcription is identical to the QA Round 1 transcription).

User Observations

Lastly, it’s important to look at the observations that the users most affected by this change — the QA Round 1 Team — noted while correcting the transcriptions.

As the higher WER would imply, the team pointed out that they were much less likely to select the ASR transcription than the human round. They reported that for audios with minimal background noise and clear speech the ASR transcription was more accurate and thus more helpful. But when the audio became noisier or the speaker had a stronger accent, the Speech Model struggled. On a number of these occasions the Speech Model produced truncated transcriptions, missing large parts of speaker turns.

As a result of this a number of the QA Round 1 Team admitted that they lost confidence in the ASR transcription. Many began ignoring this round’s transcription altogether, making the conscious decision to simply focus on the human round and their own judgement. Our user behavior metrics confirmed that users were considerably more likely to click on the human transcription round and correct it, even when both rounds contained similar transcriptions.

Next Steps

By simply examining the hard facts from the above investigation it would seem obvious that permanently replacing one of the resource-heavy human transcription rounds with an ASR transcription round would be a logical next step. Such a move would shred over nine hours of human labor from the overall transcription speed and have little detriment to the overall transcription quality output.

But, these quantitative metrics fail to represent the full story. The observations from our QA Team suggest that these improvements are more a consequence of removing the human transcription round, rather than a positive consequence of adding an ASR transcription round.

Due to these observations we have decided to investigate a further shake-up of our Transcription Process. Rather than directly replacing a human transcription with an ASR transcription, we want to cease manually transcribing audio from scratch and alternatively have a human correct an initial ASR transcription. Image 2 outlines how this future flow could look.

Image 2: Aigent’s potential Transcription Process Flow

We expect such a change to have a profoundly greater impact on our Transcription Process. Further reducing the time taken for Aigent’s Speech Team to get the vast amount of transcriptions required to train the model.

Nevertheless, there are certainly risks that such a method will cement and exasperate the model’s inherent bias. Potentially, transcribers will subconsciously fail to correct some of the model’s errors, which will then be fed back into the model through its training.

We look forward to finding out and sharing the results with you!