Udemy’s speech-to-text vendor evaluation

Published in

Udemy Tech Blog

6 min readAug 24, 2022

Introduction

As a predominantly video-based learning platform, Udemy has always been investing in making education as accessible as possible for all of our learners and instructors around the globe. In 2017, we identified having subtitles on all of our videos as a must if we are to succeed in supporting students with hearing impairments as part of our commitment to accessibility.

Having conducted an evaluation of several speech-to-text vendors, we identified one as providing the best quality and soon rolled out automatically generated subtitles for all published English courses on our platform. Two years later, we extended auto-subtitle generation to all Spanish and Portuguese courses. This year, we continue this momentum by adding four more languages into the mix by transcribing most courses in French, German, Italian and Japanese.

Of course, all of these additional languages are pointless if the automatically generated subtitles are of poor quality— so how do we evaluate different vendors to ensure the best possible user experience for our learners? Read on to find out!

Evaluation approach

The most common approach to assessing an Automatic Speech Recognition (ASR) system to date is by computing the word error rate (WER) between the reference transcription of an audio stream and the candidate transcription generated by the ASR system. The WER is derived from the Levenshtein distance or edit distance and is a good general measure of how close the output is to the expected, correct transcript.

In short, the WER counts the number of substitutions, insertions or deletions that need to be made to the candidate transcript divided by the length of the reference transcript. So for example the WER of “This is an example text” being transcribed as “This is and my example text” is 2/5 = 0.4. Ideally, we want our WER to be at 0, which is entirely possible!

Given WER works at a very high level, it should be mentioned that it does have a few limitations. The main one is its inability to penalize mistakes with content words (e.g. “compiler”, “management”, “corporate law”, etc.) higher than those with function words (e.g. “this”, “they”, “at”, etc.). For a further read on evaluating ASR systems, this article from AWS is highly recommended as a detailed and accessible resource.

In our evaluation, we also explore using a character error rate (CER), effectively applying the same approach as WER but at a character level. The potential advantage of CER over WER is that it can ignore small mistakes which would be easily corrected by a learner’s brain (e.g. for ‘their’ vs ‘there’, the WER is 1.0 while CER is 0.4, which is arguably more representative). We also explore a punctuation error rate (PER), but a challenge with punctuation is that there is no definitive accurate reference here as punctuation is not explicitly spoken.

For our most recent evaluation, we randomly selected a total of 70 lectures from our courses, totaling over 4.5 hours in length. Half of these came from our “Tech” category and the other half — from our “Business” category. The videos from these lectures were transcribed by our localization vendor to act as reference transcription.

We then generated hypothesis transcriptions by sending the same videos to APIs provided by five different products and calculating the error rates for each individual lecture. The evaluated products were: AWS Transcribe, Deepgram, Google Speech-to-text (using the premium, “video” model), Rev.ai and Speechmatics (using both their standard and enhanced models). All transcription requests had punctuation enabled where supported by the product.

Results

Now to the part of the article that everyone has been really looking for — the evaluation results! The graphs below use a boxplot visualization, mainly due to their ability to represent the range of a value in a really compact way. If you are not familiar with boxplots, the article linked above is highly recommended.

The gist of it is that the box represents the range between the first (Q1) and the third (Q3) quartile of the given data, with the solid line in the middle of the box being the median, and the dotted line — the mean for the given data set. Having the whole data set visualized in this way allows us to see not just the average, but also the best and worst case — with the worst case being particularly important because a really bad transcription would likely be really detrimental to the end user’s experience.

In the first figure, we look at the overall WER for the five vendors and it becomes instantly visible that the different models perform with a significantly variable degree of accuracy. Leading the pack we have the Speechmatics enhanced model with Q1 at 0.05, median at 0.074 and Q3 of 0.106. Of the other vendors, none have a median below 0.10, in fact the Q1 of every other model is above the Speechmatics enhanced model’s median, which is a pretty significant shift.

The results look a little different when broken down by videos from technology and business category courses, respectively. The first striking observation is that business lectures perform significantly better than tech ones, with nearly 2x higher word error rates being observed for the latter. This is likely due to the higher amount of jargon and technical terminology present in tech courses, but may also be caused by a more diverse instructor (accent) distribution, which is something that needs to be researched further.

Looking at the character error rate for the vendors, there is no significant difference in the overall results — the Speechmatics enhanced and standard models are still clearly leading the way. The gap to other models is however smaller than when looking at WER, which potentially implies that other models do reasonably well in capturing the overall phonetics from the audio stream, but don’t convert those phonetic streams to the exact right words in the end as reliably.

Finally, we take a brief look at the punctuation error rate, remembering the caveat outlined above about the challenges in punctuation evaluation due to a lack of a definitive correct reference. In effect, the graph below shows that four of the models are in the 0.3–0.35 error rate range when compared to the punctuation provided by human transcribers. Only AWS appears to be significantly worse off when compared to the rest of the models, with a median error rate above 0.4.

Given the challenges with obtaining a reliable assessment with respect to punctuation, we haven’t put as much weight on this part of the analysis in our final vendor selection and would recommend anyone for whom punctuation is of high importance to potentially conduct a manual assessment by a linguistic vendor to gain a better sense of quality here.

Conclusion

Selecting a vendor for your needs involves more than “just” looking at how well they execute the primary task at hand. There is obviously the cost (prices for the products in the evaluation are publicly available through the links above) which can limit your options based on the budget at your disposal. There is the turnaround time, especially if you’re looking for near-realtime captioning. And then there is the level of customer support that they provide.

When it comes to word error rates, it appears from the above that our vendor of choice — Speechmatics — is indeed performing quite strongly, which in turn hopefully results in satisfied students and instructors, even if no speech-to-text system is quite perfect. Having said that, if you look at the evolution of word error rates over the past ten years or so, these have been falling steadily across a variety of models, from 0.15 a decade ago, to closer to 0.05 today.

The question is then what will differentiate ASR vendors in the near future, when word error rates alone will almost certainly converge to a point where there is very little to no significant difference between the various products. A recently published article on “The History of Speech Recognition to the Year 2030” suggests the advent of semi- and even self-supervised models as well as relying further on personalization.

All of the vendors in this evaluation already support so-called custom dictionaries, which provide the ability to adapt their models to appropriate context. Given that the ability to contextualize is necessary for technical videos — this will become an even greater focal point for future development.

Thanks for reading about the vendor selection process at Udemy! Though this blog post focused on the vendor selection process for our automatic video transcription, it demonstrates how we are always striving for ways to improve access to education and learning outcomes for people around the globe. If you want to be part of a team that puts learning outcomes first, check out our open positions at about.udemy.com/careers.

Udemy’s speech-to-text vendor evaluation

Introduction

Evaluation approach

Results

Conclusion

Written by Martin Bachwerk