Find your Best Solution: Speech to Text

Matthew Leyburn
Kainos Applied Innovation
7 min readOct 10, 2019

If you want to test your own audio samples like described in this article, checkout the testing tool on Github

Implementing speech capabilities into a business or a product is a popular and beneficial addition these days, but how do you choose the best one for your use case? This is the problem I’ve been trying to solve.

Speech to text has certainly found its home in the world of technology. We can translate conversations into any language straight from our smartphones and control our own homes with a fridge — all with the power of voice.

A tweet from Twitter showing a joke about tweeting from a smart fridge

Over the past few weeks, I have been creating a tool to easily transcribe and test speech to text services. It basically automates the processes described in this article! This has helped our team determine which service performs best in various use cases.

Currently, this tool supports

The tool transcribes text using the default speech transcription model of each service. Although during testing I also used Google’s ‘video’ enhanced model, due to inconsistencies with Google’s default model — which I’ll talk about later.

Batch processing is also supported, which automates tests for multiple audio files.

This post will outline how I did it, the results I generated and how you can test your own samples.

Let’s get started 👏

Process

  1. Testing structure
  2. Testing samples
  3. Transcribe
  4. Calculate accuracy
  5. Results

1. Testing structure

I first set out to define a testing structure.

It’s difficult to define accuracy when it comes to speech to text. There are dozens of factors to take into account, like accents, background noise, context and even the equipment used to record — with many services you can also train your own model to improve accuracy. That being said I’ve chosen to test with Word Error Rate (WER). WER has its pros and cons but it does provide a baseline accuracy metric for general use, covering a wide range of use cases.

Word Error Rate (WER), is a method to measure the performance of automated speech recognition (ASR). It compares a transcript (reference) with the transcribed text (hypothesis) from a speech-to-text service.

Here’s an example:

reference: Mary had a little lamb, its fleece was white as snow

hypothesis: Larry had a clam, its crease was as white as snow

In this case, there are 3 substitutions (Larry, clam, crease), 1 deletion (little) and 1 insertion (as).

Resulting in a WER of 0.4545… or 45.5%.

2. Testing samples

Through using serveral speech to text services and researching a wide range of use cases. I’ve decided to use 4 datasets to test against. I feel these datasets cover a broad scope of use cases for speech to text.

The testing samples are:

  • Accents, 36 audio recordings of speakers with different accents from around the world, each reading the same text (samples taken from speech accent archive)
  • Same as above only with heavy background noise added to all samples
  • Unscripted dialect, like podcasts, interviews and phone calls
  • Monty Python clips (a true test of one’s voice recognition technology and also a great test of low-quality audio recordings with sound effects and music)

Each service supports many audio formats and encodings. To keep results consistent, I ensured all audio files have a 16kHz sample rate, a bit resolution of 16 bits and a mono audio channel.

3. Transcribe

First, we need a reference (something to compare against). In our case, this is the original transcript of the audio file we are testing against. Then we transcribe our audio file. The resulting transcribed text is our hypothesis.

Before comparing the reference and hypothesis, we need to make sure both pieces transcripts are consistent with each other.

For instance,

  • Format of numbers, digits (1, 12, 64) or words (one, twelve, sixty-four)
  • Stylistic differences like acronyms (“AT and T” or “AT&T”), contraction words (“they are” or “they’re”) or shorten word forms (“street” or “st.”)

The testing tool ensures that number formats, punctuation, acronyms and contraction words are consistent in both transcripts before comparing. For best results, stylistic differences like, “street” and “st.” should be identified and changed in the reference to ensure they match the speech to text services transcribing style.

Another point to consider is that uncommon words to the transcription service like names or brands can often result in a word error but keep in mind that these words can often be added to a transcription services vocabulary manually.

4. Compare and calculate

Comparing two pieces of text can be performed in many ways from, online comparators to premium software. The testing tool uses an npm package called ‘word-error-rate’. It’s lightweight and provides reliable results when comparing text and calculating WER. Results of each audio file are stored in a table and CSV file.

You can automate tests of multiple audio files with the ‘batch’ feature of the testing tool. It will then calculate averages once the tests are complete.

5. Analyse results

After the tests are run. Results can be viewed in the table generated by the testing tool or you can extract results from the CSV file to analyse further. The original transcript and transcribed is stored in the CSV.

Results are in…

🏆 Standings

  1. Microsoft speech services speech to text 🥇
  2. Amazon transcribe 🥈
  3. Google Cloud speech to text API 🥉

Slight disclaimer, as you can see in the results Google finished last. This is largely due to the fact that Google will often ignore pieces of audio and not transcribe them. I kept these cases in my results as the purpose of these tests are to test against a variety of real-world use cases, not just transcription accuracy.

Here are the overall results I generated. If you’re interested in seeing the actual transcripts, you can see the accents dataset detailed results below.

Word error rate results

Findings

There are lots of other features Amazon, Google and Microsoft speech to text services all provide that I haven’t touched on like speaker detection, custom vocabularies and training models. These can help improve accuracy and performance but my findings are based solely on my results and testing.

I’ve summarised the findings from my results with a nifty little rating system. It has three factors: price, speed and accuracy. 10 being best.

Amazon

Amazon transcribe, is not as accurate as the others tested and super slow. When I say super slow, I mean reeeaaally slow. Often 15x slower than Microsoft or Google. For a 2 seconds audio clip, on average took about 2 minutes to transcribe even with a 100Mbps download speed.

Audio can’t be transcribed locally, it has to be uploaded to an S3 bucket. This adds complexity to the process and can raise costs over time.

I do like Amazon Transcribes UI on AWS. You can clearly see the transcriptions and confidence rates of each word. It’s very easy to locate specific words and phrases that may need tweaking if you had your own custom model.

Overall, Amazon Transcribe is a good speech to text service for more complex use cases that incorporate other AWS services, but for general uses cases, I feel it’s too bulky and doesn’t perform as well as Google or Microsoft.

💰 Price 5/10
🏎 Speed 2/10
🎯 Accuracy 4/10

Google

As I mentioned earlier, Google speech to text is incredibly inconsistent at times — with both models I tested. It randomly ignores pieces of audio or even sometimes whole audio files and doesn’t transcribe the audio.

Google offers 4 pre-built models. Some of these are ‘enhanced’ models and cost more. I tested with both the ‘default’ and ‘video’ models. ‘Video’ is described as an ‘enhanced’ model and definitely superior in terms of accuracy, but again, the ‘default’ model can sometimes produce better results. This further displays the inconsistency I experienced while using this service.

If you need to cater for a wide range of languages, go with Google. It is unrivalled in comparison to Amazon or Microsoft in language support. It supports 120 languages and can auto-detect languages spoken.

If only the results that Google actually transcribed were taken into consideration, it would be a close second to Microsoft but due to its inconsistencies, I don’t think its performance is up there with Microsoft.

💰 Price 3/10
🏎 Speed 7/10
🎯 Accuracy 6/10

Microsoft

One thing that stood out was Microsofts ability to accurately punctuate transcriptions. This resulted in much more ‘human’ like and conversational styled transcriptions which I really liked.

Microsoft also transcribed background noise audio with the highest accuracy and only had a ~14% error rate increase. This is significantly lower than Amazon and Google and is a massive variable that effects accuracy.

So here’s our winner, Microsoft Cognitive services speech to text API. In every criterion I tested against, Microsoft came out on top. Its word error rate was the lowest in each dataset, consistently had the fastest performance speeds, it’s the cheapest and was pretty easy to use/set up.

From these findings, Microsoft speech services seem to be the best choice for most use cases.

💰 Price 8/10
🏎 Speed 8/10
🎯 Accuracy 9/10

Considerations

  • WER is not a completely reliable metric. It treats all errors the same and doesn’t take things like slang into consideration
  • A pretty common occurrence across each service is how they all struggled with the Monty Python clips. Music and sound effects seem to be the hardest variable to handle for transcriptions
  • Again, as mentioned before there are many features to improve the performance of these speech to text services

If you want to test your own audio samples like described in this article, check out the testing tool on Github. The project is open source so feel free to contribute or get in touch via email.

--

--