Choosing a Speech-to-Text Service

For one of our projects we needed to transcribe the speech of Dutch phone conversations from sound to text. Since we decided to run automated transcriptions, we had to decide on what piece of software would do this for us. To help us in this decision we compared two services and one open source project. We compared these different transcribers on Word Error Rate and hourly cost.

“white neon light signage on wall” by Jason Leung on Unsplash

Our problem seemed simple enough, we had a lot of recording from Dutch phone conversations and we wanted to analyze the content of those conversations. This meant going from audio to text. Transcribing audio is not that complicated to do for humans, simply listen to the audio and write down every word. The task simply takes time. However, our sample of test data was already over 200 hours of audio so manually transcribing was out of the question. Therefore we opted to take a look at some automated speech-to-text solutions to do the transcription work for us. We looked at software by Google, Speechmatics and the open source Kaldi project. We compared these three on their ability to transcribe Dutch phone recordings between customers and customer service agents.

Google Cloud Speech-to-Text and Speechmatics

These services are quite similar in their offering: cloud based speech-to-text for many different languages with high performance. Both Google and Speechmatics continually update their language models to increase accuracy and introduce new words where applicable. This continuous development is a strong point for both, especially the introduction of new words which can help with new company names and other terms. Both services have a cost per minute of audio transcribed: Google uses a fixed price, the price for Speechmatics goes down as more minutes are purchased up front. Google has a few different transcription models available although unfortunately their specialized model for phone conversations is not available in Dutch at the time of writing.

Kaldi (NL)

Kaldi is an open source speech recognition toolkit developed and maintained mainly by Daniel Povey with the help of about 70 other contributors so far. The toolkit has a lot of flexibility, especially since it’s open source and can be extended or improved by anyone who dares understand it. It does carry a lot of complexity however, requiring a lot of time and effort to fully learn the quirks. Much credit to Povey for actively replying to the Kaldi-help forums, even though it seems to contain an infinite stream of questions.

“If you are inexperienced with computers . . . we won’t have much patience with you” — kaldi-asr.org/forums.html

Kaldi NL is a set of scripts and models for transcribing Dutch audio using the Kaldi toolkit. The model was trained on the Corpus “Gesproken Nederlands” (Corpus Spoken Dutch) which contains roughly 900 hours of annotated Dutch audio. Since the Corpus is one of, if not the, largest collection of annotated Dutch audio it is a great baseline for what is possible when training models using Kaldi. With Kaldi NL we have been able to transcribe our own recordings and make a comparison to other transcribers.

One big advantage of running audio transcription on our own hardware is full control over the data. While Google promises not to use your audio for their own projects unless you specifically allow them to it would still have to be sent over. Furthermore if you want to use their specialized phone model you have to allow them to use your audio as well, giving up all your data. Keeping everything in-house gives some peace of mind, especially with the new GDPR laws in place.

Word Error Rate

Word Error Rate (WER) is calculated by counting the number of substitutions, insertions and deletions compared to a ‘correct’ transcription baseline (usually human made). Substitutions are considered to be wrong words, simply misheard or misspelled. Deletions and insertions are the mistake of having too few or too many words respectively. Google Command is a model for speech recognition mainly trained for recognizing spoken commands for a digital assistant, but we found that it performs well enough to use in our comparisons. In this example we used 2 example models from Kaldi, the Aspire and the Fischer models that can be found in the Kaldi repository. However, for our actual investigation we looked at Dutch models since that’s what we’ll be transcribing in our project.

Listen to the following audio sample (it’s only 3 seconds!) and try to transcribe the sentence before reading the actual sentence below.

For this single sentence example WER is a very harsh metric, it makes more sense for longer sentences and even whole paragraphs. The words marked in red are substitutions, and in this example there are no insertions or deletions made by the transcribers. Furthermore since this is a publicly available sound byte it might have been used by the transcribers for training, so it should not be used as an indication of performance on real-world recordings.

Word Error Rate calculation example. These transcriptions were made on 06–07–2018.

We tested the transcribers on fragments of phone recordings with customer service agents in Dutch. For these recordings the WER difference is reasonably small between the transcribers. The Dutch Kaldi model performs worse than Google does on the long sound files. However, one of the reasons it’s scoring quite low is because it has the tendency to not guess at all if it is unsure. While this is not inherently bad, not guessing is always wrong where guessing might sometimes be right.

Word Error Rate for some of our Dutch audio fragments

The cost of running a transcriber

Making a 1:1 comparison for Kaldi is not straightforward because we run Kaldi on an Amazon EC2 instance while the other services run their own servers. So to get an indication we have simply divided the costs of our EC2 instance by the amount of audio it can transcribe per hour. For this test we used a c5.2xlarge instance that costs $0.384 per hour. There is an additional monthly fee for storage usage as well, however this does not significantly increase hourly costs.

With the Amazon instance running and the Kaldi NL model working we did some testing to determine how much audio could be transcribed per hour. We determined that we could transcribe at a rate of 5 to 10% of the duration of the audio; i.e. transcribing one hour of audio in just 3 to 6 minutes. Making a conservative guess and selecting 10% of audio duration means that we would run at a cost of about $0.038 per hour of transcribed audio. However, this does not include the cost of setup and maintenance!

Transcriber cost per hour of audio:

  • Google $1.44
  • Speechmatics $2.4~3.6*
  • Kaldi NL (estimate) $0.038

*Speechmatics costs decrease with the volume of minutes purchased up front, however the cheapest option requires buying a package of $10,000.

Conclusion

In the future we could look into training our own models for Kaldi instead of using existing ones to get rid of some quirks and to include newer vocabulary. Google and Speechmatics would not need training since they update their models over time. However, for now the choice seems rather obvious if we compare cost per minute to WER. While Kaldi scores slightly worse on WER the price is many times lower than both Google and Speechmatics. Kaldi will require some extra time to understand and maintain, but the cost difference makes it a simple choice for the time being.

Any questions? Let us know in the comments. If you liked the article, please hit the clap button so more people can read this story!


About Artificial Industry: We help entrepreneurs to change the world by transforming their ideas fast and efficient into successful online businesses. We do this by creating (data) prototypes and MVP’s for our clients.