Speech recognition: initial thoughts

Published in

hyperoslo

3 min readFeb 5, 2018

Let’s talk about speech recognition. We’re probably in the golden age of artificial intelligence now, and it’s relatively easy to find decent tools and services to add this desirable smartness to our apps. Nowadays we’re equipped better than ever with:

cloud computing platforms from industry giants like Google, Amazon and IBM;
powerful machine learning libraries for training custom models;
system frameworks that work at a much higher level of abstraction, making it more accessible for developers to use these cutting-edge technologies.

So it should be pretty straightforward to implement such a simple thing as speech to text conversion, am I right?.. Yes and no. And I’ll try to explain why.

Let’s say we’re making a dictation app. You speak to it and receive a textual representation of recognised speech as an output. But there is a tiny detail: you’re not dealing with English like it’s normally shown in tutorials, but you’ll have to work with any other not so widely supported language instead. And guess what? Now your options are more limited.

Making a custom speech recognition system could be a challenging and quite expensive task, so let’s drop this idea for now and focus on available existing solutions. There are some cloud-based services that come to mind in the first place:

The most attractive one is Google Cloud Speech API. It recognises over 110 languages and variants, works in real time and has a lot of handy features, which is more than impressive.
The Microsoft speech recognition API is slightly behind with around 29 supported languages.
In comparison, IBM Watson API supports up to 7 languages, Amazon Transcribe works with US English and Spanish speech.

And don’t forget about native iOS and Android APIs, they handle a wide range of languages and always have something extra to offer.

Working with speech recognition can be even more challenging if you have to work with some domain-specific words. Fortunately, frameworks and services of your choice usually have out-of-the-box features to work around this problem. For instance, you can pass phrase hints for any recognition task if you use Google Cloud Speech API and Apple lets you provide contextual strings to help Speech framework with recognition of words that are not in the system vocabulary.

So what should you choose for your next groundbreaking speech recognition project? I would say it depends on several factors. First of all, you should think about time limitations. Maybe you would like to have more control over internal processes and prefer the flexibility of CMUSphinx toolkit. Or maybe you think that sometimes it’s just less expensive to pay for ready-to-use existing services and tools. Then comes the language support together with the need to handle specialised vocabularies, words or custom commands. Last but not least, you should be aware of platform possibilities and limitations. If you’re building an iOS application, it’s probably smart to use Speech framework. SpeechRecognizer seems like a good default choice for Android. But it gets more complicated if you aim to create a cross-platform solution or want to support multiple clients. Then it feels more natural to stick with cloud-based APIs in order to avoid different kinds of distinctions and inconsistencies in recognition results across your apps.

Speech recognition could be both hard and easy. It highly depends on the task, like everything in the software development world. You shouldn’t expect it to work perfectly. This technology is not new, but there is still a lot of work to do and a lot of things to improve. Remember how many times you complained about Siri, Alexa or Google Assistant. But hey, even humans have difficulties in understanding each other 😅.

Looking on the bright side, nothing is set in stone, and thanks to machine learning speech recognition is continuously evolving. There are plenty of good libraries, betas and preview software to play with. Web Speech API brings speech synthesis and recognition into web apps. And services like Custom Speech will not only let you create custom models, but will also learn the speaking style of your users. There are exciting times ahead of us. Every day we take one more step towards the future, whether it’s bright or not.

Speech recognition: initial thoughts

Written by Vadym Markov