Android Speech To Text — The missing guide (Part 1)
Table of Contents
· SpeechRecognizer API
· Simple example
· Language model
· Speech recognition language
· Query for the list of supported languages
· Check whether the configuration is supported (API 33+)
· Speech minimum length, silence length
· End of part 1
At Reveri, we are committed to bringing the benefits of decades of scientific research on hypnosis to the palm of your hand. Our mobile app allows you to experience the benefits of self-hypnosis from the comfort of your own home. Whether you want to improve your sleep, reduce stress, increase focus, manage pain, or achieve other personal goals, our app is designed to help you achieve your desired outcome.
Initially, our app provided simple audio playback, which guided users through a hypnosis session to help them achieve their goals. However, after integrating speech recognition technology, we received exceptional feedback from our users. With personalized guidance, tailored to the user’s unique experiences and needs during the session, they have reported better results than ever before.
We are proud to provide our users with an innovative, personalized approach to self-hypnosis, helping them to achieve their desired goals.
When we started looking into Android speech-to-text (STT) we knew it wasn’t a commonly integrated feature in apps, but we were surprised by the lack of resources available on the topic.
Not even Google has provided a comprehensive guide or codelab for implementing STT. The only information we found was the basic class documentation.
So in this two-part series, we will demonstrate how to integrate speech-to-text functionality into your application using the android.speech.SpeechRecognizer class. Additionally, we will also highlight some of its quirks, that we have learned through blood and sweat during the development process of our own app.
In the first part, we will look at simple usage, configuration options, and errors that you can encounter.
- Part 1 — Basic usage, configuration, and errors
- Part 2 — Advanced usage, offline recognition, alternatives (coming soon)
We hope you will find it helpful!
SpeechRecognizer class was added in Android 2.2 (API 8) and has not seen any changes for years. In Android 12 (API 31) and Android 13 (API 33) Google added functionality to help with on-device speech recognition.
As the official documentation says:
This class provides access to the speech recognition service. This service allows access to the speech recognizer.
That is important to understand because there is no guarantee speech recognition is available on a specific device or that the implementation will always be from Google. For example, Samsung is using its own implementation on some devices.
⚠️ The time it takes to establish a connection with the recognition service can vary significantly depending on the device, SDK version, and manufacturer. Before the recognizer is ready for speech input, it can take up to 950ms on an older Android 8 device, but it’s usually around 150–250ms on more modern devices.
Let’s start with a simple example that demonstrates how to capture the recognized text.
AndroidManifest.xml to include the
ResponseRecognizer service query, microphone, and internet permission requests.
❗ Microphone runtime permission request is not covered in this blog post. See Google guide on how to request it.
Check whether a speech recognition service is available on the system. Only if this method returns
SpeechRecognizer instance can be created.
Next, we set up a callback for when speech is recognised. This should be called before
startListening(), otherwise, no notifications will be received. All the callbacks are executed on the main thread.
Create the request intent. For now, let’s only specify the language model.
And finally, call
startListening() to trigger the speech recognition and start listening to the user. Once something has been recognized look for the result in
Bundle returned by
All code snippets combined can be found here.
At this point you’ve set up basic speech recognition but it may not work as you expect.
All the available configuration parameters can be set as Intent extras passed to
All constants can be found in the class RecognizerIntent.
Let’s examine the most frequently configurable items.
There are two options for setting a language model:
- Free-form speech recognition. Use LANGUAGE_MODEL_FREE_FORM
- Web search terms. Use LANGUAGE_MODEL_WEB_SEARCH
The free-form speech recognition model allows users to speak naturally and use their own words and phrases to communicate. This option is useful when users want to interact with the system in a more conversational way.
On the other hand, the web search terms model is designed to recognize specific words and phrases that are commonly used in web searches. This option is ideal for applications that require the user to input specific information, such as a search query or a command.
Speech recognition language
Specifying the language is optional. By default, the recognizer will use the system language.
However, for optimal user experience, it is recommended that the language setting matches the application’s UI language.
Language can be specified in
recognizerIntent by adding
RecognizerIntent.EXTRA_LANGUAGE language tag in BCP 47 format, for example, "en-US”.
⚠️ We have seen examples on StackOverflow where
Locale.*getDefault* ().toLanguageTag()was used. This should be avoided, as the default locale can return any language, which may not be supported by the recognizer. Instead, it is best to explicitly specify the desired language.
Query for the list of supported languages
The set of supported languages for speech recognition can vary based on the device and OS version.
If you want to set a recognition language that is not English or one of the most commonly spoken languages in the world, it’s especially important to query for the list of supported languages. Attempting to set an unsupported language will result in an error, and speech recognition will not work.
The result will contain the preferred language on the device and a list of all possible languages that can be requested in the configuration.
Check whether the configuration is supported (API 33+)
When targeting Android 13 or later, you can take advantage of the newly introduced method
checkRecognitionSupport(Intent, Executor, RecognitionSupportCallback). This method allows you to verify if the
recognizerIntent configuration is supported before invoking
SpeechRecognizer#startListening(Intent), thereby preventing potential errors.
RecognitionSupport object will contain a list of online and offline languages available on the device.
Speech minimum length, silence length
Google has provided default values for speech recognition in Android that make the process feel natural. However, depending on your specific use case, you may want to customize the following settings:
- Minimum length Set the minimum length of the recognition session. The recognizer will not stop recognizing speech before this amount of time. See EXTRA_SPEECH_INPUT_MINIMUM_LENGTH_MILLIS
- Silence length to complete recognition The amount of time that it should take after the recognizer stops hearing speech to consider the input complete and to end the recognition session. See EXTRA_SPEECH_INPUT_COMPLETE_SILENCE_LENGTH_MILLIS
- Silence length to The amount of time that it should take after we stop hearing speech to consider the input possibly complete. This is used to prevent the endpointer cutting off during very short mid-speech pauses. Both this and the previous settings seem to control the same thing, whichever is lower ultimately applies. See EXTRA_SPEECH_INPUT_POSSIBLY_COMPLETE_SILENCE_LENGTH_MILLIS
Getting an error and not sure what it means? Here’s a list of the most common errors and what went wrong.
End of part 1
This concludes part one of this series. This part covered the most basic and important details about SpeechRecognizer API, so hopefully, now you have a better grasp of how it works under the hood.
Thanks for making it to the very end of this post. Please let us know in the comments if is there anything, in particular, you would like us to cover in the next post or if you have any suggestions or improvements about this implementation.
Stay tuned for the next post where we will check more advanced topics: listening for partial results, observing sound volume, offline (on-device) speech recognition, and more.
I want to give a big shoutout to my awesome colleagues, Marcel Pallarés and Tristan Warner-Smith, for their amazing suggestions that have really leveled up this blog post. Thanks, guys!