Add Voice Commands to Android Apps

Published in

Geek Culture

5 min readJul 6, 2021

“Okay Google, take me home.”
“Hey Siri, play some music.”

If you have ever used voice commands on a mobile device before, you would have tried something similar; the voice assistants bundled with your phone are getting better every day and can help you do many things. But can they help you once you are inside an app? Can they help you navigate the maze of each app’s UI design? Can they help you use voice for tasks once you open the app?

Wait a minute. What do you mean by voice inside the app?

Yes — have you ever wondered if you could add voice commands to an app? Imagine getting things done just by speaking to the app rather than trying to wriggle through the UI via touch, just like how we see in sci-fi movies.

We are going to do just that with an Android app today.

No, I will not ask you to shell out loads of time and money to build some fancy deep learning based Automatic Speech Recognition system as a prerequisite. Instead, we will make use of the freely available SpeechRecognizer class in Android to complete this project.

However, if you are interested in building your own ASR, you can refer to our blog on building your own transcriber.‍

We will create a new Android Studio project for the purpose of this experiment.

Five steps need to be followed to enable our app to respond to voice commands,

Setup audio permissions.
Create a UI element that will act as a microphone trigger.
Initialize and set up the SpeechRecognizer instance.
Create a list of supported utterances that will serve as the domain for voice commands in our app.
Execute actions in the app depending on the voice command given by the user.

Setup audio permissions

We will need to declare the audio permission requirement by adding the following line in the AndroidManifest.xml file of our app.

Let us place the microphone trigger in the MainActivity.java. Before our app can start listening to user input, we need to run a permission check, and if ASR permission has not been granted, request the user to allow access to the microphone.

Note: You can choose to ask for audio permission on app startup or during run time when the user tries to talk to your app by clicking on the designated UI element. Here we will ask for audio permissions on app startup.

To know if the user approved or denied our permission request, we will need to override the onRequestPermissionsResult() function in MainActivity.

Add UI element for microphone trigger

The next step is to add a microphone button, a.k.a the trigger. Whenever the user clicks on the trigger, we will start a new audio session. For this experiment, let us take the image of a microphone as our trigger icon. We will add the image resource to the project and then add the ImageView to activity_main.xml. We will also add a TextView to show the user’s detected utterance when they are speaking.

We will also add a click listener to our trigger icon to start and stop audio sessions.

Setup SpeechRecognizer instance

Now onto the exciting stuff. Using the SpeechRecognizer instance, we will be able to listen to the user’s commands and execute actions inside our app.

Add a global instance of SpeechRecognizer called mSpeechRecognizer to MainActivity, and add a function to initialize it.

We will also add a RecognitionListener to mSpeechRecognizer, which will enable us to be notified at each stage of the audio session. The two callbacks we are interested in are onPartialResults() and onResults(). We will fill up these callbacks later on. You can ignore the other callbacks — we will not be needing them for this experiment.

Before starting off an audio session, we would also need to define an audio recognition intent which would be passed to mSpeechRecognizer.

The EXTRA_LANGUAGE_MODEL is a required extra for ACTION_RECOGNIZE_SPEECH. We are also setting the EXTRA_PARTIAL_RESULTS to true to get notified of partial speech results when the user starts speaking via the onPartialResults() callback.

The EXTRA_LANGUAGE is optional and sets the locale for speech recognition. In addition, there are a few other optional extras, which you can check out here.

Let us now handle the audio sessions inside the click handler for our trigger icon.

Once an audio session starts and the user starts speaking, we will be notified of partial and complete results in the callbacks. Let us define those callbacks.

Create a list of app-supported actions

Let us define a few actions the app can perform — and add the utterances for all those actions inside an array. Of course, this would depend on the app; for example, an e-commerce app would have a different set of actions than a messaging app.

We will define actions according to an e-commerce app for this experiment. Our array of user command utterances would look something like this.

Execute actions as per user voice commands

Now that we have our list of supported commands and the end-to-end working speech recognition functionality in the app, let us define the handleCommand() function responsible for executing user commands.

Depending on what command we received from the user, we would trigger different app actions. For the purposes of this experiment, we would only update the UI on our MainActivity, informing the user that the action was a success. If we could not find the user utterance in our static list of commands, we will tell them that the action requested by them is not supported.

That’s it! Our sample app is now a voice-enabled app that will enable users to perform actions based on voice commands.

Hope you found this helpful, and if you decide to take up any experiments with voice, do let us know in the comments below.

While the above steps help you code your own In-App Voice commands, they are still limited in the number of things you can do with it and the flexibility it provides. For example, if the user speaks the command slightly differently, it would not work. Nor is this optimized to talk back to the user to collect any additional information when required. The on-device Speech Recognition is a generic ASR. It is optimized for a more broad-based recognition, and if you want to increase its accuracy for words that matter to your app, it’s not possible. Also, the UI and UX still need to be built as an extra as the default textbox is very limited in functionality.

This is where a platform like the one we have built at Slang called Slang CONVA would help you add sophisticated In-App Voice Assistants without the need to explicitly handle all the audio aspects and parsing the command to understand the intent behind it. Its full-stack pre-built Voice Assistants provide out-of-the-box support for various domains, including handling multiple languages.

Check out our blog on building custom In-app voice assistants to understand this in more detail.