Speech Recognition on Roku using Google Cloud Speech-to-Text API

Alejandro Cotilla
Float Left Insights
5 min readJun 6, 2018

On Roku OS 7.6, Microphone APIs were made available to developers, bringing some very capable audio-capturing features. This allows us, developers, to integrate cloud-based APIs to process recorded audio for multiple purposes, one of those being Speech Recognition.

Arguably the best Speech Recognition technology available today is Google’s Cloud Speech-to-Text API. The API recognizes 120 languages and variants, it supports automatic punctuation, inappropriate content filtering, and many other advanced features, and yes, the results are almost flawless. You can read all about it on their website and even try it out.

Why would we want Speech Recognition?

  • Dictation — you can enable dictation on your custom keyboards, to make entering logins and passwords or searching for content, much easier and faster.
  • Accessibility — you can significantly improve the accessibility features of your app by allowing users to navigate just with their voice.
  • Or you could just make a Flappy Bird game where instead of pressing a button every time, you get to say “Jump” (I would play that).

Prerequisites

Before we continue, there are a few things that we need to cover to be able to run and test the demo app.

Before getting started

To make things easier during this tutorial, let’s enable “Always allow” on the Roku microphone settings, otherwise every time the app is side-loaded, the Roku will ask for permission.

To enable “Always allow”, go to Settings/Privacy/Microphone/Channel microphone access/ and select “Always allow”.

Also, as with any Google Cloud API, the API has to be enabled on a project within the Google Cloud Console and all the API calls will be associated to that project.

Summarized steps:
1. Create a project (or use an existing one) in the Cloud Console.
2. Make sure that billing is enabled for your project.
3. Enable the Speech-to-Text API.
4. Create an API key.

Show me the code

For this demo, we’ll make a very simple app, with just a label, that will display the transcript text, representing the words that the user spoke.
To begin, let’s download the starter project from here.
Uncompress the project (or not, depending which side-loading solution you’re using) and side-load.

You should see just this:

Right now nothing happens.

To fix that, let’s begin by adding the brains of the operation to our project, the SpeechRecognizer component. This component will handle all the communication between our app and the Google Cloud Speech-to-Text API.

Make sure you go through all the comments in SpeechRecognizer.brs, they explain why and how everything is setup, with references to the documentation. Also, before continuing, we have to replace YOUR_API_KEY with the actual API key generated in the steps above.

To use the SpeechRecognizer component, we just need to do:

…and also the delegate callback must be implemented, we’ll touch on that again.

So let’s use the SpeechRecognizer component. Roku’s microphone is only accessible while holding the “OK” button, so we must run the speech recognizer on a “OK” button press. To do that, we simply need to implement the onKeyEvent function and listen for an “OK” button press. We will also add a helper function to only instantiate the recognizer if needed. Let’s update AppScene by adding:

Now, side-load the app again, press and hold the “OK” button and say something, like “Hi Roku, can you hear me?” (You don’t have to explicitly say the punctuation marks, they will be detected automatically). If everything is setup correctly, the OS should display a “listening” animation at the bottom of the screen, notifying the user that the microphone is being used.

Also, you should see in the BrightScript console, a print out similar to this:

ResponseCode 200
Response {
“results”: [
{
“alternatives”: [
{
“transcript”: “Hi Roku, can you hear me?”,
“confidence”: 0.93763953
}
]
}
]
}

Do you see it? YES!! This is HUGE, we just converted speech to text on a Roku device!

Now the only thing left, is to implement the SpeechRecognizer delegate callback, so that we can display the results on our label.

But first, we have to declare the callback function in AppScene.xml.

And then we implement the that function in AppScene.brs, like so

Side-load one last time, press and hold the “OK” button, and say the same phrase or whatever you want to. The label should be updated with what you said after you release the “OK” button.

Are you seeing the same? PERFECT!! That’s it, now you can say that you successfully “applied the most advanced deep learning neural network algorithms to audio for speech recognition with unparalleled accuracy” on a Roku. I bet you always wanted to say that.

The complete project is available here.

Some tips

  • Modern applications support real-time speech recognization, meaning that as the user speaks, letters/words should start displaying, I would be interested on seeing how would you implement that. As a hint, Google also provides a RPC API, and that API has a method named StreamingRecognize 😉.
  • Know your limits. Speech recognization is not cheap nor limitless, you should check pricing and limitations. You must strategically place and consume this feature from within your app, you can start by capping the time the user can record audio to 15s-30s, that should fit most apps.
  • Your app should be able to gracefully lock/restore input interactions depending on the request state.

That’s all for now, thanks for reading! See you next time!

--

--

Alejandro Cotilla
Float Left Insights

Love tinkering with new technologies and building enjoyable user experiences.