Say That Again!

Exploring the Web Speech API

While developing my final project for Flatiron School, I was investigating different ways of transcribing speech to text and discovered some pretty cool APIs.

External API Options

My first step was to look for services that could transcribe text to speech. After a bit of googling, It seemed that the two best services were the Google Speech API and Watson Speech to Text.

The biggest problem that I encountered was that both of these services required that I upload or stream an audio file to them. My project required the user to be able to speak for an unlimited amount of time, and I began to run into problems dealing with audio files that were too large.

I continued looking for a solution that would allow me to transcribe speech without having to create and deal with large files. Finally, I stumbled onto Javascript’s built in Web Speech API.

Web Speech API

The Web Speech API has two functions, it can recognize speech and return text and it can synthesize speech from text. It works by accessing your computer’s microphone and checking any speech it receives through a speech recognition service, in Chrome the Google Speech API is used.

The first thing we need to do is create a new instance of the speechRecognition() object and assign it some properties, like what language to listen for, whether to listen for continuous speech and how many guesses we want the service to return to us.

The speech recognition service has three useful methods for controlling when it is and isn’t listening.

.start() starts the speech recognition service.

.abort() which stops the service from listening.

.stop() which stops the service from listening and attempts to return a speechRecognitionResult .

Now that we have created our instance and set our parameters, how do we interact with it? The speechRecognition object comes with a bunch of events that we can use to create the functionality we need.

For this example, we will add an event listener for the result event, which will be triggered when the speech to text service returns a result. after adding our event listener, we can call .start() on our object to begin listening.

When our event listener is triggered it will return a SpeechRecognitionResultList which contains SpeechRecognitionResult objects. These each have two keys: transcript whose value is a string of the converted text and confidence which is a decimal value that represents the speech recognition service’s prediction of accuracy.

This example is set up to log the speech that is heard by the computer, but there is much more functionality than that built in.

You can use speechGrammarList to give the service a list of words to listen for and then create different behavior based on what is said. “Alexa!”

Gotchas

Browser support seems to be quite limited for the Web Speech API, with Google Chrome and Firefox having the best implementation.

It’s also probably good to note, that sometimes the service will stop listening on its own if it thinks the user has stopped talking. This was very problematic for me, because I needed the service to listen continuously until told to stop. To get around it I ended up using setInterval() to restart the speech object a few seconds after it turns itself off.