How do you develop a speech recognition feature on a web page that has to be completely compatible with the most widely used browsers, such as Chrome, Firefox, Safari, Opera and Edge?
We’re going to talk about our choices, the pain points we faced and the solutions we found. Please note that this is not a tutorial: there will be some code snippets, but you’re not going to find a completely functioning demo.
In order to provide our users with a great conversational experience we need to:
- Get audio stream from browser
- Stream audio data to a speech recognition service and get real time results as our user talks
1. Getting audio stream from browser
This part is easy: we can use the getUserMedia/Stream API, which is supported by the most popular browsers.
Using this browser API we can get the audio stream from the microphone, so we can send it to our speech recognition service.
We then start to initialize the audiocontext and call the getUserMedia API.
While this configuration is compatible with major browsers, the parameters passed in the getUserMedia API could be different based on the needs. For more details you can check the official documentation.
Now, let’s set the configuration for the audio context:
The configuration for the audio context and script processor, could be different depending on your needs. A pain point is that some combinations of parameters are currently not working on Safari.
Please note the registration for the event “audioprocess” that is the “ondata” event for microphone stream.
Here’s the function called by the audioprocess event. You need to notice that the audio is registered in stereo, so you have two audio channels coming from the microphone. Actually we are getting only one of them, because speech recognition service needs audio with one channel only, but depending on your needs you can treat the audio as you prefer.
That the data coming from the audio stream is a Float32Array typed array — this means we will need to convert it before sending it to our speech recognition service, but we’ll take a look at this later on.
In the function stopping the recording process, we need to:
- Stop browser use of the microphone and reset variables, if any.
- Detach the listener on audioprocess event, because the stream will continue to trigger event even if the microphone is deactivated.
2. Streaming audio data to a speech recognition service
Web Speech API
The easiest way to do this is to use the Web Speech Recognition API. With this API a web application can do speech recognition in streaming directly inside the browser without the need for third party services. And that’s awesome 🚀
As in every awesome innovation, there are some minor compatibility issues that make this approach completely useless for production applications.
So how could you do speech recognition on a web page?
For a good user experience we need a real time speech recognition system that returns results while the user is talking, avoiding the full audio registration followed by transcription.
So we realized that we need a third party service with these features:
- Streaming speech recognition
- Compatibility with every browser
Google Speech API
There are plenty of services on the web enabling applications to do speech recognition via APIs. The one we decided to choose is Google Speech API.
Google Speech API provides a great stream recognition service, created by Google, really efficient and performing — and real-time response is very important for us.
The sdk library provides both an asynchronous recognition service (via gRPC and REST APIs) and a real time recognition service (through gRPC only APIs).
Our goal is to have real-time speech recognition, so we chose to use the Stream Recognition Service. Google actually provides this API only through gRPC call, by using the streaming feature of the gRPC protocol.
That’s why it’s not possible to stream the audio directly from the browser through the gRPC system. Actually the browser can do gRPC calls, but for streaming there is no hope.
We chose to stream our data to a server and then stream the data to the Google Stream Recognition API, and the other way round to gain the result of this recognition.
For streaming the audio data from browser to server there aren’t many options. Actually the only systems that could stream data from browse are webRTC protocol and WebSocket. For simplicity we chose to use WebSocket for sending the audio data to the server chunk by chunk.
We chose to develop the server in nodeJS for the simplicity to handle the streams and the really super easy implementation of web sockets, but it could be in other languages like JAVA, Python or GO, more efficient in handling big loads of connections.
So the infrastructure results to be:
And the information flow is:
- At record button click, connect to WebSocket (we want to avoid long connection that can slow down the server) and add listener for audio stream data;
- Listen for events on browser microphone;
- Send the data buffer to backend via WebSocket;
- Stream the data from backend to Google Speech Recognition service;
- Listen for results;
- Send back results to frontend via web socket in order to visualize them as they arrive.
Web socket implementation on browser
Note that the audio data stream is a Float32Array, and we need to convert it into a buffer and then reconvert on the server.
We tried to convert directly the Float32Array to buffer but we ended receiving wrong audio values on the server. In the end we solved converting the Float32Array to an Int16Array and then sending it to the server. Without this conversion the server will not receive the data in the right format.
The conversionFactor is only for conversion purposes. You can find more details here.
Web socket implementation on server
The implementation on server is standard, there have only a couple of pain points for us:
- When to start the recognition stream via Google
- How to handle the difference between the stream data format coming from the web socket and the stream audio format for the speech recognition stream
Firstly we need to instantiate a speech recognition stream, you can refer to this documentation for details.
Note that when a speech stream is created, that means a recognition request has been started, and Google has a limit of 65 seconds for speech recognition audio length. This means we have to create the speech recognition as the user on the web page pushes the record button and the server receives a web socket connection request from browser.
Then we can stream the audio chunks arriving from browser to the speech API service.
Note that the buffer coming from web socket is a standard buffer, an array of integer of 8 bit each. For this implementation we specified in the speech recognition settings that the audio encoding is “LINEAR16” that is an array of integer of 16 bit each. So we need to convert the standard buffer to Int16Array, before writing it in the speech recognition stream.
The recognize stream, returned by the Google speech recognition library, is a standard stream that could be handled this way:
It’s very important to notice the close of the stream recognition data at closing of the web socket. This way we avoid to trigger the Google speech recognition error for too long audio streaming.
As a result we’ll get a cross-browser speech recognition system that simplifies user experience, giving people the possibility to read what the system is understanding in real time, making them more comfortable and accustomed to speech interaction.