Speech recognition in browsers, 2018

6 min readDec 18, 2018

Kaleidos, a local open source company in Madrid, Spain hosts a #PiWeek — biannual open HackWeek event which I had a chance to attend this year, supported by Máximo Cuadros and a bunch of saved OSDs at source{d}.

ΠWEEK | PiWeek, Personal Innovation Week by Kaleidos

ΠWEEK (PiWeek, Personal Innovation Week) is an original idea by Kaleidos, every six months participants leave their…

piweek.com

One of the teams I know, introduced by Xaviju, is working with in-browser WebSpeech API for building a voice assistant for cooking.

As a person, keen on engineering aspects of applied Machine Learning, I took a quick peek under the hood of 2 major browsers to understand better how does it actually work and what is the implementation of this W3C API Draft behind browser’s voice capabilities.

WebSpeech API

WebSpeech API consists of 2 parts: recognition and synthesis. We’ll focus only on Recognition part here, as the first goal of the team was to activate the assistant.

The first thing one notices looking at this API — not every browser support it.

A practical question would be: does a speech recognition model work on-device, or would it require an Internet connection?

TL;DR

WebSpeech API works only in Chrome. It will require a connection to a google service (even if it can’t be seen in DevTools).

Keep reading for more details and a documented discovery process.

Google Chromium

Status: speech recognition API it implemented, but it’s partial support with prefix -webkit.

As Chromium source code is kindly indexed by an awesome Kythe.io team, one can quickly find speech_recognition_engine.h at cs.chromium.org:

// A speech recognition engine supporting continuous recognition by means of
// interaction with the Google streaming speech recognition webservice.
//
// This class establishes two HTTPS connections with the webservice for each
// session, herein called “upstream” and “downstream”. Audio chunks are sent on
// the upstream by means of a chunked HTTP POST upload.

C++ implementation of Chromium speech recognition engine reveals the HTTP endpoint that Chromium uses is https://www.google.com/speech-api/full-duplex/v1

This is quite easy to verify — turning wifi down breaks the demo. Interesting enough — no external connection was spotted in the Chrome DevTools.

Same does not quite work in OSS Chromium built, probably due to hitting quotas while missing some API keys.

Android

On Android, Chrome for mobile seems to rely on OS service android.speech.RecognitionService

Android Open Source Project is not covered by kythe.io semantic index but a public OpenGrok instance reveals SpeechRecognizer.java interface:

The implementation of this API is likely to stream audio to
remote servers to perform speech recognition.

The concrete implementation of SpeechRecognizer seems not to be available or at least hard to find.

So it might or might not work without an Internet connection to a proprietary service, but this is considered to be an implementation detail and is not documented or available as part of the Open Source source code.

Some details of this service policy for using your streamed voice data are available at google’s whitepaper e.g. it states that cookies are not sent with such requests.

Fun fact — a quick look in the Chromium project issue tracker shows that users complain that results of the recognition API are censored/filtered for “bad words” in different languages.

There seems to be a workaround to avoid filtering, but never the less it clearly illustrates how ML models are just embedded opinions and in our current brave new world users do not have much information or control around it.

On-device speech capabilities

What about offline, on-device speech capabilities in Chromium? A search in issue tracker reveals issue 816095 as feature request closed with WontFix in May 2018.

That is a bit sad, given all the recent progress in embedded ML and putting deploying advanced models like WaveNet to production by multiple companies. There now even is an OSS implementation for both synthesis and recognition using WaveNet model, so there does not seem to be a strong technical reason for not including this into the browser.

But of course if a company develops an operating system, a browser and wants to grow it’s cloud business, there most probably will be a strong non-technical incentive not to include such models in open source projects.

It may require a non-trivial R&D effort to do so, and if one gets such a model running especially on mobile devices— it may be a good opportunity to start a company.

Voysis develops offline WaveNet voice synthesis model for mobile devices

Despite decades of progress, an artificial intelligence (AI) platform capable of generating highly realistic speech…

venturebeat.com

Funny enough, to do so, according to the article above, one might need some insights:

The startup counts Ian Hodson — the former head of Google’s text-to-speech program … The market for text-to-speech applications is expected to grow to $3 billion by 2022, according to Research and Markets, and sales of digital assistants could hit $4 billion by the same year.

So, it’s time to look for open source alternatives.

Mozilla Firefox

In Firefox WebSpeech API is hidden behind a feature flag media.webspeech.recognition.enable in about:config.

A plain text search on Github mirror for speech recognition reveals SpeechRecognition.cpp in Firefox that seems to be using external project — PocketSphinx 🎉

cmusphinx/pocketsphinx

PocketSphinx is a lightweight speech recognition engine, specifically tuned for handheld and mobile devices, though it…

github.com

Sound promising but let’s dig deeper using better code navigation for Firefox codebase.

Mozilla organization does not use Kythe, but did amazing engineering and built it’s own search & navigation engine for source code — DXR.

On a hosted instance nsISpeechRecognitionService.hcan be found with @mozilla.org/webspeech/service;1?name= but nothing else really shows up 😕

Existing documentation developer documentation seems to confirm the guess but is a bit outdated — PocketSphinxSpeechRecognitionService there is no such class anywhere in the codebase.

And indeed, it was introduced in 2014 and in Sep 2017 all code that depends on PoketSphinx has been removed under bug 1396158.

Machine Learning & Open Source Speech-to-text Engine Development Project

The Machine Learning Group at Mozilla is tackling speech recognition and voice synthesis as its first project. Speech…

research.mozilla.org

So it seems like by the of 2018 in Firefox:

pocketsphinx-based on-device Speech Recognition was removed
there is also an un-merged WIP issue and a patch, introducing online recognition service through https://speaktome-2.services.mozilla.com that is open, since Jun 2018
there seems to be a plan to implement this API using new neural network based approach offline, on-device with DeepSpeech one day by ML team, but no publicly visible progress yet

So most probably a hackweek demo by the end of the week for a project with WebSpeech API will not be using Firefox in 2018.

Conclusions

Only Chromium and Firefox-based browsers have WebSpeech API exposed
Chromium: as of the end of 2018, it seems to be the only implementation of speech recognition API that actually works. It’s online, so it sends all recorded audio data to a service https://www.google.com/speech-api/full-duplex/v1
Chromium: there are no public plans for on-device offline speech recognition. (Feature request was closed with WontFix)
Chromium: by default, online speech recognition service filters for “bad words” and returns `***`
Firefox: as of the end of 2018 an old offline pocketsphinx-based implementation of recognition was removed, and there is no implementation of this API yet.
Firefox: there is a WIP patch with an implementation of online speech recognition through a service https://speaktome-2.services.mozilla.com
Firefox: there is an unconfirmed plan to add an offline recognition though DeepSpeech project

Disclaimer: this is just a series of educated guesses.
All options are personal and the author is not a browser developer.