It’s Time for Local Speech Recognition

Creating an offline voice-enabled timer with Electron & Vue

David Bartle
Picovoice
5 min readMay 21, 2021

--

Smart devices are surprisingly dumb at voice recognition; to understand our commands, they merely pipe microphone data to a technology giant’s data center for processing. This creates uncomfortable privacy implications of a direct microphone connection from a private residence to a massive corporation. Smart home skills are useful, so what if we could recreate that experience locally, and keep our sensitive data to ourselves?

The voice timer lights up after hearing the wake word (“Computer”) then processes “Start a timer for three minutes and two seconds”

A voice-controlled timer is a great smart home skill to have around. In the kitchen, you may have your hands occupied, or covered in flour. Let’s build a proof-of-concept timer app with Electron and Vue using offline Voice AI-powered locally by WebAssembly, instead of a cloud API. A kitchen timer doesn’t need to tell you jokes, order toothpaste, or tell you fun facts, but making it work offline is a win for robustness and privacy.

Flipping speech inference upside-down

In the current cloud-powered status quo, processing naturally-spoken sentences typically requires anticipating every possible word and phrase in a language, outputting it to text, then deciphering the free-form text to infer meaning.

Speech-to-Intent is the reverse approach: create a hyper-specialized voice model that by default understands nothing, then define the grammar that solves a particular problem (e.g., a kitchen timer). It’s intuitive as to why this focused approach can provide dramatic improvements over using generic speech recognition.

The Picovoice Rhino Speech-to-Intent engine combines this “from scratch” strategy with extreme efficiency to provide full voice models that require very low RAM and CPU usage (they even run on microcontrollers). Building a bespoke voice model sounds daunting, but it’s actually about as straightforward as building an Alexa Skill; transfer learning techniques mean no audio training data stage is necessary, just textual descriptions of the words and phrases and grammar to tie them all together.

Designing the Voice Model

We can start by thinking about the types of expressions we want our app to respond to. For example:

“Set a timer for 5 minutes”

From this spoken utterance, we want to know what the intent was (set a timer) and the specific information provided (5 minutes). If we design our model to understand this type of phrase, Rhino can process audio and then return a JSON object with all the relevant details:

“Set a timer for 5 minutes” → { intent: “setTimer”, slots: { minutes: “5” }}

To capture this utterance (and similar ones), we can use this syntax:

Phrases in square brackets provide a choice. Phrases in parentheses are entirely optional and may be omitted. See the syntax cheat sheet for more details. The item$pv.SingledigitInteger:minutesis a “slot”, which gives us an array of possible choices and assigns the specific choice mentioned to the output.

This single expression handles many variations in phrasing:

“Set a timer for 5 minutes”, “Start timer 2 minutes”, “Set the timer for 3 minutes”

… et cetera.

Here’s the grammar source file (YAML format) for this tutorial. It handles hours, minutes, and seconds along with distinct intents for “pause” and “resume”.

Once we’re satisfied with our grammar, we can train the text into a bespoke voice model. The trained context model is available on GitHub, under the Apache 2.0 license. Contexts can be designed and trained in Picovoice Console.

Coding a Voice-enabled Electron/Vue app

With a voice model ready, let’s build a quick app. Electron is a tremendously popular platform for making cross-platform desktop applications. We’ll use Electron with Vue as the front-end framework for the GUI. The Picovoice SDK provides a combined wake word engine, Speech-to-Intent engine, and a state machine logic to create a continuous voice assistant interaction loop (e.g., “Jarvis, dim the living room lights”).

Logos for technology used in the tutorial

Electron is essentially NodeJS and Chromium glued together, leaving a question as to where our audio processing logic goes. We will be doing all of the audio processing on the Chromium side, which has full support for the Web Audio API and WebAssembly (details of which are thankfully abstracted completely by the SDK).

First, install a recent version of the @vue/cli if you have not already. We’ll create a new Vue 3 project, add electron builder, and then add the Picovoice SDK for Vue and its peer dependencies.

To keep the article succinct, I’ve provided the project on GitHub. If you’d like, clone it to obtain the final result.

  1. Initial setup of package and dependencies (commands from above)
  2. Add the trained timer model (store as in public directory) and its source grammar (YAML)
  3. Replace HelloWorld.vue with VoiceTimer.vue. It picks a built-in wake word (I chose “Computer”), provides our timer model as an argument, and outputs voice events to the JavaScript console
  4. Link the voice commands to timer logic

VoiceTimer.vue is the heart of the project. It imports the Picovoice composable function and calls init Picovoice with the arguments provided. It watches for wakeWordDetection and inference so it can run events based on voice commands. Calling start starts up the microphone and voice processing. The voice events emitted from Picovoice drive the timer, and the results are reflected in the GUI.

The finished demo

We’ve built a minimalist kitchen timer app that provides a small but useful skill without sending microphone data over the internet. Running on Electron makes our app portable, and Vue gives a solid foundation to expand the application’s visual complexity and keep it organized. We can expand the voice model to support things like timer names, and queries, all while keeping everything 100% offline.

--

--