Yet Another Wake-Word Detection Engine

Alireza Kenarsari
5 min readApr 25, 2018

A wake-word engine is a tiny algorithm that monitors a stream of audio for a special wake-word and activates your voice assistant upon detecting it. Every time you say “Alexa”, “Hey Siri”, or “OK Google” you are using it. The rest of this document talks about it and why I thought it is a good idea to create a new one in 2018!

Things you should know about wake-word engines

1- It must run on the edge (not cloud) for multiple reasons:

  • Privacy.
  • Cost. It is simply impractical to have a stream of audio from every Toed device to the cloud 24/7.
  • Power efficiency. In order to always run it must be extremely power efficient on mobile/wearable devices.

2- Big companies (think Amazon, Google, Microsoft) have teams of scientists and engineers to build their own wake-word engine.

3- There are a handful of competent wake-word engine vendors in the market including Sensory, KITT.AI (acquired by Baidu), Snips and a freely available engine called PocketSphinx (CMU).

Why another wake-word engine?

Integrating a wake-word engine is expensive, involves upfront cost, and takes time. Although affordable for big companies this can be a show-stopper for startups who want to join the voice-enabled revolution. Furthermore, it can implicitly discourage customization and personalization for even bigger players.

This is due to how wake-word engines are trained today. In order to train an engine for “O Canada” the vendor needs to collect hundreds of people saying “O Canada” and train a model just for that. The model does one thing really well. It can detect “O Canada”. A lot of money and time is spent on data collection and custom model training. I know all this because in a previous life (prior to my career in Amazon) I made a platform that exactly did this and is currently being licensed by some of the big players in the market.

I wanted to do my part in the democratization of voice interfaces by helping startups and innovators build what they want to build. Whether you are an established company building a word-class solution, an iOS/Android developer thinking of the next big voice application for mobile, or a hacker dreaming of crowdfunding your cool IoT gadget I want to help you.

Introducing Porcupine

Porcupine is the wake-word engine we built in Picovoice. Here is my elevator pitch (in a very tall building) for Porcupine

  • It is self-service as it does not require new training data to create a new wake-word. Only its hyper-parameters need to be optimized for a new wake-word. What it means for the customer is that you don’t need to pay a hefty price for building a model for your wake-word and do not need to wait for weeks before being able to use it.
  • It is highly accurate (more on this below).
  • It is cross-platform (Raspberry Pi, Android, iOS, Linux, and Mac) and lightweight (%3.8 CPU usage on Raspberry Pi 3).
  • It is scalable in the sense that it can detect multiple wake-words concurrently without an additional CPU/memory footprint.
  • It is partially open-sourced on GitHub under Apache 2. You can create new models for Linux, macOS, Windows, Android, iOS, web browsers, Raspberry Pi, Nvidia Jetson, Beagle Bone, and microcontrollers or use a handful of existing models on all supported platforms.

Every time I finish my pitch for an interested party I get this comment: “It can’t be as accurate as other products in the market that specifically train a model for my wake-word, right?”.

But, it can! :-)

The assumption that “a model that is trained for a specific task would outperform a general model only tuned for it” could have been true a few years ago. But deep learning has evolved quite a bit since then. When a model is trained for a general task it has access to more data. Currently, we have access to many labelled speech corpus datasets with thousands of speakers for example here and here. Also, there have been many advances in transfer learning. Even big players like Amazon are experimenting with the use of transfer learning internally for their wake-word engines.

How much better is Porcupine?

In order to measure the performance of Porcupine, we created an open-source tool that facilitates benchmarking different wake-word engines (to the best of my knowledge there was none available beforehand). Although we are using it for performance evaluation of Porcupine it is modular enough that can be reused for different engines, datasets, etc.

Here we are going to compare Porcupine with:

  • Snowboy (KITT.AI-Baidu): It is part of Alexa Voice Service SDK.
  • Pocketsphinx (CMU): Open source and freely available.

There are six different wake-words for this test (Alexa, Computer, Jarvis, Smart Mirror, Snowboy, and View Glass). The utterances of wake words are crowdsourced. We have open-sourced the dataset here. The background speech is taken from LibriSpeech. Finally, noise data (to simulate noisy environments) is taken from DEMAND dataset. The dataset is mirrored here on Kaggle. Without further ado here is the result

Accuracy comparison of wake word engines. The false alarm rate is set to 1 per 10 hours for all engines.

The figure above compares the accuracy of different engines at 1 false alarm per 10 hours. PocketSphinx has the lowest accuracy among different engines. Snowboy achieves an improvement compared to Pocketsphinx and lowers the average miss rate. Finally, Porcupine gets the best result in the bunch and reduces the miss rate more than 3 times compared to Snowboy. BTW, Porcupine even runs on MCUs and DSP cores. It is can run with 18 KB of RAM. Check it out in action below

Porcupine wake word engine detecting multiple always-listening voice commands on Raspberry Pi Zero.

Why didn’t we compare with other vendors in the market? We don’t have access to their engines. Also, for ones available their EULA is not permissive enough to be able to publish the benchmarking results.

If you liked this article please share it.

Feel free to visit Porcupine’s GitHub repository and play with it.

--

--