A wake-word engine is a tiny algorithm that monitors a stream of audio for a special wake-word and activates your voice assistant upon detecting it. Everytime you say “Alexa”, “Hey Siri”, or “OK Google” you are using it. The rest of this document talks about it and why I thought it is a good idea to create a new one in 2018!
Things you should know about wake-word engines
1- It must run on the edge (not cloud) for multiple reasons:
- Cost. It is simply impractical to have a stream of audio from every voice-enabled device to cloud 24/7.
- Power efficiency. In order to always run it must be extremely power efficient on mobile/wearable devices.
2- Big companies (think Amazon, Google, Microsoft) have teams of scientists and engineers to build their own wake-word engine.
Why another wake-word engine?
Integrating a wake-word engine is expensive, involves upfront cost, and takes time. Although affordable for big companies this can be a show-stopper for startups who want to join the voice-enabled revolution. Furthermore, it can implicitly discourage customization and personalization for even bigger players.
This is due to how wake-word engines are trained today. In order to train an engine for “O Canada” the vendor needs to collect hundreds of people saying “O Canada” and train a model just for that. The model does one thing really well. It can detect “O Canada”. A lot of money and time is spent on data collection and custom model training. I know all this because in a previous life (prior to my career in Amazon) I’ve made a platform that exactly did this and is currently being licensed by some of the big players in the market.
I wanted to do my part in the democratization of voice-interfaces by helping startups and innovators build what they want to build. Whether you are a established company building a word-class solution, an iOS/Android developer thinking of next big voice application for mobile, or a hacker dreaming of crowdfunding your cool IoT gadget I want to help you.
- It is self-service as it does not require new training data to create a new wake-word. Only its hyper-parameters needs to be optimized for a new wake-word. What it means for the customer is that you don’t need to pay a hefty price for building a model for your wake-word and do not need to wait for weeks before being able to use it.
- It is highly accurate (more on this below).
- It is cross-platform (Raspberry Pi, Android, iOS, Linux, and Mac) and lightweight (%7 CPU usage on Raspberry Pi 3).
- It is scalable in the sense that it can detect multiple wake-words concurrently without additional CPU/memory footprint.
- It is partially open-sourced on GitHub under Apache 2. You can create new models for Linux/Mac and use a handful of existing models on all supported platforms.
Every time I finish my pitch for an interested party I get this comment: “It can’t be as accurate as other products in the market that specifically train a model for my wake-word, right?”.
But, it can! :-)
The assumption that “a model that is trained for a specific task would outperform a general model only tuned for it” could have been true few years ago. But deep learning has evolved quite a bit since then. When a model is trained for a general task it has access to more data. Currently, we have access to many labeled speech corpus datasets with thousands of speakers for example here and here. Also, there have been many advances in transfer learning. Even big players like Amazon are experimenting with the use of transfer learning internally for their wake-word engines.
How much better is Porcupine?
In order to measure the performance of Porcupine we created an open-source tool that facilitates benchmarking different wake-word engines (to the best of my knowledge there was none available beforehand). Although we are using it for performance evaluation of Porcupine but it is modular enough that can be reused for different engines, datasets, etc.
Here we are going to compare Porcupine with:
- Snowboy (KITT.AI-Baidu): It is part of Alexa Voice Service SDK.
- Pocketsphinx (CMU): Open source and freely available.
There are six different wake-words for this test (Alexa, Computer, Jarvis, Smart Mirror, Snowboy, and View Glass). The utterances of wake-words are crowdsourced. We have open-sourced the dataset here. The background speech is taken from LibriSpeech. Finally, noise data (to simulate noisy environments) is taken from DEMAND dataset. The dataset is mirrored here on Kaggle. Without further ado here is the result
The figure above compares the miss probability of different engines at 1 false alarm per hour. The lower the miss probability the more accurate the engine is. PocketSphinx has the highest miss rate among different engines. Snowboy achieves an improvement compared to Pocketsphinx and lowers the average miss rate. Finally, Porcupine gets the best result in the bunch and reduces the miss rate more than 3 times compared to Snowboy. BTW, Porcupine compressed is a variant of it specifically designed for MCUs and DSP cores. It is can run with 18 KB of RAM. Check it out in action below
Why didn’t we compare with other vendors in the market? We don’t have access to their engines. Also, for ones available their EULA is not permissive enough to be able to publish the benchmarking results.
If you liked this article please share it.
Feel free to visit Porcupine’s GitHub repository and play with it.