Yet Another Wake-Word Detection Engine
A wake-word engine is a tiny algorithm that monitors a stream of audio for a special wake-word and activates your voice assistant upon detecting it. Everytime you say “Alexa”, “Hey Siri”, or “OK Google” you are using it. The rest of this document talks about it and why I thought it is a good idea to create a new one in 2018!
Things you should know about wake-word engines
1- It must run on the edge (not cloud) for multiple reasons:
- Cost. It is simply impractical to have a stream of audio from every voice-enabled device to cloud 24/7.
- Power efficiency. In order to always run it must be extremely power efficient on mobile/wearable devices.
2- Big companies (think Amazon, Google, Microsoft) have teams of scientists and engineers to build their own wake-word engine.
Why another wake-word engine?
Integrating a wake-word engine is expensive, involves upfront cost, and takes time. Although affordable for big companies this can be a show-stopper for startups who want to join the voice-enabled revolution. Furthermore, it can implicitly discourage customization and personalization for even bigger players.
This is due to how wake-word engines are trained today. In order to train an engine for “O Canada” the vendor needs to collect hundreds of people saying “O Canada” and train a model just for that. The model does one thing really well. It can detect “O Canada”. A lot of money and time is spent on data collection and custom model training. I know all this because in a previous life (prior to my career in Amazon) I’ve made a platform that exactly did this and is currently being licensed by some of the big players in the market.
I wanted to do my part in the democratization of voice-interfaces by helping startups and innovators build what they want to build. Whether you are a established company building a word-class solution, an iOS/Android developer thinking of next big voice application for mobile, or a hacker dreaming of crowdfunding your cool IoT gadget I want to help you.
- It is self-service as it does not require new training data to create a new wake-word. Only its hyper-parameters needs to be optimized for a new wake-word. What it means for the customer is that you don’t need to pay a hefty price for building a model for your wake-word and do not need to wait for weeks before being able to use it. In fact, we create a model for free within a day of licensing agreement.
- It is highly accurate (more on this below).
- It is cross-platform (Raspberry Pi, Android, iOS, Linux, and Mac) and lightweight (%7 CPU usage on Raspberry Pi 3).
- It is scalable in the sense that it can detect multiple wake-words concurrently without additional CPU/memory footprint.
- It is partially open-sourced on GitHub under Apache 2. You can create new models for Linux/Mac and use a handful of existing models on all supported platforms.
Every time I finish my pitch for an interested party I get this comment: “It can’t be as accurate as other products in the market that specifically train a model for my wake-word, right?”.
But, it can! :-)
The assumption that “a model that is trained for a specific task would outperform a general model only tuned for it” could have been true few years ago. But deep learning has evolved quite a bit since then. When a model is trained for a general task it has access to more data. Currently, we have access to many labeled speech corpus datasets with thousands of speakers for example here and here. Also, there have been many advances in transfer learning. Even big players like Amazon are experimenting with the use of transfer learning internally for their wake-word engines.
How much better is Porcupine?
In order to measure the performance of Porcupine we created an open-source tool that facilitates benchmarking different wake-word engines (to the best of my knowledge there was none available beforehand). Although we are using it for performance evaluation of Porcupine but it is modular enough that can be reused for different engines, datasets, etc.
Here we are going to compare Porcupine with:
- Snowboy (KITT.AI-Baidu): It is part of Alexa Voice Service SDK.
- Pocketsphinx (CMU): Open source and freely available.
The wake-word for this test is Alexa as Snowboy already has a model for it on its GitHub repository. The utterances of wake-word are crowdsourced using Mechanical Turk and recorded on Android mobile devices. We have open-sourced the dataset here on Kaggle. The background speech is taken from Mozilla’s Common Voice project. Finally, noise data (to simulate noisy environments) is taken from DEMAND dataset. The dataset is mirrored here on Kaggle. Without further ado here are the ROC curves
The horizontal axis is the number of false alarms per hour and the vertical axis is miss detection rate. The lower the curves the better. Two different conditions have been tested separately, quiet environments and also noisy environments with noise mixed at 10 dB SNR. Each point in the figure is created by setting a different threshold (aka sensitivity) for wake-word engines and running the whole dataset through them. A more compact way of looking at this is comparing engines miss rates for a given false alarm rate as below:
PocketSphinx has the highest miss rate among different engines with the average of 45% in clean and 52% in noisy conditions. Snowboy achieves a significant improvement compared to Pocketsphinx and lowers the average miss rate to 25% in clean and 36% in noisy conditions. Finally, Porcupine gets the best result in the bunch with 9% in clean and 19% in noisy conditions.
Why didn’t we compare with other vendors in the market? We don’t have access to their engines. Also, for ones available their EULA is not permissive enough to be able to publish the benchmarking results.
If you liked this article please share it.
Feel free to visit Porcupine’s GitHub repository and play with it.