A considerable number of use cases when building voice interfaces involve inferring user’s intent from a spoken command. For example
- Switch to ESPN.
- What’s my heart rate?
- I just took my blood pressure medicine.
- Set a reminder for a dentist appointment on January 5th at 11 am.
- Can I have a double-shot Americano with lots of milk and a bit of sugar?
Conceptually we need a black box that translates the above commands into structured data representing the user’s intention. For example, given “set the lights in the living room to purple” we like the black box to give
Let’s call this black-box a Speech-to-Intent engine. In what follows I talk about the current approaches, their limitations, and the solution we developed at Picovoice.
State of the Art
Current approaches to intent inference break down the problem into two sub-tasks. First, the speech signal is transcribed using a speech-to-text engine. The speech-to-text engine can optionally be tuned for the domain of interest for improved accuracy. Then the transcription is fed into a natural language understanding (NLU) engine. The NLU engine is responsible for inferring the topic, intent, and slots (intent details) from the text.
An NLU engine can be as simple as a collection of regular expressions. It also can have a probabilistic component. The probabilistic component is almost always used as plan B. Why? Because determinism is desired.
There are some great articles about how this is done such as this.
The need for a different Approach
The main limitation of the above approach is that it requires a significant amount of compute, memory, and storage resources. When implemented as a cloud solution this is not a significant issue. But there is a lot of interest to run voice AI offline (on-device) for better privacy, latency, and reliability. Companies such as MyCroft, Snips, and Sensory are a few providing such technologies and solutions.
Hence the current on-device Speech-to-Intent solutions are resource demanding. A quick survey suggests that in order to have a Speech-to-Intent engine running fully on-device (no cloud delegation) at least a Raspberry Pi 3 or equivalent is needed. This limits their applicability to resource-constrained IoT and mobile applications.
The above uses cases are concerned with a specialized domain (context) with a limited vocabulary (think thousands not millions) and variants of spoken commands.
Picovoice’s Speech-to-Intent engine (a.k.a Rhino) takes advantage of the such property to build a jointly-optimized speech recognition and NLU engine specialized for the domain of interest. The result of this joint optimization is a much smaller model size and significantly lower run-time requirements. This means Rhino can run on MCUs with extremely tight compute and memory restrictions, as well as power-constrained (e.g. wearable) devices.
What’s the catch? Well, Rhino’s models are domain-specific. That means that a model built for a coffee maker cannot be used in a laundry machine and vice versa.
Rhino in Action
In order to demonstrate Rhino’s capabilities, we have ported it into an embedded processor called i.MX RT1050 from NXP. The processor is essentially an ARM Cortex-M7 with 512 KB of RAM. Rhino and our wake-word engine (Porcupine) collectively use 150 KB of RAM in this demo. See below.
Last but not least We have partially open-sourced Rhino. Feel free to visit Rhino’s GitHub repository.
Finally, we suspect that our approach can significantly increase accuracy compared to a generic system as it purposely limits the search space. We are in the process of creating a statistically-significant benchmark for further investigation.