Talking to Your LLM Voice Assistant

Todd Mozer's Desk
Sensory Perspectives on AI
6 min readMay 31, 2024

Large language models (LLMs) have advanced remarkably, making it inevitable that we will soon interact with our personal voice assistants that are knowledgeable about everything and able to carry out actions for us as well. Basically capable of performing almost any task a super smart helper can handle — from vacation planning, tour guide, education, entertainment, shopping and more!

Many of the current challenges with LLMs are rapidly being resolved. Newer models like GPT4o can combine vision with speech for enhanced intelligence. The models are being updated more frequently, reducing outdated information, and the updates are happening faster by breaking large models into submodels that can be independently controlled. Issues like hallucinations can be cross-checked for accuracy and minimized through source citing.

However, a significant problem remains unaddressed. They can’t be always on and always listening and watching and sending data to the Cloud based LLM. It’s too much bandwidth, destroys privacy, and would lead to a lot of unintended interruptions. Today, it can be thought of more generally as how we get an assistant’s attention without being interrupted when we don’t want it and in a way that preserves privacy and conserves energy.

Full Disclosure. I started an on device neural net speech recognition company called Sensory many years ago. We specialize in high-accuracy on-device speech recognition including wake words. Many credit Sensory with creating the first usable wake words. Sensory is likely the only company that has licensed wake word technology to major players like Google, Amazon, Microsoft, Cupertino, Samsung, Huawei, and Baidu, and for sure, we are the only company that has licensed all of these companies and hundreds of others. Sensory technology has shipped in over 3 billion products and many of these used our wake words. If you’ve used speech recognition, you’ve probably used Sensory technology — most likely in a wake word. So, I have a bias for wake words!

Current Approaches to talking to Voice Assistants

Many people want wake words to disappear, envisioning a smart assistant that “knows” when you’re talking to it. This might be ideal but it’s unlikely. Instead, voice assistants will feature a hybrid approach that combines wakewords with on-device vision, touch and various other low power sensors to detect people, noise and environments.

Let’s examine some of the current approaches:

  1. Traditional wake words (e.g., Alexa, Google, Siri): These assistants typically have a low-power listening device to conserve energy, with revalidation on the device or in the cloud to improve accuracy. For instance, an ultra-low-power wake word for Alexa can always be on and running below the OS level on chips to conserve power. Secondary and final checks ensure the word was spoken correctly and understand the intent. This approach works okay but not great. Users can experience false accepts (unwanted activations) and missed activations in noisy environments.
  2. Push-to-Talk Assistants: Devices like the Rabbit R1 and Humane pin use a push-to-talk method reminiscent of walkie-talkies. This approach eliminates the technical challenge of speech detection and keeps power consumption low. However, it is impractical when hands are busy or when the device is not easily reachable.
  3. LLM Smart Assistants of Today. Some of the nicer new voice assistants use a button to turn “always on”. By this I don’t mean the low power always on listening for a wakeword. I also don’t mean the button press to quickly capture a command. I mean always on and connected to the cloud LLM analyzing everything it hears (and sees in some cases).

My favorite voice assistant in this category is Pi by Inflection. It’s an amazing product that always listens. It’s ability to understand intent is so good that it kind of works to be always listening without a wakeword even if it consumes way too much power and bandwidth. But I do mean kind of works…I find it’s great when I’m alone, but when it’s noisy or if I have a few friends with me, it becomes too interrupting and seems to want to join the conversation (which it is amazingly good at even if I don’t want it to).I also lose a sense of privacy, unless I use the push to turn on and off with each query.

The recent Open AI demo of OpenAI’s GPT 4o was really amazing, but I did notice a few interesting issues related to waking up to point out:

a) The barge-in ability seemed really impressive, and it works great if you want to barge in, but it seemed to be affected by background noise and chatter even though it was in a controlled environment.

b) They used button presses to turn it on and off so it wasn’t overly interrupting. You had to watch carefully to see them do this.

c) Sometimes they said “Hey ChatGPT” or “So ChatGPT”…my guess is this was just a natural way to talk to it, and it wasn’t really being used as a wakeword. If it was a wake word, then it was probably built into their cloud based NLU and not a low power on device implementation. Interestingly, I often start my conversation with Pi by saying Hi Pi, even though I know I don’t need to say it!

d) They didn’t combine the camera with the voice intent to know who was speaking to who…this will come in a future generation!

e) At one point they tried to turn it off to talk amongst themselves and they must have missed the off button because Chat GPT started talking about the clothes they were wearing and this quickly got manually shut down.

Future Approaches in talking to Voice Assistants

The art of creating a device that you can talk to whenever you want — without interruptions or the need for button presses, while also ensuring privacy and low power consumption — is extremely complex and has not yet been mastered. Such a device requires several key components:

1) Better NLU…to identify who is the intended recipient of the conversation.

2) Better knowledge of the room, people and noise environment. In a quiet room with one person a wake word or button press isn’t even needed.

3) Cameras, beamforming and various approaches to estimate the intended recipient of speech through directionality. Of course something like this needs to be probabilistically weighted as I might want to talk to a device without facing it!

4) Wake words. These aren’t going away, and for sure they are the simplest and most useful proven approach of today. They will be deployed AT TIMES in every conversational assistant product of tomorrow. For power and privacy reasons, the wake words should run low power and on device.

The future Voice Assistants will not have a single approach to waking up. Wakeup will be deployed through a probabilistic deep learned and trained scheme that intelligently shifts between different wakeup modes. They all have a place and the most intelligent of products will deploy them probabilistically based on the scenarios of use! Just like we humans do! Sometimes we touch, or wave, or we direct our gaze, or we call out a name!

Sensory has technologies to enable end user defined wake words (you choose the name), and an amazing tool called VoiceHub that creates high accuracy low power wake words on the fly in any size, any platform and almost any language! If you have read this article and want to deploy a wake word into your conversational LLM, I will offer the first 10 responses a free wake word development on VoiceHub with a free year of usage!

--

--