Field testing told us screens are useless: here’s how we approached the design of a hands-free mobile UX.

Published in

Knowcast

5 min readJul 5, 2023

We’d just learned the screen is unavailable when our users want to interact with our app. So we designed a custom voice assistant to control the app using just your voice.

What luck! Voice and NLP is my wheelhouse!

In terms of experience and career, the insight suggesting voice needed to be central to the product design couldn’t have played better. I didn’t rig it, promise! You see, from 2018 (with launch of Google BERT) I’d spent several years working with “next gen” natural language processing (NLP) and voice technologies. I observed two things:

I watched as the tide around chatbots and voice assistants had slowly ebbed, moving from “this is how we’ll talk to computers now forever and always” to just a handful of genuinely best-fit situations.
It did however leave a fairly substantial industry of NLP and products, open source projects, and expertise. The needle had moved, substantially, forever. As I had some of this expertise I had a good sense of what was cheap, what was expensive, and what was still sci-fi. We had the shoulders of giants we needed!

How does voice input work anyway? An anatomy of a voice command

A voice command has up to three parts:

Listen for a specific phrase. E.g. “Hey Siri”. Or in our case “Knowcast”.
Listen for a command and its options e.g. “take a note”.
Knowcast offers note taking too, so for this, a third step is “transcribe this audio into text”. For example “Learn more about this idea of carbon sequestration”.

A snapshot of a voice command and the tech we evaluated at one point. There are three parts: wake word, command detection, and transcription. We evaluated a bunch of tools for this; picovoice was promising but didn’t quite hit the mark in the end. Amazon Transcribe was quickly replaced by open source solutions.

1. The wake word listens for just one phrase

If we listened to everything the user said, there’s a good chance that Knowcast would get really annoying, trying to act on whatever random phrases you said that translated to actions.

So you need a filter (at least today). Apple has it: “Hey Siri” is the trigger. Even that triggers incorrectly sometimes. Industry calls this a “wake word”: the computer is “asleep” and you “wake it” with a “wake word”.

We needed our own wake word. We started with “Knowcast”, and field testing showed that adding another syllable, making it “Hey Knowcast” significantly improved accuracy. (Makes sense: Knowcast is only two syllables and all the major assistants have at least three: Hey Siri, Ok Google, Alexa, Cortana.)

Sidebar: do you need the wake word? Actually we’re not sure you always do. For a general purpose assistant you would, but when there are just a small set of commands you can give, I think it’d be worthy of a field trial. Also it’s now very cheap to identify a specific person from a crowd of people talking (“diarization”), so we could focus on just what the owner has to say.

2. The search begins for the user’s request (command detection)

With not modest amounts of hubris, industry calls the next step “natural language understanding” or “NLU”. To be clear it understands NOTHING. See the picture above to help understand how it works:

Transcribe the audio into text very quickly e.g. “go backwards”
Look for additional information and extract it e.g. “30”. (They’re often called slots because NLU peeps like to invent new names for things to seem fresh and valuable.)

Sidebar: Does the voice always have to be turned into text first? Mostly. However I did talk to one founder who had great success going straight from audio to the command and parameters.

3. Finally, note-taking

For most commands, we only had two stages. But for “Hey Knowcast, take a note” we had a third: the actual note.

The note was

Captured as audio
Turned into text
Inserted as a yellow “sticky note” next to the transcript

Cool huh?

Knowcast could take notes and insert a transcription of them inline completely hands-free using just your voice. This is prototype 2 in action. See full video below.

Here’s a demo of how it worked in the car.

Here’s the March 2022 prototype in action in a car. Fully voice-controlled note-taking while listening to podcasts.

What about Siri?

Always use built-in tools if they’re there. If you want to give your iPhone voice commands, Siri should be the starting point. Here’s where we ended up:

Siri integration is super super easy. It was about a days’ work.
There’s no easy way to have extra information (like time in seconds) without adding a shortcut for every time. This is bonkers inflexible.

So that’s a no for Siri.

2023 update: anything changed?

iOS16 was exciting because Siri added enhanced third party support. Now apps can preinstall Siri shortcuts (as no one was ever going to do that themselves ever) but we found the following obstacles:

You ALWAYS have to include the app name in the command. This is fine for occasional triggers but not a full app-control experience
Despite our best efforts over several weeks we didn’t find Siri reliable enough and it never moved out of prototype stage
Still no access to the original audio (even though the user spoke to our app…?)

So we look forward to iOS offering more reliable capabilities and more flexible input options for fully voice-driven mobile app experiences but as of July 2023 it wasn’t there.

Audio feedback: the battle between noise and silence.

If you watch the video above you’ll see there are some sounds playing back to tell the user what happened. This warrants a whole separate article but here are some highlights:

Silence is as valuable as noise. For example, we spent a lot of time thinking about how long we wait until we decide “user has finished speaking”. For note-taking we made it much longer than Alexa etc because we knew that the user might pause and have a think. I think we settled on around 5s of no talking.
Too many feedback sounds can confuse the user. We had a feedback loop early on that went like this: “Hey Knowcast <bip> Go back 30s <bip>”. That first bip for our users was more jarring than we’d like so we found a way to remove it — allowing a user to move seamlessly from wake word detection to NLU.

How did we pick the commands?

That’s what I will cover next!