Some Stories About Designing for Voice

Her, Spike Jonze’s near-future love story, came out in 2013. The plot is simple: a man installs an advanced Siri-like operating system and then falls in love with it. The entire relationship is conducted via speech, through a single in-ear headphone. It’s an interesting premise, and an interesting way to talk to a computer. And it was especially interesting to me, considering my background designing for this stuff.

In 2007, I worked at a voice recognition startup. You’d call a phone number, speak to the service, and we’d send you a transcript via email. Because it was done with. human assistance, you didn’t need to be as careful with your speech, or get frustrated nearly as often. When I was deciding whether or not to take the job, I called the service and said “that German company that starts with a V” and the transcript replaced that with “Volkswagen.” Obviously a human is capable of doing that sort of calculation, but what if robots could one day do it?

A year later, Apple announced an official SDK for writing iPhone apps. I lobbied my CEO and CTO to give me budget for hiring someone who knew Objective-C. I got the budget, barely, then designed and PM’d the release. It went live on day one of the App Store, hit #1 in Productivity, got featured by Apple in print media, and eventually the company was bought out by Nuance. But before all that was the meeting with Steve Job’s friend, a VC guy in Seattle.

We arrived at his office and showed him a beta of our app. He immediately picked up on the blah visual design. “We’re basing it off the standard design patterns in iOS, like Mail,” I said. “That’s a shame,” he said. He said the design was too blah, too flat, not the kind of thing Steve could show on stage. “And trust me, Steve is obsessed with audio input. Always has been. Remember the 1984 Mac?”

(Indeed. You’ve been able to control a Mac with your voice forever. In the 90's I had trained my computer with AppleScript to run a series of actions from a single command. Like “OK Computer, build a new project,” would close all other windows, start playing music, clean up my desktop, go to my Projects folder, make a new folder, and wait for me to type a new project name. Musicians have long been fascinated by the artistic potential, which is why musicians have long used the Mac’s robotic voice in their songs, and why Radiohead’s album is called OK Computer, the default command that didn’t require holding the Escape key before speaking.)

I don’t recall the exact phrases he used, but this man impressed on us very clearly that Steve believed the future of computing, or at least a big part of it, would be voice controlled. You want to be able to tell an assistant “Set up the house for the party” and just know that everything is going to happen correctly. But imagine the number of clicks and keyboard shortcuts you’d need to go through in order to do the same things via WIMP (windows, icons, menu, pointer) computer interface. How many actions would it take? How much time? It doesn’t scale.

“And then we’ve got this iPhone, a big sheet of glass, and you can fit, what, five navigational items across the bottom? And a few actions on the top?” This guy knew, back in 2008, that we’d soon ask our phones to do nearly everything we ask of desktop operating systems, with far fewer pixels to work with. Our young company was headed in the right direction with voice input, because it was the only thing that could scale to the level of complexity without requiring additional UI. But, he pressed further, we were requiring humans to do the transcription. Which adds a delay. Which would be a problem for us until we could automate it entirely.

Then he asked about integration with other apps. Our company had tons of them, probably over a hundred, and dozen or two really popular ones. But somewhere along the way, there was an elephant in the room. To really do voice right, it would need to be done at a system level. No third party app could address key scenarios, like asking your phone in the morning “is it going to rain today?” Or even something like “Find Sarah’s phone number and text it to Betsy.”

Back in the office, the CEO and CTO quizzed me about Apple’s philosophy. I’m an amateur Apple historian and they were the kind of Microsoft veterans that say “JScript” instead of JavaScript or “the fruit company” rather than saying Apple. So I shared my point of view about the meeting. I agreed wholeheartedly that Steve Jobs has historically been obsessed with audio input and output. I agreed wholeheartedly that Apple would add something like this at the OS level. And I agreed wholeheartedly that the dictation scenario was a sliver of the overall audio strategy for Apple.

“There’s got to be some hook into the OS we can get via the SDK,” they said. “There’s got to be some workaround for being the app that Steve is looking for across the entire phone.”

“Maybe,” I said. I hated breaking the bad news to them, but I went for it. “But Apple believes in owning the whole widget. If they think it’s a big deal, they’re going to build it themselves. At the OS level.”

Within a year, the startup had been bought by a leading voice recognition company called Nuance. Two years after that, Siri launched as an OS-level voice assistant powered by Nuance. It was exactly what his VC friend had told us, and exactly what Steve had been aiming Apple towards since 1984. His company had finally kicked off his long-dreamed about voice assistant. Steve died the day after Siri was born.

Six months prior to Siri’s beta release, I was a lead designer on Windows Phone’s Apps team. We built Mail, Calendar, SMS, Internet Explorer, and so forth. If it was bundled on the device, our team was tasked with designing it. We had been working on an idea called “Capture” which sprung from the insight that jotting down quick notes via voice, scribble, or text is really important. I was fond of saying “we’re losing to the sticky note” to explain why phones were still not the preferred way to take down information on the fly. Software is too slow, requiring too many steps, with too much customization to reasonably compete against the speed of a sticky note.

Capture got killed off multiple times to make room for other features. At one point, while trying to argue my case, I remembered what Steve’s VC friend had told us years earlier about voice input. And I remembered about a vague rumor floating around amongst the Apple faithful, that Apple was working on an “assistant” for iOS. iPhone and Windows Phone already had a rudimentary voice-to-text feature, but “assistant” sounded like a bigger bet. It sounded like it might be Steve’s voice dream coming to pass. So I put together a PowerPoint presentation explaining that I thought the future of “Capture” was not just a Microsoft Design idea. I thought it was what Microsoft calls a “table stakes” feature. Something expected of every product because it’s so obvious.

I’m not claiming that Cortana was kicked off because of my PowerPoint, because it wasn’t. And I’m not claiming that when Siri was launched Microsoft experienced the familiar feeling of a feature going from “we’re not doing it” to “now we have to do it” because who knows. But that’s how it looked from the inside. We build a voice team and they got to work. Many of the designers were my friends, and they sat right next to me, so I got to listen in and share thinking through the early stages.

I remember one of the key things I was concerned about was the idea of Cortana being too much of a “black box.” What if you don’t care for Taylor Swift but somehow the computer has decided you love her? What if your entire search and entertainment experience gets corrupted by this bad assumption? And worse, what if there’s no way to see the bad assumption? And worse still, what if threre’s no way to correct it? I was very pleased to see that the concept of Cortana Notebook survived in the final version. I think it’s a key component. When things are stored in black and white, you can correct mistakes and even input important insights to allow Cortana to help in far more intelligent ways than a black box assistant might.

One final thought, this time about AirPods. The reaction from most people was pretty straightforward. “I don’t need wireless headphones. Besides, they’re expensive. And it looks like I’d lose them.”

My take is a bit different. Running (or walking around normally) with traditional iPhone earbuds often causes me trouble. I accidentally yank them out of my ear, or I get tangled on them and it makes my iPhone slip out of my hands. I don’t consider myself particularly clumsy, but headphone wires are always getting in my way.

I think back to something my friend told me in 2014. I asked him what future trends he saw over the horizon, and he told me that he thought headphone wearables were going to explode in popularity. He explained how while traveling he didn’t want to have his nose buried in his phone while wayfinding. He wanted a gentle voice in his ear explaining when to turn. Sort of like that movie Her.

So he had purchased some headphones that look a bit like a necklace. Most of the the time they just hung around his neck, but when he wanted to ask a question, he liked the idea that he could just say “directions to the Louvre” and have Siri get him there subtly.

He described his needs through the lens of a traveler, but his story resonated with me. I’ve been using my Apple Watch in a similar way. I’ve found that I don’t need to carry my phone through my office or my home, because I know I’ll be able to use my Apple Watch for light computing scenarios, and it will use WiFi to talk to my phone.

And rather than feeling hamstrung by the tiny screen, I find it fits a good triage role. I get less distracted. I go down fewer rabbit holes because there aren’t as many. For example, I can ask “Check for new email” while walking down the street and I’m told I have nothing new. Without being distracted by anything else, which is itself a feature. Or if I do have new email, I can ask to have it read to me. Or I can say “Read me the email from Mike today” to perform a search.

When I’m at the park playing with my kids, my wrist buzzes. I see it’s from my wife. With two taps I can send back a quick 👍 or 😂 or whaetver is needed. Or I can transcribe a quick response like “Home in 10 minutes.” And then I’m back to playing with my kids. We’re used to describing that as a stripped down feature set, but the ability to only see the thing I want to see and ignore everything else might be my favorite feature of something like Apple Watch. And soon, of AirPods too.