Conversations with my phone

Speech recognition is everywhere — and better than ever. So, why doesn’t my phone understand me?

My colleague wants to tell me how excited he is about the speech recognition on his iPhone. He claims to use it all the time.

“Oh, really?” I ask. “Like in the car?” (It literally takes the threat of a moving violation to get me to even think of using a voice interface.)

No, he says. All the time. Searching the Web. Finding a contact. Dictating a text message. Filing a Yelp review. Asking Siri about the weather. He’s stunned that I (a technophile and a computational linguist) could have such an aversion to taking to my phone. “Doesn’t Siri work for you?” he asks. “It works great for me.”

I don’t have the heart to tell him the truth. No, Siri doesn’t work for me. None of current crop of so-called conversational interfaces do, frankly. And until they get better, I won’t be using a single one, no matter how many Easter Eggs or jokes they include.

My quibble isn’t with automatic speech recognition (ASR), the systems that transcribe my voice into something approximating what I actually said. I’m a native English speaker with a relatively neutral U.S. accent. Most of the ASR models used today have been trained on hundreds of thousands of hours of speech of guys just like me.

No, my issue is with what systems like Siri, Alexa, and Cortana (and now, Hound) do with what I tell them. Right now, conversational interfaces are far from conversational. Instead of engaging me in a dialog that will help me get what I want, they reflexively respond to the last thing I said. While this is occasionally useful, it’s usually disappointing.

I don’t want another way to send off an imperfect Google query. I want a system that understands what I said, figures out whether it can help, and if so, works with me to make sure I’m satisfied.

5 years after Siri’s acquisition by Apple and Watson’s victory on Jeopardy, this shouldn’t be too much to ask.

So, what do conversational interfaces need to do to win me back? Lots of things. But I’d be happy if they made progress towards four “Cs”.

Context. Conversational interfaces need to understand the situational context that leads me to call on them. Anticipate what I’m going to need, based on where I am and what you know about what I generally do in situations like these. For example, if I’m in a new city, offering me directions to the nearest gas station might be useful. When I’m at home, not so much.

Systems like Google Now/Now on Tap and Apple’s forthcoming Proactive are taking steps to better anticipate user’s search behaviors. We’ll know that context has arrived when systems like Google Voice Actions and Siri start leveraging this knowledge to initiate conversations with users.

Common ground. Conversational interfaces need to pay attention and remember the things I’ve told them previously. If I’ve asked for “organic flea medicine for dogs” in the preceding statement, don’t play dumb when I ask “How much does it cost?” the next time up. Likewise, if I always search for hotels that get more than 4 stars, please help me out by assuming I’m not looking for a flea-bag motel if I forget to stipulate it otherwise.

Complexity. Today’s systems need to start handling more complex queries. Over the past 5 years, Siri hasn’t dazzled with support for many new query types. And new entrants (like Cortana, Amazon’s Alexa, and now, Hound) have tried to make their bones with a couple of rather pedestrian queries that Siri et al. didn’t support previously.

While the reviews of Hound have been positive, I was generally underwhelmed. Yes, Hound can handle some pretty complex queries. But that’s more because their developers built a few shiny new templates that can handle multiple attributes at the same time — not because Hound can understand me more deeply than its competitors.

For example, Hound gave me excellent results for a query like “Find me hotel rooms in Seattle this weekend for more than $300 a night that get more than 4 stars”. (Incidentally, the same query worked okay on Siri. I got a list of nearby hotels that had vacancies this weekend; the high-end parts of my query were ignored, however.) Other queries on Hound weren’t so successful, though. Asking “Find me a place to get a pastrami sandwich less than $15 near here that gets more than 4 stars” merely returned an offer to search the Web for something tasty.

Cooperativity. Conversational interfaces are far from cooperative. They don’t know when they don’t understand you — and what’s worse, they don’t care. They don’t help steer you to the kinds of things they could help you with. They blithely send your all-too-carefully constructed statements off to the Web in the hopes that you wanted to stare at 10 blue links after all.

Human conversations don’t work like that. Have a conversation with even the surliest of teenage minimum-wage customer service workers and you’ll have a better interaction than you’ll get with one of today’s conversational interfaces. Why? Humans know that they shouldn’t go through the effort of talking to someone — or something — without first verifying that there’s a chance the person on the other end will try to understand them. We expect that if the other person doesn’t understand us, they’ll communicate/negotiate/gesture at us until the basic message gets through.

Conversational interfaces could really benefit by trying to do less. If systems could understand when they couldn’t help us, we’d might not walk away a little wiser, but we’d be a little less fed up with our technology.

I know how hard this is. I’ve spent most of my career trying to build natural language interfaces that can help users interact with machines in the same way they interact with humans.

I’m just ready for more.