Google Stole Our Idea!

Damn you, Google!

Yesterday, the Google Assistant team announced Google Duplex, a system to interact with businesses on your behalf over the telephone, using natural language. Here’s the video:

Here I am with my former co-founder Jeff doing a demo of John Done at last year’s Betaworks Voice Summit:

As best as we can surmise, here’s the Google Duplex roadmap:

  • Schedule salon appointments
  • Make restaurant reservations
  • Ask for holiday hours

Here’s the John Done roadmap circa July 2017:

  • Ask a question, with holiday hours as our default example
  • Schedule appointments
  • Complete transactions
  • Make restaurant reservations (when making reservations in Manhattan, be prepared with a credit card)

As several friends have asked in the last 24 hours, “How does it feel when Google steals your idea?” OK, OK, I’ll tell you. But first…

Is It Hype? Yes.

Of course it’s hype. Hype is necessary. Know why we call it artificial intelligence instead of machine learning? Marketing. The future has to be marketed to get it funded.

But more specifically, while the Duplex demo is neat, a lot of what we see here isn’t actually that new.

Speech Synthesis

Google and DeepMind have already shown us via tacotron and wavenet that end-to-end synthesis and deep learning would transform speech synthesis. The voices synthesized using these new techniques are uncannily realistic.

But here’s a crazy idea: maybe don’t pretend to be human.

There are tens of millions of smart speakers out there. Aren’t most of us accustomed to whispering commands into our smartphones? We know what it means to interact with a bot by now.

And yeah, they can be frustrating, but how many really good point-and-click user interfaces have you actually encountered? Imagine now your frustration upon your descent into the uncanny valley as you realize the human you’ve been chatting with isn’t human at all.

Disfluencies: “Umm,” “er,” “ah,” “hm”

It’s hardly news that humans don’t always get right to the point. The Duplex team appears to have some good linguists working with them. I’ve met some of the voice user interface designers from the Assistant team, and they know their stuff.

Moreover, if the recurrent neural network driving their dialog manager is intelligently deciding where in the automaton’s dialog to pepper these disfluencies, that’s progress. I’m happy to see that.

But while it makes for a good demo, in my view intelligent disfluencies are not a major technical achievement, especially compared to the major technical challenges still to tackle.

Closed Domains

Case in point, natural language understanding still works best in tightly controlled domains. A system built to schedule salon appointments will probably bork up making a restaurant reservation.

I worked previously at x.ai, a company founded on the idea that if you define a problem narrowly enough, such as meeting scheduling via email, you can model the universe of that domain sufficiently to build a software agent (Amy) who can handle it automatically. And even then, it’s hard.

The example phone conversations in the Duplex press release are very to the point. The rather well-understood technique of slot filling would likely be sufficient in conversations having such a tightly closed domain.

And at John Done we discovered early on that in tightly-controlled domains, you don’t need highly accurate automatic speech recognition (machine transcription). A system for natural language understanding (even straightforward intent classifiers) can easily be trained to compensate for speech-to-text errors.

I can tell you from my personal experience of sifting through lots of human-to-software agent telephone conversations that humans are frustratingly prone to going off topic, especially when they aren’t sure whether they are speaking to a robot. And ay, there’s the rub with the state of the art.

Handling humans who go off topic continues to be a product and user-interface design problem, at least until modeling techniques get better or Duplex’s neural network becomes almost unfathomably large. Labeling all the data to get there is going to be very expensive for them.

Oh, and about labeling those conversations…

Real-Time Supervised Training: They’re Using Humans, Folks

Actually, kudos to them for mentioning in their press release how they train the system on new domains. We still live in a world where machines are trained using supervised learning, meaning we still require humans in the mix to train the machine what to do. There is considerable research interest in unsupervised learning — having the machines figure it out on their own — and progress is being made.

Managing those humans in the mix and the data they produce is too frequently glossed over as a detail. I get it. Again, we’ve got to market this stuff. And the image of thousands of contractors laboring over mundane conversations, labeling disfluencies is, well, lacking in pizazz.

But the data-labeling operation is a core activity for any business aspiring to truly make AI work as their proprietary technology. To all those previously successful founders now diving into AI: seriously, figure it out now. If you haven’t started, you don’t even know what you don’t know.

It would be great to hear more from the Duplex team on how they handle their humans in the mix. Again from experience I can tell you that involving human data-labelers in synchronous, latency-sensitive conversations is very hard to get right.

But I‘ll Admit It: Duplex is Cool

There is, after all, a good reason we wanted to build it too. I want this technology in my life.

But there’s news here too.

Dialog Manager Powered by Deep Learning

What I’ll call Duplex’s dialog manager — the system used to choose the automaton’s next utterance in the dialog — is a recurrent neural network taking as its input the interlocutor’s transcribed speech, features from the audio itself, the state of the conversation up to that point, the task at hand, and more context that they don’t divulge. That is undeniably cool.

Following Microsoft’s Tay Debacle, it was tempting when designing software agents who interact with humans to strip them of this autonomy. In other words, while we might use machine learning to produce all those inputs, from the transcribed speech to the predicted intent of the user, allowing the machine to choose the next utterance seemed, well, risky.

Hence the proliferation of bots and their stilted dialogues. They are rules-based systems; they are only so flexible. The problem with these systems is that in any domain sufficiently complex to be interesting, those rules get out of hand. Even trying to specify systems at that level of complexity is an impossibly large task, and without those specifications, how do you even know the system is doing what it’s meant to?

Better to spend that effort modeling the problem probabilistically — figuring out how to let the machine work it out by looking at example conversations that humans have labeled. With these closed domains, it’s possible to use pre-defined templates for the bot’s speech, avoiding the worst tendencies of that bigot Tay.

Awesome Language Modeling

Based on the fairly terse description of that dialog manager provided in the press release, Duplex appears to have a very sophisticated language model embedded in it. Aside from the afore-mentioned disfluencies, they imply that their model understands many linguistic features beyond just words, including, for example, pauses of different lengths. Their granular description of the dynamics of human conversations also speaks to the linguistic sophistication of the team and the model.

That is some awesome work and an impressive achievement for natural language understanding.

It follows that they have also devised a technique for the data representation of those linguistic features, a way of converting them into lists of numbers that the machine can understand. I hope we’ll hear more details from the Duplex team on how they got there.

So, How Does It Feel?

Again, there are major product and user-interface design problems to be solved even with this apparent leap in the state of the art. How do you handle it when humans go off topic? How do you fail gracefully?

The technical challenges are fascinating and still really hard, but as a startup seeking product-market fit, these design problems loom large for us.

And then there’s the business problem of distribution, how we insert ourselves into our customers’ lives. We never figured it out with John Done’s original incarnation, and of course it isn’t a problem for the Duplex team — it’ll be right there in Google Assistant.

That’s why we decided to take a slightly different tack, taking John Done stealth several months back. We’re now preparing to launch a new incarnation of John Done, as an assistant who helps you manage your phone calls. Please sign up here.

How does it feel? Validating, actually. I knew it was a good idea.

So thanks to the Google Duplex team, sure for the neato advances you’ve made, but mostly for finally convincing my wife of what until now she has only met with one raised eyebrow. “No really, honey, I’m a visionary.” ;)