The 6 Evolutionary Stages of Chatbot AI
Suddenly this year, there are ‘conversational interfaces’ and ‘chatbots’ on offer from companies big and small. Many claim to be breaking new ground, and that’s surely true in some cases. For example, Google’s recent release of SyntaxNet uses Deep Learning in a new way to achieve the greatest accuracy ever attained for syntactic parsers. The release of that tool allows companies whose chatbots have reached a certain stage of development to switch out their own syntactic parser and focus even more intently on the really interesting problems.. But how can we assess the smartness (i.e. potential usefulness) of these bots?
Many of the techniques used in these bots are new to industry but have been known to AI researchers for a decade or more. So an AI researcher can make pretty good guesses about how each bot works based on what features are promoted in its demos. What I’m going to do in this article is to give you some tools for making the same insights.
Let’s adopt a guiding metaphor that bot intelligence follows a natural evolutionary trajectory. Of course, bots are artifacts, not creatures, and are designed, not evolved. Yet within the universe of artifacts, bots have more in common with, say, industrial design than fashion, because functionality determines what survives. Like the major branches of Earth’s genetic tree, there are bundles of features and functionality that mark out a qualitative change along the trajectory of development. How many stages are there in the evolution of chatbot AI? Based on my years of research in NLP and discussion with other experts, my answer is six. Let’s look at each.
Stage 1, The Character Actor: Bots at this stage look for key phrases in what you type (or say, which they process with speech recognition) and give scripted responses. The primary effort that goes into these bots is making the scripted responses entertaining so you’ll want to continue the interaction. But because they only look for key phrases, they have a shallow understanding of what you’re saying and cannot go in-depth on any topic, so they tend to shift topics frequently to keep the interaction interesting. Bots at higher stages need to provide interesting interaction also, so it’s not a good idea to skip stage 1 entirely. There are no really good techniques at later stages to convey personality, emotion, or to be entertaining.
Stage 2, The Assistant: Bots at this stage can do narrow tasks they've been explicitly programmed to do, like take a pizza order or make a reservation. Each kind of task is represented with a form (i.e., a tree of labelled slots) that must be filled out so the bot can pass the form values to a web service that completes the order. These bots use rules similar to stage 1, but the rules have been split down the middle into two sets of rules. One set looks for key phrases in what you type, as before, and use parts of what you write to fill in the form. The other set checks which parts of the form still need filling and prompts for answers, or when the form is full, calls a web service.
This form-filling design has been part of VoiceXml, a technology that underpins automated phone services, since the early 2000s. And it’s been in AI research systems since the late 90s. Recent work in machine learning and NLP certainly makes this design work better than those old systems do.
Stage 3, The Talking Encyclopedia: Bots at this stage no longer require a predefined form but instead build-up a representation of what you mean word-by-word (aka ‘semantic parsing’). This is the stage that Google’s SyntaxNet is designed to help with, because these systems usually try to identify the ‘part of ‘speech’ of each word and how they relate to each other first, as a guide to extracting meaning. SyntaxNet can do the first step, leaving the hard, interesting work of extracting meaning that distinguishes bots at this level.
A common use for these bots is for looking up facts, such as the weather forecast two days from now, or the third-tallest building in the world, or a highly rated Chinese restaurant nearby. The way they make this work is that they build a database query as part of doing the semantic parse, leveraging the syntactic parse result. For example, nouns like ‘restaurant’ can be indexed to a database table of restaurants, and adjectives like ‘Chinese’ and ‘nearby’ add constraints for querying that database table.
Stage 4, The Historian: Bots in all the stages so far aren’t good at understanding how one of your sentences connects to another (unless both sentences are focused on a common task). Bots can’t even do a good job of knowing what you’re talking about when you say ‘he’ or other pronouns. These limitations are due to not having a usable record of how the exchanges have gone and what they've been about. What distinguishes a bot at stage 4 is that it maintains a ‘discourse history’.
In addition to those benefits, research suggests that this is the best stage to integrate gesture understanding with language understanding. The reason gesture understanding fits at this stage is that pointing gestures are a lot like pronouns, and more complex demonstrative gestures (like miming how to get a bottle cap off) build on pointing-like gestures similar to the way sentences build off pronouns.
Remember how last year’s buzz was about virtual and augmented reality? In the near future the buzz will be about giving chatbots a 3D embodiment which can understand you just from your body language. After all, conversational interfaces should be easier to use than others because you just talk, as one does in everyday life. If you reduce the amount of talk needed to just what people use with each other, that’s really smart/useful.
Stage 5, The Collaborator: The Assistant in stage 2 recognises requests that match the tasks it knows about, but it doesn't think about what you’re ultimately trying to achieve and whether it could advise a better path. The best example of such a bot to be found in AI research is one that, when asked which platform to catch a certain train, answers that the train isn't running today. It’s certainly possible to get bots at earlier stages to check for specific problems like this one, but only stage 5 bots have a general ability to help in this way.
Bots at this stage seem to fulfill the ideal of the helpful bot, but it’s still not what the public or researchers dream of. That dream bot is found in the next stage.
Stage 6, The Companion: This is a mythical stage that all bot creators aspire to but which only exists so far in fiction, like the movie Her. Ask yourself what is it about the computer character that people wish were real. I believe her empathy and humor are part of it, but that her core virtue is that she understands his situation, and what could help him through it, and she does it. Would a real bot require any technological advance beyond stage 5 to help in similar ways? It seems not. What bots do need to reach this stage is a model of everyday concerns that people have and how to provide counsel about them. That core skill, leavened with humor from stage 1, could provide something close to companionship.
Could a bot be a good counselor? It’s certainly a serious question, because the elders, teens, parents, and anxious wage-earners of today and coming years are outstripping available professional care. Conversational AI has potential to deliver real benefits beyond simplifying interfaces. Let’s work with professional counselors and make it happen.
I hope this article has provided a way for us all to identify the strengths in different chatbots and talk about them with workable shorthand terms. It would be great to see discussions about some particular chatbot being strong in stage 4 but boring because it neglected stage 1, for example. That would give everyone a chance to influence the directions of this very promising technological advance.