Betting Big on Small Data for Conversational AI

A lot of attention has been directed of late at how to leverage big data, much of it emphasizing approaches that aim to learn patterns from massive amounts of raw information with very minimal human involvement. There are strong practical and economic reasons why technology companies with large numbers of users favor big data approaches: being constantly deluged with multitudinous streams of raw data gives them a competitive advantage over companies that don’t enjoy such large user bases. So the incentives for leveraging big data are clear, but what’s less clear is how purely data-driven approaches will fare by themselves at creating artificially intelligent machines that can carry on humanlike conversations.

The promise of big data is that by using techniques collectively referred to as machine learning, computers can employ statistical methods to glean important patterns automatically. Big data proponents take the view that these patterns can be pulled out of massive datasets without needing to rely on human experts, using only the shape of the data itself, or on rough proxies like clicks and conversions. These approaches usually work quite well for problems where the data’s inherent patterns allow for it to be transparently understood. That’s why they have enjoyed so much success in areas like signal processing, image recognition, and speech recognition. But the big-data playbook does not transfer very well to the task of conversational AI, because the relevant patterns are simply not present in raw data.

Data, knowledge, and the problem of learning to converse

For the specific case of building machines that interact with humans, big data faces a stark challenge. Human language has evolved as a lossy encoding for meaning, in that what is spoken or written by people is really just the tip of an enormous iceberg of thoughts, emotions, intentions, and goals. As a result, linguistic utterances are rife with ambiguity and vagueness, with much left to the hearer to infer. People can afford this lossiness because they rely on their interlocutors’ ability to reason, filling in the missing parts of the signal based on a negotiated set of facts, awareness of the particular situation, and what is implied but not said.

Big data techniques are not particularly well suited to understanding human language in the turn-based dialog setting of conversational AIs, simply because data is not knowledge. Having knowledge of something requires human understanding; raw data is just a signal that is usually intended to evoke a certain understanding in a certain audience.

Big data techniques are not particularly well suited to understanding human language in the turn-based dialog setting of conversational AIs, simply because data is not knowledge.

To see the difference, we need look no further than encryption. An encrypted email message surely qualifies as data, since it can be encoded as a sequence of bytes just like an unencrypted message. But an encrypted message, by design, is only readable and understandable by someone who is able to produce the unencrypted form. Big data techniques make the bet that a useful understanding of the message behind the data can be extracted from the data alone, with the right techniques and enough volume, assuming the data by itself will suffice as a proxy for understanding.

Unfortunately, raw data usually doesn’t come close to capturing the elaborate reasoning people perform in the process of understanding natural language. Some of this reasoning might be relative to a shared background of knowledge about the world. For example, if on November 9, 2016, you overheard someone say to a friend She lost, but she’ll almost certainly win the popular vote by a lot, you would probably have inferred that the person referred to as she in that exchange was Hillary Clinton, a candidate in the U.S. presidential election. People sometimes reach conclusions based on their knowledge of how classes of things in the world behave in general: few would suggest that a compact car is capable of transporting a grand piano.

But critically, human reasoning can also function independently of facts in the real world. As an example, I can hypothesize that if my flight arrives late, then I will miss my connecting flight, which in turn implies that I’ll need to spend the night in Chicago. I can work out that contingency even though as far as I know, my flight is on schedule. People also reason about unspoken intentions when having a conversation. Imagine that Kim and Sandy are planning to have a meeting at some public place, and Sandy knows that Kim likes to drink coffee in the morning. If Kim then suggests a morning meeting at a place Sandy knows doesn’t have very good coffee, Sandy might suggest an alternative even though nothing has been explicitly stated about the coffee available at the proposed location.

Who’s the boss?

Approaches that aim to avoid the critical human understanding step using raw data are immediately confronted with the question of how and to what degree to replace that understanding. When machines try to handle natural language, the related tasks can be characterized along a spectrum, with some requiring more human-level understanding and some less. Tasks that demand only the shallowest level of understanding are the most amenable to big data: examples are tagging a word as a noun, verb, or adjective, etc. (part-of-speech labeling, in the jargon of the field) and identifying certain complexes, such as noun-like structures (called syntactic chunking).

But as we move in the direction of a more complete model of human language understanding, supervised machine learning, in which a human expert helps a machine learn the right patterns by providing labels for the data it learns from, becomes more effective. This human supervision of machine learning contrasts with unsupervised techniques that use only raw unlabeled data. Human supervision is more expensive and time consuming because it often requires the painstaking attention of highly trained experts. Supervised machine learning is often employed in more abstract tasks such as syntactic parsing (generating the unseen grammatical constituency structure), semantic representation (giving an abstract model of the literal meaning), and the interactive reasoning people undertake when they engage in multi-turn dialogs. Expert supervision via labeled data is a way to expose, in a form that machines can make use of, the hidden patterns in data that a human would grasp in the course of understanding.

When big isn’t big enough

Another challenge big data approaches face in understanding human language, and dialog in particular, is sparsity. It’s highly likely that everyone has at some point during their lifetime uttered a string of words of their language that had never before been produced. That’s one fascinating fact about language: its recursive structure means that a finite number of words can be combined to produce infinite constructions. This infinite productivity is one of the reasons unsupervised big data approaches to understanding language perform best when applied to very large amounts of data.

But the effects of sparsity become compounded when people string utterances together, so that it’s very likely that all of us have interactions with other people all the time that have never occurred before, especially when we take into account the understood meanings and not just the words themselves. For example, suppose someone says this to you at a party during the holidays:

Did you see that weird pink thing she got him?

It’s quite possible that you have never encountered this exact string of words before, but you probably understand immediately that the person is referring to the giant stuffed pink giraffe that you heard someone say your friend Robin gave your other friend Chris at that same party.

The chance of already having observed verbatim, in some large data set, the exact dialog a user has with a conversational AI and then inferring an understanding about it is small and diminishes with the length of the exchange. We could try to get around this problem by abstracting things to a high level, for example, by trying to identify user intentions and dialog moves rather than attempting to understand entire dialogs in all their intricacy. But given the way human conversations unfold, where each utterance is not only interpreted in the prior linguistic and world-knowledge context but also updates and extends that context, the expert labeling required is so complex that it becomes extremely time consuming and expensive to get enough useful examples to learn from.

Bigger isn’t always better

Rather than pursuing a purely big-data strategy for language understanding and dialog, at Ozlo, we are instead building a hybrid system that leverages both big data and what I’ll call small data approaches. Small data explicitly models human understanding of language in context, rather than trying to distill artifacts of that understanding from lots of examples. It aims to capture the deep and complex processes of understanding and reasoning that goes on behind the user’s (and the system’s) natural language utterances. These processes often transcend the approximations that can be gleaned from the utterances themselves.

Small data explicitly models human understanding of language in context, rather than trying to distill artifacts of that understanding from lots of examples.

To achieve this, small data leverages expert training about the inner workings of dialog that draws on linguistic theory, our understanding of how people perform logical inference, and models of human knowledge about the world. This requires developing both an extensive graph of human knowledge of concepts and the relationships between them as well as a database of deep background knowledge about real-world entities. It also requires building an understanding of the mapping of language to meaning in context that looks not just at the hidden semantic structure of the utterances but at what else has been said in the current conversation. And it requires a model of how humans start from facts, assumptions, and hypotheses and proceed to derive other, possibly contingent, facts from them.

The hybrid small and big data approach we embrace at Ozlo uses both supervised and unsupervised machine learning where they make sense, especially when it can help the system make decisions that it lacks enough information to make otherwise. But specifically for the case of dialog, we are also making a significant investment in capturing what it is that people do when they understand and reason about information in a multi-turn conversational setting. For some purposes, we use human experts for supervised machine learning, crystallizing their expertise in the form of labeled data that improves our machine-learned models. For others, we directly encode expert understanding of human language and reasoning in the form of algorithms (such as the one that drives our dialog system) and data structures (for example, in building our databases of world knowledge).

Our aim is to create more natural AI systems that approach conversational interaction from a humanlike angle. We believe much of the complexity of dialog will be very difficult or impossible to capture from data alone, regardless of how much is used.

Deciding what counts as AI: Turing vs. Winograd

Alan Turing’s famous test for whether a machine qualifies as exhibiting AI essentially comes down to being able to produce outputs that fool a human most of the time. Turing’s proposal maintains that a machine’s internal understanding of its conversations with people doesn’t matter, that all that counts is whether it generates utterances that are good enough. This way of thinking is very much in the spirit of big data approaches, and reflects the pervasive undercurrent of empiricism in many branches of science in the late 1940s and early 1950s, when Turing devised his test. There are at least two examples of computer programs that can pass the Turing test using a combination of mimicry and clever strategies for formulating questions by echoing back parts of the user’s utterances. One is Joseph Weizenbaum’s ELIZA, developed in the 1960s, and probably the first attempt at an AI chatbot; more recently, the chatbot Eugene Goostman was built using similar tactics.

In small data approaches, the Turing test’s empiricist bent is tempered with a strong dose of rationalism, because small data is driven by the idea that what goes on inside the machine, the actual mechanism of understanding, really does matter. Small data takes the point of view that a more suitable test for AI is given by Winograd schemas: questions that are difficult to answer without near-human-level understanding. An example Winograd schema is

The trophy would not fit in the brown suitcase because it was too big

where the challenge is to correctly answer the question What was too big? Without world knowledge about the relative sizes of suitcases and trophies, and what it means for something to fit inside something else, a computer would have a very difficult time answering correctly. Small data maintains that human-level AI will need to be able to pass tests for intelligence that more closely resemble Winograd schemas than they do the Turing test. To date, the most successful attempts to build systems that pass Winograd schema-based tests have used a mix of small and big data.

In fact, throughout the history of conversational AI, the vast majority of systems of any scale or complexity have made intentional use of small data at some level. People who study and build the AI systems of the future would be well served to learn from that evident pattern in the data.

Thanks to Chris Brew, Mike Hanson, Ron Kaplan, Robin Martin, Adwait Ratnaparkhi, and Heidi Young for critiquing an earlier draft.