What are the Dialog Systems, or Something about Eliza

It seems that dialog systems have been around from the dawn of our age. A good example of them is Eliza — a chatbot system born in the 1960s. This is a psychotherapist bot, whose interface resembles the interface of Apollo mission, and it’s not a coincidence. Let’s talk more about dialog systems, but first — our guest star Eliza:

Nowadays you meet dialog systems virtually everywhere — when calling to a bank you will first hear a pleasant voice of a so-called auto-informer system saying “dial 1 for credit card”, when driving to a new place you are given directions by your navigator software. These along with Siri from Apple or Microsoft Cortana are all dialog systems.

Why are they so popular? Probably because a conversation is a natural way for people to get information.

Classification

We need to talk about classification of dialog systems to better understand how they are built. There are two main criteria which we’ll use to describe dialog systems:

  1. General-Purpose — Task-Oriented
  2. Open Domain — Closed Domain

Task-oriented systems are created to solve a particular problem: find the information requested by a user, accomplish a task. Therefore, there always is a point when the system has done what it was asked to do (or failed and run out of options) and the conversation can be finished. They are opposed to general-purpose systems whose goal is conversation itself and they aim at making it enjoyable.

Closed-domain systems are systems which are capable of accomplishing a particular task (or a small set of tasks) in a narrow domain, e.g. ordering a taxi, finding information on flights, etc. On the other hand, open-domain systems are not limited to one domain, they are meant to be omni-purpose: e.g. Siri is supposed to do anything that can be done by an iPhone.

Open Domain systems are significantly more difficult to implement, then Closed Domain ones, the system needs to “know” more. Let’s have a look at a few examples:

Let’s start with simplest dialog systems — auto-informers. They are for sure closed-domain and task-oriented. They even barely have a dialog — you can only take part in a conversation by dialing numbers on your phone.

The other example is ELIZA , which is also closed-domain: it interprets all the phrases in a psychological way (which resembles real psychologists a bit, don’t you think?..) and is general-purpose in a way that it has no explicit goal in chat.

Next thing in the list is Char-RNN from Andrej Karpathy. It is here as an example of simplest chatbots. Char-RNN itself is a neural language model which can continue a string of text fed to it as an input. You can use it as a chatbot, for example, if you train it on subtitles to series or movies, so it learns how to “answer” to your line like characters from the movies do. Based on that one could say that CharRNN is open-domain (it can talk in any topic) and general-purpose ( there is no notion of goal in such dialog). The only issue with this model is that it is also has no notion of dialog, phrase or even word. It just does the only thing it knows — continues a string of text.

There are only two items on the list left: the ConvAI and so-called True AI. Why is it explicitly stated that AI is “true”? That’s because nowadays everything is called AI and you need to set apart say auto-informers from full-glory Artificial Intelligence. It can keep up with a dialog on any topic. And not only keep up, but also lead a conversation. As you know, there is no such system at the moment, but one could try to build a system as close to this as possible. ConvAI is one of such efforts.

ConvAI

The last item undiscussed here is ConvAI, and it’s time to talk about it. ConvAI stands for Conversational Artificial Intelligence, this is a name of the Conversational Intelligence Challenge (convai.io), which will be held at NIPS this year, a week from today. The idea of this challenge is to make dialog system which can discuss some text with a human. This is challenge for teams providing their dialog systems to chat with volunteers.

But back to the theme. Since the ConvAI is a Challenge one should decide somehow, who is the winner there. And this is actually not that simple question. Due to standard metrics of textual comparison, used e.g. in machine translation competitions, like BLEU, ROUGE, etc. don’t suit here.

BLEU and human judgement correlation. From [1].

As you can see on the picture, BLEU score is not even slightly correlated with human judgements. And at the same time humans have quite strong correlation between themselves, which means that one person’s opinion about the quality of an answer usually matches those of other people.

If you’re not familiar with such a metrics, let's describe them. Metrics like BLEU are made for comparing strings. They are higher for pairs of strings which have more common words and phrases.

Let's consider an example. Say we have a question to a dialog system: “Tell me something about elephants." The possible system's answer could be: “These are huge animals with big ears”. But if the known answer is “Elephants are mammals which live in Africa and India, their weight can reach 3 tons”, then the suggested answer will be scored poorly despite the fact that it is totally correct according to common sense.

Due to such high variation in answers we cannot score dialog systems for their answers directly. And, more importantly, since we have no other scoring for general-purpose systems, we cannot independently score them. We need some other way to score at least goal-oriented systems. And there is such a metric, which is pretty straightforward - it is called Task Completion Rate:

Task completion rate is the percentage of successful runs of a dialog system. A successful run is a conversation where a system succeeded in accomplishing its goal: for goal-oriented systems goal is usually pre-defined, general-purpose systems we usually consider user satisfaction to be the goal.

Task Completion is judged by humans, in our case they decide if the conversation was satisfactory, engaging, and so on. Therefore, we could compare dialog systems based on a number of successful dialogs they had. Successful dialog here is the one which achieved its goal. For the ConvAI we’ve decided to make conversation about some piece of text, for example a Wikipedia paragraph. After finishing the conversation, a person is asked to leave her/his feedback on the quality of conversation. The trick here is that we don’t inform the person on the nature of her/his counterpart: that could be either AI (a bot from one of participating teams) or a fellow human being. Based on this assessment we could rank dialog systems from dummy Char-RNN bot to real human intelligence.

Datasets

As mentioned above you could donate your dialogs to community, so we collect the dataset of human-human and human-bot dialogs. The first part of the dataset is already available here. After the end of our challenge the rest of the data will be also make public.

Of cause, our challenge is not the first dialog systems challenge (although it is unique in some features) with a dataset. One convenient list of dialog corpora is published in [2]. The other example of dialog systems challenge and dataset is DSTC which stands for Dialog State Tracking Challenge. This year it is held for the 6th time. All the data from it is available from here. What is Dialog State Tracking? The idea of dialog state refers us to semantic frame — the restricted dialog information representation, so the dialog could be considered successful if whole frame is filled with knowledge from dialog. As it could be seen from the definition this challenge is for closed domain systems. So despite that this challenge is really the closest to ConvAI, but the unique feature of ConvAI is open domainness of proposed conversations.

Conclusion

If you’re attending NIPS, don’t miss a chance to have some fun chatting with bots and humans and donate your dialogs with chatbots to create more robust conversational AI. Also we should mention that the challenge is co-organized by Yoshua Bengio, who has no need in presentation, and also Alan Black and Alex Rudnicky from Carnegie-Mellon University, who have extensive experience in research and development in dialog systems.

In the end we’d like to invite you volunteer in ConvAI once more, bots (and fellow humans!) are waiting for you. You could start chatting right now!

References

  1. Chia-Wei Liu et al. How NOT To Evaluate Your Dialogue System: An Empirical Study of Unsupervised Evaluation Metrics for Dialogue Response Generation. arxiv:1603.08023
  2. Iulian Vlad Serban et al. A Survey of Available Corpora for Building Data-Driven Dialogue Systems. arxiv:1512.05742