Neuromation Researchers Won the NeurIPS ConvAI2 Competition!

A team with Neuromation researchers has won the Conversational Intelligence Challenge 2 (ConvAI2), an official competition of this year’s NeurIPS 2018 conference in Montreal, Canada.

Sergey Golovanov, AI Researcher at Neuromation, together with his friend Alexander Tselousov formed the “Lost in Conversation” team to enter NeurIPS 2018 competition… and they won the ConvAI2 finals!

Neither Sergey nor Alexander were able to visit NeurIPS in person, so we asked their former team member and Neuromation AI Researcher Rauf Kurbanov to present their solution.

Neuromation AI Researcher Rauf Kurbanov

Let’s find out what happened, what the competition was about, and how we won…

A slide from the organizers’ presentation announcing the victory. Yes, it was a contest about text generation, not graphic design.

Conversational Intelligence: Why It Is So Hard

First, let’s discuss the challenge itself. Basically, it was a competition among chatbots, programs that are supposed to keep a conversation going with a real human being. Chatbots were a very hot topic in AI a couple of years ago, and some overly optimistic promises were made. Today it is very clear that we are still have a long way to go before passing the Turing test.

The problem with the Turing test is that human language is redundant, contrived, and extremely underdefined, all at the same time. First, language has a lot of ways of saying the same thing, and you can usually remove quite a few words while still getting the message across (not all of us are Hemingway, as these blog posts make painfully clear). This redundancy helps us understand each other through such imperfect media as sound, but for an AI model it means that it has a lot of extra structure to learn. And this structure is very contrived; the Chomskyan program of constructing a “universal grammar” has never really worked, and every natural language has many more rules than we usually think. One classical example would be the order of adjectives in English: usually you say “a big red brick building”, if you say “a red big brick building” it means you’re deliberately twisting the phrase to emphasize color, and it would be completely wrong to say “a brick big red building”, nobody says that. I bet most of you have never given this rule any thought unless you’re linguists, but we all follow it without thinking.

But that’s not the hardest part. AI models can learn complex structures just fine, and this baroqueness of natural language does have its logic that linguists have been studying for hundreds of years. The hardest part is that language relies on our natural understanding of the world, and that’s the “underdefined” part. To make conversation, a human being has to understand the world around him/her, relying on the intuition learned over decades of living in this world. And that’s what we definitely cannot give to AI models, at least not yet.

To answer a question like “I have a book in the left hand and a pen in the right; which do you see on the left while standing in front of me?”, you have to have a deep understanding of the 3D world we live in. A chatbot can have a predefined answer to “What are you favourite movies?”, but if you dig deeper and ask “What do you like about that movie?”, “What’re your favourite parts of it?”, “What books does it remind you of and why?” and so on, the chatbot will have to start evading questions because it can’t have the fullness of human experience preprogrammed or even learned: at this point, we simply don’t know how to do that.

So What’s the Challenge About?

However, even though the main goal remains out of reach, in recent years conversational models have seen a lot of exciting progress. Modern conversational models are based on sequence-to-sequence learning and usually employ recurrent neural networks, attention mechanisms, or both. One of the first successful applications of seq2seq to conversational models was by Vinyals & Le (2015), there have been several important works on hierarchical encoder-decoder architectures that combine an encoder/decoder architecture for individual utterances and a general state vector that keeps the history of conversation (Sordoni et al., 2015; Serban et al., 2016), and so on. We don’t want to go into mathematical details here but definitely plan to devote some future NeuroNuggets to this field, it’s extremely exciting and is on the cutting edge of research in attention mechanisms and RNNs.

The ConvAI2 Competition specifically concentrated on three problems that still plague even state of the art chatbots (see the competition’s web site for more details):

  1. conversational models usually don’t have long-term memory since they are trained only on recent dialogue history. Straightforward seq2seq models suffer from this the most, but even hierarchical encoder-decoder architectures struggle to account for long-term history of the dialogue.
  2. chatbots lack a consistent personality: to be convincing, a chatbot should have its own biography, surroundings, family and pets, character traits etc. There have been recent works to remedy this, e.g., (Zhang et al., 2018), but this is still very much a work in progress.
  3. since conversational models have to be able to evade tricky questions, they often fall into the local optimum of non-specific answers like “I don’t know” or “Let’s change the subject”. This is of course better than some completely incoherent answer but still ultimately unsatisfying. There has been some work on this, see e.g., (Li et al., 2015), but it’s still a big problem.

To make the contestants work on these problems, ConvAI2 used the so-called Persona-Chat dataset. It consists of conversations between real people (it was crowdsourced) who were given the task to simply chit-chat in a natural way, get acquainted and learn a bit about each other based on a profile communicated to the crowdworkers. The dataset has over 160K utterances in almost 11K dialogs among 1,155 possible personas, with a portion set aside for validation. A persona is a small set of profile sentences that describes likes, dislikes, and discussion points that should come up in a chit-chat conversation.

Sample dialogue from the ConvAI2 dataset. Note how the personas are defined through natural language text as well.

I would like to use this opportunity to thank the organizers of the ConvAI2 competition. Several primary organizers are from the Moscow Institute of Physics and Technology and DeepPavlov: Mikhail Burtsev, Varvara Logacheva, and Valentin Malykh. Thanks, guys! But the organizational team (see the ConvAI2 website) was much larger and included many top NLP researchers. Our big thanks go to Ryan Lowe, Iulian Serban, Shrimai Prabhumoye, Emily Dinan, Douwe Kiela, Alexander Miller, Kurt Shuster, Arthur Szlam, Jack Urbanek, and Jason Weston; and a special thanks to the advisory board of the competition with none other than Yoshua Bengio, Alan W. Black, Joelle Pineau, Alexander Rudnicky, and Jason Williams. Thank you all for this wonderful opportunity!

The Evaluation Problem

So the organizers published the Persona-Chat dataset and submissions started pouring in. There is, of course, the problem of which loss function to use for your model, but just for the sake of argument let’s assume everybody programmed huge tables of predefined answers. The question remains: how do you rate the submissions? What is the metric you evaluate these models on?

This is a very interesting problem for conversational models. It’s very hard to come up with a good metric for text generation, especially for an open-ended problem like dialogue. “Hello”, says the human. “Hello”, replies the other human in the training set. “Hi, how’re you today?”, says the model. Is it a wrong answer? Well, sure, not a single word matches the ground truth! But is there really anything wrong with it? The same problem arises in machine translation: one sentence can have lots of correct translations that have little to do with each other (because any language is redundant, remember?).

Although there is still no good answer to the evaluation problem, we have to do with what we’ve got. The organizers of the ConvAI2 competition settled on two different ways to evaluate the submissions.

First, the automated metrics: Naturally, there wouldn’t be enough resources to score all models manually, so the preliminary stage of the competition was evaluated automatically. There were three metrics on this stage:

  1. Perplexity (PPL) is the log-likelihood of the ground truth answer (that is, the answer in the training set) according to the model.
  2. Hits@1 is a metric that attempts to simplify conversation down to a classification problem. It chooses the correct response from the training set, adds several (usually about 20) random utterances from other dialogues, and then counts the percentage of the model preferring the correct answer.
  3. F1 score is the metric that most directly compares a model’s replies to “ground truth” replies. It’s the harmonic mean of precision and recall computed over the words in the ground truth. Basically, the more words a model guesses correctly in its reply the better.

Here is the final leaderboard for the preliminary stage:

Source. Pears mark the teams that went through to the final. Apples mark best results.

Note that the Hugging Face team (from the cognomial social AI startup) had the very best results in terms of all three metrics, sometimes by a large margin, and Lost in Conversation barely got in.

The models that made it to the finals were, however, scored differently. This time, ConvAI2 went to Amazon Mechanical Turk (still the default source for your crowdsourcing needs). It would be relatively easy to set up a real Turing test, but at the current state of technology it would be pointless: no general conversational model can hold its own and avoid being discovered as a chatbot for any prolonged period of time. So turkers simply scored the models on a scale from 1 to 5 with respect to three metrics: fluency, consistency, and engagement. Finally, turkers tried to guess the persona provided for the bot; this was another metric since a good chatbot was supposed to use this persona as a condition and show it through conversation. We refer to (Zhang et al., 2018) for more details on these metrics. The organizers also solicited volunteers to do the same; since volunteers are unpaid and not bound by instructions as strictly as the turkers, this was supposed to test how robust the models were.

By now, you can guess the results. By human evaluations, the Lost in Conversation model made by Sergey Golovanov and Alexander Tselousov outperformed every other model in the finals! It wasn’t even far from the results scored by actual humans:

Source. Lost in Conversation is about as close to live humans as to the second place model, Hugging Face. The confidence intervals are pretty wide, though.

And here is the final leaderboard with all the numbers. Lost in Conversation did not have the most recognizable personas but otherwise the ratings are indeed excellent:


How Lost in Conversation Did It

The competition started in April, and the teams were allowed to submit a solution once a month to be scored both on the hidden test set and on live people. We did not really take advantage of this offer though.

In July the iPavlov team (many organizers of ConvAI2 are members) organized a live hackathon called DeepHack on this topic. Our team decided to participate in this hackathon. At this time it was a larger team: Sergey Golovanov, Alexander Tselousov, Rauf Kurbanov, and Mark Geller (from JetBrains Research). The team did not score particularly high at the DeepHack competition. Rauf and Mark left the team after the hackathon and there was a moment of doubt whether it’s worth to continue.

We don’t have a good picture from this year’s contest, but here is a photo of the winners together with Rauf Kurbanov from last year’s DeepHack devoted to machine translation, which they (together with former Neuromation researcher Anastassia Gaydashenko) also won!

The Lost in Conversation team on DeepHack.Babel 2018. Left to right: Rauf Kurbanov, Alexander Tselousov, Sergey Golovanov.

However, in mid-August Sergey came back to the project and started trying out new ideas. Then Alexander joined him. They basically reworked the whole model and codebase from scratch. But the hackathon did bring experience and knowledge about how state of the art conversational models should work.

Sergey and Alexander worked on the solution in their spare time. They were short on resources and time, so many ideas were left unexplored. The final version of the model was actually trained in the very last day because the guys found a bug in the previous version.

But it all worked out quite well in the end. Sergey Golovanov and Alexander Tselousov won the ConvAI2 NeurIPS Challenge!

And that’s all for today, folks. We will probably dive into the details of our solution in a subsequent NeuroNugget at some point in the future, but today we just wanted to share our joy of this exciting victory and explain what it was about. Thank you for reading — and good luck to our wonderful researchers in their future endeavors! May all competitions go as well as this one!

Sergey Nikolenko
Chief Research Officer, Neuromation

Sergey Golovanov
Researcher, Neuromation