🦄 How to build a State-of-the-Art Conversational AI with Transfer Learning

Thomas Wolf
May 9 · 12 min read
Online demo of the pretrained model we’ll build in this tutorial at convai.huggingface.co. The “suggestions” (bottom) are also powered by the model putting itself in the shoes of the user.
  • How you can reproduce the model we used in the NeurIPS 2018 dialog competition ConvAI2 which won the automatic metrics track,
  • How we distilled 3k+ lines of competition code in less than 250 lines of commented training code (with distributed & FP16 options!), and
  • How you can train this model for less than $20 on a cloud instance, or just use our open-sourced pre-trained model.

An AI with a personality 🤠

We’ll build a conversational AI with a persona.

  • fine-tune this language model to adapt it to our end-task: dialog.

What would be a good pretrained model for our purpose?

The bigger the better, but we also need a model that can generate text. The most commonly used pretrained NLP model, BERT, is pretrained on full sentences only and is not able to complete unfinished sentences. Two other models, open-sourced by OpenAI, are more interesting for our use-case: GPT & GPT-2.

🦄 OpenAI GPT and GPT-2 models

In 2018 and 2019, Alec Radford, Jeffrey Wu and their co-workers at OpenAI open-sourced two language models trained on a very large amount of data: GPT and GPT-2 (where GPT stands for Generative Pretrained Transformer).

A decoder/causal Transformer attends to the left context to generate next words

👻 Adapting a language model to a dialog task

Our language model is trained with a single input: a sequence of words.

  • the history of the dialog with at least the last utterance from the user,
  • the tokens of the output sequence that have already been generated since we generate the output sequence word by word.
Input sequence: a concatenation of persona (blue), history (pink) and reply (green) with delimiters (light pink). Here we generate the word “you” to complete the reply.
  • Our transformer is position-blind! Attention is a symmetrical dot-product so we should add position information for each token.
Summing three types of inputs embeddings indicating words (grey), position (gradient) and segments (blue/pink/green)

👑 Multi-tasks losses

We have now initialized our pretrained model and built our training inputs, all that remains is to choose a loss to optimize during the fine-tuning.

We will use a multi-task loss combining language modeling with a next-sentence prediction objective.

The next-sentence prediction objective is a part of BERT pretraining. It consists in randomly sampling distractors from the dataset and training the model to distinguish whether an input sequence ends with a gold reply or a distractor. It trains the model to look at the global segments meaning besides the local context.

Multi-task training objective — the model is provided with two heads for language modeling prediction (orange) and next-sentence classification (blue)
  • Next-sentence prediction: we pass the hidden-state of the last token (the end-of-sequence token) through a linear layer to get a score and apply a cross-entropy loss to classify correctly a gold answer among distractors.

🦊 Training on a dialog dataset

The ConvAI2 competition used an interesting dataset released by Facebook last year: PERSONA-CHAT.

Organization of the JSON version of PERSONA-CHAT

👻 Talking with the Model — the Decoder

The amazing thing about dialog models is that you can talk with them 🤗

Generating a sentence word by word (source)
Left: Probability assigned to tokens generated by humans and beam search using GPT-2 (Note the strong variance in human text not reproduced by beam-search). Right: N-gram distributions in human and machine-generated texts (Note the complete separation between greedy/beam-search and sampling decoding methods).

Example using the interactive scripts with default settings — Bot personality: I read twenty books a year. I’m a stunt double as my second job. I only eat kosher. I was raised in a single parent household.

👻 Conclusion

We’ve come to the end of this post describing how you can build a simple state-of-the-art conversational AI using transfer learning and a large-scale language model like OpenAI GPT.

  • the open-sourced code and pretrained models are here.

References:

[1] ^ Importance of a Search Strategy in Neural Dialogue Modelling by Ilya Kulikov, Alexander H. Miller, Kyunghyun Cho, Jason Weston (http://arxiv.org/abs/1811.00907)

HuggingFace

Stories @ Hugging Face

Thanks to Victor Sanh, Clément Delangue, and Pierric Cistac.

Thomas Wolf

Written by

Natural Language Processing, Deep learning and Computational Linguistics – Science Lead @Huggingface | thomwolf.io

HuggingFace

Stories @ Hugging Face