Recipes for building an open-domain chatbot

Published in

DAIR.AI

10 min readJun 5, 2020

(Roller et. al, 2020).
This paper studies in depth the performance of a chatbot based on the Transformer. It shows that it’s able to respond in a human-like way, and it’s able to maintain a chit chat conversation. However, the authors also show that the model lacks in-depth knowledge, it forgets facts stated before and it tends to repeat what the other speaker is saying.

What does it propose?

The paper constructs different chatbots based on the Transformer, and it analyzes different axes of developing a chatbot. It finds that:

Fine-tuning on datasets that focus on personality, empathy, knowledge, etc. makes the chatbot conduct more human-like dialog (even when using smaller models).
It tries different decoding strategies, showing that beam search can be as good or better than sampling.
It presents some of the flaws of the developed models.

To construct a chatbot we need to build a system that generates text answers given a previous dialogue. To achieve this we need a model, training data, a way to train this model (a loss function), a decoder (or simply an algorithm) to produce an answer given the model output, and finally proper evaluation metrics. In the following sections (1–5), we’ll go through the different strategies tested in this paper for each of those steps, and then we’ll look at the results obtained (Section 6).

1. Models

1.1 Retriever

(Humeau et al., 2019)

The idea: given a dialogue history (context), it retrieves the next dialogue utterance by scoring a large set of candidate responses (typically all possible training responses).

How: It constructs an embedding of the context (y_ctxt) and one for each response candidate (y_cand_i), to then calculate the score of each with the dot product: y_cand_i ⋅ y_ctxt. These embeddings representations are constructed in the following steps:

First, the model obtains the candidate embeddings using a Transformer-based encoder (BERT) and an aggregator function, that simply takes the classifier embedding C of the output, or the average of the tokens (as shown on the right side of Figure 1).
Then the model encodes the context (as shown on the left side of Figure 1) using another Transformer and then performing m attention blocks (m being a hyperparameter). Each attention (see the definition here) uses as keys and values the Transformer output and as query a learned code c_i unique for each attention. It then computes another attention on top of those embeddings, where the query is the y_cand_i and the keys and values are the output from the other attention. Below is the step represented in equations:

Generator

(Vaswani et. al, 2017)

The proposed generator is similar to the standard seq2seq model originally proposed by Vaswani et al. (2017) but a lot bigger (90M, 2.7B, 9.4B). In comparison, Meena Google’s chatbot (Adiwardana et. al, 2020) has 2.7B parameters.

Retrieve and refine

(Weston et al., 2018)

In an effort to solve the common problems of generator models (e.g. knowledge hallucination, unable to read and access external knowledge, dull and repetitive responses), that authors propose the following. They mix the two models above appending to the input of a generator model the output of a retriever model (Figure 2), using a special separator token. They experiment with two variations:

Dialogue retrieval: this uses the dialog history and it produces a response (same retriever architecture)
Knowledge retriever: this retrieves information from a large knowledge base, where the candidates are obtained from a TF-IDF-based inverted index lookup over a Wikipedia dump. In this case, a Transformer is additionally trained to decide when to add the knowledge retrieval and when not to (as some contexts may not require knowledge).

Figure 2. Retrieve and Refine architecture.

2. Training objectives

The following are the training objectives proposed in the paper:

Retriever: cross-entropy using the different y_cand_i, where y_cand_1 is the score of the correct response and the others are sampled negatives
Generator: uses standard maximum likelihood estimation (MLE)
Dialogue retrieval: it has been proven that simply using MLE makes the model ignore completely the retrieved utterance. This probably happens because the relation between the retrieval response and the gold label (the correct final answer) is not clear. Thus, here they replace the gold label with the retrieved utterance α% of the time.
Knowledge retrieval: here we can simply use MLE because the datasets used for fine-tuning have a clear correspondence between the correct knowledge retrieval and response.

Unlikelihood training

(Welleck et al., 2020)

The authors also tried an unlikelihood objective because it was created to mitigate problems of MLE when training language models, such as repetition (using the same tokens more frequently than a human), and token distribution mismatch (using specific tokens that have low frequency too rarely compared to humans).

Main idea: to decrease the model’s probability of certain tokens, called negative candidates C_t. To achieve that, we’ll add an expression to the MLE loss that will take these candidates into account, which is referred to as the unlikelihood loss:

Where p_θ is our language model predictions, and x_<t is the sequence of t preceding tokens. As typically with losses, we have a negative logarithm that we will minimize, which is equivalent to maximizing the logarithm, therefore we’ll be maximizing whatever is inside it. As we don’t want the negative candidates (c) to be highly probable, we’ll maximize the likelihood of not having them, so we’ll maximize 1−p_θ(…).

Thus, the actual training objective will be a mixture (gated by α hyperparameter) of the unlikelihood of bad candidates and the likelihood of the next token:

The authors defined the set of bad candidates as the tokens that were generated more frequently by the model than by humans. To measure these frequencies they kept a running count of the distribution of the tokens generated by the model and they compared it to the distribution of the gold responses.

3. Decoding

The authors tried different decoding strategies as described below:

Beam search (summary here)
Top-k sampling: at each time step the word i is selected by sampling from the k (=10) most likely candidates using the model distribution.
Sample-and-rank sampling: N independent sentences are sampled (following the model probabilities), in other words, N beam searches with beam size 0 are performed, and then the one with the highest probability is selected.

They also tried additional constraints for the decoding process:

Minimum length: forces the model to produce an answer of a defined length.
Predictive length: predicts (with a retriever model) the minimum length of the answer (e.g., <10, <20, <30, >30 tokens) followed by the same procedure performed in 1.
Beam blocking: forces the model to not produce in the next utterance a trigram (a group of 3 words) that’s in the input or in the utterance itself. That can be achieved by setting to 0 the probability of the words that would create a trigram that already exists.

4. Training data

The following training dataset is used:

Pushshift.io Reddit: Reddit discussions covering a vast range of topics.

Two-way conversational datasets are used to fine-tune the models:

ConvAI2 dataset (Zhang et al., 2018) focuses on personality and engaging the other speaker. It gives a persona description to the speaker (which is concatenated to the history to use it as input in the model).
Empathetic Dialogues (Rashkin et al., 2018) focuses on empathy.
Wizard of Wikipedia (Dinan et al., 2018) focuses on knowledge.
Blended Skill Talk (Smith et al., 2020) provides a dataset that focuses on blending all the previous skills. This is constructed with one human speaking freely (using its persona) and the other one guided, that is he/she has to choose an utterance response from 3 different possibilities constructed by a model trained in each of the three previous datasets.

5. Evaluation methods

The evaluation techniques used in this paper are as follows:

ACUTE-Eval

This is a manual evaluation where a rater chooses between two chatbot dialogues constructed by a human talking to a model. The human rater needs to choose one of the conversations for each of the following questions:

“Who would you prefer to talk to for a long conversation?” (Engagingness)
“Which speaker sounds more human?” (Humanness)

So, we send the two dialogues to several raters and we count the votes given to each model.

Self-Chat ACUTE-Eval

Same as ACUTE-eval but the dialogues are generated by the model talking to itself instead of a human.

6. Results

The results are comparisons in the number of votes received by each question presented above. In the paper, the authors mention that some results are “not significant”, which basically means that given the number of answers collected and the votes on each side, it is not certain if one is better than the other, as in the difference could be the noise of the measure.

Results of Self-Chat ACUTE-Eval

When comparing the 3 models using standard beam search (beam size 10, no minimum beam decoding constraint, but with context and response 3-gram blocking), the results indicate that the Retriever model outperforms all the model variants.

When comparing decoding choices:

In terms of minimum length, the best results were encountered when setting a minimum length of 20 or 40, or when predicting the minimum length using the buckets 10, 20, 30, 40.
In terms of beam blocking, blocking 3-grams already used in the context or in the response gives the highest scores, however, the differences were not significant.
Comparing different beam sizes and sampling methods, it appears that a beam value of 10 is superior to 1 or 30, and a 10 size beam is on par with sampling methods.

Other results include:

Larger models perform better
Fine-tuning in the 4 extra datasets give huge improvements
Using the persona context (description about a specific persona) after having fine-tuning provides a little improvement compared to not using them
Unlikelihood training has a small gain (although it’s not statistically significant). Notice that the conversations in these experiments are short so maybe the advantages of this training objective are not totally exploited

Results of ACUTE-Eval

Results of conversations of 14 turns between humans-chatbot.

Comparing the 3 models with the improved decoding strategy (beam size 10, minimum length 20, blocking context and response) the results show that RetNRef outperforms both the Generator and Retriever variants.

Compared to Meena (Adiwardana et. al, 2020) results indicate that:

In the engagingness question, the generative model of the same size is better 75% of the time.
In the humanness question, the generative model of the same size is better 65% of the time, and the generative model of the same size trained with unlikelihood is better 70% of the time.

When comparing one human-chatbot dialogue to a human-human dialogue, the results that are statistically significant show that the models in this paper are 37% of the time better than human-human dialogues in the engagingness question. Additionally, the generative model is 49% of the time better in the same question, but this is not statistically significant. Even though this result sounds promising, the model is not close to performing human-like dialogue.

Failure cases

Below we can see the flaws that the authors presented and that are not really measured by this evaluation:

Words repetition. The minimum length helps to create more detailed messages, but the core problem still remains. Some 3-grams were over-expressed compared to human-human conversations, such as: “do you like”, “lot of fun”, “have any hobbies”. The current evaluation does not seem to expose this as boring because the conversations are short and are evaluated separately.
Ideas repetition. Beam blocking helps with this issue, but it can be seen that the model has a tendency to repeat what the other part says, if the human says he/she has a dog then the bot repeats that it has one too, the chatbot likes the same bands as you, etc.
Forgetfulness. The model does not link correctly to past statements. For example, you tell the model you have a dog, but then later in the conversation, it asks what pets do you have.
Contradiction. It makes contradictions linked to the overall knowledge. For example, it says it lives in the midwest, and then it specifies it lives in Georgia (which is not in the midwest).
Knowledge. They observed that the models often switch topics, avoiding the challenge of going “deeper”. The reading of knowledge only hurt the model in the evaluation setup, possibly due to:

The model attempts to use knowledge when there is no need or using it incorrectly.
Deeper knowledge is not really required in this setup since the dialogues are short and tend to cover only shallow topics whereby the speakers get to know each other.

Context length. The models in this paper have a hard limit of 128 tokens. There’s been some research in this problem but it would need another setup to be evaluated with dialogues longer than 14 turns.
Deeper understanding. These models cannot be taught a concept through further conversation, so as-is they will be stalled on their initial knowledge. See fun examples in Figure 3.

What’s next!

This paper shows a really robust and advanced chatbot, however it also presented a lot of remaining challenges to really be near a human-like bot. If you’re interested in one of the challenges presented in this summary I encourage you to read the paper! 😀 The paper is interesting because it also cites work that is trying to overcome some of the issues discussed.