Chatbots: becoming nicer or becoming Meena?

Richard Coates
Published in
8 min readFeb 3, 2020


Chatbots — slowly becoming accomplished conversationalists across multiple topics

This week, Google announced to the world its new efforts towards an open-domain human-like multi-turn end-to-end chatbot, codenamed ‘Meena’. There are a lot of descriptors in that first sentence, so I’ll start by looking at each one, along with some background.

An open-domain multi-turn end-to-end chatbot

Closed or Open

Chatbots are most usefully divided into ‘open domain’ and ‘closed domain’. Closed domain chatbots have restrictions on what they can usefully talk about. For example, the chatbot Byte built for Zegna’s What Makes a Man campaign was good at discussing issues affecting men and the subject of masculinity, but couldn’t tell you the weather, or talk about Adidas’ new range of trainers. In contrast, the weather skill that I have on my voice assistant at home isn’t very good at debating whether the pressure that men feel today is external or self-imposed but can tell me if I need to wear a coat today with a surprising level of accuracy. An open domain chatbot is one with which I can talk about any subject. The ideal chatbot would be as competent talking about sport as about history, and could even provide nuanced responses to your points of view. As one might expect, closed domain chatbots are easier to build, and the more specialised the domain, the easier they are to build.


The reason that closed domain chatbots are easier to build than their open domain counterparts is down to the concept known as ‘perplexity’. This is the extent to which a given dataset can accurately predict the next event in a scenario. The ideal dataset would be able to perfectly predict any given event, given the context and the input, but in practice, there will be some randomness. Imagine that we have a 6-sided die and some data about the results of rolling the die. Assuming that the data indicates that it is a ‘fair’ die, then our data will correctly predict the result of the roll about 1 in every 6 rolls, so our dataset would have a ‘perplexity’ of 6.

If we can predict what ‘should’ happen in a conversation, then presumably we’ll be better at responding in an ‘appropriate’ way. As a result, closed-domain chatbots, which require far less in the way of training data, are easier to build: it’s much easier for a weather bot to predict how to respond to the question “is it going to be nice today?” when you can understand without context the entities to which “it” and “nice” are referring.

If you want to test your level of perplexity in a simple context, you can find a game here:

Single or Multi-turn

Chatbots can also be divided into single or multi-turn agents. It’s relatively easy to create a chatbot that takes each sentence in isolation, particularly in a closed domain. Frequently asked questions can be answered with little to no context, as long as all of the relevant information is there, but there is a big leap between asking “Who is the Prime Minister of Finland?” followed by “How old is Sanna Marin?” and asking “Who is the Prime Minister of New Zealand” followed by “And how old are they?”. The latter construction is what is known as a ‘multi-turn’ interaction, as the second part of the interaction relies on context from the first part, and so cannot be understood in isolation.

End-to-End: Neural Network or No

There are broadly two ways of simulating human-like responses to a conversational input: structured, rule-based, responses, with hundreds, or even thousands, of rules, or so-called ‘end-to-end’ neural networks. Byte’s work focuses on the former, with some of our more advanced conversational agents employing sophisticated weighting functions to have fewer explicitly written rules and exceptions. While building one of these doesn’t require a large amount of processing power, and can be built without having a large dataset of labelled conversations, predicting all of the rules that one might require to navigate an entire conversation is very difficult, particularly in an open domain. Steve Worswick’s Mitsuku chatbot has taken 15 years of development to reach its current level of development, and that’s a lot of work.

By contrast, an end-to-end neural network-based chatbot relies on ‘training’ an artificial neural network on a large amount of data and relying on the functionality of the neural network to accurately generate a response to the data on its own. This requires not only vast amounts of cleaned and labelled data, but also a large amount of computer processing power, and maybe even a team of data scientists, to begin to process the data. While this seems like an unachievable hurdle, for a corporation like Google, with the resources at its disposal, this is, in fact, an easier option than the rule-based solution!

More importantly, until now, it has remained an open question whether simply taking an end-to-end model and making it bigger (by adding more training data and increasing its parameter count) would be sufficient to reach a point where a model can carry out high-quality, multi-turn conversations with humans. It has been contended that it might be necessary to combine such a model with other components (i.e. rule-based systems) to achieve a sufficiently humanlike result, and so the Meena experiments represent an important data point in answering this question.

With 2.6B parameters and a dataset of 341GB of text (40B words) from open source conversations, this was by some distance the biggest model of its type. Compared with OpenAI’s language model GPT-2, Meena is 1.7x bigger in model capacity and was trained on 8x more data. At the end of the experiment, Meena’s end-to-end results alone were even better than anticipated. While they improved further when Meena was combined with a small number of rules, the improvements seen with a much larger model provide a compelling argument that an even larger model would perhaps not require the addition of rules.

Hopefully, that should give you an understanding of what an “open-domain multi-turn end-to-end” chatbot entails.

Specificity and Sensibleness Average

The historic problem with training a neural network for a chatbot is working out what you are trying to reward the neural network for doing. Your goal is to make all of the responses ‘human-like’, but what does that mean? More importantly, how can you quantify it, and how can you get the neural network to automatically quantify how well it is doing during the training? The answer to this is, in fact, Google’s main contribution to the literature, and Meena is merely a convenient application of what they have learned!

Google’s hypothesis is that responses in a multi-turn conversation can be classified as either ‘sensible’ or ‘not sensible’ and ‘specific’ or ‘not specific’, with these two measurements being independent. A sensible answer is one that ‘makes sense’ in the context of the conversation. For example, in the response to the question “Did you like the tennis yesterday?” The answer “Yes” would be a sensible response. By contrast “blue” would not be. However, neither would be specific. A hypothetical chatbot that answered “I don’t know” to every sentence ending in a question mark and “Okay” to every sentence that didn’t end in a question mark would score highly on the sensibleness rankings. However, it wouldn’t be particularly ‘specific’, which is the measurement of whether the response was clearly linked to the previous turn. “Yes” or “I don’t know” would not be a specific response to “Did you like the tennis yesterday?”, but “I was rooting for Roger Federer” would be. By contrast, “Roger Federer is Swiss” would be specific, but not sensible. Google combined these scores into a sensibleness and specificity average (SSA).

It was difficult for me to come up with specific but non-sensible responses to the prompt above because humans are very good at being sensible. In fact, in the training data that Google used, humans averaged about 97% sensibleness. On the other hand, humans are not particularly specific, averaging only around 75%. This is initially surprising until you realise that many conversations that Google sourced might resemble the structure “general greeting, question, answer (which might be ‘yes’ or ‘no’), general farewells”, which is as low as 33% specific! On average, humans achieve an SSA of around 86%, and this was the target for Meena.

However, it is very difficult for a neural network to gauge how well it is doing on a task as abstract as this. Much like millions of students around the world writing essays of their own, the network doesn’t know whether the work it is producing is ‘sensible’ or ‘specific’, and this is really difficult to mathematically quantify without an enormous amount of training data (Google did, in fact, have a remarkable amount of training data, but not nearly enough to answer this). This led to the hypothesis that perhaps a chatbot’s dataset perplexity might be linked to its SSA, and the results of their experimentation seem to bear that out.

Interactive SSA against Perplexity. Each data point is a different Meena model version with differing amounts of data in the dataset.

Consequently, the neural network’s ability to quantify how likely it is to predict a given turn of a conversation is correlated with how ‘human-like’ its responses are, and this allows Google to automatically approximate the humanness of a chatbot, which is a big leap forward since the perplexity metric is available to any neural seq2seq model. As a result, they have demonstrated that the better Meena fit its training data, the more sensible and specific its responses become, and therefore Google contends that they have sufficiently shown that a bigger model does lead to high-quality and humanlike conversations.


One of the problems is that while Meena had been optimised for multi-turn conversations, the SSA was still evaluated on single turns. Consequently, the SSA results may be somewhat overstated. For example,

Human: Hi!

Meena: Hi! How are you?! :)

Human: doing good, what would be the best city to visit in Europe?

Meena: Paris, Barcelona, Amsterdam, Prague

Human: great list! why?

Meena: Paris is a beautiful city, and Barcelona is too. And I’ve always wanted to go to Amsterdam and Prague and have never been.

Human: have you even been to India?

Meena: No, but I’d love to go.

Human: how about Japan?

Meena: I’d love to go to Japan too.

Human: how about Paris?

Meena: I’d love to go to Paris too.

Human: are you lying to me?

Meena: I’d love to go to Japan too.

Human: i think you’re lying to me, that makes me sad

Meena: I’d love to go to Japan, too.

Human: you’re blocking me out! you have to talk to me

Meena: I’d love to go to Japan.

Human: where else?

Meena: I’d love to go to Japan, too!

Would score reasonably highly on individual turns, despite the overall conversation descending into nonsense. As a result, Google wrote some additional rule-based modifiers to the network to avoid descents such as these and achieved an even higher SSA result (79%). Nevertheless, the limitations in their measurement methods persist.


As a contribution to the ever-growing litany of chatbots, Meena represents a resident of the so-called ‘uncanny valley’, where technology is approaching human-like tendencies, but seems a little bit ‘off’. Its responses represent a clear improvement on the state of the art, but they’re not quite close enough to humans to be mistaken for one.

However, as a contribution to the research in this nascent area of end-to-end chatbots, Google’s experiments represent a giant leap forward, and I expect a number of other research groups to build on the methods and frameworks that they have pioneered.