Chat about anything with Human-Like Open-Domain Chatbot

Aaditya Prakash Singh
Analytics Vidhya
Published in
4 min readMar 25, 2021

Metrics developed by Google to evaluate open-domain chatbots

Most of today’s chatbots are highly specific in their conversations (according to their domain of usage) and users can’t afford to drift away from their expected use. They are not good with retaining context from past conversations, sometimes give meaningless, illogical responses and quite easily give the response, “I don’t know”.

Open-domain chatbots are conversational agents that can chat about anything and have basic knowledge about the real world. In the research paper “Towards a Human-like Open-Domain Chatbot”, Google introduced Meena. Meena is claimed to be the smartest chatbot, highly sensible and specific in its responses, unlike other chatbots.

One of the challenges with the chatbots is their evaluation. There is no automated and quantitative way of determining, how good an open-domain chatbot is performing with respect to a human. In this article, we’ll discuss two such metrics but before that, let’s take a look at what is Meena.

Meena’s Architecture

Meena is a neural conversational model that consists of 1 evolved encoder(to process the context from past conversations) and 13 evolved decoders (to develop a response to the context from the encoder). Meena is an evolved transformer sequence2sequence architecture, aiming to minimize perplexity(uncertainty in choosing the next token/word). After hyperparameter tuning, it was deducted that a powerful decoder is necessary for better conversation skills.

Meena is trained on 2.6 billion parameters with 341 GB of preprocessed social media text which is 8.5 times more training data of OpenAI’s GPT-2(former state of the art for generative model).

It was discovered using neural architecture search (NAS). NAS automates the development of neural architecture by allowing more resources to well-performing architectures and less or no resources to underperforming models.

Sensibleness And Specificity Average(SSA)

Google introduced a new metric for the evaluation of open domain chatbots called Sensibleness and Specificity Average(SSA). For its testing, Google crowdsourced Meena and some of the other open domain chatbots like Cleverbot, DialoGPT, etc.

People were asked two questions for each response from the chatbot, “Is the response sensible” and “Is it specific”. For example, if a person says “I love tennis” and the chatbot replies with “That’s nice”. In this case, the chatbot’s reply is sensible but not specific as it can be used in multiple situations. Whereas, if it replies with “Great!! I love Roger Federer”, then it would be considered as specific too.

Sensible and Specificity Average is the mean of a fraction of sensible responses and a fraction of specific responses. The results proved that fine-tuned Meena(79%) outperforms other open domain chatbots and is close to the human level (86%). So basically Meena is better when it comes to SSA score.

The disadvantage of using SSA is the manual work done by people that slows down the chatbot’s evaluation.

SSA scores of humans and various open domain chatbots (source: ai.googleblogs.com)

From the above chart, it is clear that Meena(large) is close to the human level in the art of conversation.

Perplexity

Perplexity is an automatic metric for the faster evaluation of language models. It measures how well the model predicts what the user is going to say.

Moreover, during the research, it was concluded that perplexity has a high correlation with the human evaluation technique SSA and can be used for the efficient development of chatbots.

SSA vs perplexity (source: ai.googleblogs.com)

This is encouraging as perplexity is automatic, optimizable with a standard cross-entropy loss function.

High confidence of a language model is represented by a lower perplexity score. Meena(without fine-tuning) has a perplexity of 10.2 (smaller is better) which reflects 72% of SSA.

Further Research & Conclusion

Personality and factuality are also a few of the attributes other than sensibleness and specificity for evaluation metrics. Perplexity needs to be reduced by optimizing the algorithm, data, and architecture of the neural network.

Google is currently not releasing any external research demo because of safety concerns.

So, at last, perplexity is to date the best automatic way to evaluate your chatbot as it has a strong correlation with the human method of evaluation (SSA) and we are not far away from an era when we wouldn't be able to differentiate between the conversations of a human and a chatbot.

For more in-depth intuition about Meena refer to the link below.

Cheers, Happy Learning

Aaditya Prakash Singh,

Chief AI Officer at Ontribe — https://ontribe.in/

--

--