Don’t build chatbot end-to-end

Published in

Voice Tech Podcast

4 min readOct 3, 2020

Be able to create a well-functioned chatbot by just looking at past conversations when customers are served well is such an appealing idea that there is so much effort around this, from both academia and industry, but is it a good idea from a production point of view?

In recent held SIGDIAL 2020, 21st annual meeting of the Special Interest Group on Discourse and Dialogue, an evaluation track paper won the best paper award. In that paper, the authors painstakingly implemented various strategies for each of the four main components (NLU, state tracking, dialog policy, and NLG) needed for a chatbot, and subject all valid combination to vigorous testing, at both component level and system level, and here are the key takeaways:

“We find that rule-based pipeline systems generally outperform state-of-the-art joint systems and end-to-end systems, in terms of both overall performance and robustness to task complexity. The main reason is that fine-grained supervision on dialog acts would remarkably help the system plan and make decisions because the system should predict the user intent and take proper actions during the conversation. This supports that good pragmatic parsing (e.g. dialog acts) is essential to build a dialog system.”

Furthermore, it is shown that combining BERT based natural language understanding (NLU) module, and rule-based dialog state tracker, and dialog policy, and retrieval based natural language generation (NLG) module produce the best chatbot according to all different metrics.

But while it is clear that building chatbot via end to end learning is a bad idea from paper, there is not much analysis on why that is. In this blog, we offer some hypotheses about this, and also some practical reasons why we should not building chatbot end to end.

Build better voice apps. Get more articles & interviews from voice technology experts at voicetechpodcast.com

Issue of exponentially many conversational paths

Unlike the graphical user interface where users can only interact with the system per system’s implementation, in a conversational interface, users can say anything they want. Different topics can interweave arbitrarily, at different junction point; within each topic, there can be many different orders in terms of how information is exchanged. This results in exponentially many conversational paths, and thus artificially increases the space/dimension of conversation that the learning machine has to model, and makes it harder to find the common pattern and learn how to react in there.

2. Ambiguous system actions

In an arxiv paper by Johannes et al, they studied many leading conversational datasets and trying to find why it is hard to learning machine model the conversation by doing error analysis. And one of the phenomena that they discover from these datasets is ambiguous system actions: there are many different actions follow from the same dialogue history in history so it is only useful for unless they predict the probability distribution of system actions instead of a single best action. It is clear that a system trained on these cannot learn deterministic behavior. While modifying evaluation metrics might help, the bottom line of the problem is there is actually no reason to model the surface system actions, as it is something that the system produces. Building system component-wise forced us to model system behavior in a factorized fashion, thus naturally sidestep this wasteful effort.

3. Separation of concerns

To react to ever-changing business conditions, business owners/managers need to pass on their knowledge and wisdom to the frontline worker or chatbot system so that they know what to do. Business is running on rules, especially these digitalized ones. Running a business on a small set of rules not only makes it easy to produce a consistent user experience, but it is also easy to pass around the business values behind these rules. So it will be extra work for a business owner to create endless conversational sessions if they build a chatbot from scratch if they want to change chatbot behavior because of the changing business condition. If instead, we can build/teach/program in a component-wise fashion, then we can have business people work on dialog policy directly in the business rule space, and the NLU team figure out how to convert language into structure form that the service can digest directly (can be easily outsourced), a chatbot can be built much more efficiently and also cost much less.

In this blog, we outline some reasons why building a chatbot system in an end to end fashion is not a good idea. Instead, doing it component-wise not only can produce better performance consistently, but it is much easier to maintain. I hope this can help those who just started to think about building a conversational interface for your service for real production use.

Reference:

Is Your Goal-Oriented Dialog Model Performing Really Well? Empirical Analysis of System-wise Evaluation, Takanobu, Ryuichi and Zhu, Qi and Li, Jinchao and Peng, Baolin and Gao, SIGDIAL 2020 https://www.sigdial.org/files/workshops/conference21/pdf/2020.sigdial-1.37.pdf
Where is the context? — A critique of recent dialogue datasets, Johannes E. M. Mosig, Vladimir Vlasov and Alan Nichol, https://arxiv.org/pdf/2004.10473.pdf

Don’t build chatbot end-to-end

Something just for you

Written by Sean Wu