Making your bot shine

Published in

Sage Developer Blog

6 min readAug 21, 2020

Techniques for improving your chatbot’s performance

Focusing on building a chatbot, but not on optimizing it, is a bit like ordering a burger without the fries — sure, it will do the job, but are you really getting the full experience?

Defining the use case and building the chatbot is important, but thinking about the user experience is critical in today's ‘bot ready’ marketplace — users expect chatbots to understand colloquialisms, accents, slang, jokes and so much more.

Testing, gaining feedback, and refining your chatbot can ensure that your user experience is crisp and clear, as well as dispel confusion and dissatisfaction experienced by users that, in some cases, can be worse than not having any chatbot at all.

If we are supplementing humans with machines, the hand-off needs to be seamless — that’s not to pretend to be human, but we want the machine to work in partnership with us and provide a decent experience.

In this article, we discuss how optimization is handled in several popular frameworks for bot development.

What type of information is needed for improving the bot?

The information for optimizing a chatbot can come from various sources:

The bot knowledge (training data): it is taken from the training examples and can help to detect conflicts in the data. Training data can also contain similar examples than can decrease the performance of the recognition algorithm, lack of balance, void data, etc. This is relevant, especially for big solutions with many purposes or in KBs developed by two or more people because it is very easy to add content trying to cover the maximum number of possible variations.
The conversation logs: can be much more valuable for improving and building knowledge, however, this is only available in the tools that are also hosting the bots and have access to the logs, i.e Luis, SAP Conversational AI or DialogFlow. This information is basic for evaluating the bot’s performance with real users in real conversations and can help with adapting the initial design to the real behavior demanded by users. Therefore, information from conversation logs, once the bot is live, is basic for adapting the solution to real user behavior.
Regression tests: Regression test values are generated when comparing two or more versions of the bot. The bot solution and the related training data should be treated like code as the changes in the training data can have a dramatic impact on the way your bot performs. It’s important to run regression tests to make sure that everything works as expected.

Using these sources properly, it is possible to better adapt the user experience over time and increase satisfaction levels.

How information can be used for improving your bot performance?

Insights for bot optimization can be presented in different ways, these are the most useful:

A list of specific errors, for example, intents with duplicated training examples, user’s inputs that matches more than one intent with similar confidence values, etc. Having information about those errors is easy to grasp and normally allow quick fixes.
Metrics for evaluating performance, most commonly precision and recall scores.
Tips. This is very useful for non-technical users, as it translates the metrics into meaningful messages and provides specific actions easy to understand.
Graphical interfaces, like charts, are a nice way of detecting problems affecting the whole solution, for example, common intents that overlap with each other or if a specific intent is over triggering. Confusion matrix graphs are good for detecting these issues. On the other hand, graphs can be sometimes confusing if the aim of the graph is not clear or the design is mixing lots of different information in the same interface.

How does each tool for designing bots provide this information?

The various tools for designing bots considered in this article show insights for optimizing your bot in different ways. Let’s discuss some of the approaches we have found for each task.

Detecting problems in the training examples:

Microsoft LUIS is a good example of it as it shows specific tips for improving intents and entities and it also offers general tips for boosting the bot performance.

Sometimes basic intent is missing in your knowledge. It is useful if your tool is detecting this. In this regard, DialogFlow shows different messages when basic information is missing

Intents with similar or overlapping training examples. Detect them and provide quick action for fixing the problem. In this regard, SAP Conversational AI and LUIS allows you to easily see which intents are very similar or have common examples.

The confusion matrix shows the relationship between all the intents and are also useful to have an overview of the solution. SAP Conversational AI offers interesting visualizations in order to see overlapping between intents.

However, there are other tools like LUIS or QBox that offer a visualization that works well for a few intents but that are not practical if you have a solution with many intents (>30).

Detecting problems in conversation logs:

For basic metrics, like the most replied answers, number of users, etc. — SAP Conversational AI offers the basic information. Other tools like DialogFlow take for granted that you will create a custom dashboard for showing this information.

Some apps provide ways for creating on-the-fly answers for this or selecting automatically the best intent. In this regard, LUIS allows you to easily assign intents. However, there are tools like DialogFlow where it is not possible to add questions from the historic view.

Inputs which matched two or more intents with similar confidence values. LUIS is a good example of that as it clearly shows the questions that got intents with similar confidence values.

Use other indicators for assessing the accuracy of the response: user’s feedback, sentiment analysis, etc. Unfortunately, none of the tools that we analyzed showed this information

Detecting problems with regression tests:

PROs:

Useful to make sure that bot behaviour is consistent after changes (setting a benchmark)
Provides a numerical indicator of the bot’s performance. It can be used for cross-version evaluation. Statistical ratios need to be interpreted for non-technical issues.

CONs:

Difficult and time consuming to build
Need update every time the knowledge changes
Best if the tool allows building test on-the-fly from log data

Regarding regression tests, SAP Conversational AI and Botium are good tools for it as they offer a great visualization and allow you to easily detect problems in case the new version is worse than the previous one. On the contrary, there are other tools like DialogFlow that don’t offer this feature and others like QBox that offer it but have a visualization that it’s sometimes confusing.

Conclusion

From my work in this field, these are the main elements I would recommend that you take into consideration when choosing a tool for designing your chatbot:

1 The tool should provide metrics and a specific set of actions that should be done in order to improve performance. Example: DialogFlow, SAP Conversational AI, and LUIS.

2. The tool should also provide simple graphs representing just one metric but showing how this metric affects all the components. Doing so the user should detect easily the intents/entities which need changes or fixes.

3. It's not recommended to use a chatbot design tool that doesn’t have clear metrics, complex graphs with mixed information inside or messages without clear improvement suggestions.

References

Google DialogFlow: https://cloud.google.com/dialogflow

SAP conversational AI: https://cai.tools.sap/

Microsoft LUIS: https://www.luis.ai/

QBox: https://qbox.ai/

Botium: https://www.botium.ai/