How to Design the Training Data for an AI Assistant

Announcing Dialog Skill Analysis for Watson Assistant

Published in

IBM watsonx Assistant

6 min readDec 10, 2019

Building AI assistants that work well for your business is not as easy as it should be. That’s because it’s hard to understand why an assistant makes mistakes and, more importantly, how we can minimize those mistakes with the limited time available at our disposal…

I’m on the Watson Assistant algorithms team and our mission is to make assistants (aka chatbots) much easier to design and launch. That’s why we just released the Dialog Skill Analysis Notebook, a new Python framework along with an easy-to-use Python notebook to help you quickly and effectively build AI assistants using Watson Assistant! Whether you are new to the process and are building your first AI assistant or you’re a veteran and have an assistant working well in production, this new framework is intended to help everyone with questions like…

How do I know my assistant is doing a good job?
How do I test and measure my assistant’s performance?
Why is the assistant responding incorrectly to this question?
How do I improve my assistant’s ability to understand questions?

How it Works

The Python notebook can be run on your local laptop by forking or cloning the Github repository linked at the end of this post. You can also use the hosted version of the notebook on the IBM Gallery and not worry about the installation process!

In the next few sections, we briefly describe some of the analysis you could try out using the new framework. In these sections, we use an example customer care scenario, where an AI assistant is trained to classify questions about a store like “Where is your store located?” or “What time does it open?” to intents like Customer_Care_Store_Location and Customer_Care_Store_Hours. This scenario is the same data that you gain access to when you create your first assistant using Watson.

Note: We assume basic familiarity with the process of creating AI assistants using Watson Assistant.

Part 1: Training Data Analysis

When you have many developers working on creating and evolving a dialog skill, the design of your assistant can get complicated. Part 1 of the framework focuses on analyzing your training data and uncovering potential pitfalls in design.

For example, if you run the term analysis section on the sample customer care data, you can see terms that are correlated with each intent in your dataset!

You will see that terms like speak & agent are correlated with the General_Connect_To_Agent intent. This appears reasonable as the data includes examples like “I want to speak to an agent” for the General_Connect_To_Agent intent.

In this analysis, if you see numbers or unusual terms that should not be a part of that intent, it indicates potential flaws in the design of the skill which you should immediately review!

Part 2: Analyze Your Dialog Skill

We know that the amount of time you have to enhance your assistant prior to launch is limited. We would like to help you focus your enhancement efforts on metrics that positively impact your use case.

When you first build a dialog skill you may use the Try it out panel to evaluate if the assistant is able to predict the correct intent when you enter test examples. Unfortunately, this technique does not scale!

Users can ask questions in a variety of different ways and it becomes difficult to manually track performance when you have more than a few intents.

Part 2 of the framework helps you analyze your dialog skill using a test data set! You need to create a test set that consists of additional examples for each intent.

These examples should NOT overlap with the examples that were entered as part of dialog skill creation. This is because the assistant already knows the correct response to those examples.

Using the test set the framework calculates performance on statistical metrics like Accuracy, Precision, Recall & F1. In the illustration below, you can view the assistant’s performance on the sample customer care data.

Let us walk through an example of how to interpret the metrics mentioned above.

For an intent like Help, recall is 100%, precision is 66.67% and F1 is 80% on the test set —

A high Recall score indicates that sentences belonging to the Help intent are being correctly identified as belonging to the Help intent
A high Precision score indicates that sentences belonging to other intents are not being misidentified as belonging to the Help intent
Both precision and recall are important. F1 is a metric that combines both. A high F1 score indicates you are doing well on both precision and recall.

In the metrics chart above, we see that recall is relatively high for the Help intent. This indicates that sentences that belong to the Help intent get correctly identified.

We also see that precision is relatively low for the same intent. This indicates other sentences not belonging to it are getting misidentified as belonging to the Help intent.

In your production environment, if messages identified by your assistant as belonging to the Help intent result in costly, high priority interactions with your customer support, then the relatively lower precision might result in unnecessarily higher costs! Focusing on improving precision can potentially lower your costs…

Part 3: Advanced Analysis

Part 3 of the framework dives into the advanced features available for analysis. For example, if you want to know why certain sentences are being misidentified, you can experiment with the advanced analysis section.

Illustrated below is a novel machine learning technique that visualizes the relative importance of the terms in the sentence.

The sample customer care data includes intents like

Customer_Care_Store_Location, Cancel, Customer_Care_Appointments General_Connect_to_Agent, Thanks, Customer_Care_Store_Hours, General_Greetings, Help

Ideally, the assistant should map the sentence “If you are closed on Sunday, can you slot me in for tomorrow afternoon?” to the Customer_Care_Appointments intent because the user is asking for an afternoon appointment. But the assistant maps it to the Customer_Care_Store_Hours intent!

The framework can help you identify terms which seem to mislead the assistant. If you look at the visualization shown above, you will notice that the terms closed & afternoon seem important to the assistant.

These terms are part of the training examples used in the Customer_Care_Store_Hours intent. On the other hand, there are no terms which relate to being slotted in, in the Customer_Care_Appointments intent.

The framework helps you identify key terms in the sentence that it picked up as important. We know the quality of the data affects the learning of the model. You can fix the model by adding more examples that help your assistant make smarter decisions. In the above scenario, you can add examples that relate to being slotted in for an appointment.

In Conclusion

The capabilities illustrated above are a glimpse of the novel features introduced by the Dialog Skill Analysis framework. We hope they help you launch your assistant projects faster. We would also love to get feedback on your experience using Watson Assistant and the new Dialog Skill Analysis framework.

Getting Access

You can fork or download the dialog skill analysis framework on Github: https://github.com/watson-developer-cloud/assistant-dialog-skill-analysis

If you run into any issues using our framework, please create issues on the Github repository and we will try and address them as soon as we can.

For those of you who do not want to worry about the installation process, we have also made available a hosted version of the notebook on the IBM Gallery: https://dataplatform.cloud.ibm.com/exchange/public/entry/view/4d77701840fcb2f21587e39fdb887049

One More Thing…

If you are also interested in analyzing the effectiveness of your assistant using the log data you generate in production, do check out the Python recommendations notebook https://github.com/watson-developer-cloud/assistant-improve-recommendations-notebook