Testing Strategies for Chatbots (Part 2)— Testing Their Dialog Logic
Dialog unit testing (philosophy)
In my previous post I talked about how to test a chatbot’s classifier. Now that we know the chatbot knows how to interpret user input by turning user utterances into user intents, we are ready to test the dialog routing logic in our chatbot. It’s important to remember that this testing should focus only on the conversation route and state, and not bleed into testing classification performance. In Watson Assistant we can guarantee we are not testing the classifier by exclusively using only exact utterances included in the ground truth since they will be perfectly classified.
Chatbots generally implement routing logic through contextual state and conditional branches. We can verify the routing logic implementation by sending the chatbot a series of user utterances and examining our conversational state. When using Watson Assistant there are two primary ways to examine the conversational state: the response text and context variables.
Dialog unit testing (standalone chatbot)
Now that we know the chatbot knows how to interpret user input by turning user utterances into user intents, we are ready to test the dialog routing logic in our chatbot. Chatbots generally implement routing logic through contextual state and conditional branches. We can verify the routing logic implementation by sending the chatbot a series of user utterances and examining our conversational state. When using Watson Assistant there are two primary ways to examine the conversational state: the response text and context variables.
Each conversational state option has advantages and drawbacks. Response text is straightforward to evaluate; however, it may change frequently, leading to challenges in our test assertions. This can be mitigated by only doing partial matching on a response text, including matches as small as a single expected word. Context variables are easy to examine but our chatbot may not otherwise have unique state variable combinations for each unique node in the dialog. Thus, a good dialog unit testing strategy will use both.
We can use any number of unit testing frameworks to verify dialog routing logic. No matter what framework we use, each test will set up a conversational state through one or more utterances and then will verify that conversational state by examining the response text(s) and context variables.
I’ve created a JUnit-based chatbot testing framework based on the Java SDK for Watson Assistant. You can find this framework on GitHub at https://github.com/andrewrfreed/assistant-javatester.
The complexity of our unit testing depends partially on whether our chatbot is purely standalone or if the chatbot is integrated with an orchestration layer that coordinates activities with one or more other systems. Let’s first start by testing dialog routing logic in a standalone chatbot, where our test can interact directly with the chatbot.
The example
I’ve created a simple testing workspace with a single dialog flow. The chatbot is primarily concerned with detecting requests for appointments. Once someone requests an appointment they ask how to contact the repair personnel. The contact request is answered differently for “gold” members and regular (non-“gold”) members.
The conversational flow in Figure 1 is supported by the workspace in Figure 2.
Unit testing the example with test code
The most interesting routing logic is the conditional. We want to verify that the chatbot gives different responses to gold members and non-gold members. This is easily accomplished with two tests, one for each branch.
Non-gold member:
public void two_turns_without_outside_context() {
MessageResponse response = null; //turn 1, goes to node Schedule appointment
response = conversation.turn(“Can someone support me at home?”); //turn 2, within appointment goes to “contact us”
response = conversation.turn(“Can I text you a question?”);
assertContains(“turn 2 state text”, response, “Contact me via email”);
}
Gold member:
public void two_turns_with_outside_context() {
MessageResponse response = null;
conversation.getContext().put(“goldMember”, “true”); //turn 1, goes to node Schedule appointment
response = conversation.turn(“Can someone support me at home?”); //turn 2, within appointment goes to Gold Member “contact us”
response = conversation.turn(“Can I text you a question?”);
assertContains(“turn 2 state text”, response, “No, we’ll call you!”);
}
The full listing for this test code is found at ExampleTest_Code.java
.
This style of unit testing works well if you don’t mind writing and maintaining your tests in code. As mentioned before, there is a maintenance challenge if dialog response text changes frequently. It is also possible to write dialog unit tests without using code. Let’s explore a code-less way to add new tests.
Unit testing the example with test configuration
The core of our testing logic follows a stable pattern: the user provides one or more input utterances, the test examines the response text and context variables. We can then use a data-driven test pattern like Parameterized Tests to drive tests from a data template. My test framework provides an example of this pattern.
The pattern for a single test is as follows:
#test_name,(context variable name, context variable value)*
utterance1,expected_output_substring1, (context variable name, context variable value)*
utterance2,expected_output_substring1, (context variable name, context variable value)*
…
The same tests from the previous example are represented in the following test configuration.
#regular appointment,,,,,
Can someone support me at home
Can I text you a question?,Contact me via email#gold member appointment,goldMember,true,,,
Can someone support me at home
Can I text you a question?,we’ll call you!
These tests are driven via a single Parameterized test file: ExampleTest_Parameterized.java
.
The full configuration listing for the test file is found at simple-assistant-test-cases.csv
.
Dialog unit testing (orchestrated chatbot)
Most chatbots are not standalone systems, rather they interact with one or more other systems through an orchestration layer. In this architecture, users interact only with the orchestration layer. The orchestration layer can forward user input to the chatbot for intent identification and dialog routing, however the orchestrator can also choose to modify input before it is sent to chatbot (for example, masking sensitive inputs) or modify output from chatbot (for example, personalizing responses or injecting data from external systems). Since the user interacts only with the orchestration layer, our tests will exclusively interact with the orchestration layer as well.
Let us reiterate that our interest is in unit testing the dialog routing logic. It is much cheaper to test conversation variations at this unit testing layer rather than higher layers which include user interfaces and external systems. As such you should aim to catch all dialog coding bugs at this layer. You may decide to test “most” scenarios rather than “all” here; the decision hinges mostly on test code maintenance cost
I highly recommend using test doubles or mocks for any external systems the orchestration code interacts with. Assuming you are using the good software engineering principles of loose coupling and high cohesion this will not be a problem. Dependency injection is your friend! The example below will make use of dependency injection and test doubles to isolate our testing on the routing logic only.
The example
An orchestrator is configured to handle eCommerce order cancellations. In the example the chatbot is used to detect the user wants to cancel an order, the orchestration asks directly for their member ID and their yes/no confirmation (without involving the chatbot), and the orchestration calls the appropriate backend systems to handle the order processing particulars.
Unit testing the example with test code
As in the previous test examples, the test largely follows the pattern of providing user input and verifying conversational response text and state variables. The test below verifies the response text from each turn of the conversation. It also inspects the test double to see if the expected external interactions were made — in this case, was a call made to cancel an order at the right time?
public void test_order_return() {
String utterance1 = “I would like to cancel my order”;
String utterance2 = “Yes, cancel that order”;
String memberId = “ABC123”; //Conversation turn 1
//response = “Sorry to hear that, what is your Member ID?”
String resp1 = orchestrator.onInput(utterance1);
assertContains(resp1, “Member ID”); //Conversation turn 2
//response = “Cancel order number 456?”
String resp2 = orchestrator.onInput(memberId);
assertContext(orchestrator.getContext(), “memberId”, memberId);
assertContext(orchestrator.getContext(), “orderId”, “456”);
assertContains(resp2, “Cancel”);
assertFalse(testDouble.isOrderCancelled()); //Conversation turn 3
//response = “I cancelled order number 456.”
String resp3 = orchestrator.onInput(utterance2);
assertTrue(testDouble.isOrderCancelled ());
assertContains(resp3, “cancelled”);
}
The test makes use of a test double which is injected into the orchestrator. Use of a test double means we don’t have to rely on the external system being around (if the external system is down, we can still test our routing logic). We can hardcode simple logic for un-interesting functions (for example, getLastOrder) and we can watch the interesting functions more closely. In the test double below the cancelOrder function is implemented so that we can be sure it was called.
class OrderManagementTestDouble implements IOrderManagement {
private boolean orderCancelled = false;
public String getLastOrder(String memberId) {return “456”;} public void cancelOrder(String orderId) {
System.out.println(“Cancelling order #” + orderId);
orderCancelled = true;
} public boolean isOrderCancelled() {return orderCancelled;}
}
The full code listing is found at ExampleCode_Orchestrator.java
. The accompanying sample orchestrator is Orchestrator.java
.
Conclusion
As in any software engineering endeavor it is important to apply good testing principles to your chatbot. Chatbot testing includes two fundamental aspects: testing the classifier used by the chatbot and testing the routing logic implemented in your chatbot. Testing both aspects is important to building a high quality chatbot.
The previous post outlined strategies for testing your classifier’s performance and using the test results to improve the classifier.
This post demonstrated several different testing strategies for testing the routing logic used by your classifier. As with any automated testing strategy the approaches in this post require an up-front investment in time to create the tests and an ongoing maintenance cost. If you have very simple routing logic with frequently changing response text it may not be worth investing into automating tests with the strategies in this post. However if you have complex routing logic, or frequently changing routing logic, you may want to consider using automated tests.