Tracking Loss of Containment in Virtual Agents

Published in

IBM Data Science in Practice

7 min readMar 27, 2020

If your enterprise has a virtual agent (chatbot) or is thinking about building one, you will want to understand how the solution can deliver business value. “Containment” is one of the top metrics a conversational solution will be measured against. This term refers to how often the bot is able to resolve a user’s request without escalating to a human agent.

Every successful transaction within a bot can be calculated as cost savings. Business value is often realized once the solution can contain enough transactions to cover the cost of the implementation and ongoing maintenance.

Many solutions are capable of tracking the total number of interactions that escalate to a live agent. However, the reasons for transfer are myriad. Having the ability to drill down into these reasons can aid you in identifying and prioritizing where to spend your improvement efforts.

Case Study

I recently consulted with a Watson Assistant client to help them improve the containment for their virtual agent. Since the bot was already in production, the first step was to audit the dialog flows. We needed to identify which nodes were making the decision to escalate to an agent.

For each node where an escalation decision is made, we assigned one of the categories listed below and added them up. Being a fairly mature solution, this was no small task. The dialog design contained dozens of complex interaction flows. These findings were an eye-opener for the solution’s business sponsor:

Read on for a detailed description, including some mitigation techniques, for each category…

Business Process

These escalation triggers are dictated by the business. Further drill-down behind each of these triggers included:

sensitive topics (as defined by the business)
the conversational interface does not provide an ideal user experience
complex manual processing required
lack of integration with systems required for task automation
lack of automation capability (such as APIs) within integrated systems

Each one should be reviewed for enhancement opportunities. Are there places where you can simplify the steps needed to complete a task? Can you overcome the barriers that are preventing integration with backend systems? Can APIs be implemented to facilitate automation?

Some topics/goals may never be able to be contained. Others are possible, they just take time and resources. These enhancements should be prioritized according to your runtime metrics and the most urgent pain points.

Menu Choice Other

This escalation category was a bit nuanced and may not apply to all use cases. In our use case, it was very closely related to Business Process. For complex (multi-turn) dialog flows, we had a topic confirmation mechanism. (Confirmation meant we were getting the right general intent.) At some point in the flow, the user is presented with drill-down options/subtopics. All options keep the conversation contained except the last one, “Other”. The user is escalated to an agent when they select this option.

In some cases, the wording for these menu options was found to be confusing. Sometimes the instructions were poorly worded. In other scenarios, the menu choices contained jargon that might not be understood by an average user. This could cause users to select “Other” even though a valid answer option is listed among the choices.

In other cases, we discovered options that were deliberately omitted from the original scope. The strategy (which is very common for an MVP) was, “If the problem isn’t ‘x’, ‘y’, or ‘z’, it’s too complex and we just need to escalate for now.” These cases are essentially the same as the Business Process category. This approach is often rooted in the constraints of inefficient/undocumented manual business processes. Fortunately, virtual agent projects tend to drive overall business process maturity. As time goes on, look for opportunities that allow your bot to start taking on more responsibility.

Another cause of hitting this escalation point was simply part of the natural evolution for a conversational solution. You don’t know what you don’t know. But you still have to plan for this eventuality. Review your runtime logs to discover the gaps between what you thought the users would come here for and what they actually want from your bot.

Long Utterance

This escalation point was implemented only after the solution had been in production for a while. Though Watson Assistant allows a maximum of 2048 characters as input, some use cases will not come near that. Our runtime logs revealed that extremely long inputs tended to be truly complex or contained multiple, unrelated requests. Since these requests were not essential to (or representative of) the bot’s core purpose, the best user experience was to immediately escalate.

Low Confidence

When the classifier returns low confidence, it will usually provide a message such as, “I’m sorry, I didn’t understand. Please rephrase your question.” It is very common to set a counter for this occurrence and escalate after 2 or 3 retries. High occurrence of escalations in this category comes down to two main causes:

poor training of the model
scope of a model does not match real-world demand

You will need to analyze your runtime logs to determine which is the cause. You may find it is a combination of the two.

This was mentioned above but bears repeating: Review your runtime logs to discover the gaps between what you thought the users would come here for and what they actually want from your bot.

Negative Feedback

#General_Negative_Feedback is a pre-built intent which you can add from the Watson Assistant Content Catalog. It detects utterances that express unfavorable feedback. When a solution escalates due to negative feedback, it is important to find out how often this is happening. If the frequency is high, you will need to review the logs (or perhaps you have survey results available) to find out why so many users are frustrated or unhappy. Some possible causes for user frustration may include:

bot user interface design
poor intent detection
unsatisfactory answers
lack of “human touch”

Review the utterances that trigger this event to find patterns in what is making the users unhappy.

User Requested Agent

Our use case makes a distinction between when a user requested an agent immediately versus when a user requests an agent after the conversation had completed a few turns. We suspected that users who ask for an agent right away have either had a poor experience with the bot, or they believe that their question is too complicated for a bot. This would be the equivalent of hitting zero for an operator right away in a phone menu.

You can leverage this opportunity to collect training data (or perform some basic routing) by responding with something like, “In order to direct you to the right team, please provide a brief statement about what you need.” Review these utterances to find out how often they could have been successfully resolved by the bot.

If the user is requesting an agent after the conversation is underway, you might take a look at the topic/intent that was most recently identified. If a pattern emerges, investigate. This could be poor training or a problematic business/dialog flow.

Either way, the fixes will take place over time, as you continuously improve a bot to gain your users’ confidence.

Runtime Logs: What’s Happening in the Real World?

These categorizations can be applied, in part or in whole, in many other domains and use cases. They help narrow down actionable resolutions. This is only one side of the story, though. Before we could make any judgments or take action, we needed to start collecting data about how often each of these events occurs in production.

Our next sprint was dedicated to tagging the dialog nodes with context variables. A reporting script was implemented to scrape the runtime logs. Then, all we could do was wait for the data to start coming in.

Actionable Insights

The insights gained from this exercise were incredibly valuable. For instance, the “Low Confidence” numbers indicated some problems in our training. The chatbot team immediately got to work on classifier improvements. The model steadily improved and our escalations for this reason were cut in half over the following sprints.

The “Long Utterance” and “Negative Feedback” escalations were relatively low, so we didn’t need to spend much time on these. They just need to be monitored.

We can’t prevent the user from asking for an agent, but we can indirectly influence those categories over time through improvement in other areas. It takes time to build trust, so we planned to monitor these as well.

The business now had a starting point for prioritizing automation efforts. This data also gave the team some enlightenment on what level of containment should reasonably be expected for this use case. In an ideal world, the majority of escalations would be due to “Business Process”, and we would see that overall volume decrease over time.

Conclusion

Understanding why a chatbot is escalating to your live agents is key to improving your overall solution. Armed with this knowledge, your organization can make data-driven choices about where to invest resources for improvement and enhancement.

If you would like help in building or improving a conversational solution, reach out to IBM Data and AI Expert Labs and Learning.

Special thanks to the reviewer: Andrew Freed

Cari Jacobs is a Cognitive Engineer at IBM and has been working with companies to implement Watson AI solutions since 2014. Cari enjoys kayaking, science fiction, and Brazilian jiu-jitsu.