Training Considerations for Chatbots Using Multiple Skills Orchestration

Cari Jacobs
IBM Data Science in Practice
8 min readJun 12, 2020

When you are building a Watson Assistant chatbot that must handle a variety of knowledge domains, you may begin looking into a concept known as “multiple skills orchestration”. Natural language classifiers often perform best when the classifier is focused on a single domain. Some use cases will perform just fine when mixing domains in a single skill (such as a banking use case that also handles light chit-chat). However, if your chatbot needs to demonstrate a broad range of knowledge and is struggling with accuracy and confidence, the solution may benefit from isolating each domain as a separate skill. This allows you to deliver a robust natural language experience where the chatbot can cover a wide range of topics and also go deep into any single domain.

The three main approaches for implementing a multiple skill solution are:

  • Router
  • Spray
  • Waterfall

There are a number of factors to consider when selecting an approach. (Check out this article to dive into the technical implementation aspects.) Once an approach is selected, there are a couple of options for training your skills. This article focuses on those training considerations.

Note: This article includes the results of several experiments that demonstrate various machine learning/training concepts. These experiments were conducted using data available in the Watson Assistant Content Catalog. Consider replicating these experiments with your own data if you are struggling to identify the best approach for your use case.

The Router Approach

The router pattern is a hierarchical configuration in which an utterance is initially classified to the most likely topic or domain, and then “routed” to a sub-skill for refined intent classification (and usually to deliver the associated response).

If you choose to implement the router pattern, you will need to decide how to train the router skill. You have two options:

  • Option 1 — import all intents from each sub-skill “as-is”
  • Option 2 — merge all intents from each sub-skill under a single intent (per sub-skill) before importing

I ran two experiments to test these options.

Experiment Setup for Option 1:

1. Add several domains from the Content Catalog (For my experiment, I added Bot Control, General, COVID, Insurance, Mortgage, and Telco)

2. Export the entire training set and select a percentage to be permanently removed as a blind test set (I randomly selected 400 examples to be withheld as blinds)

3. Import the remaining examples (the training set) into an empty skill

For my option 1 experiment, this resulted in a train set containing 1445 examples under 95 separate intents.

Experiment Setup for Option 2:

1. Update the training set to combine all examples for the same domain under a single intent

2. Similarly, update the “golden intent” in the blind test set

For my option 2 experiment, this resulted in a train set containing 1445 examples under 6 separate intents.

I like to use the WA-Testing-Tool for most of my performance testing. Analyzing the results is easy. Since all of the questions are considered “in-domain”, we just divide the true positives (the utterances scored as “correct”) by the total number of questions for an overall accuracy result.

Router Experiment Results

Option 1 — leaving the intents split — resulted in 75% overall accuracy. The clear winner for this data set is the option 2 method, with an overall accuracy of 92%. This result makes sense because we just want to get the general topic as a first step.

What are the trade-offs? To get the better performance outcome demonstrated in the second option, the training data from each sub-skill must undergo an additional transformation before it is loaded into the router skill: the examples must be merged into a single intent. This is a pretty minor tradeoff considering the improved accuracy.

The Spray Approach

The Spray approach is a flat configuration in which an utterance is sent to all skills simultaneously and the skill returning the highest confidence is selected to continue the interaction. In this type of solution, we want each skill to be good at identifying utterances that belong to its domain. Ideally, each skill would also reject utterances which do not belong in the domain.

There are two options for training your skills under this approach: train your skills without counterexamples or train your skills with counterexamples.

I ran ten experiments to demonstrate the benefits and trade-offs of using counterexamples in the individual domain skills.

Experiment Setup

The setup for each experiment involved creating a separate skill for each domain from my original train set. (I used the same data from my train/test sets in the ‘Router Option 1’ experiment). I trained five skills as follows:

  • Bot Control + General Intents (310 examples / 19 intents)
  • COVID (510 examples / 23 intents)
  • Insurance (185 examples / 12 intents)
  • Mortgage (107 examples / 20 intents)
  • Telco (333 examples / 20 intents)

I ran my blind set against each skill to get a baseline performance.

Next, I added counterexamples to each domain skill:

  • Bot Control + General Intents (Add 1135 counterexamples from COVID, Insurance, Mortgage, and Telco)
  • COVID (Add 935 counterexamples from Bot Control, General, Insurance, Mortgage, and Telco)
  • Insurance (Add 1260 counterexamples from Bot Control, COVID, General, Mortgage, and Telco)
  • Mortgage (Add 1338 counterexamples from Bot Control, COVID, General, Insurance, and Telco)
  • Telco (Add 1111 counterexamples from Bot Control, COVID, General, Insurance, and Mortgage)

Recall that for the router experiments, we were mainly concerned with true positives — this is because the entire test set was considered “in domain”.

For the spray experiments, a few additional steps need to be taken on the test output files. First, you need to identify which questions are “in domain” for any given test. Next, you need to calculate the true negatives and false positives for each test.

  • True positive: an in-domain utterance that matches the correct intent
  • True negative: an out-of-domain utterance that does not match any trained intent
  • False positive: an out-of-domain utterance that matches a trained intent
  • False negative: an in-domain utterance that does not match the correct intent

Spray Experiment Results

The accuracy for our in-domain utterances tended to fall 3–5 points between the baseline skill and the counterexamples version. However, the false positives consistently shifted to true negatives, resulting in a significantly higher adjusted (overall) accuracy.

Let’s take a deeper look into what’s going on. For the utterance, “Have you been well?” we want to match to the intent ‘General_Greetings’ under Domain Skill 1. In the Baseline Solution, our utterance is recognized in four of the five domains, and the highest confidence is returned from our COVID domain. This would result in an incorrect response. By adding data from Domain Skill 1 (Bot Control and General) as counterexamples in all other domains, only Domain Skill 1 returned an intent, making this an easy choice for our orchestrator.

What are the trade-offs?

As shown in the results above, you may see a slight decrease in the accuracy of in-domain utterances when counterexamples are added. (If the decrease is significant, this could indicate overlap between domains or insufficient training in that particular skill.) Also, training each skill with counterexamples is a bit more time consuming. The training effort grows with each skill that is added, as every other skill must be updated.

Another thing to be aware of is that counterexamples can only be managed through the API (as of this writing). Be aware that the ‘create counterexample’ operation is rate limited to 1000 requests per 30 minutes. (I used a simple shell script which loops through the lines in a file but had to split my files and take a short break when my counterexamples exceeded 1000.)

One additional trade-off consideration for the spray approach is latency. Your orchestrator will need to wait for all skills to respond before selecting the highest result.

The Waterfall Approach

The waterfall approach is similar to the router approach in that it is somewhat hierarchical. However, the primary classifier is trained on the main business use case. If an utterance does not match with good confidence in the main use case, the skill would instruct the orchestrator to “fall back” to a secondary skill. This would most commonly be accomplished via an “anything_else” node at the dialog root level of the main skill. (The fallback skill would then have an “anything_else” node to handle inputs that are not understood by either skill. )

As for training the skills, you would probably see the best performance by adding counterexamples from the fallback skill into the primary skill. It isn’t necessary to train the fallback skill with counterexamples from the primary skill. It also won’t hurt, because it could help force false negatives from the primary skill to hit the fallback’s ‘anything_else’ node. (But that’s a training/confidence issue you really should address back in the primary skill.)

Refer to the experiment results under the spray approach (or run a similar experiment using your own data).

Ongoing Maintenance Considerations

Whichever approach you go with, you will need to have a process in place for keeping your skills updated. For the router approach, you will need to feed sub-skill updates into the main skill. For the spray approach, you will need to feed each domain skill with counterexamples from all other skills. For waterfall, you will need to update the primary skill with counterexamples from the fallback. All of these updates could be automated using the Watson Assistant APIs.

It is also important to have a person or small team of “gatekeepers” who will ensure that any sub-skill or domain skill does not contain conflicts or overlap. For example — don’t include a “Hello” intent in more than one skill.

Conclusion

You can provide a robust, natural language chat experience using multiple Watson Assistant skills. Select the approach that best meets your goals and maintenance capacity. Make sure you have a plan to coordinate the ongoing training of your skills to get the best performance out of your solution.

If you would like help in building or improving a conversational solution, reach out to IBM Data and AI Expert Labs and Learning.

Special thanks to the reviewers: Victor Povar and Andrew Freed

Cari Jacobs is a Cognitive Engineer at IBM and has been working with companies to implement Watson AI solutions since 2014. Cari enjoys kayaking, science fiction, and Brazilian jiu-jitsu.

--

--

Cari Jacobs
IBM Data Science in Practice

Cognitive Engineer at IBM Watson. Interests include natural language processing, user experience design, and martial arts.