Best Practices for Building a Natural Language Classifier / Chatbot, Part Two

Published in

IBM Data Science in Practice

7 min readMar 15, 2020

In Part One of this series, we covered what it takes to get your team up and running to start building a natural language classifier. Now let’s continue with some specific guidelines for working with your data.

Best Practice #4: Define Each Intent

Intents are purposes or goals that are expressed in a user’s input. It is important to group examples according to what the user’s goal is — from the user’s perspective.

The most common underlying problems in every poor performing model I’ve encountered are intents that are either too broad or too specific (or a combination of the two). It’s not always easy to know the “right” level of specificity to be covered by an intent. In some domains, there is an unavoidable degree of topic overlap. As a general rule, think of intents as the verb/action part of a sentence.

Real-world example 1: Too Broad

A few years ago, I worked with an auto financial services company to build a conversational agent. The end-users were auto lease and loan customers. When we reviewed initial training data from human-to-human chat logs, we found questions such as “What is my car’s VIN number?”. The business SME determined that the answer to this question would be to display the customer’s contract. Their reasoning was that the vehicle information number is provided on the contract. They put this example into the intent “view_contract”.

Later, we found questions asking, “How many total payments will I make?”. This was another question that would be answered by displaying the customer’s contract. The SME proposed adding this to the “view_contract” intent. However, we found that, from the customer’s perspective, these questions did not have similar goals compared to the VIN questions. Therefore, these questions should not be trained in the same intent.

This was an example of trying to group questions based on the answer unit (which is really the business perspective).

The classifier’s job is to identify the user’s objective.

Keep in mind that the end-user does not have any knowledge about your back-end business processes. Once we understand what the user needs, we can map that to how the business would actually process the request.

Real-World Example 2: Too Specific

In an auto insurance use case, we came across questions such as: “How much will my cost go up if I add my wife?”, “Will my premium increase if I add my daughter”, “How much does it cost to include my roommate on my policy?”

In all of these cases, the user’s goal is to determine the cost of adding a driver. We don’t want three separate intents divided along the lines of relationship to the driver. One intent for all of these questions should be sufficient, even if the answer will be different for each relationship.

Key Takeaways for Defining Intents

· Questions must be grouped according to expressions of a common goal or objective — from the end user's point of view

· Questions should not be grouped together based only on the fact that they share the same response or outcome

· Once your model can successfully identify the right intents with good confidence, it is quite simple to route multiple intents to a single answer

· With context or other natural language processing strategies (such as entity detection), you can also route a single intent to multiple possible endpoints

· These distinctions can be very subtle; sometimes it takes a few experimentation cycles to find the right balance of depth or breadth for an intent

· Be sure to document what purpose/goals are meant to be classified under each intent in the Intent Description field. This definition needs to address the full range of examples represented in the set and is vital to aid future teach/train cycles

Best Practice #5: Omit single word/phrase examples from training

In general, I advise against the inclusion of any single words or terms in the training set. Without additional words provided in the utterance, it is nearly impossible to determine the user’s goal. There are a few exceptions, such as certain Interactive Voice Response (IVR) scenarios or conversational intents that handle greetings, closings, and other minor chitchats (“Hi”, “Hello”, “Thanks”, “Thank you”, “Goodbye”, “Help”).

Consider a Human Resources scenario: the term “W2”. This term could show up in examples under multiple intents, such as:

· w2_reprint (“I need to reprint a W2.”)

· employee_status_change (“How do I change an employee from contractor to W2?”)

· w2_delivery (“When will I receive my W2?”)

Adding the example “W2” to any intent will cause that exact utterance to match the intent with 100% confidence. It’s best to omit the single word from training and let Watson Assistant’s Disambiguation feature do its job.

If your logs show that users start out treating the conversation interface as a search engine, another strategy for handling single words is to create a dialog node that detects the exact match input of targeted keywords. You can then deliver an intelligent response while guiding the user to provide an expanded natural language request: “I can answer many questions about W2’s, can you tell me more about what you need?” (Caution: this approach should be applied sparingly — matching input text in your node conditions can run counter to the effort you have invested into training your model.)

Best Practice #6: Assess Volume and Distribution

The Watson Assistant documentation recommends at least 5 examples per intent, however for many enterprises, domain-specific solutions perform better starting around 15–20 examples.

A large variance in the number of training examples per intent can cause some real problems. Some intents may perform well with just a few examples, while other intents will require more. Finding the right balance can be tricky and time-consuming. Term overlap will often drive the need for additional examples within targeted intents.

When you train your first model, consider keeping your volume distribution within a specified range in order to get a baseline performance reading. For example, target an average of 15 examples per intent, but permit no fewer than 7 and no more than 25 per intent as a starting point.

After you run your first performance experiment, begin adjusting as necessary based on the accuracy of each intent. Over-trained intents tend to result in higher false-positives, while under-trained intents can result in too many false-negatives.

The worst case of over-training that I have encountered was a set that contained 1000+ examples in a few intents, while other intents had just 10. This classifier struggled to perform.

Even in complex domains, Watson Assistant often excels with intents that contain just 20–50 examples. If you find yourself building an intent with more than 200 examples and your performance still isn’t satisfactory, you may need to revisit your intent definitions and the dialog strategy.

Best Practice #7: Start Small, Iterate, and Incorporate

A solution that does 5 things very well will usually deliver more business value than a solution that does 100 things poorly.

Start small (identify what is minimally viable). Be iterative. Expect to reshuffle your intent classifications a few times. This may cause re-work in dialog nodes. For that reason, focus on getting your initial ground truth in a good state of performance before you start building other complex rules to handle your inputs.

Deployment of your natural language solution is not the end of your project, it is an early milestone.

AI solutions can only improve once they are exposed to real-world interactions. Plan to review your logs. (Several analysis tools are explored in this post.) Begin your next improvement cycle by incorporating representative examples and other key learnings about how your users interact with the solution.

Conclusion

Many companies don’t know where to start on their AI journey. For projects that use natural language classification, it is important to assemble the right team. Understand how your data drives the scope and powers the solution. Incorporate user research and data science techniques to make evidence-based recommendations for improvement. These principles can get you on the right track for building your natural language classifier.

For help in implementing these practices, reach out to IBM Data and AI Expert Labs and Learning.

Special thanks to the reviewers: Andrew Freed, Leo Mazzoli, and Zach Eslami

Cari Jacobs is a Cognitive Engineer at IBM and has been working with companies to implement Watson AI solutions since 2014. Cari enjoys kayaking, science fiction, and Brazilian jiu jitsu.