Best Practices for Building a Natural Language Classifier / Chatbot, Part One

Cari Jacobs
Mar 12, 2020 · 7 min read
Image for post
Image for post
Photo by Dylan Lu on Unsplash

Natural language classification is a component in many AI-powered solutions such as chatbots, virtual agents, and agent assistants. IBM’s Watson Assistant and Watson Natural Language Classifier (NLC) products leverage powerful machine learning techniques to extract meaning from a user request. However, the underlying classifier must be trained in your business domain and use case. This training activity tends to be an aspect of AI projects that is not well understood — yet it is probably the most crucial because it is the foundation on which the entire solution is predicated.

Natural language classifiers are built using a supervised machine-learning approach, which means that humans annotate data by mapping example inputs to classification labels (a.k.a. intents). This results in a training set (a.k.a. ground truth), which is ingested into the model. Once trained, your model can begin making predictions for inputs it has never seen.

In practical terms, this means that if you train a classifier with questions about a business’s operating hours, such as:

“When do you open?”

“What time do you close?”

“Are you open today?”

“What are your hours?”

The classifier will be able to understand other questions as having the same goal or meaning, even if those questions are not worded exactly the same, such as:

“What time does the store open?”

“Are you open on the weekend?”

“How late do you stay open?”

For general domains with a very basic use case, this task is quite easy — a simple chatbot could be deployed within a day. However, there are plenty of complex business scenarios that require a higher degree of planning and effort. Most enterprise-level conversational solutions take four weeks to six months to deploy a minimally viable product (MVP), and a significant portion of that time will be dedicated to training the classifier.

The following best practices are provided as guidance for building a natural language classifier that will perform well enough to deliver real business value in your AI solutions. Many of the examples cited refer to Watson Assistant, but the general principles apply to NLC as well.

Best Practice #1: Establish The “Ground Truth” Team

Image for post
Image for post
Photo by S O C I A L . C U T on Unsplash

Before we can talk about what to do, we need to establish who’s going to do it. What resources and skillsets does your project need? Who is responsible for training, testing, and updating the machine-learning model?

For a proof-of-concept or small pilot, the above tasks may be owned by a single person, but most enterprise use cases will require about 2–5 people contributing to this effort.

Generally, a project will require one or two people who have some data-science skills and understand the goals for the business use case. This role should act as the gatekeeper for the training set. They will need to be able to run performance experiments and interpret the results. Because artificial intelligence is still a relatively new business initiative in most organizations, there may not be an existing job role to own this task. For our discussion, let’s call this role a Data Engineer.

The other half of this team should consist of subject matter experts (SMEs) who represent the business. They are likely to be the primary stakeholder and main beneficiary of the value that the AI solution will deliver. For example, if you are building a virtual agent chatbot that is intended to reduce call center volume, the SMEs should be senior-level customer care agents or team leads. At least one is necessary, 2–3 may be needed for a complex domain. These individuals will identify candidate training utterances and assign (label) the appropriate intent classifications.

Best Practice #2: Plan For Several “Ground Truth” Working Sessions

Image for post
Image for post
Photo by Helloquence on Unsplash

In my experience, the highest-performing models are a result of the team working together to achieve “inner-annotator agreement”. In other words, the guidelines for labeling data must be commonly understood and consistently applied. Unfortunately, this isn’t a good fit for a “divide and conquer” approach. A disagreement between your experts may go unnoticed or have a low impact on day-to-day human operations, but this will limit the effectiveness of your classifier. These disagreements must be exposed and resolved as you complete your first training cycle.

So what might your approach look like in practice? It may seem counter-intuitive, but the fastest path to a high-performing model is for the data engineers and business SMEs to meet for several working sessions and walk through the potential training candidates as a group. SMEs need to reach an internal agreement on the meaning behind each utterance. Their reasoning should be communicated and understood across the team.

The SMEs should make recommendations on whether or not to include training examples and suggest the appropriate intent.

The data engineer should validate these recommendations against intent definitions and best practices (more on this later). Next, they should run experiments to determine if the proposed changes have the intended effect on model performance.

Best Practice #3: Select The Right Scope For Your Solution

Image for post
Image for post
Photo by Ricardo Arce on Unsplash

Successful virtual agents require a user-centered design approach. It is essential to discover everything you can about your target end-users and how they will want or need to interact with your brand over a conversational platform.

What scenarios will drive users to engage with a conversational solution? What questions or requests will be asked of the solution? The answers to these questions will dictate how you train your classifier.

Many companies start with an idea that a chatbot should be built based on their website’s Frequently Asked Questions (FAQ). An FAQ may be a good starting point as a concept, but it has two inherent risks:

1) Lack of evidence that users actually will ask these questions, which can result in wasted effort and resources

2) Lack of sufficient, representative training data

Unfortunately, the FAQ isn’t a list of questions that are frequently asked — it is a collection of answers that are frequently provided. Sound pretty nuanced? The difference is this: the list of “FAQ questions” maps one official, canonized version of each question to the corresponding answer. When we build a natural language classifier, we are looking for multiple variations of how each particular type of request might be worded.

If your data collection strategy provided you with representative data, identifying the appropriate scope should be pretty straightforward: label your utterances and take a look at the volume distribution. Your chart will probably look something like this:

This type of chart is commonly referred to as “long tail”, or “fat head/long tail”.
This type of chart is commonly referred to as “long tail”, or “fat head/long tail”.

Pick a cutoff point for minimum training examples per intent. Anything to the left of that cutoff should be included because the evidence indicates that these questions are likely to be asked.

Keep in mind that if you choose to omit training for questions that are frequently asked, your solution will probably default to a standard “Don’t Understand” response such as, “I’m sorry, I didn’t understand your question, please re-phrase your request.” No amount of re-phrasing is going to deliver a satisfactory answer for the user. This experience will probably leave them extremely frustrated. The solution will gain more trust and confidence from your users if it can acknowledge the topic, even if it has to deflect or escalate the user.

If you don’t have the option of following a data-driven approach for your initial launch, it is all the more important to plan for iterative, fast-follow improvement cycles (e.g. Agile sprints). Plan to review your logs immediately to determine the gap between what you hypothesized users would ask and what they actually ask.

Summary, Part One

At this point, you should have a good understanding of how to establish a team, plan for working sessions, and select the scope for your solution. In the second part of this series, we will get into some specific guidelines for working with your data.

Continue to Part Two…

Special thanks to the reviewers: Andrew Freed, Leo Mazzoli, and Zach Eslami

Cari Jacobs is a Cognitive Engineer at IBM and has been working with companies to implement Watson AI solutions since 2014. Cari enjoys kayaking, science fiction, and Brazilian jiu jitsu.

IBM Data and AI

AI Trust | Automation | Language

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store