Image for post
Photo by Jeremy Thomas

A bottom-up approach to NLU

Alex Dubois
Nov 26, 2020 · 4 min read

An important aspect of conversation design is understanding your customers’ intents. What are your customers asking? What problems do they have?

To solve this, access to real conversational data is critical — without it, you’re pretty much playing a guessing game; you can brainstorm the most common intents with your team, but correctly addressing the long tail specific to your domain is next to impossible.

Image for post
Image for post
Image by Author

However, access to conversational data isn’t enough: without proper tooling you’ll find yourself manually sifting through transcripts of conversations with absolutely no idea on where to start and when to stop, what utterance constitutes a valid intent vs. what is noise etc.

The typical approach to this problem has been to apply unsupervised clustering techniques.

Image for post
Image for post
Image taken from IBM Watson (reference)

There are two clear problems with unsupervised clustering as an approach to discovery and training of intents:

  • A first obvious problem is that clusters will often overlap (see image above), and represent similar / same intents, requiring a manual intervention to disambiguate them.
  • A less obvious but more fundamental problem, is that unsupervised clustering techniques do not say anything about how abstract or specific the intent generated from a given cluster should be.

For example, a cluster with utterances similar to “how can I transfer funds to my checking account?” could be assigned to any one of the these 3 labels, from most abstract to most specific

  1. Has a question
  2. Has a question > about bank account
  3. Has a question > about bank account > transfers

Determining which label to apply is a non-trivial problem, as the right level of abstraction for any given intent depends on whether there is sufficient data to accurately train the intent at that level of abstraction.

This is a classic chicken-and-egg problem: you need labeled data in order to correctly label your data.

Bottom-up approach to intent discovery & data labeling

Bottom-up labeling applies the tried and tested divide-and-conquer approach to this problem, with great success. Instead of expecting a human or unsupervised algorithm to correctly “predict” what intents and abstractions exist in the data, it provides a simple framework to iteratively discover this information.

The bottom-up “algorithm” is simple:

  • Step 1: Identify a few very high-level intents that can capture most (if not all) of meaning in your data (in our experience, “has a question” and “has a problem” are great starting points).
  • Step 2: Label your conversation / utterance data, assigning utterances to one of these high-level intents (the cognitive load at this labeling step is minimal, since the decision boils down to simply assigning each utterance to one of the existing high-level intents)

The outcome of this step is very valuable in itself, as it provides high-quality and domain-specific training data to classify users who “have a question” or have a problem”.

Image for post
Image for post
Image by Author
  • Step 3: For every intent (i.e: “has a question”), identify more specific “sub-intents” that its training examples can fall into (i.e: “has a question > about credit account”, “has a question > about account settings”)
  • Step 4: Re-assign the top-level intents’ training data to the more specific sub-intents you’ve just created
Image for post
Image for post
Image by Author
  • Repeat steps 3 & 4 (i.e: divide an conquer)
Image for post
Image for post
Image by Author

Every step produces training data for classifiers that can recognize increasingly specific intents: this is one of the major advantages of this approach.

What’s the catch?

If this solution to labeling and training data seems too obvious, it’s because it is: divide-and-conquer has been used to break down problems into manageable chunks for a long time; it just hasn’t been easily made available to data labeling and intent discovery use-cases yet.

The main reason for this is a question of tooling and resources: the labeling and refactoring workflows required to make this efficient and manageable at scale are costly to build out, and only the more sophisticated companies have done so — these companies are able to charge customers thousands and thousands of dollars to build and train intents from unstructured data.

There are however some solutions out there focusing on democratizing this approach: HumanFirst is one of them, and provides one of the first out-of-the-box bottom-up labeling and intent discovery solution. In our next article, we’ll explore how machine-learning and semantic search can accelerate this bottom-up approach. Stay tuned!

HumanFirst Blog

The Hub for Conversational AI Data.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store