A bottom-up approach to NLU

An important aspect of conversation design is understanding your customers’ intents. What are your customers asking? What problems do they have?

To solve this, access to real conversational data is critical — without it, you’re pretty much playing a guessing game; you can brainstorm the most common intents with your team, but correctly addressing the long tail specific to your domain is next to impossible.

However, access to conversational data isn’t enough: without proper tooling you’ll find yourself manually sifting through transcripts of conversations with absolutely no idea on where to start and when to stop, what utterance constitutes a valid intent vs. what is noise etc.

The typical approach to this problem has been to apply unsupervised clustering techniques.

There are two clear problems with unsupervised clustering as an approach to discovery and training of intents:

  • A first obvious problem is that clusters will often overlap (see image above), and represent similar / same intents, requiring a manual intervention to disambiguate them.
  • A less obvious but more fundamental problem, is that unsupervised clustering techniques do not say anything about how abstract or specific the intent generated from a given cluster should be.

For example, a cluster with utterances similar to “how can I transfer funds to my checking account?” could be assigned to any one of the these 3 labels, from most abstract to most specific

  1. Has a question
  2. Has a question > about bank account
  3. Has a question > about bank account > transfers

Determining which label to apply is a non-trivial problem, as the right level of abstraction for any given intent depends on whether there is sufficient data to accurately train the intent at that level of abstraction.

This is a classic chicken-and-egg problem: you need labeled data in order to correctly label your data.

Bottom-up approach to intent discovery & data labeling

Bottom-up labeling applies the tried and tested divide-and-conquer approach to this problem, with great success. Instead of expecting a human or unsupervised algorithm to correctly “predict” what intents and abstractions exist in the data, it provides a simple framework to iteratively discover this information.

The bottom-up “algorithm” is simple:

  • Step 1: Identify a few very high-level intents that can capture most (if not all) of meaning in your data (in our experience, “has a question” and “has a problem” are great starting points).
  • Step 2: Label your conversation / utterance data, assigning utterances to one of these high-level intents (the cognitive load at this labeling step is minimal, since the decision boils down to simply assigning each utterance to one of the existing high-level intents)

The outcome of this step is very valuable in itself, as it provides high-quality and domain-specific training data to classify users who “have a question” or have a problem”.

  • Step 3: For every intent (i.e: “has a question”), identify more specific “sub-intents” that its training examples can fall into (i.e: “has a question > about credit account”, “has a question > about account settings”)
  • Step 4: Re-assign the top-level intents’ training data to the more specific sub-intents you’ve just created
  • Repeat steps 3 & 4 (i.e: divide an conquer)

Every step produces training data for classifiers that can recognize increasingly specific intents: this is one of the major advantages of this approach.

What’s the catch?

If this solution to labeling and training data seems too obvious, it’s because it is: divide-and-conquer has been used to break down problems into manageable chunks for a long time; it just hasn’t been easily made available to data labeling and intent discovery use-cases yet.

The main reason for this is a question of tooling and resources: the labeling and refactoring workflows required to make this efficient and manageable at scale are costly to build out, and only the more sophisticated companies have done so — these companies are able to charge customers thousands and thousands of dollars to build and train intents from unstructured data.

There are however some solutions out there focusing on democratizing this approach: HumanFirst is one of them, and provides one of the first out-of-the-box bottom-up labeling and intent discovery solution. In our next article, we’ll explore how machine-learning and semantic search can accelerate this bottom-up approach. Stay tuned!




Head of Growth @ HumanFirst

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Performing Analysis of Meteorological Data

Upward Spiral In Remittance

Go for a Mattress that Fits OnesRequirements https://t.co/2FEBSUCgrN

Understanding the Analytics Maturity Model

A comprehensive beginners guide to tackle text classification problems.

SMOTE vs SMOTE-NC in Imbalanced Datasets

I paid that “advisor” no mind applying during the Fall of 1994.

Accessing the U.S. Energy Information Administration data using REST APIv2

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Alex Dubois

Alex Dubois

Head of Growth @ HumanFirst

More from Medium

Amazing works being done by Masakhane in the African NLP space

55% Want Superhuman IVR & Chatbots!

The People Behind The Bots — Shalini Johar

Top 5 countries and hospitals that have adopted Voice-bots!