Quality Metrics for NLU/Chatbot Training Data / Part 2: Embeddings

Florian Treml
Sep 15 · 4 min read

What are Embeddings? What is similarity, cohesion and separation?

This article series provides an introduction to important quality metrics for your NLU engine and your chatbot training data. We will focus on practical usage of the introduced metrics, not on the mathematical and statistical background — I will add links to other articles for this purpose.

This is part 2 of the Quality Metrics for NLU/Chatbot Training Data series of articles.

For this article series, you should have an understanding what NLU and NLP is and about the involved vocabulary (intent, entity, utterance) and concepts (intent resolution, intent confidence, user examples).

What are Embeddings ?

Embeddings are a type of word or sentence representation that allows words or sentences with similar meaning to have a similar representation.

While this sounds complex, the concept is easy to understand when looking on this scatter chart and an example:

  • each colored dot represents a word or a sentence
  • the lower the distance between two dots, the more similar the words or sentences are (in this case: from a semantical point of view)
  • the higher the distance, the less similar they are
Image for post
Image for post
2D visualization of word embeddings

As an example:

  • “I’d like to order a drink”
  • “I want iced coffee”
  • “not interested”

The first two sentences will be rather close in the Embeddings space, while the third one will appear distant to both of the first two.

Mathematically speaking, an embedding is a vector in an n-dimensional space — the higher n, the more complex concepts can be handled. It is not a trivial task to map natural language into an n-dimensional space while considering semantical similarity. Fortunately, there are ready-to-use models available for the most-spoken languages, for example the Universal Sentence Encoder developed by Google.

An encoder is a neural network that takes the input, and outputs a feature map/vector/tensor — a point in n-dimensional space.

Reducing this n-dimensional vector into a 2D representation to be visualized on a flat scatter chart is a matter of Principal Component Analysis (PCA).

Using Embeddings for Training Data Analysis

When training an NLU engine for chatbots, you typically have labeled training data available — a list of intents each with a couple of training phrases for each intent. Our tool of choice for showing a sample data analysis workflow is Botium Box.

Botium first generates semantic embeddings of the training phrases by using the Universal Sentence Encoder module and visualizes them in a 2D-map. Based on the similarity between the training phrases, the average similarity between the intents is computed (separation), as well as the average similarity of phrases within an intent (cohesion). This approach helps to identify training phrases that might confuse your chatbot — based on the similarity in the embedding space.

Utterance Similarity

Training phrases in different intents that have high similarity value can be confusing to the NLU engine, and could lead to directing the user input to the wrong intent.

Image for post
Image for post
Utterance similarity

Intent Separation

Given two intents, the average distance between each pair of training phrases in the two intents is shown.

Image for post
Image for post
Intent separation

Intent Cohesion

Cohesion is the average similarity value between each pair of training phrases in the same intent. That value is computed for each intent. The higher the intent cohesion value, the better the intent training phrases.

Image for post
Image for post
Intent cohesion

Improve Chatbot Training Phrases

To improve the quality of the training phrases for your intents, consider the following approaches:

  • Find the phrases in different intents with high similarity in the Utterance Similarity table, and change or remove them
  • For intents with low cohesion, add more meaningful training phrases
  • For intent pairs with low separation, investigate training phrases

Give Botium Box a test drive today — start with the free Community Edition, we are happy to hear from you if you find it useful!

Looking for contributors

Please take part in the Botium community to bring chatbots forward! By contributing you help in increasing the quality of chatbots worldwide, leading to increasing end-user acceptance, which again will bring your own chatbot forward! Start here

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data…

Sign up for Data Science Blogathon: Win Lucrative Prizes!

By Analytics Vidhya

Launching the Second Data Science Blogathon – An Unmissable Chance to Write and Win Prizesprizes worth INR 30,000+! Take a look

By signing up, you will create a Medium account if you don’t already have one. Review our Privacy Policy for more information about our privacy practices.

Check your inbox
Medium sent you an email at to complete your subscription.

Florian Treml

Written by

Founder and CTO Botium

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

Florian Treml

Written by

Founder and CTO Botium

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store