Quality Metrics for NLU/Chatbot Training Data

Part 1: Confusion Matrix

Florian Treml
Nov 12, 2019 · 7 min read

What is a Confusion Matrix ? How to generate and read a Confusion Matrix ? How to calculate precision, recall and F1-Score for your NLU engine ?

This article series provides an introduction to important quality metrics for your NLU engine and your chatbot training data. We will focus on practical usage of the introduced metrics, not on the mathematical and statistical background — I will add links to other articles for this purpose.

This is part 1 of the Quality Metrics for NLU/Chatbot Training Data series of articles.

For this article series, you should have an understanding what NLU and NLP is and about the involved vocabulary (intent, entity, utterance) and concepts (intent resolution, intent confidence, user examples).

Why are Quality Metrics for NLU important ?

Training an NLU engine for a chatbot always follows the same approach:

  1. You have labeled training data available (labeled means: you know what intent an utterance has to resolve to)

For step 2 the tool of choice is Botium to automate the training process.

For step 3, one approach would be to do some manual or automated test against the NLU engine with a special test set not used for training. The basic question is: did step 4, refinement of the training data, have a positive or negative impact on the NLU performance ? Did it make my NLU better or worse in resolving the intents ?

You may ask why refinement of training data could have negative impact ? Because changing user examples for an intent can and will have an impact on intent resolution and confidence for intents with similar user examples.

Even when dealing with a small to medium chatbot project with 30 intents and 70 user examples per intent, there are thousands of test results to validate and compare to the previous training cycles — impossible when relying on quick feedback cycles. What we need are a rather small amount of comparable numbers (or metrics) — in best case exactly one number — to tell us about the general NLU performance, and some other numbers telling us the hot spots to give attention. In one sentence:

Quality Metrics make NLU training cycles comparable and point out areas of interest.

The Confusion Matrix

A Confusion Matrix shows and overview of the predicted intent vs the expected intent. It answers questions like “When sending user example X, I expect the NLU to predict intent Y, what did it actually predict ?”.

The expected intents are shown as rows, the predicted intents are shown as columns. User examples are sent to the NLU engine, and the cell value for the expected intent row and the predicted intent column is increased by 1. So whenever predicted and expected intent is a match, the cell value in the diagonal is increased — these are our successful test cases. All other cell values not on the diagonal are our failed test cases.

Examples

Here is a small extract from a large confusion matrix calculated for Sara, the Rasa Demo Bot:

This matrix lets deduct statements like these:

  • There are 53 (52 + 1) user examples for the affirm intent. But for one of them, the NLU predicted the enter_data intent instead.

And from these statements, there are several conclusions:

  • The ask_howold and how_to_get_started intents are trained perfectly

Precision, Recall and F1-Score

The statements above are logically flawless, but not totally intuitive.

  • How to decide if an intent or an intent pair needs refinement and additional training?

That’s where the statistical concept of precision and recall comes into play, and the F1-Score representing the trade-off between the two.

You can find details on Wikipedia, this article will only give a rough overview.

Precision

In the example above, the NLU recognized the intent ask_faq_platform for 21 (18 + 3) user examples. For 3 of them, the expected intent was another intent, so 3 out of 21 predictions are wrong. The precision is ~ 0.85 (18 / 21), number of correct predictions for intent ask_faq_platform shared by total number of predictions for intent ask_faq_platform.

The question answered by the precision rate is: How many predictions of an intent are correct ?

Recall / Sensitivity

In the example above, we have 121 (1 + 117 + 3) user examples for we expect the intent contact_sales. The NLU predicted the intent contact_sales for 117 user examples only. The recall is ~0.97 (117 / 121), number of correct predictions for intent contact_sales shared by total number of expectations for intent contact_sales.

The question answered by the recall rate is: How many intents are correctly predicted ?

Precision vs Recall — F1-Score

While those two sound pretty much the same, they are not. In fact, it is not possible to evaluate the NLU performance with just one of those two metrics.

Again, from the example above:

  • The contact_sales intent has been predicted 117 times, and 117 of the predictions are correct. The precision rate is 1.0, perfect.

In theory, it is possible to get a perfect precision rate by making a very low amount of predictions for an intent (for example, by setting the confidence level very high). But the recall rate will dramatically decrease in this case, as the NLU will make no prediction (or a wrong prediction) in many cases.

On the other hand, it is possible to get a perfect recall rate for an intent by resolving EVERY user example to this intent. The precision will be very low than.

The trade-off between recall and precision is called F1-Score, which is the harmonic mean between the two. Most important, the F1-Score is a comparable metric for measuring the impact of NLU training. The rule of thumb (with some exceptions) is:

Increasing F1-Score means increasing NLU performance, decreasing F1-Score means decreasing NLU performance, within your test data.

An F1-Score of 0.95 usually is a good value, meaning the NLU engine is working pretty good on your test data.

An F1-Score of 1.0 means that all your test data is perfectly resolved by your NLU, the perfect NLU performance. This may be pleasant for regression testing, but typically it’s a sign for overfitting — a topic for another article.

Automatically Calculate Precision/Recall/F1-Score and Generate Confusion Matrix

Botium is the Selenium for Chatbots, and the perfect choice for automated training and testing of any of the supported NLU engines:

  • IBM Watson

Botium Box (free Community Edition available) records all test data and calculates important NLP analytics for you.

Apart from the Confusion Matrix, including recall, precision and F1-score (see screenshots in this article), you also get a full test result list, showing all your utterances with expected and predicted intent and confidence in a plain Excel list.

Give Botium Box a test drive today — start with the free Community Edition, we are happy to hear from you if you find it useful!

Looking for contributors

Please take part in the Botium community to bring chatbots forward! By contributing you help in increasing the quality of chatbots worldwide, leading to increasing end-user acceptance, which again will bring your own chatbot forward! Start here:

https://github.com/codeforequity-at/botium-core

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data…

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store