Quality Metrics for NLU/Chatbot Training Data / Part 1: Confusion Matrix

Florian Treml
Analytics Vidhya
Published in
7 min readNov 12, 2019


UPDATE 2020/11/01: Botium’s free plan is live! With Botium Box Mini you will be able to:

  • use multiple chatbot technologies
  • set up test automation in a few minutes
  • enjoy a new improved user interface
  • get the benefits of a hosted, free service

Take it for a test drive

What is a Confusion Matrix ? How to generate and read a Confusion Matrix ? How to calculate precision, recall and F1-Score for your NLU engine ?

This article series provides an introduction to important quality metrics for your NLU engine and your chatbot training data. We will focus on practical usage of the introduced metrics, not on the mathematical and statistical background — I will add links to other articles for this purpose.

This is part 1 of the Quality Metrics for NLU/Chatbot Training Data series of articles.

For this article series, you should have an understanding what NLU and NLP is and about the involved vocabulary (intent, entity, utterance) and concepts (intent resolution, intent confidence, user examples).

Why are Quality Metrics for NLU important ?

Training an NLU engine for a chatbot always follows the same approach:

  1. You have labeled training data available (labeled means: you know what intent an utterance has to resolve to)
  2. You feed the training data into your NLU engine
  3. You validate the training outcome
  4. Refine training data and repeat until satisfied

For step 2 the tool of choice is Botium to automate the training process.

For step 3, one approach would be to do some manual or automated test against the NLU engine with a special test set not used for training. The basic question is: did step 4, refinement of the training data, have a positive or negative impact on the NLU performance ? Did it make my NLU better or worse in resolving the intents ?

You may ask why refinement of training data could have negative impact ? Because changing user examples for an intent can and will have an impact on intent resolution and confidence for intents with similar user examples.

Even when dealing with a small to medium chatbot project with 30 intents and 70 user examples per intent, there are thousands of test results to validate and compare to the previous training cycles — impossible when relying on quick feedback cycles. What we need are a rather small amount of comparable numbers (or metrics) — in best case exactly one number — to tell us about the general NLU performance, and some other numbers telling us the hot spots to give attention. In one sentence:

Quality Metrics make NLU training cycles comparable and point out areas of interest.

The Confusion Matrix

A Confusion Matrix shows and overview of the predicted intent vs the expected intent. It answers questions like “When sending user example X, I expect the NLU to predict intent Y, what did it actually predict ?”.

The expected intents are shown as rows, the predicted intents are shown as columns. User examples are sent to the NLU engine, and the cell value for the expected intent row and the predicted intent column is increased by 1. So whenever predicted and expected intent is a match, the cell value in the diagonal is increased — these are our successful test cases. All other cell values not on the diagonal are our failed test cases.


Here is a small extract from a large confusion matrix calculated for Sara, the Rasa Demo Bot:

This matrix lets deduct statements like these:

  • There are 53 (52 + 1) user examples for the affirm intent. But for one of them, the NLU predicted the enter_data intent instead.
  • The NLU predicted the ask_faq_platform intent for 21 (18 + 3) user examples, but it was only expected in 18 of them, for the remaining 3 the expected intent was contact_sales, so prediction was wrong.
  • For the ask_faq_platform intent there are 19 (18 + 1) user examples, but only 18 of them have been recognized by the NLU.
  • For 38 user examples, the ask_howold intent was expected, and the NLU predicted it for exactly these 38 user examples.

And from these statements, there are several conclusions:

  • The ask_howold and how_to_get_started intents are trained perfectly
  • There are 3 user examples where the NLU predicted ask_faq_platform, but the test data expected the intent contact_sales — find out the 3 user examples and refine training data for them
  • enter_data intent was predicted for 3 (1 + 1 + 1) user examples where another intent was expected. On the other hand there are 682 user examples correctly identified as enter_data, so the trade-off for this intent is acceptable

Precision, Recall and F1-Score

The statements above are logically flawless, but not totally intuitive.

  • How to decide if an intent or an intent pair needs refinement and additional training?
  • How to actually compare the total NLU performance to a previous training cycle ?
  • How to compare the performance for the most important intents to the previous training cycle ?
  • How to decide if the training data is good enough for production usage ?

That’s where the statistical concept of precision and recall comes into play, and the F1-Score representing the trade-off between the two.

You can find details on Wikipedia, this article will only give a rough overview.


In the example above, the NLU recognized the intent ask_faq_platform for 21 (18 + 3) user examples. For 3 of them, the expected intent was another intent, so 3 out of 21 predictions are wrong. The precision is ~ 0.85 (18 / 21), number of correct predictions for intent ask_faq_platform shared by total number of predictions for intent ask_faq_platform.

The question answered by the precision rate is: How many predictions of an intent are correct ?

Recall / Sensitivity

In the example above, we have 121 (1 + 117 + 3) user examples for we expect the intent contact_sales. The NLU predicted the intent contact_sales for 117 user examples only. The recall is ~0.97 (117 / 121), number of correct predictions for intent contact_sales shared by total number of expectations for intent contact_sales.

The question answered by the recall rate is: How many intents are correctly predicted ?

Precision vs Recall — F1-Score

While those two sound pretty much the same, they are not. In fact, it is not possible to evaluate the NLU performance with just one of those two metrics.

Again, from the example above:

  • The contact_sales intent has been predicted 117 times, and 117 of the predictions are correct. The precision rate is 1.0, perfect.
  • There are 4 more user examples for which the NLU predicted another intent. The recall rate is ~0.97, which is pretty good, but not perfect.

In theory, it is possible to get a perfect precision rate by making a very low amount of predictions for an intent (for example, by setting the confidence level very high). But the recall rate will dramatically decrease in this case, as the NLU will make no prediction (or a wrong prediction) in many cases.

On the other hand, it is possible to get a perfect recall rate for an intent by resolving EVERY user example to this intent. The precision will be very low than.

The trade-off between recall and precision is called F1-Score, which is the harmonic mean between the two. Most important, the F1-Score is a comparable metric for measuring the impact of NLU training. The rule of thumb (with some exceptions) is:

Increasing F1-Score means increasing NLU performance, decreasing F1-Score means decreasing NLU performance, within your test data.

An F1-Score of 0.95 usually is a good value, meaning the NLU engine is working pretty good on your test data.

An F1-Score of 1.0 means that all your test data is perfectly resolved by your NLU, the perfect NLU performance. This may be pleasant for regression testing, but typically it’s a sign for overfitting — a topic for another article.

Automatically Calculate Precision/Recall/F1-Score and Generate Confusion Matrix

Botium is the Selenium for Chatbots, and the perfect choice for automated training and testing of any of the supported NLU engines:

  • IBM Watson
  • Google Dialogflow
  • Microsoft LUIS
  • Amazon Lex
  • SAP Conversational AI
  • Wit.ai
  • Rasa
  • Botpress
  • Custom HTTP/JSON endpoints
  • and many more …

Botium Box records all test data and calculates important NLP analytics for you.

Apart from the Confusion Matrix, including recall, precision and F1-score (see screenshots in this article), you also get a full test result list, showing all your utterances with expected and predicted intent and confidence in a plain Excel list.

Give Botium Box a test drive today — start with the free Mini Edition, we are happy to hear from you if you find it useful!

Looking for contributors

Please take part in the Botium community to bring chatbots forward! By contributing you help in increasing the quality of chatbots worldwide, leading to increasing end-user acceptance, which again will bring your own chatbot forward! Start here:




Florian Treml
Analytics Vidhya

Co-Founder and CTO Botium🤓 — Guitarist 🎸 — 3xFather 🐣