Benchmarking Intent Classification services — June 2018

Intent classification is an important component of any Natural Language Understanding (NLU) system in any chatbot platform. For the chatbot to work well, it must recognize correctly the intent of the user from user input in order to trigger the correct action or dialog.

Another important aspect of NLU is entity extraction, which is the subject of a separate benchmark that we have completed previously.

Usually in any bot-building platform, the bot developer creates a list of intents, and for each intent provides a set of training phrases representing what a normal user may say for that intent. The number of training phrases varies across intents and bots. We can expect from a few (<10) to a hundred and even more training phrases per intent. Once the bot is trained, the bot’s intent classification is evaluated using testing phrases to see if the bot detects the intents correctly.


In August 2017, a published paper “Evaluating Natural Language Understanding Services for Conversational Question Answering Systems” compared the intent classification and entity extraction of various platforms, including DialogFlow, LUIS, Watson and RASA, using the three corpora listed in the table below:

Link to the corpus data used in the paper: https://github.com/sebischair/NLU-Evaluation-Corpora

We will compare the intent classification results in the above paper with those for Botfuel NLP classification service (thus excluding the entity extraction part). The Botfuel classification service is part of Botfuel bot building platform.

In addition:

  • We have computed the result for Recast, a bot building platform which was not covered in the paper, using their API and UI for manual verification.
  • We have showed the result for Snips, which was computed by them and mentioned in their blog post. The details of this calculation are available here.
  • We re-tested all three corpora on DialogFlow and obtained very similar result to the one in the paper.

The table below shows the f1-score for intent classification for each corpus and the overall result using micro averaging as in the original paper:

f1-score for intent classification for each corpus and the overall f1 using micro averaging. Results for Luis, DialogFlow, Watson and RASA are from the paper.

You can find our classification results here: https://github.com/Botfuel/benchmark-nlp-2018.

Here are some observations:

  • All platforms perform similarly well (with Botfuel, LUIS and Watson slightly better). Note that the three corpora are specific, so the results above should not be taken as the estimate of the platforms’ performance in other situations, as was indicated in the paper.
  • Similar performance patterns were observed across the three corpora: best on Chatbot, then on Ask Ubuntu and worst on Web Applications. This is the same order as the average number of training phrases by intent, which are 50, 10 and 4 for Chatbot, Ask Ubuntu and Web Applications respectively.
This benchmark shows that the performance of Botfuel classification service is on par with that of other platforms, such as LUIS and DialogFlow. It also suggests strongly that one can improve the intent classification by increasing the size and the quality of the training phrases.