Natural Language Understanding Benchmarks — part 1

Jose Marcos
Melior.AI
Published in
3 min readSep 3, 2018

What is Natural Language Understanding?

One of the key elements of an Artificial Intelligence driven chatbot, one of the technologies we developed at Melior, is the Natural Language Understanding (NLU) engine.

The NLU engine is the technology in charge of making sense of what a user is saying; what a given statement means and what relevant information it contains. (Sentiment, Toxicity, Formality or Tone are some other aspects the NLU can be in charge of understanding, but that’s a story for another post.)
A system like this would usually be composed of several sub-components; several pre-processing algorithms, Machine Learning models of varying levels of complexity, and post-processing pipelines. For the purposes of this post, we will focus on one of the core aspects; intent-classification.

We refer to intent classification as the process of assigning one or several labels or intents to a given sentence with the goal of determining what the sentence is attempting to achieve.

For example, if we were building a NLU engine to classify between two intents: ‘DepartureTime’ and ‘FindConnection’, for the following statement:

when is the next train in muncher freiheit?

The NLU engine should understand that we are asking about DepartureTime.

How does Melior’s NLU rate against competitors?

Our NLU engine is such an integral part of our work that we spent a great deal of time prototyping, improving and measuring the performance of our technology against different models and solutions.

We followed the methodology as described in Evaluating Natural Language Understanding Services for Conversational Question Answering Systems to compare our technology with previous evaluations.

To do so we used the same data-sets, trained our models, and calculated the same metrics as found at the 2018–01-Braum-et-al-extension in this evaluation carried out by Snips.

These are our preliminary results alongside other existing solutions:

Although the measurements were taken some time ago¹— and NLU services improve over time — we are nonetheless encouraged to see the very high scores our NLU has achieved in such a short period and are confident about future improvements. These results are to be taken as preliminary baselines.

For the purpose of this benchmark, we trained our models without parameter fine-tuning, data-augmentation, model-selection and without using an ensemble of models. A combination of these techniques would probably boost performance.

In our following posts, we will expand the benchmarks to include more fine-grained details as well as other datasets, and comparisons of applying some of the before-mentioned improvements techniques.

Moreover, to make comparisons fairer, we will re-train the models of open-source alternatives to account for possible improvements before re-publishing our findings.

Stay tuned if you want to learn more about this and other aspects of our AI!

[1] RASA and Snips metrics were obtained in January 2018 by Snips in this benchmark.

--

--