Benchmarking Natural Language Understanding Systems: Google, Facebook, Microsoft, Amazon, and Snips

By Alice Coucke, Adrien Ball, Clément Delpuech, Clément Doumouro, Sylvain Raybaud, Thibault Gisselbrecht and Joseph Dureau

Snips is an AI voice platform for connected devices. It’s Private by Design, runs on-device, and stays up when the internet goes down. Under the hood, here are the three main components at play:

End-to-end pipeline for voice interfaces.

As a company, we put a lot of emphasis on measuring how well our algorithms are doing, for each of these steps. First of all, we do not want our clients to compromise performance over privacy. In addition, comparing our performances to the main competitors is a very good way for us to get a tangible indication of the progress we’re making.

In this post, we are going to focus on Natural Language Understanding (NLU). We will compare our solution to the main existing alternatives: Google’s, Facebook’s, Microsoft’s, and Amazon’s Alexa. For full transparency, we are sharing the data, the procedure, and the detailed metrics obtained for each provider.

An open-source benchmark

Today, we are open-sourcing a dataset of over 16K crowdsourced queries. More precisely, this dataset contains 2400 queries for each of the 7 user intents we tested:

7 intents covered in this benchmark.

This data is freely available on Github, as long as more details on the procedure. We hope it will serve anyone working in the field, and wants to compare his/her performances to the services we benchmarked on this data: Google’s, Facebook’s Wit, Microsoft’s Luis, Amazon’s Alexa, and Snips’ NLU.

It’s all about Filling slots

The trickiest part in Natural Language Understanding is extracting the attributes of the query the user is making. This problem is called slot-filling.
Let’s take the following example:

“Is it gonna be sunny on Sunday after lunch, so we can go to Coney Island?”

A first model will first identify that the user is asking for the weather conditions. The slot filler will then be looking for typical attributes of such queries: a location, a date, etc. In short, the NLU component receives the sentence in natural language: “Is it gonna be sunny on Sunday after lunch, so we can go to Coney Island?”, and returns a structured object that a classic algorithm can act upon:

"intent": "GetWeather",
"datetime": "2017–06–04T14:00:00–05:00",
"location": "Coney Island"

The performance of such systems can be measured in two ways. The first one, precision, measures how exact are the attributes extracted by the engine. On the other hand, recall captures the amount of existing attributes that are recovered by the model. You can see it as a measure of the sensitivity of the model.

For instance, a model that makes few predictions but whose extracted attributes are always correct would have a high precision, but a very low recall. A balance needs to be found between the two. Precision and recall can be aggregated into a single score called F1-score. A 100% F1-score means that you’re catching 100% of the attributes specified in user queries, and that the attributes you extract never contain any wrong or superfluous information.

How are Google, Microsoft, Facebook, Amazon, and Snips doing at this problem?

With every NLU system, building your own model always takes you through the same steps: you enter examples of possible queries users would make, and tag the sentence segments that match the attributes you’re looking for. It is generally a manual procedure that takes a decent deal of effort and imagination. On rather rich intents, an enthusiastic developer will generally stop after supervising around 70 query examples.

We’ve computed the performances of each service, when used under these conditions. This was done by randomly picking multiple training sets of 70 queries, for a given intent, that were then fed to each NLU engine. Performances were computed by measuring the performances on a validation dataset. For each intent, we averaged the scores obtained across the multiple repetitions of the experiment. The details of the procedure are described on Github. Here are the results we obtained:

F1-score (per intent, and averaged) for the different providers trained on 70 queries / intent.

These results show that the Natural Language Understanding engine you’d create with Snips would be significantly more reliable than what would be achieved with the other platforms. If we dive a bit into the details, we see that Microsoft’s, Facebook’s, and Amazon’s solutions are penalised by their poor sensitivity: their recall is respectively 53%, 56%, and 49%, versus 65% for Google, and 77% for Snips.

A truly reliable AI

The results above show how current NLU solutions compare, when manually trained on a sample of 70 queries. They also show how these solutions stand, in absolute terms. The main take-away is that performances achieved in this setting are far from perfect. None of these solutions works more than 4 times out of 5.

There is a reason for that: it is hard to handle any possible formulation, when you’ve only seen 70 examples of a given intent. This is why we have built a unique solution that leverages a combination of automated pattern generation, crowdsourcing, and automated validation, to generate any amount of high-quality data in a few hours. This is a service we are proposing to our enterprise clients.

For example, if the Snips NLU engine is trained on 2000 queries instead of 70, a massive increase in performance is observed:

F1-score (per intent, and averaged) for the Snips NLU engine trained on 2000 queries / intent.

Thanks to data generation, the F1-score is over 20% higher than the performance that can be achieved with alternative services, in which no such data-generation solution is available. Beyond the comparison between NLU providers, it is remarkable that under these conditions the extracted attributes are exact in 95% of the cases (precision), and the ability to identify attributes (recall) is over 90%. We are thrilled to be working towards solutions for voice assistants that are not only private by design — our NLU engine can be embedded on engines as small as Raspberry 0’s , while all alternatives run on a server — but that are also truly reliable.

To end this post on concrete illustrations, let’s have a look at three queries taken from the validation set, to understand how the different solutions perform. In the following simple example, all NLU engines perform correctly:

Here, Sacaton is confused with a country by Alexa and with a datetime by The latter understands that gluten is a restaurant type, along with Wit and Alexa. Luis, on the other hand, does not detect any attribute.

In this last example, we see that again, Luis does not detect any slot. confuses the best bistro with a restaurant name, and interprets the pronoun me as the state in which the user wants to make a reservation. Alexa, on the other hand, has the correct answer.

The benchmark data, but also the detailed performances of each provider can be found on Github. Any contribution or feedback to this dataset is of course welcome, and would contribute to bringing greater transparency to the NLU industry. Feel free to contact us, should you have any questions / comments.

If you enjoyed this article, it would really help if you hit recommend below :)

Follow us on Twitter @alicecoucke , @jodureau, and @snips

If you want to work on AI + Privacy, check our jobs page!