Snips brings Cloud-Level Spoken Language Understanding to the Edge

Over the summer of 2017, Google and Microsoft raced to claim a major milestone in the history of Human-Machine Interfaces: their respective Speech Recognition engines had reached human-level accuracy. Their cloud-based solutions offered better, or nearly equivalent accuracy to professional transcribers on a reference academic dataset of informal phone conversations.

Less than 18 months later, the adoption of voice interfaces is overwhelming. It is now ordinary to have a cloud-powered microphone in our direct environment, whether it’s in our pocket, home, vehicle, etc. But naturally, concerns are rising regarding privacy and security risks associated with such centralized, ubiquitous voice interfaces. These concerns have motivated further research towards alternative solutions for Speech Recognition, that wouldn’t run in the cloud. At Snips, we are happy to announce that we have reached a new milestone: on voice interface use cases, cloud-level Spoken Language Understanding is now possible on the edge.

We base this statement on performance metrics tailored to what voice interfaces are used for on the edge. Rather than focusing on a word error rate metric, that equally penalises missing “the” or “kitchen” in a query like “Turn on the lights in the kitchen”, we focus on complete Spoken Language Understanding solutions. These systems combine Speech Recognition with Natural Language Understanding to extract the intention of the user, as well as the parameters of this intention (a.k.a. slots). The end-to-end target metric we use is the success rate in understanding speech, i.e. the proportion of voice queries for which both the intent and the slots have been correctly understood.

To assess the performance of different solutions on voice interfaces tasks, our Machine Learning team released a reference dataset focused on two of the most typical use cases: smart lights 💡, and music 🎵.

Smart lights 💡 are probably the use case for which sending and processing data in the cloud makes the least sense. Basically, you are sending your voice to a distant server, possibly thousands of miles away, to turn on a light bulb you could reach by hand. Something that simple should always be handled locally.

Music 🎵, on the other hand, is both the most common and one of the most complex use cases for voice interfaces. The difficulty arises from the unlimited creativity in naming artists, tracks and albums. The number of unique words used in music streaming services’ catalogs is significantly larger than the size of the English dictionary. In addition, many of these words originate from languages foreign to the user, which means several potentially accented pronunciations for each word need to be supported. It follows that demonstrating that Snips can achieve cloud-level accuracy on the edge on the music use case fundamentally disproves the widely-held assumption that the cloud is necessary for voice interfaces.

For both domains, we crowdsourced, curated, and re-recorded 1 500 different voice queries in far-field conditions. The resulting dataset is shared, in the interest of reproducibility and with the hope that it will prove useful to the Spoken Language Understanding research community.

In what follows, we present some of the main results of how the embedded Snips Voice Platform fares with regards to cloud-based solutions. More details regarding the methodology, and how to architecture an embedded, private-by-design Spoken Language Understanding system can be found in an article we recently published.

Cloud-level performance on a Raspberry Pi 3

At Snips, one of our goals is to make the embedded Snips Voice Platform as widely accessible as possible. This is why when we refer to the edge, we don’t think of high-end graphic processing units (“GPUs”) that could only fit into the bill of materials of an expensive car. Instead, the reference hardware we take is one of the most mundane of current single-board computers: the Raspberry Pi 3.

The Raspberry Pi 3B + is a quad-core Cortex-A53 at 1.4GHz, with 1GB of RAM. This is as much RAM as there was in an iPhone 5 (2012), and as much computing power as there was in an iPhone 4S (2011). We are also looking into (much) smaller devices, but let’s keep the focus on this one for now. In short, the way we optimize performance on this type of hardware is by:

  • optimizing the trade-off between accuracy, memory footprint and computational efficiency when training the acoustic model of the Speech Recognition engine. This is the component that translates sound into probabilities over phonemes. We look for the optimal phonetic resolution for this component to minimize footprint while not sacryficing performance.
  • specializing the Language Model and Natural Language Understanding components to the domain of the assistant, in order both to reduce their size and increase their in-domain accuracy. This means that for every use case, we specialize our models on broad, tailored datasets, as opposed to cloud solutions that rather rely on an “one-size-fits-all” approach, aiming for as large vocabularies as possible.

These aspects are described in further details here.

Below are the performances of the embedded Snips Voice Platform, specialized on each of the voice interface domains, compared to the equivalent, cloud-based Google services: Google Speech-to-Text, combined with Google Dialogflow. Dialogflow is trained on exactly the same datasets as the Snips NLU counterpart. We used the service’s built-in slots and features whenever possible in the interest of fairness. Performance is measured on a combination of close and far-field recordings. Metrics are normalized with regards to performance obtained by professional transcribers on each use case.

Both on a small vocabulary domain like smart lights 💡, and on a large vocabulary domain such as music 🎵, the performance of the Snips Voice Platform is close to human level, and higher than or on par with Google’s cloud-based services.

These results are obtained on a Raspberry Pi 3, with Speech Recognition running faster than real time, while the audio signal is being captured. At the end of the query, the Natural Language Understanding step runs in less than 60ms, which is typically less than a round trip to the cloud. Similar or better performances can be obtained with industrial alternatives to the Raspberry Pi, such as the i.MX8 series of application processors, for instance.

No silver bullet in Speech — the need for specialization

Let’s dig a little deeper into the results of the music use case. Our test dataset can be segmented in three Tiers, spanning different levels of popularity among the 10,000 most popular artists according to a public ranking of Spotify charts from the same week these benchmarks were run. The first Tier corresponds to the most popular artists, ranked between 1 and 1 000 (“Tier 1 Artists”). The second Tier is comprised of artists ranked between 4,500 and 5,500 (“Tier 2 Artists”). The third Tier contains artists ranked between 9,000 and 10,000 (“Tier 3 Artists”).

Here’s how the Snips Voice Platform fares in comparison to (i) Google Speech-to-Text and (ii) professional transcribers on these different Tiers. Numbers are normalized with regards to the performance of the professional transcribers on the most popular Tier.

This experiment shows that Google Speech-to-Text’s performance quickly decreases as the artists’ popularity diminishes in the charts. On Tier 2, and Tier 3, Google Speech-to-Text’s success rate is less than 50% of what human transcribers achieve on Tier 1. This incidentally reveals a failure rate of more than 50%.

Google Speech-to-Text is probably the reference large vocabulary Speech Recognition engine, yet it fails to live up to expectations on this common, real-life use case, confirming that there is no silver bullet in Speech Recognition. And this despite the fact that, according to the documentation, it supports more than 10x proper nouns compared to the number of words in the entire Oxford English Dictionary.

The Snips Voice Platform obeys a radically different, specialized logic. It is systematically, and automatically specialized for the domain at stake. For music, its vocabulary will have been tailored to the 10,000 most popular artists on Spotify, as well as to over 60,000 track names, and 100,000 album names which cumulatively involve 178,000 word pronunciations.

Specialization ensures a high and consistent level of support across the board, offering better experience on the music 🎵 use case than Google Speech-to-Text.

Importantly, the 10,000 artists names used in this benchmark represent only a fraction of the 75,000 artists that generate 95% of the traffic on Spotify, and of the over 2 million total artists available on the main streaming platforms. To meet the challenge of covering all available artists, the embedded Snips Voice Platform offers the possibility to dynamically, and locally, extend the vocabulary of the Spoken Language Understanding engine. This means that strictly all of a user’s favourite music, however original or obscure, will be supported with the same level of accuracy. As taste evolves, and a user keeps discovering new artists, tracks or albums, regular updates of the vocabulary can simply be run on-device to keep their voice interface current and relevant. Context awareness is the key.

The road ahead

With Speech Recognition engines reaching human level in 2017, it is only natural that performance achieved with lighter hardware is now catching up - up to a stage where cloud-level Spoken Language Understanding is now possible on the edge, on common IoT hardware. Even for the most challenging use cases, voice interfaces can run on the edge without compromising on accuracy.

This being said, many challenges still lie ahead. First of all, the benchmarks shows there is still room for improvement to reach human level performance on such problems. More generally, far field, noisy conditions and accents still pose difficulties to all current Speech technologies. Errors are still not rare enough, regardless of the system.

At Snips, we strongly believe that a better use of context is the key reaching the next level in performance of Speech Recognition technology. So far, we have shown that specialization of the Spoken Language Understanding engine to the domain of the voice interface can cause a significant improvement in performance, as illustrated on the music use case. Further improvements will come with context-augmented specialization, for example through speaker recognition, cleaner separation between a user’s voice and background noise, and by dynamically adapting to the user’s vocabulary. And in this quest for more context awareness, privacy will remain an increasingly important concern.

Stay tuned for more progress by the Snips team on these fronts! 😉🤖

If you liked this article and want to support Snips, please share it!

Follow us on Twitter jodureau and snips.

If you want to work on AI + Privacy, check our jobs page!