Deep dive into Snips Spoken Language Understanding Embedded System

Joseph Dureau
Snips Blog
Published in
5 min readMay 31, 2018

How we achieved a private and efficient cloud-independent voice interface.

In this blog post, we introduce our machine learning team’s recent article on the architecture of the Spoken Language Understanding (SLU) system embedded into the Snips Voice Platform. This embedded inference solution is fast and accurate while enforcing Privacy by Design — no personal user data is ever collected. The resulting SLU engine runs entirely offline, is lightweight, and fast to execute, making it a fit for deployment on small devices. The aim of this paper is to contribute to the collective effort towards ever more private and efficient cloud-independent voice interfaces.

The Voice Interface Landscape Today

Over the last several years, thanks in part to steady improvements brought on by deep learning approaches to speech recognition, voice interfaces have greatly evolved. Voice interfaces have become much more reliable, with state-of-the-art speech recognition engines reaching human level in English. This achievement has unlocked many practical applications of voice assistants which are now used in many fields, from customer support to autonomous cars and smart homes. In particular, smart speaker adoption by the public is on the rise, with a recent study showing that nearly 20% of U.S. adults reported having a smart speaker at home.

However, these recent developments raise questions about user privacy — especially since unique speaker identification is an active field of research using voice as a sensitive biometric feature. The CNIL (French Data Protection Authority) advises owners of connected speakers to switch off the microphone when possible and to warn guests of the presence of such a device in their home. The General Data Protection Regulation, which harmonizes data privacy laws across the European Union, now requires companies to ask for explicit consent before collecting user data.

A Private-by-Design Embedded Platform

The Privacy by Design principle sets privacy as the default standard in the design and engineering of a system. In the context of voice assistants that can be deployed anywhere, including users’ homes, this principle calls for a strong interpretation to protect users against any future misuse of their private data. We at Snips define Private-by-Design as a system that does not transfer user data to any remote location, such as cloud servers.

Within the Snips ecosystem, the SLU components are trained on servers, but the inference happens directly on the device once the assistant has been deployed — no data from the user is ever collected or stored. This design choice adds engineering complexity, as most IoT devices run on specific hardware with limited memory and computing power. Cross-platform support is also a requirement in the IoT industry, since IoT devices are powered by many different hardware boards, with sustained innovation in that field.

For these reasons, the Snips Voice Platform has been built with portability and footprint in mind.

The Snips Voice Platform embedded inference runs on common IoT hardware as light as the Raspberry Pi 3, a popular choice among developers. Other Linux boards are also supported, and the Snips SDK for Android works with devices with Android 5 and ARM CPU, while the iOS SDK targets iOS 11 and newer. For efficiency and portability reasons, the algorithms have been re-implemented whenever needed in Rust — a modern programming language offering high performance, low memory overhead, and cross-compilation.

Challenges of SLU Engines

The Snips ecosystem comprises a web console to build voice assistants and train the corresponding Spoken Language Understanding (SLU) engine, made of an Automatic Speech Recognition (ASR) engine and a Natural Language Understanding (NLU) engine.

The ASR engine translates a spoken utterance into text through an Acoustic Model, mapping raw audio to a phonetic representation, and a Language Model (LM), mapping this phonetic representation to text. The NLU then extracts intent and slots from the decoded query. LM and NLU have to be mutually consistent in order to optimize the accuracy of the SLU engine.

The figure below describes the building blocks of this SLU pipeline.

Transforming sound into meaning

ASR engines relying on large deep learning models have improved dramatically over the past few years. Yet, they still have a major drawback today:

The size of these models, along with the computational resources necessary to run them in real-time, makes them unfit for deployment on small devices, so that solutions implementing them are bound to rely on the cloud for speech recognition.

Our Recent Article on Building a Reliable Embedded SLU Engines

Enforcing privacy by design implies developing new solutions to build reliable SLU engines that are constrained in size and computational requirements, which we detail in our recent paper “Snips Voice Platform: an embedded Spoken Language Understanding system for private-by-design voice interfaces.

This article covers the following aspects:

  • the challenges of embedding high-performance machine learning models on small IoT devices;
  • how small-sized neural networks can be trained to yield near state-of-the-art accuracy while running in real-time on small devices;
  • how to train the language model of the ASR and the NLU in a consistent way, efficiently specializing them to a particular use case;
  • the high generalization accuracy of the SLU engine in the context of real-word voice assistants;
  • a data generation procedure to automatically create training sets through a combination of machine learning and crowdsourcing, providing sufficient and high-quality training data without compromising user privacy.

Focusing on Automatic Speech Recognition (ASR) and Natural Language Understanding (NLU), the article presents Snips’ approach to training high-performance Machine Learning models that are small enough to run in real-time, on small devices, while maintaining compliance with the privacy-by-design principle. Therefore assistants created through the Snips Voice Platform never send user queries to the cloud. You can read the full article here. Don’t hesitate to ask any question you may have!

If you liked this article and want to support Snips, please share it!

Follow us on Twitter jodureau and snips.

If you want to work on AI + Privacy, check our jobs page!

--

--