Visual Wake Words with TensorFlow Lite Micro

Oct 31, 2019 · 3 min read

Posted by Aakanksha Chowdhery, Software Engineer

Why does “Yo Google” not work with Google Assistant? After all, it’s only one word different from the phrase “Ok Google”. It’s because Google Assistant is listening for two specific words — or Wake Words. Wake Words are critical to the design of low-power machine learning to process data with a computationally inexpensive model to “wake up” the device for full processing. Audio wake words, such as “Ok Google”, are widely used to wake up AI assistant devices before they process speech using more computationally expensive machine learning models.

With the availability of low-power cameras, a popular application includes using a vision sensor with a microcontroller to classify when an image frame contains a person (or any object of interest). We refer to this application as Visual Wake Words because it enables a device to wake up when a human is present, analogous to how audio wake words are used in speech recognition.

Machine learning at the edge refers to models that run on-device, without connectivity to the cloud. As users deploy models on audio sensors or low-power cameras, microcontrollers provide a good compromise as a low-cost computational platform. However, available machine learning models hardly fit the device constraints (in terms of power consumption and processing power) of microcontrollers as detailed in this paper.

At CVPR 2019, Google organized a new challenge, called the “Visual Wake Words Challenge”, soliciting submissions of tiny vision models for microcontrollers. The challenge was to classify images to two classes (person/not-person) that serves a popular use-case for microcontrollers. Google open-sourced the Visual WakeWords Dataset derived from the COCO dataset: the label 1 corresponded to at least one person (or object of interest) being present in the image, and the label 0 corresponded to the image not containing any objects from person class.

Visual Wake Words model to detect the presence of a person.
Visual Wake Words model to detect the presence of a person.
Visual Wake Words model to detect the presence of a person.

Machine learning for microcontroller devices requires re-thinking the model design, accounting for new tradeoffs between memory footprint, accuracy, and compute costs. Typical microcontrollers have extremely limited on-chip memory (100–320 KB SRAM) and flash storage (256 KB–1 MB). The entire neural network model with its weight parameters and code has to fit within the small memory budget of flash storage. Further, the temporary memory buffer required to store the input and output activations during computation must not exceed the on-chip memory. In Visual Wake Words Challenge, researchers designed models with highest accuracy that fit the microcontroller device constraints: model size is less than 250 KB, peak memory usage is less than 250 KB, inference cost is less than 60 million multiply-adds per inference. The talk from the workshop is available on IEEEtv.

Typical microcontrollers have extremely limited on-chip memory and flash storage.
Typical microcontrollers have extremely limited on-chip memory and flash storage.
Typical microcontrollers have extremely limited on-chip memory and flash storage. For example, SparkFun Edge development board has 384KB RAM and 1MB Flash.

The submissions used optimization techniques such as model pruning and quantization (available from Tensorflow Model Optimization toolkit) and neural architecture search algorithms to design tiny models that fit microcontroller device constraints. To deploy their model on device, users leverage Tensorflow Lite Micro, the ML framework for microcontrollers from TensorFlow team.

The challenge generated a lot of excitement in the research community and received submissions from ARM, Samsung, Qualcomm, MIT, Berkeley, University of Oxford etc. On the Visual WakeWords dataset, top-scoring entries from MIT and Qualcomm achieved a classification accuracy of 94.5% and 95% in two categories: deployable today and deployable in next-generation ML frameworks.

One of the winning teams from MIT released a demo of their implementation recently.

To download the Visual Wake Words dataset and train a model yourself, you can walk through the following tutorial.

Thank you to everyone who worked on this release: Aakanksha Chowdhery, Daniel Situnayake, Pete Warden as well as the following colleagues for their guidance and advice: Jon Shlens, Andrew Howard, Rocky Rhodes, Nat Jeffres, Bo Chen, Mark Sandler, Meghna Natraj, Andrew Selle, Jared Duke.


TensorFlow is an end-to-end open source platform for…

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store