Posted by Aakanksha Chowdhery, Software Engineer
Why does “Yo Google” not work with Google Assistant? After all, it’s only one word different from the phrase “Ok Google”. It’s because Google Assistant is listening for two specific words — or Wake Words. Wake Words are critical to the design of low-power machine learning to process data with a computationally inexpensive model to “wake up” the device for full processing. Audio wake words, such as “Ok Google”, are widely used to wake up AI assistant devices before they process speech using more computationally expensive machine learning models.
With the availability of low-power cameras, a popular application includes using a vision sensor with a microcontroller to classify when an image frame contains a person (or any object of interest). We refer to this application as Visual Wake Words because it enables a device to wake up when a human is present, analogous to how audio wake words are used in speech recognition.
Machine learning at the edge refers to models that run on-device, without connectivity to the cloud. As users deploy models on audio sensors or low-power cameras, microcontrollers provide a good compromise as a low-cost computational platform. However, available machine learning models hardly fit the device constraints (in terms of power consumption and processing power) of microcontrollers as detailed in this paper.
At CVPR 2019, Google organized a new challenge, called the “Visual Wake Words Challenge”, soliciting submissions of tiny vision models for microcontrollers. The challenge was to classify images to two classes (person/not-person) that serves a popular use-case for microcontrollers. Google open-sourced the Visual WakeWords Dataset derived from the COCO dataset: the label 1 corresponded to at least one person (or object of interest) being present in the image, and the label 0 corresponded to the image not containing any objects from person class.
Machine learning for microcontroller devices requires re-thinking the model design, accounting for new tradeoffs between memory footprint, accuracy, and compute costs. Typical microcontrollers have extremely limited on-chip memory (100–320 KB SRAM) and flash storage (256 KB–1 MB). The entire neural network model with its weight parameters and code has to fit within the small memory budget of flash storage. Further, the temporary memory buffer required to store the input and output activations during computation must not exceed the on-chip memory. In Visual Wake Words Challenge, researchers designed models with highest accuracy that fit the microcontroller device constraints: model size is less than 250 KB, peak memory usage is less than 250 KB, inference cost is less than 60 million multiply-adds per inference. The talk from the workshop is available on IEEEtv.
The submissions used optimization techniques such as model pruning and quantization (available from Tensorflow Model Optimization toolkit) and neural architecture search algorithms to design tiny models that fit microcontroller device constraints. To deploy their model on device, users leverage Tensorflow Lite Micro, the ML framework for microcontrollers from TensorFlow team.
The challenge generated a lot of excitement in the research community and received submissions from ARM, Samsung, Qualcomm, MIT, Berkeley, University of Oxford etc. On the Visual WakeWords dataset, top-scoring entries from MIT and Qualcomm achieved a classification accuracy of 94.5% and 95% in two categories: deployable today and deployable in next-generation ML frameworks.
To download the Visual Wake Words dataset and train a model yourself, you can walk through the following tutorial.
Thank you to everyone who worked on this release: Aakanksha Chowdhery, Daniel Situnayake, Pete Warden as well as the following colleagues for their guidance and advice: Jon Shlens, Andrew Howard, Rocky Rhodes, Nat Jeffres, Bo Chen, Mark Sandler, Meghna Natraj, Andrew Selle, Jared Duke.