Simple Speech Keyword Detecting with Depthwise Separable Convolutions

Chengwei Zhang
5 min readJun 27, 2018

Keyword detection or speech commands can be viewed as a minimal version of speech recognition system. What if we can make the model that is accurate yet consume small enough memory and computational footprint that runs in real-time even on a microcontroller in bare metal(without an operating system)? If that becomes real, imagining what traditional consumer electronic devices will become smarter with always-on speech commands enabled.

In this post, we will take the first step to build and train such a deep learning model to do keyword detection with the limiting memory and compute resources in mind.

Keyword detection system

Compare to a full speech recognition system which is typically cloud-based and can recognize almost any spoken words, keyword detection, on the other hand, detect predefined keywords such as “Alexa”, “Ok Google”, “Hey Siri”, etc. which is “always on”. The detection of the keywords triggers a specific action such as activating the full-scale speech recognition system. In some other use case, such keywords can be used to activate a voice-enabled lightbulb.

A keyword detection system consists of two essential parts.

  1. A feature extractor to convert an audio clip from time domain waveform to…

--

--