The Evolution of Sound Recognition

Published in

Wavio

5 min readDec 4, 2018

It’s been a while since I last published a piece covering sound recognition. My apologies as the hustle and bustle took away my time to create content covering sound recognition.

Today, I’m going to write something quirky and educational…

… the evolution of sound recognition!

No, this is not your typical history lesson luring you into a deep slumber… I promise to keep it crash-course style!

You can refresh your mind by reading….“What the heck is Sound Recognition?” on what sound recognition looks like nowadays (practicing radical candor here… I need moar views).

Anyways, let’s start!

Sound waves

One of the earlier discoveries was when people realized machines could intercept sound waves (aka vibrations), to trigger simple actions. An action would be simply adjusting the brightness of a light to represent the level of sound waves.

Decibels

People, however, continued to question the phenomenon of sound waves: How can sound waves be measured real-time, by machines rather than sensing simply the presence of sound waves? Soon thereafter, machines were able to convert sound waves into decibels.

Frequency

Thanks to Heinrich Hertz’s discovery of frequency back in the 19th century, which served as a precursor to machines measuring and identifying frequencies from sounds. As a result, machines could measure sound with higher accuracy through frequencies, which look like this in its most basic form:

Categorizing sounds

People began to think about how to categorize sounds rather than comparing sounds on different frequencies. As a result, machines were trained to identify different sound categories (i.e. clinking, increasing, bang, increasing then decreasing…).

Processing sounds on the cloud

You probably called this one, but cloud computing happened. Cloud allowed machines to extract enormous amount of data, in order to process recorded sounds and identify.

Manual Sound Recording

Interestingly enough, the cloud was a huge step forward as sound data was a big obstacle for people trying to figure how to train machines with such data. However, that was a step back for consumers’ privacy. Fortunately, a few sound recognition startups emerged recently, including Wavio.ai, who solved this privacy issue by creating privacy-by-design solutions. Wavio made this possible by creating their first prototype: a hardware dev board that could be fed with self-recorded sound data, then locally run to identify the same sound as the self-recorded sound data.

In 2015, Wavio pushed sound recognition further by developing a training model that could be fed with self-recorded data, which would be installed as firmware on a plug-and-play device. As a result, Wavio delivered a locally-run product, identifying sounds by matching with existing sound data inside the hardware to trigger notifications or actions.

Data for making sound recognition better

With the rise of machine learning, here’s a rule of thumb: to improve the accuracy in sound recognition, we would need more data. How can we feed 1,000 samples of doorbell sound? People had to record the sound data themselves; it was quite the hassle. In addition, due to the massive amount of data, one could only do1–3 sound models on a device but beyond that, it would be difficult to do more due to limited processing power.

Next came AudioSet (Thanks, Google’s Sound Understanding Team!)… for anyone to access a large-scale dataset of manually annotated audio events from YouTube. The platform made it easier to train sound models off device before importing into device.

Machine Learning Models

As sound data became readily available, the next question was: how can we make machine learning models lightweight in order to squeeze in huge amount of valuable data inside devices for processing multiple sounds/sound models simultaneously? And… Google comes to the rescue with TensorFlow, once again. TensorFlow provide open sourced machine learning models that are lightweight and super effective. By utilizing the confusion matrix, sound recognition’s accuracy theoretically could be pushed to as high as 95% with sufficient sound data.

By combining TensorFlow, AudioSet and Wavio’s proprietary sound recognition algorithm, one could build and deploy a locally-run product identifying 500+ sounds.

What Wavio is doing nowadays is perceived as the most current stage of sound recognition technology.

Sound recognition down the road

Newfound researches, discoveries, and inventions are becoming stepping stones the next day at this pace. Sound recognition technology is moving so quick that investors, businesses, and the general mass are still confused on which use cases to focus on with sound recognition, or are still bat shit crazy about voice recognition. Yep, I’m calling these people out.

When will we see the market catch up with the value of sound recognition? Or, rather, will the market ever catch on — because sound recognition could become too overwhelming with the sheer number of use cases?

Well, now at least you know a bit more about sound recognition. It’s about high time we pop a few cold ones and ride it out on the future for this emerging technology.

Cheers.