What is music recognition software and how does it work?

Simon Lee
4 min readJan 8, 2019

--

Have you ever heard a song and wondered who sang it and what it is? With the advent of audio recognition software, you can easily identify songs, melodies, advertisements and even movies with the ease of a button.

Shazam Entertainment Limited was first founded by Chris Barton and Philip Inghelbrecht in 1999 and went on to launch their music recognition app, Shazam, in 2002 on mobile phones. Users had to dial in and hold a 30 second recording to receive their results in the form of a text message. Only in 2008 did Shazam go on to become a smartphone app on the iPhone 2.0 app store. Shazam has seen more and more users over the years and has even acquired partnerships with Spotify and Apple, leading to their purchase by Apple in September 2018 for $400 million.

— What are the challenges with identifying a song for a computer?

Humans recognize sound not by comparing each bit we hear to a memorized version, but instead we recognize specific chords in succession that triggers our memory. Computers can only compare data literally and has no way of implicitly recognizing patterns as easily. We as engineers would need to define and quantify these patterns for the computer to match. This is where spectrograms and audio fingerprints come in handy.

— What are spectrograms and audio fingerprints?

Spectrograms are visual graphs of frequencies of sound over time along the x and y axis with a color gradient to represent the amplitude of the frequency. One can then take two spectrograms, one of a live recording and the other inside of a database of songs, and compare them to see if they match. If they do, then you can identify the song just from its spectrogram. Translating a spectrogram into data that a computer can understand would require too much data to realistically scale upwards as there is too much information and comparing them to each and everyone one stored in a database would be implausible.

— How does Shazam handle this pattern recognition?

Shazam handles this by taking the spectrogram and transforming it into an audio fingerprint, similar to dots on a graph. Each dot would then represent the highest magnitude frequency at a specific point in time. By converting to an audio fingerprint, this drastically decreases the amount of data needed to represent a specific sound. Shazam further simplifies the audio fingerprints and saves snippets of the sound, represented by frequency numbers, and stores them into a hash table. By using a hash table, searching for a song is as easy as finding a song in your database with enough matching snippets. Higher efficiency means lookup times are lower and when dealing with people nowadays, the longer it takes to get results means they are more likely to become frustrated and stop using your app.

— What are other technologies/uses for Audio recognition?

Copyright claims can be filed automatically using audio recognition

Audio fingerprinting can be used in ways outside of identifying songs on Shazam’s app. YouTube for example can run an algorithm on their videos and check for copyright infringements just by matching audio fingerprints with songs that don’t belong to the content provider. Twitch is also another company that automatically mutes audio on users videos by partnering with AudibleMagic who helps identify songs that are being used without authorization.

Alexa, play despacito

While similar to audio fingerprints, speech recognition is another field that is quickly becoming more mainstream. Google Home and Amazon’s Echo are just a few examples of speech recognition. By using acoustic modeling, artificial neural networks, and audio fingerprints, complex systems can be created to interact and interpret human language instantaneously.

--

--