How does Shazam work?

Ana
6 min readDec 20, 2018

--

Magic behind Shazam explained in simple English.

source: giphy

Shazam is a music recognition app that has been around for almost 20 years now. It’s one of my favorite apps that has always fascinated me with its speed and precision. I’ve read about its algorithm many times in the past, but every time I’d go back to thinking about it as some sort of magic. That changed recently when I started coding and building my own apps. I‘ve learned enough to understand the basic pipeline of Shazam, and I’ve decided to share it here.

What is sound?

Sound is made of waves that vibrate through some medium, such as air or water, and can be decrypted by human ears. Two main physical characteristics of sound are frequency and amplitude.

Frequency is a number of cycles per second, and it’s measured in Hertz. Amplitude, on the other hand, represents the size of each cycle.
If we take a graph bellow as an example, we see the total of 2 cycles over the period of time of 0.1 seconds (represented on the horizontal (X) axis). This means that the frequency of this wave is 2 / 0.1 = 20Hz (or 20 cycles per second). Amplitude is represented on the vertical (Y) axis, and in the case of this wave, it’s 1.

source: coding geek

Another thing we can notice is that there is only one sound wave on this graph. This kind of wave is known as pure tone, and it represents an isolated sound that doesn’t really exist. Instead, all sounds are more complex by nature and represent a sum of multiple tones at different amplitudes. A graph bellow is an example of a real sound.

source: coding geek

Analog vs. digital sound

Every sound that exists in air or any other medium is analog. Analog waves are continuous and quite complex. They’re so detailed that even the smallest fraction of a wave can still be divided into even smaller parts. Digital sound, on the other hand, has to have a minimum unit just because we can’t afford to store an infinite amount of data. This means that during this unit of time the sound wave can’t change. For example, if the minimum unit is 1 millisecond, it means that both frequency and amplitude, as well as other characteristics of a sound, will be the same. Keep in mind that the minimum unit has to be small enough, otherwise digital sound might sound completely different from the analog one. When represented on a graph, analog sound has smooth curvy waves, while digital waves tend to be more stepping in nature.

The process of transforming analog sound to digital is called sampling. During this process certain information gets lost, and what we end up with is more of an approximate representation of a sound than the exact copy of it. Most of the music we listen to today is digital, with the exception of vinyl records that have analog recording imprinted on them.

Why is all this important?

Well, think about it. The first step to identifying (or should I say, ‘shazaming’) a song is recording it with your phone’s microphone. The song that you hear is analog, and it gets transformed into a digital signal while being recorded. Makes sense, right? But this wave is still a wave, it has only been shrunk in size and stored. It needs to be transformed into some form that’s easier to compare and identify.

Fourier transform

Fourier transform (FT) is a formula that transforms a sound wave into a graph of frequencies that the sound is made of, and their intensities. If we look at the gif bellow we can see digital sound (made of red stepping waves) and analog sound that is made of multiple pure tones (blue curvy waves), being transformed into a graph.

source: wikipedia

There are few problems though: 1. FT takes a very long time to transform a song into a graph, and 2. FT gives us only frequencies and their intensities (amplitude) without any information about timing. In other words, we don’t know when in the song these frequencies occur.

The problem of speed can easily be solved by using Fast Fourier Transform (FTT), that uses the process of downsampling to keep the frequency information at the expense of other data. However, there is no way FT and FFT can solve the problem of incomplete visualization of a sound. In order to fix this Shazam uses a special type of graph called spectogram.

Spectogram and audio fingerprint algorithm

Spectogram is a visual representation of frequencies as they vary in time. In other words, it’s a three dimensional graph. If we look at the example bellow, we can see that axes represent frequency and time, while the third value (amplitude) is represented by the intensity of color of each point in the image.

source: coding geek

Spectogram is the very basis of Shazam’s audio fingerprint algorithm. We can think of it as a condensed digital summary of a song. Just like human fingerprints, every song’s acoustic fingerprint is unique, and can be easily identified even if there are small variations in data. This allows Shazam’s algorithm to get rid of all unnecessary information about a certain song. This is achieved in few ways.

Human hearing abilities help filter a big chunk of data. Human ears can register sounds of frequencies between 20 Hz and 20,000 Hz. In practice though, this range tends to be even more narrow (you can test your hearing abilities here), and it decreases over age. On top of that, human perception of loudness depends on frequency. The higher the frequency, the louder the sound. All this allows algorithm to focus only on peak points in the graph, labeled as higher energy content. This filters all the unnecessary data, and reduces the impact of background noise on audio identification.

Once the audio fingerprint is created, it gets stored in the database in the form of a hash table. The keys in this table are frequencies: the “peak intensity” plus a second “anchor point”. This method of acoustic fingerprinting allows applications such as Shazam to have the ability to differentiate between two closely related covers of the same song.

Once the audio fingerprint is created, it gets stored in the database. Shazam has more than 11 million songs in their database.

Now that we know all this, let’s put it all together:

Step 1: the song we want to identify is an analog wave
Step 2: it’s recorded by our phone’s microphone and turned into a digital format
Step 3: digital audio is converted to the frequency domain using the Fourier Transform
Step 4: a unique audio fingerprint is formed using spectogram
Step 5: fingerprint is compared to all the samples in the database
Step 6: if fingerprint matches any sample in the database it gets identified
Step 7: user gets the return information about the song

source: toptal

If you’re interested in learning more about Shazam check out the following blogs:
- Coding Geek’s very well explained and detailed post about the mechanism behind Shazam and everything else you need to know about it
- TopTal’s post about Shazam with the simplified version made in Java
- research paper written by Avery Li-Chun Wang, the co-founder of Shazam
- Shazam official blog

--

--

Ana

I moved from Europe to USA and from human languages to programming languages.