How Shazam Works

Trey Cooper
7 min readJan 29, 2018

--

Shazam allows you to send a recording made on your phone of nearly any song and Shazam will tell you the song’s name, the artist’s name and other data about that song. This other data often includes links to places where you can purchase the song as well as upcoming tour dates for that artist. In recent years, Shazam has extended its library to include ads and television. For example, you can Shazam a commercial and be given additional information about the product including a link to purchase that product.

Founded in 1999, Shazam is older than the smartphone. In the early days of Shazam, users would call in on a device that looked like this

A Shazam search is able to find a match even in noisy environments like bars or nightclubs as long as that song is already in Shazam’s database. A recording of at least five seconds will give the best results. You can start recording at any point in the song and Shazam will send you a match in a matter of seconds. For this service to work well, Shazam has a growing database of over 8 million songs/audio files. Assuming that the average audio file is three minutes long, it would take over 45 years to play each one back to back!

With a database of this size they have great coverage but how does Shazam find a match so quickly in such a large database? First off, the actual audio files are not what is being searched when you Shazam a song. Instead, Shazam has an audio fingerprint for each audio file in the database. The recording that a Shazam user submits is also made into an audio fingerprint which allows them to make comparisons accurately and quickly. These audio fingerprints consist of collections of numerical data. If you are wondering how this catchy tune that you are Shazaming gets turned into numbers, the next section is for you.

How sound works

At its most basic definition, sound is particles vibrating. There are three elements that make each sound unique: Amplitude, Frequency, and Time. Amplitude is the size of the vibration, which we perceive as the loudness the sound. Frequency is that rate at which at which the vibration occurs. A sound’s frequency is what we perceive as pitch. Frequency is measured in Hertz (Hz) which represent how many times a sound wave repeats per second. The human ear can hear sounds ranging from 20Hz to 20,000Hz. To give some perspective, the lowest note on a traditional 88-key piano, A0 has a frequency of 27.5Hz. The frequency of a musical pitch at each octave increases logarithmically. In other words, the frequency of each octave is twice as much as the octave below it. For example, the frequency of A1 is 55Hz, the frequency of A2 is 110Hz, and the frequency of A3 is 220Hz.

In the chart above, you can see that many instruments can play the same notes but a note on the violin and the same note on a piano will sound different. This difference in tonal quality is known as timbre. The timbre of a sound is created by frequencies within the sound that are higher (repeat at a faster rate) than the perceived pitch of the sound. These frequencies are known as overtones. Check out this recording which begins with a note (C4 261.23Hz) on a piano followed by each of the overtones within the sound played one by one and concludes with the same note it began with.

These overtone frequencies are what give an instrument its characteristic timbre.

Time is important because it not only gives us the length of time in which occurs but also at what time a sound occurs in relation to other sounds. A given song can be made up of many instruments that vary in frequency and amplitude as they move through time in relation to each other. Because of the complexity of amplitude, frequency, and time, and the ability to measure them precisely, two different versions of the same song will still generate a unique audio fingerprint.

How a fingerprint is made

To make an audio fingerprint, an audio file is converted into a spectrogram where the y-axis represents frequency, the x-axis represents time and the density of the shading represents amplitude (Fig 1A).

For each section of an audio file, the strongest peaks are chosen and the spectrogram is reduced to a scatter plot. At this point, amplitude is no longer necessary (Fig 1B).

Now we have all of the basic data to match two files that have undergone the fingerprinting process. However, it is only possible to match them if a Shazam user began recording at the exact millisecond that a song began. Since this is almost never the case, there are additional steps to audio fingerprinting. Through a process called combinatorial hashing, points on the scatter plot are chosen to be anchors that are linked to other points on the plot that occur after the anchor point during a window of time and frequency known as a target zone (Fig 1C).

Each anchor-point pair is stored in a table containing the frequency of the anchor, the frequency of the point, and the time between the anchor and the point known as a hash. This data is then linked to a table that contains the time between the anchor and the beginning of the audio file. Files in the database also have unique IDs that are used to retrieve more information about the file such as the song’s title and the artist’s name.

How to find a match

Now that we have created fingerprints for both audio files, each of the anchor-point pairs from the Shazam user’s recording are sent to Shazam’s database to look for matching anchor-point pairs. This search will return the audio fingerprints of all songs that contain any hash matches. Once we have all of the possible matches for the Shazam user’s recording, we need to find the time offset between the beginning of the Shazam user’s recording and the beginning of one of these possible matches from the database. This offset in timing can be calculated by subtracting the time of the anchor-point pair’s occurrence in the Shazam user’s recording from the matching hash’s time of occurrence in the audio file from Shazam’s database. If a significant amount of matching hashes have the same time offset, that song is determined to be a match!

When mapped to a scatter plot where the y-axis represents the time at which the hash occurs in the Shazam user’s recording and the x-axis represents the time at which the hash occurs in the audio file from Shazam’s database, the matching hashes will form a diagonal line (Fig 3A). In a histogram of the same data where the y-axis represents the offset times and the x-axis represents the amount matches, there will be a large spike at the correct offset time (Fig 3B).

This audio search method is accurate enough to find matches despite Shazam user’s recording containing noise such as people talking, road noise and even other songs. Because the number of anchor-point hashes created by an audio fingerprint is much higher than the amount of anchor-point matches required to return a positive search result, the anchor-point hashes that are masked by external noise is not enough to prevent Shazam from consistently finding a match for an audio file from the database. Since the search algorithm is built to find matches to recorded audio in Shazam’s database, if you are at a concert and you get a positive match when shazaming a song it is most likely that the performer is using backing tracks and/or lip-syncing. Another side effect is that Shazam will return the original recording that a sample comes from if an artist has not combined the sample with any other sounds and has not altered the sample in any way.

References

Research paper by Shazam co-founder Avery Wang https://www.ee.columbia.edu/~dpwe/papers/Wang03-shazam.pdf

--

--