Audio fingerprinting — what is it and why is it useful?

Published in

Chirp

6 min readMar 15, 2018

An audio fingerprint (also referred to as an acoustic fingerprint) is a compact representation of some audio (be it music, environmental sound etc.) that encapsulates information that is specific to the audio that it represents. The role of an audio fingerprint is to capture the signature of a piece of sound, such as a song, that allows it to be differentiated from other sounds. Audio fingerprinting has many applications, including watermarking, monitoring broadcast/distribution of audio content, and content-based sound retrieval. The latter application will be the focus of this article, although the same methods and considerations apply to other uses of audio fingerprint technology.

Much in the same way human fingerprints are used to identify a single person, it is important to note that audio fingerprints are designed to be specific to an instance of sound (an audio file), as opposed to a concept or class of sounds (like ‘ambient music’ or ‘rain sounds’). So let’s see how this works in practice.

Most audio fingerprinting technology that is used for services like Shazam and AcoustID extracts fingerprints from a time-frequency representation of audio, called a spectrogram (Figures 1 shows a spectrogram of some speech). Spectrograms are great because they allow us to identify the frequency content over time, and how loud or quiet each frequency is. This is all well and good, but in the raw form spectrograms are not very useful as audio fingerprints, for a couple of reasons. Firstly, they contain a lot of information, much of which may be redundant for the purposes of audio fingerprinting. Secondly, they are not robust to degradation of audio quality. This is shown in Figure 2, which is the same audio file as Figure 1, but played in a different environment, with clearly audible background noise. It is clear that the background noise has resulted in different spectrograms, but one thing is immediately obvious — the peaks are mostly intact. As such, the spectrogram peaks are a good starting point for generating a robust audio fingerprint.

Figure 2 — The same piece of speech played in a noisy environment

There are many approaches to identifying peaks, but what we essentially want is to identify the salient points in each region of the spectrogram that are not a result of any background noise. Each region can be considered as a 2D window, the size of which will determine the number of peaks we end up with.

Figure 3 — Segment of speech with peaks annotated

Figures 3 and 4 show the peaks detected on our speech file, with and without background noise. Note that the peaks appear more closely grouped at higher frequencies, but this is just due to the logarithmic frequency scale used for the plotting. One nice property of taking the peaks (as opposed to other audio features such as spectral statistics or zero crossing rate) is robustness to noise. Even with the clearly audible background noise, most of the peaks in the clean audio file are also detected in the noisy one, with a few extra peaks in introduced by the background noise.

Figure 4 — The same segment of speech played in a noisy environment, with peaks annotated

Once we have identified all the peaks in an audio file, we have the starting point for a fingerprint. We could just take the coordinate of each peak as a fingerprint, but it is easy to imagine that there are lots of pieces of audio (or songs) that share some peak positions, so it is clear a single peak would not suffice as a unique fingerprint. This is explained in this excellent paper by Avery Wang (the man behind Shazam’s technology). For a given frame of audio analysed over 1024 frequencies, we would have 1024 potential peak positions, equating to 10 bits of information. This is pretty low considering the potential size of the search space (millions of songs or many hours of audio), so we need a method of increasing the entropy of our fingerprint. There are a few clever ways to do this. Wang’s approach is to construct fingerprints from pairs of peaks (Figure 5). The peaks are split up into target zones, and each zone is allocated an anchor peak. Each peak in the target zone is paired with the anchor. A hash can then be constructed for each peak, consisting of the frequency of each peak pair and the distance between them.

Figure 5 — Pair of peaks used for the fingerprints (each fingerprint based on an anchor-peak pair)

But there are many alternative ways to increase the entropy. For example, we could take all the peaks in a section of audio as the fingerprint (Figure 6). The size of the section (i.e. width of the fingerprint) is an important factor here, and will depend on the use case. If it is too short, then we risk having low entropy (i.e. the fingerprint will not contain enough information to differentiate it from fingerprints of other audio files). However, if it is too long then we increase the amount of information required to store a database of fingerprints and search the database for a given query. As such, there is trade-off between the amount of information we need to make our fingerprints unique and the practicality of searching for potentially millions of fingerprints.

The approach that we developed for the Hijinx Alive Beat Bugs toys is unusual in that it is capable of performing real-time audio fingerprint recognition on a particularly low-spec computer — in fact, on an embedded microchip that costs less than $1 per unit. It’s a very different set of constraints to the large scale fingerprinting used by Shazam, and an example of how fingerprinting can be used to create interactive audio experiences without high-performance hardware.

The Hijinx Alive Beat Bugs Toys

In this blog, we have explained what audio fingerprints are, and given some examples of how they can be created. It is important to note that there is no single approach to generating an audio fingerprint, although most of the best performing methods are based on peaks in the spectrogram. Some approaches to audio fingerprinting lend themselves more towards scaling up to potentially millions of audio files (and hundreds or thousands of millions of fingerprints), whereas others are more suited to specific hardware constraints such as low memory and performance requirements.

Adib’s experience in audio span a wide range of areas, including sound engineering, signal processing, machine learning, vocal analysis, and audio perception. He holds a BSc in Audio Technology and is currently finishing up a PhD in the Centre for Digital Music at Queen Mary University of London. Adib joined Chirp to focus on researching machine listening and intelligent audio systems.

Chirp is a technology company enabling a seamless transfer of digital information via soundwaves, using a device’s loudspeaker and microphone only. The transmission uses audible or inaudible ultrasound tones and takes place with no network connection. To learn more visit chirp.io

Audio fingerprinting — what is it and why is it useful?

Written by Chirp