A Fingerprint for Audio

The popularity of the human fingerprint has never been higher than it is now. No longer are human fingerprints solely used by forensic specialists to identify individuals. Nowadays, we identify ourselves with our fingerprint every time we unlock our phone.

Our use of the human fingerprint for identification purposes is an extremely succesful data reduction method. We can identify an individual with great certainty based only on a few key features in the fingerprint. Can we do something similar with audio, i.e. identify an audio track based on only a few key features?

Why fingerprint audio?

There are many reasons why one might want to identify an audio track. We might want to determine the title and author of the song we’re listening to. Audio fingerprinting may also be used not only to identify what track we’re listening to but also where in the track we’re listening, so that we can synchronise multiple pieces of content (for instance, allowing us to join in on a TV quiz show). The article “An Overview of Audio Recognition Methods” described use cases of audio recognition in more detail. It also introduced audio fingerprinting as a method to perform audio recognition.

When compared to e.g. audio watermarking, fingerprinting is termed a passive technique, since it does not require altering the original content. Passive methods sample the audio as-is, and aim to perform recognition based on the sampled data. The main advantages of this methodology are its unobtrusiveness and the possibility to recognise audio that is unavailable for modification (e.g. audio tracks stored in the user’s home or controlled by a third party). The disadvantage is, in general, the need to compare sampled data to a database.

History of techniques

Although pioneering work on audio matching goes back to the very beginning of audio broadcasting, most of this work comprises straightforward correlation computation, that is, a direct comparison between audio data. This technique is prohibitively expensive to compute and to store, given that al raw audio of interest must be stored and compared. Around the year 2000, the first ideas were coined to create audio fingerprints. The idea is to extract the most important features that are characteristic of the audio yet robust to noise and distortion, and compare these features between samples and references. Note the similarity with real fingerprints: our fingerprints can be used to identify us because they are characteristic to the individual, whilst also being robust to noise and distortion (for instance, dirty fingers).

Spectrograms

Practically all audio fingerprints are based on features in a spectrogram. A spectrogram is an approximate decomposition of the signal over time and frequency. It is created by taking a short window of time of the signal, and then performing a Fourier transform that decomposes that window over its frequencies. By repeatedly performing this calculation for subsequent windows of time, we find the frequency composition of the audio as time progresses. To illustrate this, let’s look at the waveform of the first five seconds of Amy Winehouse’s “Back to Black”:

The waveform of the first five seconds of Amy Winehouse’s “Back to Black”.

The waveform shows the amplitude of the sound wave as time progresses. The music rhythm is clearly visible but, as we’ve seen in the overview on audio recognition, the exact shapes are rather sensitive to distortions. Now let’s look at the spectrogram of this same section of music:

A spectogram of the first five seconds of Amy Winehouse’s “Back to Black”.

Note the additional information provided by the spectrogram. It now becomes apparent that some of the beats manifest themselves in much higher frequencies than others. For instance, the beats at around 1.8 and 2.7 second have components at 5kHz and 15kHz, whereas the beats at around 1.2 and 2.2 second are characterised by frequencies of 1 kHz or lower. In the overview, we saw that the first set of beats are must more robust to distortion than the latter. This is no coincidence, but rather a natural result of distortions: lower frequencies tend to be more easily influenced by speaker and room set-up than the higher frequencies are.

To create a fingerprint, we must extract the characteristics that best define the audio from the spectrogram. There are many possible ways to do this. Let’s look at some market-leading variants.

Shazam

One of the first algorithms developed in the industry was developed by researchers from Shazam. Their solution is to identify the strongest peaks in the spectrogram, and to store the relative signatures of these peaks. The algorithm is illustrated in the image below:

Red circles indicate the strongest peaks and red lines connect peaks that are close to each other. The result is a “spider web” over the spectrogram. The web is much sparser than the original spectrogram and can therefore be stored more efficiently. Furthermore, the web is robust to distortions like white noise, since that will have a relatively small impact on the strongest peaks. The web therefore acts like an audio fingerprint.

Some challenges remain. For instance, it is not immediately clear just how many peaks and connections we should store. The more intricate the web we create, the larger the dataset and the harder it is to compare with a reference. However, if the web is made simpler, the risk that it is impacted by noise increases, as well as the chance of a false positive: falsely reporting a match with a reference.

Furthermore, although many types of noise will not impact the strongest peaks, some distortions will shift or modify even the strongest peaks (for instance, low quality speakers). The web may also break if a strong burst of noise (for instance, someone talking) takes out an essential section.

Philips

Roughly at the same time as Shazam, the Philips audio research department presented another method of fingerprinting. Where Shazam’s algorithm selectively represents parts of the spectrogram that it considers most important, Philips’ algorithm revolves around compressing the full spectrogram as much as possible. The Philips algorithm looks out for changes in time and frequency.

This is illustrated in the figure above, which shows the spectrogram changes of “Back to Black” (compare with the original spectrogram). Note that this technique naturally places an emphasis on the noise-robust high-frequency peaks and less so on the low-frequency peaks, since the latter don’t change as quickly.

The resulting figure is then rendered into a binary fingerprint by storing only whether the changes are larger than a threshold, as illustrated:

Though the above figure may not seem very useful to us, it codes a lot of essential information that is easily parsed and compared by a computer. This shows some key differences between this algorithm and Shazam’s algorithm: the Philips algorithm is much more resistant to a strong burst of noise or a shift of the peaks, since it doesn’t prefer any sections of the spectrogram to others. However, it is less resistant to continuous noise.

Intrasonics

Intrasonics have researched a solution that attempts to find the middle ground between the original philosophy of Philips and Shazam. Instead of deciding what part of the spectrogram we think is most important, the Intrasonics algorithm leverages modern machine-learning techniques to select the best features of the spectrogram. To do that, the Intrasonics fingerprinting algorithm first enters a training phase, in which the spectrogram is filtered by a range of possible filters at different frequencies, some of which are illustrated below.

The shape of the filter roughly reflects the feature that the filter looks out for. During the training phase, more than 10'000 filters are applied to the spectrogram, and the results are all collected together. These results are then analysed by the computer to learn which features are most characteristic of the audio.

The image above illustrates what the computer is looking for. If a feature is very characteristic, it will look very different for differing audio segments (blue) whilst looking very similar for similar audio segments (red). The computer looks for features that separate the red and the blue data points as much as possible, and selects the best filters out of the filter candidates. Once a set of filters is selected that work well together, the training phase is completed.

The selected filters are then used to extract the features from the spectrogram that help us identify the audio best. The advantages of this methodology is that the system is not reliant on what we think may be distinctive features of the spectrogram, but on the features that have been shown to be. Furthermore, by adding noise to the training data, we can train the system to ignore similar noise in the future. The system can easily be re-trained for specific use case if so desired. These are typical advantages of using machine learning to aid feature extraction.

Audio Matching

Once the fingerprints have been extracted, the next challenge is to identify the content to which the fingerprint belongs. This process is often called audio matching (although arguably, it should be called fingerprint matching). In order to perform the matching, we must first establish a reference database. Essentially, the reference database is acquired by extracting fingerprints from the reference content. The reference content may be pre-existing (e.g. a film), or could be a live feed (e.g. a TV channel). The size of the reference database will generally determine the solution architecture.

Note: We will focus below on the use case of a mobile phone app. However, use cases in toys or other embedded devices are possible, as well.

Small reference database

The standard use case is characterised by recognition of only a small selection of audio. In this case, a small reference database is one that represents roughly 10 hours of audio or less. This is generally enough to create interactive solutions that synchronise with a film, tv show or similar.

The reference database is created by fingerprinting the reference material. The reference database can then be built into the app off-line, or it can be distributed later (on-line). Once the app has received the reference database, it can function locally and off-line. The app performs the following two-step process:

1. extract a fingerprint from audio captured with the microphone, and

2. compare the fingerprint to the reference database.

This results in a low-latency solution that doesn’t require any connectivity or server infrastructure. The app can work as a standalone unit and can recognise and synchronise to content whenever and wherever it is encountered.

Large reference database

In cases in which the app is designed to recognise and respond to a large body of content, the architecture of the previous section is no longer feasible. In this case, the storage and computational complexity required to handle the reference database is too much for it to be incorporated in an app. Therefore, the reference database is not distributed to the app, but maintained on the server instead.

It follows that the audio matching must be performed on the server, too. In this scenario, the app extracts the fingerprint from the audio and uploads it to the server that maintains the reference database. The server performs the audio matching, and the result is passed back to the app.

The advantage of this methodology is that the body of reference content may be arbitrarily large. An added benefit is the simplicity of appending more reference material into the database. This is especially useful in cases in which audio is continuously being added to the database, for instance for live TV matching.

Data-rate and matching quality

It is apparent that there is a wide range of techniques to create a fingerprint from audio, as well as a wide range of applications. It may not come as a surprise that no “ideal” system exists. For instance, we’ve seen that different types of noise affect different methodologies more than others. We’ve also seen use cases require a reference database size ranging from a few hours to nearly boundless (for the latter, simply imagine the number of songs that Shazam may need to match).

Let’s look at some of the most important performance metrics for an audio fingerprinting system:

· False negative probability — A false negative denotes the case that the audio-matching system fails to report a match when it should have (that is, the audio was present in the reference database).

· False positive probability — A false positive describes the case that the audio-matching system reports a match when it shouldn’t have (that is, the audio was not present in the reference database).

· Data rate — indicates the amount of data in a fingerprint

Although these are not the only relevant metrics, it is especially illustrative to consider the above metrics in more detail. We will see that choices in system design will generally impact multiple metrics simultaneously. For example, we will see that systems that aim for a very false negative probability tend to suffer with a higher false positive probability and vice versa.

Data rate

The data rate indicates the number of bytes generated for the fingerprint. If Shazam chooses to store more spectrogram peaks and connections, for instance, this increases the fingerprint size and therefore the data rate. Equally, if Intrasonics increase the number of filters applies to the spectrogram, the data rate increases. It may be obvious that an increased data rate leads to more storage required on the phone and potentially increased internet data and/or bandwidth usage. What may be less obvious is that an increased data rate leads to increased storage requirements for the reference database and slower matching. This happens because the reference database contains fingerprinted audio: if the fingerprint data rate increases, the references grow in size, and the matching must compare a bigger set of data.

A reference distribution of bit differences between matching and non-matching audio segments.
The distribution of bit differences between matching and non-matching audio segments, using 4x fewer bits than the reference above.
The distribution of bit differences between matching and non-matching audio segments, using 16x fewer bits than the reference (top picture).

So why increase the data rate? The data size of the fingerprint governs the maximum number of minutes of distinct audio that can be discerned by the system. Information theory teaches us that a fingerprint of only 4 bits, for example, can maximally discern 2*2*2*2 = 24= 16 different possibilities. If the system must be resilient to random noise, the possibilities are further reduced. In practice, the more bits we use, the better we can discern a larger reference database. To illustrate this, let’s look at the comparison between the red and the blue curves we’ve seen before (left).

As before, the red curve represents similar audio, the blue curve represents different audio. We wish that our audio-matching system classifies similar audio as “the same” (everything in the red curve) and classifies different audio as “different” (everything in the blue curve). The images above show, from top to bottom, the effect of reducing the number of bits in the fingerprint (with ratio 16:4:1). It is clear that the fewer bits used, the more overlap appears between the curves and the harder it is to discern similar audio from different audio.

False negatives versus false positives

Immediately relevant to this discussion is the tradeoff between false negatives and false positives. Once the red curve and blue curve have been established by choosing the data-rate, we must set a threshold that separates “similar” audio from “different” audio. This is illustrated in the figure below:

As illustrated, the threshold defines the cut-off point for audio that we classify as a match. Everything to the right-hand side of the cut-off point is classified as different audio. Thus, similar (red) audio samples to the right of the cut-off point will falsely be classified as different, that is, they will become false negatives. Conversely, different audio samples (blue) to the left of the cut-off point will falsely be classified as the same, that is, they will become false positives. As we set the cut-off further to the left, the false negative probability increases and the false positive probability decreases.

Compromise

As we’ve seen, the data rate, false positive rate and false negative rate combine into an intricate compromise. It is impossible to maximise the performance of all three performance indicators; rather, we must choose a compromise that suits the use case. Using machine learning to establish the system parameters, as Intrasonics do, provides the advantage that the system can be re-trained for new requirements when necessary.

Privacy

In the current time of tracking cookies and customer profiling, recording audio is a sensitive topic. Therefore, the Intrasonics fingerprinting solution is developed with privacy in mind. In the case of an in-app reference database, no audio data leaves the phone. That means that the entire interactive app can run without requiring WiFi, Bluetooth or access to the internet, making it especially suitable for privacy-centric applications.

Even in the case of an external reference database, the fingerprints that are uploaded are abstract signatures. The original audio cannot be recreated from its fingerprint; the fingerprint cannot be used to interpret voice or context.

Further reading

Issues with digital watermarking and perceptual hashing (Kalker et al., 2001)

An Industrial-Strength Audio Search Algorithm (Wang, 2006)

A Highly Robust Audio Fingerprinting System (Haitsma and Kalker, 2002)

Pairwise Boosted Audio Fingerprint (Jang et al., 2009)