An overview of audio recognition methods

Jerome Schalkwijk
Intrasonics
Published in
7 min readAug 23, 2018

Use cases

Audio recognition is the discipline of establishing whether two audio clips are “the same”; that is, to recognise audio as something that’s already known. Some of the reasons why we might wish to do so are:

· Information retrieval — We all know the feeling of hearing a piece of music that we really like, and wondering what it is. If that information is not announced to us, we might be able to find out using audio recognition: Shazam is a well-known example of this use-case.

· Synchronisation — By recognising what audio is playing and how far into the track we are, a secondary device may synchronise with audio or video. This enables synchronised interactive experiences: for instance, a TV show may offer a play-along quiz for your smartphone, or a phone app might show synchronised lyrics during a concert.

· Determining audience — It is important for content creators and distributors to know the size of their audience. It is now possible for audience researchers to install audio recognition apps onto volunteer’s phones. The app can determine what content is being heard and reports this back to the researchers.

· Copyright protection — To protect content creators, it is important for them to know who is distributing their material, and whether that material is properly licensed. Automated audio recognition solutions may help find their content in a wide range of media platforms and allow them to determine whether a licence is in place.

Audio distortions

In the previous section, the words “the same” were written in quotes. Intuitively, the concept of “the same” needs no explanation. For instance, we all easily recognise our favourite song on the radio when it comes on. So what is the difficulty with establishing whether two audio clips are the same?

In the digital world, we call two datasets identical if every bit in the two sets is identical. Digital audio is just a dataset. After all, audio, like everything else in the digital world, is represented by a set of bits. It is easily understood, however, that two audio clips that humans would refer to as “the same” may not comprise identical bits. For instance, if we store our favourite song in a wav file, it will comprise different bits than the same song stored in mp3 format. The bits will differ again between mp2, aac, flac, and so forth. Irrespective of the format, however, it remains the same song to us. When we refer to “audio recognition”, we must recognise this concept and create a digital solution that will identify audio in all these different formats as being “the same”.

As Intrasonics, we’ve set ourselves an even more ambitious goal. Intrasonics provides audio recognition that isn’t limited to the digital domain only. That is, we wish to recognise ambient audio, audio that is travelling through air. Most of us realise that our favourite song sounds different when played out of different speakers, or even when played out of the same speakers but in a different room. But to realise just how different our song is in these scenarios, let’s have a look at the waveform of our song. The waveform illustrates how amplitude (roughly: volume level) varies over short periods of time. A beat at a constant frequency, for instance, would show up as a series of peaks and troughs at regular intervals.

Waveform visualising the first 5 seconds of Amy Winehouse’s “Back to Black”.

The figure above shows the waveform of the first five seconds of Amy Winehouse’s “Back to Black”. The song’s rhythm is clearly visible. This waveform is created from the original, digital version of the song. Now let’s play this song through high-quality Tannoy speakers, and record the sound 150cm away from the speakers, with high quality recording equipment, without any noise.

Waveform visualising the first 5 seconds of Amy Winehouse’s “Back to Black”, after playback and recording.

The first five seconds of the resulting recording is visualised in the waveform above. Although some of the strongest beats are clearly recognisable (compare the beats, for instance, 1.8 seconds into the track), some of the other characteristic features of the waveform have disappeared or have been smudged (compare the beats, for instance, 1.2 seconds into the track). Interestingly enough, if you’d listen to the original and the recording, you wouldn’t hear many differences, except for some general loss in audio quality.

A recording made by a mobile phone listening to the same track in a noisy environment, played from lower quality speakers, would show a waveform hardly recognisable as the same as that pictured above. It seems that our human ears and brains have developed techniques to mask out most of the sound distortions that happen in a normal environment. Even though the waveform might look very different, we can easily recognise the audio as being “the same”.

Automated audio recognition

So how do we automate audio recognition? In other words, how do we develop a computer system that identifies whether two tracks are similar enough to classify as “the same” to the human ear? It turns out that the computer science of determining whether two datasets are identical may be easy, but determining whether two datasets are alike is rather hard.

There are two main approaches to automated audio recognition: active and passive recognition, sometimes called watermarking and fingerprinting, respectively. Both of these approaches have their advantages and disadvantages. Intrasonics provides a solution for each of these approaches, as they each have their use cases, and can complement each other well.

Watermarking

Active recognition methods modify, or watermark, the original audio. The term watermarking refers to the long-standing method of adding a watermark to a text document. When an audio watermark is added to the original audio clip, a recognition system can look for the existence of this watermark in a second audio clip to determine whether the clips are the same.

Intrasonics have developed audio watermarking that is robust enough to persist through all forms of mp3, aac, mp2 or other compression. Furthermore, the watermarks are robust enough to persist even after playing audio through speakers and recording it. Our watermarking detection is, sensitive enough to pick up watermarks even in a noisy environment: the watermarks can be detected in the background music over a conversation, in the music of a shopping mall, or in a revving car.

An important difference between the traditional document watermarks and audio watermarks is their conspicuousness. Whereas it is acceptable for a document watermark to be visible (as long as it doesn’t impact the readability of the text), it is generally unacceptable for audio watermarks to be audible. Intrasonics’ watermarking technology is unique in that it is inaudible to the human ear, but can nevertheless easily be identified even in highly compressed media (for instance, Youtube, Netflix, etc).

One big advantage of watermarking is the ease of audio recognition. Once audio has been watermarked, the watermarks can easily be detected by phone, tablet, laptop or an embedded device. No internet or Bluetooth connectivity is required because audio does not need to be sent to a server: detection is local and off-line. Furthermore, detection requires very little processor effort, so it has negligible impact on the battery life of portable devices. Interested in more detail into how audio watermarking works? Read our watermarking article!

Watermarking is not possible in all use cases, however. In some cases one may not have the possibility to modify the original audio. This might be due to the original audio being already available to users, or because we wish to recognise audio controlled or distributed by a third party.

Fingerprinting

For such cases, audio fingerprinting may present a solution. Fingerprinting, sometimes also called “audio hashing”, is a form of passive recognition because it requires no modification of the source material. Instead, the goal is to recognise an audio clip by comparing it to a reference database of audio, and determining whether a match can be found.

The waveforms above clearly illustrate why direct comparison of the audio waveform is not effective. Instead, the audio is reduced to a fingerprint. The naming is chosen because an audio fingerprint functions much like a real fingerprint: it represents features of the audio which can be compared to a reference. By creating a database of audio references we’re interested in, we can compare an audio sample with these references to identify it, similar to how a forensic department might identify an individual from their fingerprint.

In cases where we’re looking to identify something specifically, for instance a movie, the reference database doesn’t need to be large. In these cases, it is possible to perform fingerprinting and reference look-up locally and off-line much like watermarking. However, in cases where the possibilities span a larger set of content, this approach becomes infeasible. In that case, the fingerprint is uploaded to a server that compares the fingerprint to references. The server then responds back with the result of the match, similarly to how Shazam performs music lookup.

Interested in more detail into how audio fingerprinting works? Read our fingerprinting article.

Further reading

Techniques for data hiding (Bender et al., 1996)

A review of algorithms for audio fingerprinting (Cano et al., 2002)

An Industrial-Strength Audio Search Algorithm (Wang, 2006)

--

--