Demystifying Audio Watermarking, Fingerprinting and Modulation/Demodulation
In the real world, we’re continuously inferring meaning from sounds we hear around us, from alarms and ring-tones to the complex semantics of spoken language. Similarly, in the digital realm, there are lots of different ways to embed and extract meaning from a piece of sound — and different motivations for doing so.
In this post, we’ll examine three specific approaches to audio information analysis: watermarking (as done by Digimarc and Fraunhofer), fingerprinting (as done by Shazam), and modulation/demodulation (as done by us at Chirp).
Each has specific affordances and applications, with substantial overlap between them. This article should help to distinguish between the three, and explain where and why each can be best deployed.
Motivations behind audio information retrieval
There are a number of different reasons we may need to extract information from an audio signal.
- for communication: A dominant theme — and the one we’re concerned with at Chirp — is the use of audio as a data carrier, translating digital information into sound and then decoding it at the receiving end. The objective is simply to send a digital message, intact, from machine to machine. This has a long-reaching historical precedent, from Morse code to V.90 dial-up modems. Today, the big revolution is in doing so over the air, with “air-gapped” distances between the speakers and microphones of devices, a situation that is becoming increasingly relevant in the age of things.
- to obtain metadata: When listening to a piece of music on the radio or podcast, it’s useful to be able to find out what it is and who it’s by. Assuming that all we have to go on is the audio itself (without an additional metadata stream such as m3u), this means that the metadata must be somehow encoded within the audio, or derived from it by identifying the work.
- to protect rights-holders: For owners or licensees of copyrighted material, it’s important to check whether their library of material has been properly licensed by third-party broadcasters. This can be done by automatically listening in to radio and TV streams, identifying the material and confirming that the broadcaster has licensed the work. Likewise, a broadcaster can use information retrieval to check over their own library to confirm the rights are in place.
- for musical analysis and performance: Fields such as computational musicology can provide a wealth of information on compositional trends, instrumentation and performance styles by automatically analysing a recorded work.
- device synchronisation and second-screen display: Increasing number of traditional TV broadcasts now make use of devices such as tablets and mobile devices to display supplementary information, or give secondary interactive controls such as voting mechanisms. Embedding sync data in the audio stream allows the broadcast and accompanying device to stay in sync with each other, to trigger specific actions, or to make UX interfaces visible to the viewer.
We’ll now look at three particular approaches to extracting information from an audio stream: watermarking, fingerprinting, and modulation/demodulation.
Fingerprinting, or “content-based audio identification”, produces a fingerprint of a snippet of audio by analysing its musical content and mapping out its general contours — for example, looking for distinctive melodies or rhythms. (In practice, most real-world implementations derive more sophisticated measures by deriving properties of the frequency spectrum.)
This fragmentary fingerprint can then be looked up in a huge database or “corpus” of known fingerprints, typically to identify the source music that the fragment comes from. The best-known example of audio fingerprinting in the consumer realm is Shazam, capable of identifying short fragments of music from a vast database of tens of millions of tracks. The difficult feat of this kind of real-world fingerprinting is that it must be (a) performed efficiently, in near real-time despite the complexity of the search process; and (b) robust to background noise and distortion, given that the sample may be taken from a noisy environment or played from a low-bitrate MP3. (More about how Shazam works).
Fingerprinting is applied in the broadcast realm to protect rights holders from copyright infringement. Services such as Nielsen Broadcast Data Systems provide fingerprinting-as-a-service, maintaining a database of tracks on behalf of their rights holders and then scanning a substantial list of radio stations to ensure licensing is in place.
It can also be deployed for synchronisation and second-screen purposes, with the ability to make broadcasts interactive by listening for the specific audio fingerprint in question.
The greatest power — and the greatest constraint — of fingerprinting is that it does not modify the original source material in any way. This has the major benefit of being non-destructive and not requiring any expensive or time-consuming retroactive processing of media databases.
However, it means that arbitrary data cannot be directly embedded or communicated in this way. Once the fingerprint is obtained, all it provides is a digest of the track, which is then looked up in a database. This additional lookup step typically requires that the device be internet-connected (in the case of networked databases such as Shazam’s), and adds some architectural complexity
Because fingerprint must be pre-populated in the corpus for the match to be successful, the fingerprinting paradigm is a poor fit for cases where the data to be embedded needs to be generated or modified in real-time, such as a sports score.
Similarly to fingerprinting, audio watermarking is popular for rights management in that it can also be used to identify a particular audio recording or broadcast. Unlike fingerprinting, watermarking operates by modifying an original piece of source material by layering additional information on top of it.
The crux of watermarking is that it should be done in a way that is (a) imperceptible to the listener, yet (b) be resilient to distortion and compression. This is a challenging and often contradictory set of requirements: compression codecs such as MP3 operate on the basis that they explicitly strip out those parts of the acoustic signal that are not audible to the listener. Yet a good watermark should persist and remain detectable even when compressed using a lossy codec such as MP3.
How is this achieved? Typically, using clever combinations of steganography (that is, concealing information by subtly modifying the source material) and psychoacoustics (…alongside the scientific understanding of how humans perceive sound). Some approaches layer low-frequency noise onto the original recording, that can be correlated with an expected carrier pattern; others make use of psychoacoustic phenomena such as the Haas effect, adding a subtle echo to the recording that our brains perceive as part of the original sound. For a summary of different approaches, see this BBC R&D audio watermarking white paper.
Watermarking itself has different motivations, which affect the selected approach. A watermark for copyright control should be as difficult as possible to detect and remove from the original broadcast, and ideally use cryptographic processes that can not easily be reverse-engineered to prevent others from stripping or spoofing watermarks.
More benign motives for watermarking include the addition of metadata, including subtitles or artist/track data, or sync information to a broadcast. In these cases, the hard-to-remove requirement becomes diminished. Here, a watermark could be as simple as a set of inaudible ultrasonic tones that are overlaid onto the original material. This can obtain higher bit-rates than the subtle approaches of steganographic watermarking, whilst retaining the benefits of imperceptibility and dynamic payload support. Chirp’s ultrasonic encoding protocols can be used to introduce inaudible watermarks of this type.
The final approach we’ll look at is that used by Chirp in our technology products. Audio data encoding — or modulation/demodulation — is a technology that has been used since the early days of radio communication, from Morse code to DTMF dial-tones to the V.90/V.92 56kbps protocols used by dial-up modems and fax machines. This is, in fact, the etymology of “modem”: modulation/demodulation.
Unlike either of the previous approaches, it does not require an existing audio signal to operate on. Instead, data is encoded by generating a new signal whose properties are determined by the data to be transmitted. In the simplest mapping, the presence of a signal denotes a “1”, and the absence of a signal denotes a “0”.
Where signal is present in the communications channel, such a distinction is less clear, so early dial-up modems would use frequency A (in the case of Bell 103, 1270Hz), to denote a “1” and frequency B to denote a “0” (1070Hz).
Of course, it is possible to go beyond a binary on-or-off approach. Chirp’s communication system maps integers to larger sets of frequencies: our standard protocol uses tones of 32 different frequencies, resulting in far larger throughput.
However, sending acoustic information between air-gapped devices has challenges of its own. Background noise and reverberation serve to distort the original signal, meaning that the transmission rate must be reduced to keep reliability high.
There’s also the likelihood of missed tones — for example, in a situation in which a passing siren obscures part of the original signal — which requires additional error-correction to be factored in to correct and compensate for misdetections.
One of the key strengths of audio modulation is that information can be encoded and decoded in real-time, without any external components such as a look-up database. It’s a good fit for compact, dynamic payloads, which means it’s appropriate for creating network-like communication links between devices. It has a relatively high throughput versus watermarking (and certainly fingerprinting), making it suitable for security-critical applications such as exchanging authentication or payment tokens.
It is also relatively computationally light, particularly in comparison to the great complexity of a fingerprinting server. This means that audio-based communication is viable amongst simple devices such as IoT nodes.
Modulation plays some part in watermarking, in which data must be acoustically encoded before it can be applied to the original source material. However, the ultrasonic “watermarking” described above would be highly inappropriate for copyright control or other integrity-critical purposes: it could trivially be removed by a filter that strips out ultrasonic frequencies.
The table below summarises some of the key affordances of each of these three different approaches to audio information retrieval.