In our audio recognition document, we described the concept and use cases of audio recognition. We have also introduced audio watermarking as a method to perform audio recognition. But how does one watermark audio in such a way that its robust to noise and distortions, but so that it won’t be noticed by a listener?
The use of audio watermarking as a recognition method may not be entirely self-explanatory. Essentially, audio watermarking solves a more general problem, that of sending additional data over sound (more below). We can use these additional data to recognise the audio, if the watermark acts as a reference to metadata embedding the desired information in the audio.
Once the audio is watermarked, a phone (or other device) can extract the watermark and dereference the metadata, thus learning what it needs about the audio. This process is illustrated in the graphic below:
In the graphic, a watermark id is added to original audio, creating watermarked audio (the blue waveform). The watermark id uniquely identifies metadata in an external metadata database. The metadata database may exist in the cloud or locally on the phone. When the phone captures the audio using its internal microphone, it can extract the watermark from it. The watermark can be dereferenced to obtain the audio metadata.
Using audio watermarking for recognition has the following advantages:
· Audio watermarks can be detected and extracted from the audio on any device with a microphone. No Bluetooth, WiFi or other connectivity is required. Watermark extraction is battery-efficient and can run even on old and/or slow hardware.
· Watermark recognition scales exceedingly well. Detection is as simple as extracting a watermark, no matter how large the audio set is that one wishes to recognise. Furthermore, since no servers are required, there’s no cost to increasing the user base.
· Audio watermarks can distinguish material that is indistinguishable to humans. For instance, a song might be watermarked differently on Spotify than on Apple Music, allowing watermark detection to establish the source of playback.
An important difference between traditional document watermarks and audio watermarks is their conspicuousness. Whereas it is acceptable for a document watermark to be visible (as long as it doesn’t impact the readability of the text), it is generally unacceptable for audio watermarks to be audible. Therefore, audio watermarking techniques aim to “hide” data in the audio in such a way that a listener won’t notice. A few approaches have been tried and tested over the years.
Possibly the most obvious way to hide sound from humans is to hide it outside the human auditory spectrum. The human ear and brain work together to process sounds roughly between 20Hz and 20kHz, as illustrated below.
The image shows the sound level threshold for different frequencies. This shows that the human auditory system only perceives sounds of sufficient loudness. The loudness required for sound to be perceived varies over frequency and age group. In practice, this means that humans tend not to perceive sounds above around 16kHz, and this threshold further lowers as they age. Therefore, the audio spectrum beyond 16kHz is sometimes called the ultrasound range.
Provided that mobile phone (or other) microphones can capture a part of the ultrasound range, it is thus possible to add additional data. The advantage of adding data in this frequency range is that its very simple, since we don’t have to worry about impacting the audio quality. This technique therefore allows for high volumes of data to be embedded.
The problem with ultrasound watermarking, however, is that the watermark is easily removed. Since the watermark exists in distinctively separate frequencies as the main content does, the watermark can be removed without impacting the content. This makes the technique ineffectual for forensic purposes.
Furthermore, it is not always by intent that the watermark gets removed. Since ultrasound is not audible to humans, it often gets removed in compression. For instance, the following figure shows what happens if we upload a file to YouTube or Vimeo.
We can see that if an input of constant amplitude per frequency is uploaded to YouTube or Vimeo, the streaming service’s compression algorithms remove audio content with frequencies higher than, respectively, 15.8kHz or 17.0kHz. This figure is made using the best upload and playback settings available to us in each respective service. This implies that any watermarks added beyond the maximum frequency will be lost as soon as the video is uploaded.
Ultrasound compression is not limited to YouTube and Vimeo. The below images are audio spectral analyses taken from Netflix and BBC iPlayer, respectively.
The figures show that Netflix’ compression algorithms strongly compress audio above the 10kHz mark, and the BBC iPlayer removes anything higher than 15kHz. Although these figures may be somewhat dependent on internet speed and contract type (in case of Netflix), the majority of content will have its watermark stripped out on these channels. Compressions algorithms like these are active even on TV broadcasting, although quality in these cases will be much more diverse depending on contract, geographic location and channel.
The disadvantages related to ultrasound watermarking have led to the invention and use of spread spectrum watermarking. Spread spectrum watermarking was originally used in image watermarking, but was later modified to be usable in audio, too. As the name implies, the idea for this technique is to distribute the watermark over the whole spectrum. This implies that the watermark becomes intertwined with the actual content, rendering it hard to remove the watermark. And since the watermark exists in the same frequency range as the content, it won’t be removed by compression algorithms.
Because the watermark exists in the same frequency range as content, we must be more careful with its placement. In order to avoid degrading the original content, watermarks must leverage weaknesses in the human hearing to avoid being noticeable. A spread spectrum watermark is generally added in the form of low amplitude noise over the full spectrum. As long as the original content comprises a wide enough range of frequencies, the human auditory system won’t notice the low level noise.
Spread spectrum watermarking solves many of the problems and limitations of ultrasound watermarking, but is not without limitations itself. For one, it seems that not all listeners are equally insensitive to low level noise. Depending on aggressiveness of watermark insertion and the type of audio content, some listeners report hearing the watermark as a “hum”.
Secondly, spread spectrum watermarking tends to be very sensitive to pitch shifts. If the frequencies of the sender differ from the frequencies that the receiver expects, watermark extraction fails. One common case of pitch shifting is the Doppler effect. The Doppler effect describes the perception that audio seems to experience a frequency shift when the sender and receiver are not moving at the same speed. We experience the Doppler effect when we hear the sirens of an ambulance vehicle change pitch as it comes toward us, passes us and then moves away from us. Spread spectrum watermarking is generally so susceptible to the Doppler effect that extraction of a watermark may fail if the phone and speaker aren’t perfectly still, limiting its use in day-to-day applications.
In principle, sound echoes off every object in earshot. The reason we don’t usually hear these echoes is because the human auditory system has evolved to filter out echoes, especially the short ones. After all, if we heard every echo from every object, our senses would get overwhelmed. We only hear the long echoes, which is why we can hear echoes over long distances, i.e. in a big empty church or cave. Echo-Modulation techniques make use of this insensitivity to hide data in short echoes. Intrasonics is currently the only party offering echo-modulation watermarking.
Echo-modulation watermarking involves taking the original content, calculating what its natural echoes would be, and adding data into these echoes. These artificial echoes have the same structure as natural echoes, but are placed in such a way that devices can selectively extract the artificial echoes to determine the watermark. The shape of the echo determines the data content. If required, multiple non-overlapping echoes can be added simultaneously to increase the data rate.
Because echo-modulation techniques use echoes of the original content, audibility and decode-ability of the watermark are dependent on the original content. For example, it may be understood that echo-modulation techniques cannot watermark silence (since silence has no echoes). Since artificial echoes are different from random noise, the watermarks don’t sound like a noisy hum. In fact, echo-modulation watermarks are practically inaudible in audio in which we usually experience many echoes (e.g. a football match). On the other hand they may be more audible in situations where we’re not accustomed to echoes (e.g. a single-instrument classical music piece).
Echo-modulation techniques are far less sensitive to frequency and timing shifts than spread-spectrum watermarking is. Echo-modulation watermarks are very robust to the Doppler effect and can easily be picked up from a moving phone.
Watermarking audibility is determined by watermarking amplitude, i.e. the strength of the watermark that’s introduced. If the watermarking amplitude is very low, it will practically be inaudible, but very little data can be sent. Conversely, more data can be sent over a stronger watermark, but it’ll be more audible. Second, a certain amount of data is used for error correction, which ensures that even a partially received watermark can still be successfully extracted. Error correction compensates for noise that hinders the retrieval of artificial echoes, ensuring that watermarking remains robust even in heavily distorted or noisy audio. Error correction comes at the cost of useful data rate, though.
Watermark quality is thus a trade-off between data rate, robustness and audibility.
This trade-off is visualised in the figure above. A watermark that is designed completely for being inaudible will have a low data rate and won’t be very robust. A watermark designed for very high data rate will likely suffer in robustness and will be more audible. This illustrates that no perfect watermark exists. Rather, we may want to choose different watermark configurations that apply to different situations. At Intrasonics, we call these configurations schemes. Schemes are designed to select a trade-off that fits a certain scenario. Intrasonics offer a range of standard schemes that are suitable to common scenarios.
Techniques for data hiding (Bender et al., 1996)
Secure Spread Spectrum Watermarking for Multimedia (Cox et al., 1997)
Robust Audio Watermarking using Improved TS Echo Hiding (Erfani and Siahpoush, 2009)
Echo Hiding (Gruhl et al., 1996)