A Perceptually Meaningful Audio Visualizer
Audioscope: What you see is what you hear.
I pay a lot of attention to details in sound. I wanted to be able to see these details, and also point them out while describing sounds to people. Unfortunately, most audio visualizers don’t reveal these details.
So I created Audioscope, and made a video and soundtrack to demonstrate how some of these fine sonic details are made visible and obvious:
How it Works
tl;dr it turns sine waves into circles
Technically: The y axis is the raw audio signal, and the x axis is the signal filtered such that every frequency is phase shifted by 90˚.
Here’s the visual explanation:
Sound Is Made of Sine Waves
We can decompose signals/sound waves into sine waves/pure frequency components. These components have an amplitude and phase.
By summing them together, we can get the original signal back.
Sine Waves Are Made from Circles
We can get a sine wave by tracing out a circle and plotting the y axis.
We can do the same using the x axis.
These two waves are the same, except they are 90˚ out of phase.
If we put a sine wave on the y axis and combine it with a 90˚ phase-shifted version on the x axis, we trace out a circle.
If we make phase/time a dimension on the z axis, we trace out a helix.
Helices turn out to be a mathematically simple way of working with signals. In my opinion, it’s also a more natural way of interpreting audio signals. Given a pure sine wave sound, while converting it to a helix requires you to add an imaginary component to the signal, the resulting helix is more representative of the purity of the sound, since the radius/magnitude is constant.
But let’s keep time in the time dimension and keep the visuals two dimensional.
Turning Waves into Circles
Because we can decompose signals into component sine waves, and convert sine waves into circles, we can convert every component sine wave of a signal into circles, and represent the signal as a sum of circles, where the y axis is the original signal, and the x axis is the signal with every component sine wave phase shifted by 90˚.
This results in a visualization of the signal that is one part real and one part imaginary, but also perceptually meaningful:
- Loud sounds have large shapes, and quiet sounds have small shapes. Near silence is a dot in the middle, and pure silence is a plain black screen.
- A pure sine wave is just a circle, where the radius corresponds to amplitude.
- Purer sounds are very round because they’re made of very few sine waves.
- Brighter sounds end up looking spiky because they have many frequency components and also digital sound has limited resolution/is “pixelated”.
- Percussive/transient sounds flash on the screen because these signals are very short.
- Sustained tones create sustained shapes because tones are periodic signals that have repeating parts that have the same shape, and these shapes keep getting traced out over and over again.
- Multiple tones in perfect harmony also have sustained shapes because perfect harmony means the frequencies are integer ratios of each other. In other words, the combination of these periodic signals is also a periodic signal.
- Multiple tones in imperfect harmony have shifting/vibrating shapes because something to do with interference and beating and it’s just not periodic so the same shape doesn’t get repeated ok also most music uses imperfect harmony so every time there are multiple tones it’s probably gonna look messy sorry this deserves a dedicated post
Thickness, Hue, and Saturation
The beam of the Audioscope visualizer has variable thickness and color. These things are more subtle and unpredictable, but if you’re curious and ok with more math, read on.
- Thickness: inversely proportional to speed.
- Hue: instantaneous pitch, derived from angular velocity.
- Saturation: inversely proportional to amount of noise
The thickness decreases as the beam moves faster. This causes high-frequency sounds or loud sounds to appear thinner.
The color is a lot more complicated. Using the HSV color space:
- Hue relates to pitch (more technically, pitch class). Pitch is circular, and hue is circular, so this is a natural mapping to make.
- Saturation corresponds to amount of noise, where: more noise → more white, less noise → purer colors.
- Value is maxed out because I want only the brightest colors
At a high level: the hue of the color roughly corresponds to the pitch of the locally largest frequency component. If we’re dealing with pure sine waves, it directly corresponds to the pitch of the sine wave. This means that, if a 440Hz (A4) sine wave is red, 220Hz (A3) and 880Hz (A5) are also red. A sine wave going from 440Hz to 880Hz would start at red, cycle through every color of the rainbow, and end up at red.
pitch ≈ log_2(frequency)
pitch class ≈ pitch mod 1
Technically: At a given point in the beam, we have the angular velocity ω (how fast the beam is turning at that point) (this is distinct from instantaneous frequency). For a pure sine wave, ω corresponds to frequency; If the beam turns twice as fast, the frequency doubles. Interpreting ω as frequency, we can use the above formula to convert it to something corresponding to pitch (class), and use the result as the hue of the color at this point.
Even more technically: For small values of ω, the effects of noise are much more prominent, so there’s actually a filtering step at the end that basically gets the average hue and amount of noise. However, this type of noise isn’t directly related to noise in the signal; It is related to the amount of noise in the angular velocity over time. Well, it should be related, but the current formula needs improvement.
After all this, the colors only have apparent meaning in exceptional cases (pure frequencies). But it does make for nice rainbows that entirely depend on the sound.
Filter Design and Implementation
I’m able to describe the concept of phase shifting every frequency by 90˚ while avoiding heavy mathematics. But actually creating the filter that does this for arbitrary signals requires domain knowledge. This is for those who are familiar with digital signal processing.
I created a generator for an FIR filter that removes all negative frequencies and also DC and Nyquist. I could have just used the plain Hilbert transform, but I wanted to make sure that, for the lower transition band, the magnitude of the real part approximately decreases similarly as the imaginary part, and similarly for the transition band near Nyquist, so that the results will be as circular as possible (as opposed to having vertically oriented ellipses). Low frequencies are very important in electronic music.
Rust is still relatively new and it seems no one has implemented an efficient convolution yet using the FFT, so I just implemented overlap-save on the spot, and made the filter length be as large as possible (and also odd) depending on the FFT size. I generated the impulse response for a bandpass filter with real part removing DC and Nyquist and imaginary part the Hilbert transform, and had it windowed with a Hamming window.
It was an option to use a pair of IIR filter that used less memory and had better magnitude response, but I saw the group delays for the lows and felt it was unacceptably long for an application that needs to be as responsive as possible. Also, I wasn’t okay with the idea of non-linear phase, which I imagine would ruin the integrity of the waveform.
As for the filters for getting the hue and saturation; I just implemented my own biquad lowpass (as in, I copypasted the formula). As I mentioned, I think there’s room for improvement. Currently, I take the angular velocity, take the logarithm of it, and then filter it, because my reasoning was that taking the log would cause the noise to be amplified and the filter would more strongly remove it. But isn’t there some invariance in that ordering? idk I didn’t want to think too much about math tbh and also was too rushed to really take a good look at the waveform and spectrum of the angular velocity BUT IT WOUDL BE NICE