For real-time analysis we will do our best to detect beats within, or as closely as possible to, the audio that is currently playing in the scene. We’ll find some limitations here but will come away with a solution that is usable for many use cases. It’ll also be a good introduction of the concepts needed to perform the preprocessing analysis.
For playing back audio in Unity, we will always be using an AudioSource to play a file which is represented as an AudioClip. Once we’ve imported our audio files as AudioClips, we should set the load type to “Decompress on Load” to ensure that we have access to the audio sample data at runtime.
Attach an AudioSource component to a GameObject in your scene, drag your AudioClip into your AudioSource, make sure “Play on Awake” is selected, and press Play. You should now hear your audio file playing within Unity.
Once we have audio playing in Unity, the Unity API provides some very handy helpers for getting access to information about the audio that is currently playing. This makes our real-time analysis of the audio samples themselves very straightforward.
We have access to two important helpers:
As you can see (at the time of this writing), the documentation for each helper isn’t very descriptive. I’ll try to add some clarity as we go along.
GetOutputData is going to give us an array representing amplitude, or loudness, over time for a particular channel. We will call this simply “sample data”. While sample data becomes more useful for preprocessing, we don’t actually need it for real-time beat detection due to the next helper: GetSpectrumData.
GetSpectrumData is going to give us an array representing relative amplitude, which I’ll refer to as significance, over the frequency domain for a time sample on a particular channel. We will call this “spectrum data”. This is extremely useful, because this can give us a clear indication that not only did something significant happen at this point in time, but it can also tell us which frequency ranges had significant action. That can help us make intelligent decisions about what is happening in different frequency ranges, which we can roughly translate to different instruments, within the track.
GetSpectrumData, under the hood, is performing a Fast Fourier Transform (FFT) to convert the amplitude over time domain into the frequency domain, returning just the relative amplitude portion of the complex data returned from the FFT. While the FFT is typically performed over the entire frequency range from 0Hz to the sampling rate (let’s use 48kHz for this example), 0Hz — half the sampling rate (24kHz) is a mirror of the second half of the range (24kHz-48kHz). This mid-point is called the Nyquist frequency. For this reason, GetSpectrumData and some other FFT-based helpers only return relative amplitude for the first half of the analysis. That means the number of audio samples required to perform the FFT is double the number of frequency bins returned by GetSpectrumData. So, the higher the granularity of frequencies you want to analyze, the larger the time frame required to generate that granularity. If you want 1024 frequency bins, giving you a granularity of 23.43Hz per bin, it will require 2048 audio samples. Requiring more audio samples means it takes longer to collect those samples which can add ambiguity to knowing exactly where in the time domain each frequency value was detected. It’s a trade off that you can experiment with to see what works best for you. I found that a spectrum array size of 512 and 1024 were both more than adequate for basic onset detection.
I strongly suggest you watch the following video on the Fourier Transform to get a better understanding of what Unity is doing for us:
For the parameters of GetSpectrumData:
- Channel 0 is a safe choice here, because if we have audio in stereo we would need to deal with 2 channels worth of sample data individually. Channel 0 contains the average of the stereo samples, combining every 2 stereo samples into 1 mono sample. This allows us to make simpler decisions about what’s happening in the audio.
- We can choose from a number of FFT Windows to window and scale our spectrum data. I’ve found that BlackmanHarris does an excellent job of showing us distinct action per frequency bin without much leakage between bins.
- The array we provide will be populated, to the length of the array we provide, with the spectrum data of the most recently played audio samples. This means we don’t have to call GetOutputData before GetSpectrumData. The array length must be a power of 2. The FFT analysis is most performant on sample sizes that are a power of 2, and Unity enforces that suggestion with GetSpectrumData.
I’ve found that a spectrum array size of 1024 gives us nice granularity. Again, this means Unity will be taking 2048 audio samples under the hood. If we know the audio sample rate, we can find the supported frequency range of our spectrum data, and then, using our array size, we can quickly find out what frequency range each index of our array represents.
Working with Frequencies
Unity can tell us the audio sample rate in hertz (Hz) of our mixer with the static member AudioSettings.outputSampleRate. This will give us the sample rate that Unity is playing the audio and will typically be either 48000 or 44100. We can also get the sampling rate of our individual AudioClip with AudioClip.frequency.
Knowing our sample rate, we can then know the maximum supported frequency of our FFT, which will be half the sampling rate. At that point, we can divide by our spectrum length to know which frequencies each bin (index) represents.
48000 / 2 = 24000Hz at the top of our supported range. We typically only care about 20Hz-20000Hz for audio, but the extra few hertz don’t mess anything up.
24000 / 1024 = ~23.47Hz per bin. Now, obviously we don’t have enough granularity to represent every individual frequency, but it should be good enough for most use cases. At this granularity our 10th bin would give us the relative amplitude for ~234Hz +/- a small window of neighboring frequencies.
Using the pre-defined standard for describing frequency ranges, we can start to make decisions based on which frequencies have action at a point in time in our track. You can also go a level deeper and attempt to detect different notes based on frequency.
To make sure this is working correctly and that we are doing the math right, I downloaded the audio from a headphone test, which just sweeps the frequency spectrum from 10Hz — 20kHz. The track hits 234Hz at about 2:08. So, if I load this track into my AudioSource and call GetSpectrumData at 128 seconds, I should expect to see action around bin 10.
And we do!
You can see here that there is some leakage between bands, but we are still granular enough that we can see that there is significant action in the 234Hz range at this time, just as we expected.
Here is the code to do such a test yourself:
Alright, now we can know the frequency distribution at a point in time in our audio track. This opens a lot of doors. Most importantly for now, it means we have the data necessary to perform Onset Detection using Spectral Flux.
Onset Detection using Spectral Flux
If you haven’t already, I really do encourage that you read Mario’s article linked above. In Part 1, Mario defines a lot of the terminology we need to continue. We will jump in around Part 6 to plug in our real-time spectrum data into the spectral flux algorithm. Mario does an excellent job explaining what spectral flux is and how we can use spectral flux to detect beat onsets, so I won’t attempt to reword it here. I will mostly be translating Mario’s preprocessed Java solution into a real-time Unity C# solution.
Spectral Flux is all about finding the aggregate difference per bin between spectrum data at two close points in time. For us, that really means comparing the spectrum data for the audio that is playing during the current frame to the data from the last frame, or the last time we checked. You can see here that we’re already exposing ourselves to error with real-time analysis, as we are limited by our progress into the song and the framerate itself.
Each time we call GetSpectrumData, we should keep the most recent spectrum data for comparison. If we are updating our spectrum data each frame, then that is pretty simple.
Great, now we are keeping enough history to do our comparison. We want to know the difference, per frequency bin, between the most recent spectrum data and the current spectrum data. To clean the data up a bit, we will only keep positive differences, so that hopefully we can see if we are on our way up an onset in the spectrum as a whole. We will call this the rectified spectral flux. Remember that we are analyzing the entire spectrum which contains all supported frequencies. If you want to run the same algorithm over a subset of frequencies, just the sub-bass and bass range for example, it’s as simple as specifying which indexes of the 1024 length spectrum you want to process the rectified spectral flux for. Nothing else about the algorithm needs to change. Now that you know how to determine which index corresponds to which frequency bin, you should be able to fit this to your exact use case.
At Giant Scam we like to run this algorithm over multiple frequency ranges simultaneously to remove noise and hone in on the many different things that could be happening in a song at a point in time. Then, we can make better decisions about gameplay by knowing which “instruments” have how much action going on at different times in the audio track.
Here, for simplicity, we analyze the entire frequency spectrum at once.
Next, we want to be able to generate a threshold based on our spectral flux values calculated for a time frame surrounding (both in the past and in the future) a particular spectral flux value. This will allow us to determine if the change in the spectrum was significant enough that we consider it a beat. We average the frame of spectral flux values and multiply it by our sensitivity multiplier. If the spectral flux value we are processing is higher than our raised average, we have an onset!
This gets tricky when we are talking about real-time, because we can’t really look at spectral flux in the future. You can do some tricky things with Unity like sending the audio to a muted Mixer and playing it ahead of what the user hears, but we aren’t going to implement that here.
What we’ll do instead is just do onset detection a number of spectral flux values (our frame sample size / 2) in the past. So if our frame size is 30, once we have 30 spectral flux values we can accurately process value 15 by averaging values 1–30, multiplying the average by some sensitivity multiplier, and checking to see if value 15 is higher. So we’ll start processing value 15 and move forward from there to value 16, which will be compared to the average of values 2–31, and so on. Being ~15 spectral flux values behind real-time can leave us roughly a half second behind the currently playing audio depending on framerate.
The remaining functions are very straightforward.
First, we only really care about the portion of the flux that is above the threshold. We will call that the pruned spectral flux. If the flux is below the threshold, we’ll say the pruned spectral flux is 0.
Finally, we can determine if a spectral flux sample is a peak. We do this by comparing the sample’s pruned spectral flux to its immediate neighbors. If it is higher than both the previously sampled pruned spectral flux, and the next, then it is a peak! This means, of course, we have to go one more sample behind the currently playing audio, so that the sample we are analyzing has neighbors that have already calculated their pruned spectral flux.
Bringing it all together isn’t too tricky. We need to continuously collect spectrum data, calculate the rectified spectral flux, and then take a look in the past (by half of our window size), to see if that sample was above the threshold that we can now generate, and then go one more sample into the past to see if that sample was a peak.
A sample script to do that, showing the index gymnastics described above, is here:
You might notice that I’m populating a class to group the output of each step of our algorithm for each spectral flux sample.
Grouping this information allows me to plot the output of the algorithm in real time. I find that visualizing the output of the algorithm makes the algorithm make more sense, and it also can give us clues on how sensitive we should make our threshold since we can see the peaks being detected as the track plays.
To start seeing some results, we can call the analyzeSpectrum function from our behavior that is playing the song:
I plugged my friend Terror Pigeon’s song “Chamber of Secrets for 1” into my AudioSource, and visualized the output of the algorithm over time. Here, you can see the heavy bass beat is coming through pretty clearly and consistently. If you notice that you’re over-recording, you can bump the threshold up a bit to remove some of the extra noise.
Each green point is a rectified spectral flux sample. You can see that we have green points all the way up to the green line, which is representing the currently playing time in the audio track. The blue points are the threshold at that point in time. As you can see, we can only generate the threshold a few samples behind real-time, due to the time frame of samples necessary to calculate the average. The red points are our peaks, which in this case are the evenly spaced bass beats at the beginning of the song. Since we are analyzing the entire frequency spectrum here, the bass beats aren’t always in the same spot on the y-axis of our plot (the rectified spectral flux value), because there are other things going on in the song, in frequency ranges outside of bass, that may be increasing or decreasing in intensity.
So what can you do now that you have the peaks? Well, the peaks are clear indicators of onsets, and onsets are clear indicators of beats. You are now technically tracking the most significant beats happening in the song. You can use this information to spawn items, affect the environment, visualize the audio track, etc… The catch is that you’re always going to be a few samples behind the real-time audio, and that isn’t optimal. You could look into the Unity hack with the Mixer that I mentioned earlier, or store the beats while playing the song for the first time and cache the results for later. If you’re like me and you don’t want either of those limitations, then we should find a way to preprocess an entire audio file up front, running all of the samples through our same spectral flux algorithm, so that we can detect all beats prior to gameplay — even if it’s the first time we’ve seen the audio track. It takes some work to get preprocessing audio functioning in Unity, but it’s certainly doable.
And that’s where we’ll start in the next part of the series:
Preprocessed Audio Analysis