If you haven’t read the previous post in the series, Real-time Audio Analysis Using the Unity API, please take the time to do so before reading on. It covers many of the core concepts needed to perform our preprocessing analysis.
When doing real-time analysis, we found that we had to lag slightly behind the currently playing audio in order to detect beats. We also could only use the beats detected up until the current time in the track to make decisions. We can eliminate these limitations by preprocessing the entire audio file up front and detecting all beats before playing audio to the user.
While most of Unity’s helpers are meant for real-time analysis, it does offer us one helper that can get us started. To process an entire audio file up front, we need all of that file’s sample data. Unity allows us to get that data with AudioClip.GetData. As you can see in the documentation example, it allows us to populate an array with all of the samples that the clip contains.
AudioClip.samples is the total number of single-channel or combined-channel samples contained within the clip. That means if a clip had 100 samples in stereo, AudioClip.samples would be 50. We multiply by AudioClip.channels because AudioClip.GetData is going to return sample data in interleaved format. Meaning, if we have a clip that is in stereo, the data will come back as:
L = Left Channel ; R = Right Channel
[L, R, L, R, L, R,…]
That means that, if we want data similar to what Channel 0 returns in the real-time helper AudioSource.GetOutputData, we need to go through our full set of samples and average every two samples together to get mono samples. You’ll find that this can be very taxing on the CPU because of the number of samples in a given clip.
Let’s do the math:
If I have a 5 minute (300 second) audio track, with a sampling rate of 48000 samples per second, that means we have:
300 * 48000 = 14,400,000 samples (which can be found in AudioClip.samples)
BUT, if we are in stereo, we have interleaved samples, which doubles our sample count
14,400,000 * 2 = 28,800,000 samples
We really want that 14.4 million samples in mono, so we’ll iterate over the 28.8 million samples and average every pair of stereo samples to get this. If you’d like to do this over a series of frames using the offset parameter to get smaller chunks of the audio samples, you should know that the offset parameter is actually applied against the combined-channel number of samples (AudioClip.samples). So if you pass an offset of 10, you’re still going to get interleaved stereo samples, but it will start 20 (10 pairs of stereo) samples into the stereo data.
No matter what you’re doing, iterating over 28.8 million samples is not going to be fast enough to do on the main thread in Unity. I recommend passing this task to a background thread using the System.Threading library, which is packaged with Unity C#. We won’t go deeply into System.Threading here, but you should know that Unity strongly suggests (often in the form of throwing an exception) that you do not access any Unity API functionality from within a background thread. Grab any values from AudioSouce / AudioClip that you need, make them accessible, and then do the math inside of the thread and leave all access of the Unity API to the main thread.
Let’s get set up to convert our stereo samples to mono. First, we need to grab any attributes from the Unity API that we might need later when we spawn the background thread:
Now the loop for combining channels:
Now, our float array preProcessedSamples contains data that is of the same format as what’s returned in real-time by AudioSource.GetOutputData, but it contains samples from the beginning of the track until the end instead of just the samples for the currently playing audio. You can compare your output to the output of AudioSource.GetOutputData for a point in time in the song and see that they are fairly close. One catch here is that AudioSource.GetOutputData returns the most recently played 1024 (or length of your array) samples, which might trip you up if you’re doing the math to find the starting point in your preprocessed samples that represent a point in time in the track.
Working with Frequencies (and external libraries)
We have our sample data, which is amplitude over time, but what we want is spectrum data, or significance over the frequency spectrum at a point in time. In our real-time analysis we achieved this with Unity’s helper AudioSource.GetSpectrumData to perform a Fast Fourier Transform. Unity does not have an equivalent helper for performing an FFT on raw sample data, so we’ll want to look elsewhere.
I landed on a very lightweight C# implementation of the basic Fourier Transform called DSPLib. It includes both the Discrete Fourier Transform and the Fast Fourier Transform, a number of available windows, and, most importantly, it gives us output that represents the same relative amplitude that we had when using AudioSource.GetSpectrumData.
Click “Download Library C# code only” to download the source code for the FFT, then extract the DSPLib.cs file and place it in a directory in your project.
There is a catch here. DSPLib relies heavily on the Complex data type provided by C#.NET’s System.Numerics library. You’ll see that Unity is complaining about not being able to find the Complex data type or System.Numerics. This is because Mono, the flavor of .NET provided by Unity, does not include the System.Numerics library. What we can do to get around this is go directly to the source (Microsoft’s github page in this case) and download Complex.cs and place it in our project. Complex.cs does require some minor conversions before it will compile in Unity. I’ll place my version of the conversion here:
If we add references to our new libraries at the top of our file, we should be able to compile with no issue.
I’ll go ahead and show you the code for executing the rest of the sample preparation before we send our spectrum data to our spectral flux algorithm. If we’ve done everything correctly, we won’t have to change our spectral flux algorithm at all.
Much of what you see here is based on the examples given on the DSPLib site. The basic steps are:
- Combine stereo samples to mono
- Iterate over samples in chunks the size of a power of 2. 1024 is our magic number here again.
- Grab the current chunk of samples to be processed
- Scale and window it with one of the available FFT Windows. I found DSPLib’s implementation of Hanning to be pretty reliable here.
- Execute the FFT to retrieve complex spectrum values, which will again be half the size of our sample data — 512 in this case.
- Convert our complex data to a usable format (array of doubles)
- Apply our window scale factor to the output
- Calculate the current audio time represented by the current chunk
- Pass our scaled, windowed, converted spectrum data to our spectral flux algorithm. You can see here that I set us up to use floats, and DSPLib really likes doubles. I’m adding some overhead by doing the conversion to float instead of converting the spectral flux algorithm to use doubles.
In the real-time analysis we asked Unity to give us 1024 spectrum values, which means under the hood Unity was sampling 2048 audio samples on the time domain. Here we provide 1024 audio samples from the time domain which gives us 512 spectrum values. That means we have slightly different frequency bin granularity. We could of course provide 2048 audio samples to have the exact same granularity as we did in the real-time analysis, but I found that having 512 bins was very passable for onset detection at different points in the spectrum.
For 512 bins, we’d simply divide the supported frequency range (our sampling rate / 2 — the Nyquist frequency) by 512 to determine the frequency represented per bin.
48000 / 2 / 512 = 46.875Hz per bin.
We don’t need to redefine our Onset Detection using Spectral Flux algorithm here. It stays exactly the same because we’ve formatted our spectrum data in a similar way to what we had available when doing real-time analysis.
Navigating the Spectral Flux Output
The calculation of audio time is pretty simple. We know that our sample data is amplitude over time, so therefore each index must represent a different point in time. Our sample rate and our current position in the sample data can tell us roughly (within a few milliseconds) what time our current sample chunk is representing.
Let’s take our Terror Pigeon song from the real-time example. It is 04:27, or 267 seconds long. The clip has a sampling rate of 44100 (AudioClip.frequency) samples per second. So, we expect there to be roughly:
44100 * 267 = 11,774,700 samples
Logging AudioClip.samples gives us 11,782,329 samples. About 8k more samples, or 0.18 more seconds, than we expected. This is only because the track is not exactly 267 seconds long.
We need to know how much time is being processed per chunk, so that we have some idea of what time is represented by a spectral flux sample.
1 / 44100 = ~0.0000227 seconds per sample
0.0000227 * 1024 = 0.023 seconds per chunk
We could come to the same result by dividing the track length in seconds by the total number of samples, if we have that information available.
267 / 11,774,700 = ~0.0000227 seconds per sample
0.0000227 * 1024 = 0.023 seconds per chunk
We can do the math in reverse as well, to know what spectral flux index corresponds to a particular time in the song. Listening to the song , I’m hearing a bass beat about 3.6 seconds in. There are plenty of others, but let’s use this one as an example.
The time divided by the length of time per sample should give us the index, but we have to remember that we’ve been grouping by 1024 samples per chunk to get our spectral flux.
So at time 3.6:
3.6 / 0.023 = 156.52 — so we’d expect to find a peak near index 156.
Each index in our spectral flux output only represents 0.023 seconds, and I picked out time 3.6 by ear, so we may not be right on target. Let’s take a look at 10 samples in each direction from index 156, about a half second total, to see if we’re close.
Our algorithm found a peak at index 148. So I was 8 indexes, or about 0.184 seconds off. Not too shabby.
Looking at a larger window in each direction, we can see that we are logging a peak about every 18 indexes, or every 0.414 seconds. This means that the song’s tempo must be ~145 BPM for the section we are analyzing. To be sure, I asked Terror Pigeon what the actual BPM of the song is and he said I was 2 off, it’s 143 BPM for the full song. Not bad for analyzing a ~1 second time frame!
You can do a similar comparison with the following code:
Let’s plot the output of both the real-time analysis and the preprocessed analysis side-by-side to compare.
Our real-time plot is on the top and our preprocessed plot is on the bottom. You can see there is a bit of skew, but the highest 9 peaks on the real-time plot roughly correspond to the same 9 peaks on the preprocessed plot. The skew is because our indexes in real-time are roughly 1 frame’s time apart, which is not the same distance as our indexes in preprocessed which we know are 0.023 seconds apart for this track. You can see that there is less fluctuation and less over-recording in the preprocessed plot and, of course, our preprocessed plot extends beyond real-time. We can jump to any time in the song and our preprocessed spectral flux will be available.
We are now mapping all beats of an audio file before playing the audio to the user. While results may vary, this should set you up nicely to start creating your own gameplay based on beats within an audio file.
What to experiment with and where to focus next will be covered briefly in the Outro.