For the past two years, I have been actively developing ChowTape, an audio plugin designed to emulate the sound of reel-to-reel analog tape. One of the reasons I’ve enjoyed working on this project is that it provides me an opportunity to implement advanced, cutting-edge signal processing techiques, in service of user-friendly, musically interesting software. In this article, I’d like to talk about a new feature that was recently added to the plugin: a specific type of neural network known as a State Transition Network (STN). I wanted to write about this feature because I think it provides useful insight, and shows some of the practical considerations for how neural networks can be used in real-time audio signal processing.
For most readers it may not be immediately obvious how tape emulation and neural networks are connected. Many proponents of neural networks enjoy the fact that they can be used to model “end-to-end” systems, which in this case would mean recording audio directly from an actual reel-to-reel tape machine, and training the neural net to replicate the recorded signal. However, for use in the ChowTape plugin, this type of system would not be particularly useful. ChowTape contains many parameters that allow users to alter the physical characteristics of the tape and tape machine that the plugin is emulating, several of which cannot easily be changed on an actual tape machine, for example, the size of the playhead used by the tape machine. With that in mind, it would difficult to collect the necessary data to train an “end-to-end” neural network, that contains all of the parameters used by the plugin.
Instead, to show how neural networks can be useful for audio effects like this, we must dig a little bit deeper into the guts of the tape emulation algorithm. The most important aspect of emulating analog tape is re-creating the sound of “tape distortion”, which is created by the process of magnetic hysteresis that occurs when audio is recorded to tape. In ChowTape, this re-creation is accomplished using a discretized version of the Jiles-Atherton equation, a mathematical model of magnetic hysteresis. While the process of discretizing the Jiles-Atherton equation is quite involved (for more technical readers, see here), all we really need to know about it in this case is that it results in a nonlinear “state-space” equation that needs to be solved in real-time.
A state-space equation is an equation that can be solved for some output, based on a current input, and a current “state” that contains intermediate values from past iterations of the equation. In this case, the “input space” is comprised of the current input signal and the current derivative of the input; the “state space” is comprised of the previous input sample, the input derivative from the previous sample, and the previous output sample. While solving nonlinear state-space equations of this sort is difficult to do in real-time, it is often necessary in audio signal processing. Often, these equations are solved using explicit solvers, like the Runge-Kutta method, or with iterative solvers like the Newton-Raphson method.
Indeed, the most recent version of ChowTape allows users to choose their own “hysteresis mode”, corresponding to a 2nd-order Runge-Kutta solver (RK2), a 4th-order Runge Kutta solver (RK4), or a Newton-Raphson solver with either 4 or 8 maximum iterations (NR4 or NR8). The most accurate mode is the NR8 solver, however, computing 8 iterations of the hysteresis equation is quite processor-intensive, so most users can only use a few instance of the plugin in NR8 mode before maxing out their CPU. With that in mind, I started thinking about creating a new hysteresis mode that could achieve comparable accuracy to the NR8 solver, but with faster computation time. Enter the State Transition Network.
State Transition Networks
As it turns out, a similar problem was tackled in 2019 by some researchers at Native Instruments. The researchers were attempting to emulate the filter circuit from the Korg MS-20 synthesizer, but found that their state-space equation for the circuit could only be solved by a rather nasty Newton-Raphson iteration. The researchers noticed that the Newton-Raphson solver was simply performing a memoryless nonlinear mapping from one point in the state space to another, and realized that another tool could be used instead: Deep Neural Networks. The basic idea was to train the neural network to perform this state transition, so that given the same inputs, and the same prior state, the aptly named State Transition Network (STN), would provide an output that accurately emulated the desired state-space system.
Quickly, the researchers realized the full advantage of the neural network approach: using the State Transition Network doesn’t require any prior knowledge of the state-space equations! So instead of having to derive the full set of equations that could be used by a Newton-Raphson, or other type of state-space equation solver, all they needed to do was collect their data, assemble the neural network architecture, and train the network. Finally, in collecting the data, the researchers would construct the physical circuit they were modelling, and measure the voltage at a couple specific points in the circuit, thereby collecting the state-space data necessary for training the network. Clearly the idea has been a success, as evidenced by the 2019 DAFx paper published by the group, as well as Native Instruments’ new Guitar Rig 6 Pro, which claims to use the same technique.
Inspired by the success of the folks at NI, I decided to attempt to train an STN to to be used in the ChowTape hysteresis emulation. Unfortunately, since I’m modelling tape, rather than a circuit, the only way I could collect my state-space data was to generate synthetic data. From there, the STN process worked as expected: I generated data using the NR8 solver from the ChowTape hysteresis algorithm, created the STN architecture in Tensorflow, and successfully trained an STN that could replicate the output of the Newton-Raphson solver. Unfortunately, there were still a couple important problems left to solve.
First, I needed to be able to implement the STN in the plugin, and I needed that implementation to be fast enough to give a significant speed improvement over the NR8 solver. Second, the hysteresis processing in ChowTape contains 3 parameters, Drive, Saturation, and Bias. I needed a way to somehow include the parameters in the STN model without compromising accuracy or speed.
To implementing the STN for real-time processing, I decided to create my own C++ implementation of a deep neural network. This can be done in native C++ using algorithms such as
std::inner_product, however, using external libraries like Eigen or xsimd can be useful as well for improving the network inferencing speed through SIMD instructions.
The Native Instruments researchers had tried 2 different network sizes in their paper, a 2x8 network with 2 layers of 8 “neurons” each, and a 3x4 network. In order to test the computation time of these networks, I set up a benchmarking suite for the hysteresis algorithm, and measured how long it took each network to process 30 seconds of audio, compared to the NR8 solver. For both networks, I found that the hysteresis algorithm with the STN was ~2.5x faster than with the NR8 solver. Eventually, I found that I could achieve the desired accuracy with a 2x4 STN, which was measured to be ~3x faster than the NR8 solver.
A common problem with using “black-box” systems like neural networks in audio processing is being able to implement user parameters for these systems. Somehow, I needed a way to implement all 3 of the ChowTape hysteresis parameters (Drive, Saturation, Bias) using the STNs. A handy way to think about this sort of thing is to view the parameters for an audio processor as a sort of “parameter space”. In this case, since the hysteresis processor contains 3 parameters, the parameter space is 3-dimensional, like a cube. The STN mode needs to sufficiently “cover” the parameter space so that users of the plugin can utilise the full range of all the parameters.
My first thought was to try to train individual STNs for every possible parameter configuration. Since all 3 parameters are continuous, I would essentially need to “sample” the parameter space. I decided to use 20 “samples” for each parameter, which I figured would sufficiently cover the parameter space, without the “steps” between each sample becoming notiecable. Unfortunately, that meant I would need to train 20^3=8000 networks, which would take quite a long time to train. Further, the weights for each network are stored a file ~100 kB in size, so having to store 8000 of those files would make add 800 MB to the size of the plugin binary, and make the plugin take significantly longer to load.
Next, I thought I could try to train the parameters as inputs to the STN. In theory, this would allow the network to effectively “learn” the parameters, meaning I could train a single network that could cover the entire parameter space. Unfortunately, while the STN was able to learn the simplest parameter “Drive”, the other two parameters were only confusing the network, so it never fully converged during training. I think the network may have been able to learn the more complex parameters if I had used a larger network, but I was worried that using a larger network would compromise the speed improvements I was hoping to gain.
Eventually, I decided to combine the two approaches: I would train individual networks for each configuration of the “Bias” and “Saturation” parameters, and train each network to learn the “Drive” parameter. That way, the inidividual networks would only have to sample 2 dimensions of the parameter space. Further, I decided to only use 10 samples for the “Bias” parameter, since small changes to that parameter are not quite as noticeable. In all, this left me with the task of training 200 STNs. Not an easy undertaking, but certainly doable. More importantly, storing the resulting network weights would only take ~20 MB, and could be loaded into the plugin in just a few milliseconds.
In the end, training all 200 STNs took about 2.5 weeks, in large part because all of my attempts to automate the training process failed. That said, I am happy with how the final product turned out: for most choices of parameters and input signal, the STN sounds quite close to the NR8 solver, with a significantly shorter computation time, however, for more “extreme” choices of parameters, particularly for signals that are heavily distorted by the hysteresis algorithm, the STN starts to display its own unique, characteristic sound. Specifically, the bass response is slightly attenuated, and the higher frequencies will sound a little bit more “harsh” as they distort. When testing the plugin on some of my mixes, I found that on some signals I preferred the sound of the STN solver, while other signals were better suited by the NR8 solver. At the of the day, the STN mode is just another sonic tool that you can use to achieve the sound you’re after.
I hope this article has shed some light on some of the practical challenges associated with using neural networks in real-time audio signal processing, particularly in the world of audio effects where user parameters are such an important factor. If you’d like to hear these neural networks in action, feel free to download ChowTape for free on GitHub!
Big thanks to Julian Parker, Fabian Esqueda, Andre Bergner, and Boris Kuznetsov for their work developing the STN as an analog modelling tool. In particular, thanks to Fabian for reaching out to me with clarifications about how the STN is typically trained.