Neural Networks for Real-Time Audio: Introduction
This is the first of a fIve-part series on using neural networks for real-time audio.
Artificial intelligence impacts our lives more each day, whether we are aware of it or not. From social media feeds to online shopping to self driving cars, A.I. is changing the way we live and how we make decisions.
But wait, isn’t A.I. all about terminators and humanoid robots and machines taking over the world? That’s what we see in movies, but in reality A.I. is just a different way to solve problems. It uses computers, big data, and clever math to solve problems in a way that previously only humans could do.
Neural networks are a subset of artificial intelligence. The theory is loosely based on the way the brain works, which is made up of neurons that form the complex connections that make us who we are. Instead of biological cells, it uses numerical values or “nodes” that form layers and are connected in defined mathematical ways. It turns out that even simple structures of these artificial neurons can be used for impressively complex tasks, such as reading hand written letters, or recognizing faces from pictures.
Why use neural networks for audio? One common application is text-to-speech or speech-to-text. Products like Alexa, which listen to our voice and understand what we are asking it (most of the time), and respond with a voice that sounds human. But there are many other applications for neural networks and audio.
Audio quality is a big deal in the music industry. We want to hear the best sound possible coming out of our stereos or headphones. Analog hardware is often prized for it’s ability to add warmth to sound. For example, high-end guitar amplifiers use vacuum tubes to amplify and overdrive an otherwise clean guitar signal. They distort the sound in a way that’s pleasing to the ear. Transistor-based amplifiers (solid state) are a little too good at what they are made to do. It’s the imperfections in the way vacuum tubes amplify sound that add something special. But they are expensive and fragile, and like any lightbulb, will eventually burn out.
Neural networks can be used to accurately model these non-linear audio components. This type of modeling is called “black-box” modelling, because we aren’t concerned with how the physical system works, only that the model responds in the same way. The opposite approach is “white-box” modelling, where detailed circuit analysis and equations define each electrical component. A hybrid approach is called “grey-box” modelling.
There are three steps involved with using just about any neural network:
- Data Collection: The model must have data to learn from. In the case of a guitar amplifier, the neural network needs to take a clean guitar signal and simulate the dynamic response of the amplifier. This requires two separate recordings: the input to the system, and the output from the system (where the system is the audio amplifier/circuit/component). Audio recording is an art of it’s own, and there are many techniques to capturing the best sound.
2. Model Training: The input and output signals are then used to train the neural network to behave like the real system. A network architecture is defined, along with an optimizer and loss function. The neural net is trained to minimize the loss between the predicted signal and the truth signal (actual signal out of the audio device). The choice of network architecture and loss function are critical to how the model performs.
3. Model Deployment: Once the neural net model is trained, you can deploy it for use in a specific task. This is also known as “inference”, because the model is inferring how to react to new inputs based on what it learned in the training phase. In the case of a real-time guitar amp model, the network is continually processing a live audio signal. Digital audio effects typically run at 44.1kHz (44,100 samples-per-second) or higher. That’s a massive amount of data to convert from analog to digital, push through a neural net, and convert back to analog and send to your speakers before your ears notice a delay. High-performance algorithms are required to make this happen.
This article series assumes the reader has a basic understanding of programming languages and A.I. frameworks, and is intended as a reference for implementing real-time audio solutions using neural networks. I will mainly be using the GuitarML project for my code examples. GuitarML is a collection of open-source guitar plugins and the associated machine learning software used to train the models. We will be going step-by-step through three different neural net architectures and their real-time algorithms. The real-time software uses the JUCE audio framework and is written in c++. The machine learning code uses Tensorflow (with Keras) and PyTorch.
In my next article, I’ll run through code examples of the WaveNet neural network model for real-time audio processing. Go to the next article by clicking here.
If you like my work here and on GuitarML, consider joining my Patreon for behind the scenes software development posts and neural net amp/pedal models for my plugins.