AI in Music Production (Part 2)

12 min readJun 14, 2023

This is the second part of the mini-series on state-of-the-art, applications and potential impact of AI in music production.

The outline of the overall mini-series is as follows:

Part I: Origins of AI, generative AI and examples for using generative AI in music. In case you missed it, you can find Part 1 here.

2. Part II (this here): Key approaches of the most recent AI technology wave, and specific additional techniques relevant to musical applications

3. Part III: Potential AI applications along the music production workflow and future outlook

As I am writing this second part, I realize that it is somewhat challenging to talk about AI technology and music without assuming too much about the reader’s background; regarding AI as much as music and sound processing. So, hopefully, what’s following here is useful as a kind of translation and juxtaposition of how certain techniques correspond to musical concepts.

Key AI Concepts relevant to Processing Music

There are a few types of neural network architectures that are of particular importance for music applications.

Convolutional Neural Networks

Convolutional Neural Networks (CNN) apply to data and information that has a natural 1- or higher-dimensional organization. For example, audio streams can be represented as a series of samples, or digital images have a representation as 2-dimensional arrangement of same-sized, square pixels. Given a specific time point in the case of audio or a specific pixel in the case of an image, a convolution operation caluclates a weighted average for that location using a small neighborhood. We may consider, for example, 5 time points before and after the time of interest when calculating such a weighted sum. Or, we may decide to consider the square 3 pixels left, right, up and down relative to our pixel of interest as its neighborhood. The weights that are used in this operation are then called a kernel.

These operations that compute weighted sums using kernels are quite similar to applying a filter in the sense of traditional digital signal processing. The key addition that neural networks provide is that the specific values of the kernel are not specified ahead of time but rather learned as part of the training process. And because filters allow us to extract certain properties of the incoming signal, what we have are trainable extractors and detectors. CNNs also provide so-called pooling layers, which aggregate the incoming signal or image to a lower resolution. This corresponds to down sampling a signal in traditional processing.

There is also a conceptual inverse of computing a convolution, which is called a transposed convolution. Transposed convolutions can be used to transform a filtered signal back to the original input. Like the convolution kernel, the vector of weights defining the transposed convolution are typically learned as part of a training process. We find transposed convolutions commonly in so-called decoders, which transform an internal representation of a signal back to its external form. For example, in the music setting, a decoder would transform an internal representation back to an audio waveform that can then be played back and listened to.

Recurrent Networks

Recurrent Neural Networks (RNN) are networks that include cycles in their network topology. This creates a feedback loop that allows the network to learn dependencies over time. To a certain extent, these feedback loops are similar to feedback lines used in traditional signal processing for the purpose of creating delay, echo, chorus or phaser effects. However, for each of these effects, it is primarily the time delays and the amount of feedback that is influencing the overall result, and additional processing on the signal being fed back is rather secondary. For RNNs, there is a fixed size loop in the system, but the specific processing, which may include certain forms of memory, are learned through training.

A specific sub-family of networks adds explicit gated memory, and the rules to update and use memory state are part of the training process. The most well-known network type with gated memory is long short-term memory (LSTM) networks, and a second popular type is gated recurrent units (GRU). With gated memory, we gain two important properties: (i) There is explicit state that the system is maintaining over time (“memory”). (ii) There is control in the system when this state is being updated (“gated”). Thus, in principle, such a system can learn rules involving time and incoming signal properties. It is therefore not surprising that recurrent networks have been used extensively for the generation of melodies and other audio or music patterns.

RNNs have been successfully used for speech synthesis, and the underlying models, such as WaveNet (van den Oord et al. 2016), can be adapted to general sound synthesis [(Zhao et al. 2019)]. In addition, LSTMs have been used for speech recognition, and as such can be used for sound analysis and classification. As direct application to music processing, LSTMs have been used for building emulations of guitar amplifiers and effects [(Wright et al. 2020)] by directly learning a transfer function from input to output signal. GuitarML is a project site collecting a few audio processors created using such networks.

Variational Autoencoders

Variational Autoencoders (VAE) were proposed as an approach to learn a smooth, lower-dimensional representation (mathematically: a manifold) of the space represented by a training data set [(Kingma and Welling 2013)]. VAE consists of two parts: An encoder network that maps an input example to a lower-dimensional parameter space, and a decoder network that is trained to recreate the example from the lower-dimensional parameters. During the training process, noise is being added in a controlled form to ensure that the mapping is smooth, that is, small changes to parameters results in small changes in the generated output. Once trained, the decoder can be used to generate output based on a set of parameters provided by the user. Because the mapping is smooth, it is possible to traverse a path in the parameter space, which results in continuous, smooth changes to the generated output. Therefore, it is possible to control parameters through envelopes or automation.

For example, RAVE (project, VST, Max) [(Caillon and Esling, 2021)] is a real-time audio synthesizer and synthesize plug-in that is trained on audio material to construct a low-dimensional parameter space. Once trained, it can synthesize audio signals from parameter values and transitions between them. For example, if the parameter values are derived from another audio signal, this system performs real-time style transfer, such as converting an audio signal from one instrument to another.

Generative Adversarial Networks

Generative Adversarial Networks (GAN) were originally proposed by Goodfellow et al (2014). Training a GAN works by training two networks in parallel: A generator is learning to generate artifacts such as an image or a waveform that mimics the distribution of examples provided as training data. Often, the generator will learn to generate output as a function from a latent distribution or parameter space. The discriminator, in turn, is trained to distinguish generated output from real examples. By training these two networks together, the generator is learning to “fool” the discriminator. This process has the goal that ultimately content generated cannot be distinguished from the provided training data just by looking at the artifact itself.

GANs are therefore a common building block for creating “realistic” high-fidelity output. In the case of music applications, RAVE, for example, is using a GAN stage as part of its training for achieving high-quality synthesis results.

Transformer Models and Language Models

Transformer models were originally proposed by Vazvani et al. (2017) as architecture for machine translation and form the underlying technology for large language models. Transformers have been applied in other areas as well, including computer vision and as a key component of AlphaFold 2, which can predict the 3D structure of a protein based on the sequence of amino acids making up the molecule. In the case of machine translation, transformer networks learn to map a sequence in an input language to a sequence in an output language. In this setting, the overall network can be split into an encoder stage that creates a latent representation of the input, and a decoder stage that is trained to predict the next symbol of the output based on the input sequence and the output that has been generated thus far.

For language models for a single language (or across multiple languages, but outside the realm of machine translation), transformers are trained to backfill symbols or words that have been masked away (“fill in the blanks”). This approach allows to train transformer networks using semi-supervised learning: Instead of costly data labeling, one can simply utilize a large corpus of relevant example data (such as a web crawl) and create individual training examples by masking away input and using the unmasked version as ground truth.

Key building block of transformer networks is so-called attention, which allows the network to learn interdependencies between different elements of the sequence, even if those are not in direct proximity. This is a key differentiator to CNN architectures, where the neighborhood is fixed. In the case of natural language, attention allows transformer networks to learn grammatical dependencies within a sentence, semantic and contextual carryover across sentences, or the ability to rhyme. This ability to learn and represent long-distance relationships sets transformers apart from recurrent networks.

For musical applications, transformers have been used for learning and generating melodic patterns and accompaniments [(C.-Z. A. Huang et al. 2018)]. Transformers serve as the foundation for several systems aiming to generate end-to-end music based on textual prompts by virtue of “music language models”. Key examples include Google’s MusicLM [(Agostinelli et al. 2023)] and MusicGen out of Meta [(Copet et al. 2023)]. MusicLM builds on an earlier transformer-based model called AudioLM [(Borsos et al. 2022)], which uses a hierarchical approach to represent semantic patterns, coarse acoustic patterns and fine acoustic patterns. MusicLM expands AudioLM by conditioning the generation process on textual descriptions and other signals, such as melody. MusicGen, as well, utilizes a transformer-based decoder conditioned on a text or melody representation. Given the importance but also complexity of these models, I plan to write a dedicated, separate article about music language models.

Music-specific Techniques

This section collects a few prominent techniques utilized for synthesizing audio signals.

Spectral Processing via Short-term Fourier Transform

A Fourier Transform (FT) converts a representation of a signal over time, say, as a sequence of sample values, into a representation in frequency space: amplitudes and phases of sinoid waves. However, the traditional FT is “fixed” in the sense that the full signal is converted into a single vector of frequency phases and amplitudes. This, however, makes it difficult to apply an FT to a longer or even continuously playing audio signal. Interactive and real-time applications become impossible. For these kinds of applications it is desirable to apply FT to smaller fragments of the signal, but in a way that allows them to properly “stitch together”. A way to do this is Short-Term Fourier Transform (STFT).

STFT applies Fourier transformation using a window sliding over a longer recording to create a spectrogram that captures the magnitude and phase of frequency components over time. In the discrete case, the data to be transformed is broken up into chunks or frames, which usually overlap each other to reduce artifacts at the boundary. Each chunk is then Fourier transformed, and the complex result is appended to form a matrix, which records magnitude and phase for each point in time and frequency. A specific form of STFT is computing a MEL spectrogram, which represents frequencies on the mel scale. The mel scale uses a unit of pitch such that equal distances within this scale sound equally distant to the listener, and it is particularly useful for applications modeling how a sound is perceived by a human (for example speech recognition) [(Volkmann, Stevens, and Newman 1937)].

With STFT, music can be condensed into a representation that requires a lot less time points per second compared to raw audio. Therefore, quite a few systems utilize spectrograms obtained via STFT as internal representation where musical and sonic patterns are learned, which are then transformed into high-fidelity audio using an appropriate decoder stage.

Resynthesis

Resynthesis approaches aim to derive control parameters for a given synthesis engine with the aim of reproducing sounds as presented during training of the underlying ML model. For example, an ML model can be constructed to derive parameters for a concrete synthesizer engine such as the Yamaha DX7 in order to match example sounds provided as audio data [(Caspe, McPherson, and Sandler 2022)]. Effectively, the ML is learning the inverse synthesis function, which maps resulting audio to synthesis parameters. In order for such approaches to work, it is necessary to have a differentiable mathematical model of the synthesis process, which can be plugged into the optimization process underlying ML model training.

Differential Digital Signal Processing

A specific kind of synthesis is additive synthesis, where the amplitude of a fundamental frequency alongside the harmonics are controlled directly. Differential Digital Signal Processing (DDSP) is a technique proposed by [(Engel et al. 2020)], which combines additive synthesis with filtered noise. Within this framework, the ML model is tasked to predict a fundamental frequency and relative amplitudes of the harmonics for the additive part alongside amplitude and filter parameters for the noise component. More specifically, the ML model will generate time series for all these parameters to describe the formation of a sound over time.

Predicting Sample-by-Sample

The most general and fundamental approach to generating audio is to have an ML model create a direct sequence of signal amplitudes (“samples”) over time. For high quality output, this implies that the model needs to create a series of amplitude values at a frequency of 44.1KHz or higher. The most prominent method in this category is probably WaveNet [(van den Oord et al. 2016)], which was proposed and has been used as the foundation for speech synthesis engines. For practical implementations, output amplitudes are commonly represented using a 1-hot encoding. This means that, say, an amplitude range of 0 to 255 is represented using a binary vector with 256 components, where only one vector component corresponding to the selected value is set to 1, while all other values are set to zero. For a higher resolution, such as 16-bit with 216 = 65,536 different values, this amounts to the ML model creating output vectors of dimension 216, which can be computationally expensive and more difficult to train. For this reason, such approaches tend to either predict amplitudes with less precision (and thus higher noise components). A compromise is using non-linear scales to represent amplitude, such as exponential mappings.

Having discussed key techniques in current AI and music processing, I will focus on applications and implications of all of this in the next part of this series. Stay tuned!

References

Agostinelli, Andrea, Timo I. Denk, Zalán Borsos, Jesse Engel, Mauro Verzetti, Antoine Caillon, Qingqing Huang, et al. 2023. “MusicLM: Generating Music From Text.” arXiv [cs.SD]. arXiv. http://arxiv.org/abs/2301.11325.

Borsos, Zalán, Raphaël Marinier, Damien Vincent, Eugene Kharitonov, Olivier Pietquin, Matt Sharifi, Olivier Teboul, David Grangier, Marco Tagliasacchi, and Neil Zeghidour. 2022. “AudioLM: A Language Modeling Approach to Audio Generation.” arXiv [cs.SD]. arXiv. http://arxiv.org/abs/2209.03143.

Caillon, Antoine, and Philippe Esling. 2021. “RAVE: A Variational Autoencoder for Fast and High-Quality Neural Audio Synthesis.” arXiv [cs.LG]. arXiv. http://arxiv.org/abs/2111.05011.

Caspe, Franco, Andrew McPherson, and Mark Sandler. 2022. “DDX7: Differentiable FM Synthesis of Musical Instrument Sounds.” arXiv [cs.SD]. arXiv. http://arxiv.org/abs/2208.06169.

Copet, Jade, Felix Kreuk, Itai Gat, Tal Remez, David Kant, Gabriel Synnaeve, Yossi Adi, and Alexandre Défossez. 2023. “Simple and Controllable Music Generation.” arXiv [cs.SD]. arXiv. http://arxiv.org/abs/2306.05284.

Engel, Jesse, Lamtharn Hantrakul, Chenjie Gu, and Adam Roberts. 2020. “DDSP: Differentiable Digital Signal Processing.” arXiv [cs.LG]. arXiv. http://arxiv.org/abs/2001.04643.

Goodfellow, Ian J., Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. “Generative Adversarial Networks.” arXiv [stat.ML]. arXiv. https://arxiv.org/abs/1406.2661

Huang, Cheng-Zhi Anna, Ashish Vaswani, Jakob Uszkoreit, Noam Shazeer, Ian Simon, Curtis Hawthorne, Andrew M. Dai, Matthew D. Hoffman, Monica Dinculescu, and Douglas Eck. 2018. “Music Transformer.” arXiv [cs.LG]. arXiv. http://arxiv.org/abs/1809.04281.

Huang, Qingqing, Aren Jansen, Joonseok Lee, Ravi Ganti, Judith Yue Li, and Daniel P. W. Ellis. 2022. “MuLan: A Joint Embedding of Music Audio and Natural Language.” arXiv [eess.AS]. arXiv. http://arxiv.org/abs/2208.12415.

Kingma, Diederik P., and Max Welling. 2013. “Auto-Encoding Variational Bayes.” arXiv [stat.ML]. arXiv. http://arxiv.org/abs/1312.6114v11.

Liu, Haohe, Zehua Chen, Yi Yuan, Xinhao Mei, Xubo Liu, Danilo Mandic, Wenwu Wang, and Mark D. Plumbley. 2023. “AudioLDM: Text-to-Audio Generation with Latent Diffusion Models.” arXiv [cs.SD]. arXiv. http://arxiv.org/abs/2301.12503.

Oord, Aaron van den, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. 2016. “WaveNet: A Generative Model for Raw Audio.” arXiv [cs.SD]. arXiv. http://arxiv.org/abs/1609.03499.

Rombach, Robin, Andreas Blattmann, Dominik Lorenz, and Patrick Esser Bjorn Ommer. n.d. Latent-Diffusion: High-Resolution Image Synthesis with Latent Diffusion Models. Github. Accessed April 26, 2023. https://github.com/CompVis/latent-diffusion.

Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Ł. Ukasz Kaiser, and Illia Polosukhin. 2017. “Attention Is All You Need.” In Advances in Neural Information Processing Systems 30, edited by I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, 5998–6008. Curran Associates, Inc. Also: [1706.03762] Attention Is All You Need (arxiv.org).

Volkmann, J., S. S. Stevens, and E. B. Newman. 1937. “A Scale for the Measurement of the Psychological Magnitude Pitch.” The Journal of the Acoustical Society of America 8 (3): 208–208.

Wright, Alec, Eero-Pekka Damskägg, Lauri Juvela, and Vesa Välimäki. 2020. “Real-Time Guitar Amplifier Emulation with Deep Learning.” NATO Advanced Science Institutes Series E: Applied Sciences 10 (3): 766.

Zhao, Yi, Xin Wang, Lauri Juvela, and Junichi Yamagishi. 2019. “Transferring Neural Speech Waveform Synthesizers to Musical Instrument Sounds Generation.” arXiv [eess.AS]. arXiv. http://arxiv.org/abs/1910.12381.