Neural Nets for Generating Music
Algorithmic music composition has developed a lot in the last few years, but the idea has a long history. In some sense, the first automatic music came from nature: Chinese windchimes, ancient Greek wind-powered Aeolian harps, or the Japanese water instrument suikinkutsu. But in the 1700s automatic music became “algorithmic”: Musikalisches Würfelspiel, a game that generates short piano compositions from fragments, with choices made by dice.
Markov chains, formalized in the early 1900s to model probabilistic systems, can also be used to generate new musical compositions. They take the motivations behind the dice game a step further, in two ways. First, Markov chains can be built from existing material rather than needing fragments explicitly composed as interchangeable components. Second, instead of assuming fragments have equal probabilities, Markov chains encode the variation in probabilities with respect to context.
Iannis Xenakis used Markov chains in his 1958 compositions, “Analogique”. He describes his process in “Formalized Music: Thought and Mathematics in Composition”, down to the details of transition matrices that define the probabilities of certain notes being produced.
In 1981 David Cope began working with algorithmic composition to solve his writers block. He combined Markov chains and other techniques (musical grammars and combinatorics) into a semi-automatic system he calls Experiments in Musical Intelligence, or Emmy. David cites Iannis Xenakis and Lejaren Hiller (Illiac Suite 1955, Experimental Music 1959) as early inspirations, and he describes Emmy in papers, patents, and even source code on GitHub. Emmy is most famous for learning from and imitating other composers.
While Markov chains trained on a set of compositions can only produce subsequences that also exist in the original data, recurrent neural networks (RNNs) attempt to extrapolate beyond those exact subsequences. In 1989 the first attempts to generate music with RNNs, developed first by Peter M. Todd, then Michael C. Mozer and others, were limited by their short-term coherence.
In 2002 Doug Eck updated this approach by switching from standard RNN cells to “long short term memory” (LSTMs) cells. Doug used his architecture to improvise blues based on a short recording. He writes, “Remarkably […] LSTM is able to play the blues with good timing and proper structure as long as one is willing to listen.”
A big leap in compositional complexity came out of Magenta in September 2018 with Music Transformer by Huang et al. Unlike Performance RNN, the samples from Music Transformer do not succumb to chaos after the first few measures. They trained on Bach chorales (without dynamics) as well as a piano competition data (with dynamics).
One of the recurring difficulties encountered when training these systems is deciding on a representation of music. Designing an encoding for a RNN might start with a metaphor of text: the RNN is processing a sequence of states (letters) unfolding over time or space (the page). But unlike text, a single moment in music can contain more than one symbol: it can be a chord, or it can have a combination of qualities that is best described by its components. There can also be long durations of silence, or states can have wildly varying lengths. These differences may be resolved by carefully crafting the representation, or by heavily augmenting the dataset and designing the architecture with the capacity to learn all the invariance.
Another significant challenge with data-driven algorithmic composition is: what data to use? Whose music counts? When any automated creative system needs to be trained on a large number of cultural artifacts, it can only perpetuate the dominance of what is already well-documented. In music, this means a lot of Bach, Beethoven, and other old white European men. (Two exceptions: some English and Irish folk music, and some video game music.) The data is also selected by machine learning researchers, who are also a relatively homogenous group (though decreasingly so).
While LSTMs and Transformers manage to maintain long-term consistency better than a standard RNN or Markov chain, there is still a gap between generating shorter phrases and generating an entire composition; something that has not yet been bridged without lots of tricks and hand-tuning. Startups like Jukedeck, Aiva, Amper, and others are trying to fill this space of on-demand, hand-tuned formulaic generative music. Some going so far as to produce entire pop albums as marketing. Big companies are getting in on the action, too. François Pachet, formerly at Sony Computer Science Laboratories and now at Spotify, has been working with algorithmic music for some time, from his Continuator to the more recent Flow Machines.
Eduardo Reck Miranda, a composer and researcher previously at Sony CSL, has released an entire album of “computer-aided symphonic works” called “Mind Pieces, Sound to Sea” through an otherwise traditional label specializing in classical and jazz. While the technologies behind groups like Sony CSL are proprietary, we can make some guesses based on the researchers involved. For example: it’s likely that Flow Machines has continued with the same approach as Continuator, more akin to David Cope than Doug Eck. (But for RNN-based approaches to “duets” and “continuations”, check out Deep Musical Dialogue by Mason Bretan, and AI Duet by Magenta.)
At IBM the Watson team has developed a system called Watson Beat that can produce complete tracks in a limited number of styles, based on a melodic prompt.
Other researchers on the Watson team have worked with Alex Da Kid to suggest themes and inspiration for music based on data mined from social media and culture.
Dice games, Markov chains, and RNNs aren’t the only ways to make algorithmic music. Some machine learning practitioners explore alternative approaches like hierarchical temporal memory, or principal components analysis. But I’m focusing on neural nets because they are responsible for most of the big changes recently. (Though even within the domain of neural nets there are some directions I’m leaving out that have fewer examples, such as restricted Boltzmann machines for composing 4-bar jazz licks, short variations on a single song, or hybrid RNN-RBM models, or hybrid autoencoder-LSTM models, or even neuroevolutionary strategies).
The power of RNNs wasn’t common knowledge until Andrej Karpathy’s viral post “The Unreasonable Effectiveness of Recurrent Neural Networks” in May 2015. Andrej showed that a relatively simple neural network called char-rnn could reliably recreate the “look and feel” of any text, from Shakespeare to C++. The same way that the popularity of dice games was buffeted by a resurgence of rationalism and interest in mathematics, Andrej’s article came at a time when interest in neural networks was exploding, triggering a renewed interest in recurrent networks. Some of the first people to test Andrej’s code applied it to symbolic music notation.
Christian Walder uses LSTMs in a more unusual way: starting with a pre-defined rhythm, and asking the neural net to fill in the pitches. This provides a lot of the global structure that is otherwise usually missing, but heavily constrains the possibilities.
While all the examples so far are based on symbolic representations of music, some enthusiasts pushed char-rnn to its limits by feeding it raw audio.
Unfortunately it seems that char-rnn is fundamentally limited in its capacity to abstract higher level representations of raw audio. The most inspiring results on audio turned out to be nothing more than noisy copies of the source material (some people explain this when sharing their work, see SomethingUnreal modeling his own speech). In machine learning this is related to the concept of “overfitting”: when a model can recreate the training data faithfully, but can’t effectively generalize to anything novel that it hasn’t been trained on. During training, initially first the model performs poorly on both the training and novel data, then it starts to perform better on both. But if you let it train too long, it gets worse at generalizing to novel data at the expense of recreating the training data. Researchers stop the training just before hitting that point. But overfitting is not so clearly a “problem” in creative contexts, where recombination of existing material is a common strategy that is hard to distinguish from “generalization”. Some people like David Cope go so far as to say “all music [is] essentially inspired plagiarism” (but he has also been accused of publishing pseudoscience and straight-up plagiarism).
Instead of using a recurrent network to learn representations over time, they used a convolutional network. Convolutional networks learn combinations of filters. They’re normally used for processing images, but WaveNet treats time like a spatial dimension.
Looking into the background of the co-authors, there are some interesting predecessors to WaveNet.
- Sander Dieleman is first author on End-to-end learning for music audio (2014), a rare and early example of processing raw audio sample-by-sample with a neural net; in this case for genre classification (first use of neural nets for this task was five years earlier).
- Aäron van den Oord is first author on Pixel Recurrent Neural Networks (2016), introducing networks that generate images pixel-by-pixel.
- Alex Graves, besides having a long history working with speech and recurrent neural networks, showed a demo of end-to-end trained neural net generated synthetic speech in March 2015.
One of my favorite things to emerge from the WaveNet research is this rough piano imitation by Sageev Oore, who was on sabbatical at Google Brain at the time.
In April 2017, Magenta built on WaveNet to create NSynth, a model for analyzing and generating monophonic instrument sounds. They created an NSynth-powered “Sound Maker” experiment in collaboration with Google Creative Lab New York. I worked with the Google Creative Lab in London to build NSynth into an open-source portable MIDI synthesizer, called “NSynth Super”.
In February 2017 a team from Montreal lead by Yoshua Bengio published SampleRNN (with code) for sample-by-sample generation of audio using a set of recurrent networks in a hierarchical structure. This research was influenced by experiments from Ishaan Gulrajani who trained a hierarchical version of char-rnn on raw audio.
Both SampleRNN and WaveNet take an unusually long time to train (more than a week), and without optimizations (like fast-wavenet) they are many times slower than realtime for generation. To reduce the training and generation time researchers use audio at 16kHz and 8 bits.
But for companies like Google or Baidu, the primary application of audio generation is text to speech, where fast generation is essential. In March 2017 Google published their Tacotron research, which generates audio frame-by-frame using a spectral representation as an intermediate output step and a sequence of characters (text) as input.
The Tacotron demo samples are similar to WaveNet, with some small discrepancies. In May 2017, Baidu built on the Tacotron architecture with their Deep Voice 2 research, increasing the audio quality by adding some final stages specific to speech generation. Because generating audio from amplitude spectra requires a phase reconstruction step, the quality of polyphonic and noisy audio from this approach can be limited. But this hasn’t stopped folks like Dmitry Ulyanov from using spectra for audio stylization, while Leon Fedden, Memo Akten and Max Frenzel have used spectra for generation. For phase reconstruction, Tacotron, Dmitry and Max use Griffin-Lim, while Leon and Memo use LWS . Leon, Memo and Max all use an autoencoder to build a latent space across spectrograms.
Besides Dmitry, other researchers who have looked into style transfer include Parag Mital in November 2017 (focused on audio stylization) and Mor et al in May 2018 (focused on musical style transfer across instruments/genres). For more early work on audio style transfer with only concatenative synthesis, “Audio Analogies” (2005) provides a lot of inspiration.
In November 2017, DeepMind published their “Parallel WaveNet” technique where a slow-to-train WaveNet teaches a fast-to-generate student. Instead of predicting a 256-way 8-bit output, they use a discretized mixture of logistics (DMoL), which allows for 16-bit output. Google immediately started using Parallel WaveNet in production. In December 2017, Google published Tacotron 2 using a parallel WaveNet as the synthesis (vocoder) step instead of Griffin-Lim phase reconstruction. This kicked off a wave of papers focusing on speech synthesis conditioned on mel spectra, including ClariNet (which also introduces an end-to-end text-to-wave architecture), WaveGlow and FloWaveNet. In October 2018, Google published a controllable version of their Tacotron system, allowing them to synthesize voice in different styles (something they proposed in the original Tacotron blog post). There is a wealth of other research related to speech synthesis, but it isn’t always relevant to the more general task of generating audio in a musical context.
In February 2018, DeepMind published “Efficient Neural Audio Synthesis” or “WaveRNN” which solves fast generation using a handful of optimizations. Instead of using DMoL outputs, they achieve 16-bit output by using two separate 8-bit outputs: one for the high bits, and one for the low bits.
Where might this research head next?
One domain that seems under explored is corpus-based synthesis (granular or concatenative) combined with frame-level representations. Concatenative synthesis is common in speech synthesis (where it’s called “unit selection”). These techniques also have a long history in sound design for texture synthesis with tools like CataRT. One significant limitation of this sort of corpus-based approach is that it’s impossible to generate a “moment” of audio that never appeared in your original corpus. If you trained a corpus-based model on all of Bach, and Bach never wrote a C minor major 7th chord, then you will never be able to generate a C minor major 7th. Even if the model learns how to produce each of the notes in the chord, and even if it even learns how to represent the corresponding frame, you won’t have source material to sample for synthesis. To overcome this constraint, perhaps there is something waiting to be discovered at the intersection of frame-by-frame granular modeling and research on audio decomposition/factorization.
In terms of the research approach, I see at least two recurring questions. First, what kind of representations should we use? Should we treat sound as individual samples, as spectral frames with mostly monophonic tonal content, as a pitches in a grid, as properties of a vocal synthesizer? How much domain-specific knowledge should we embed into our representation of sound? And second, how do we want to interact with these systems? Do we want them to learn from the entire documented history of music with a vague goal of producing something similar, or something novel? To construct entire compositions, or to improvise with us? I’m wary of anyone who suggests that there is only one answer to these questions, and if anything we need to expand our imagination in terms of sound representation and modes of interaction.
I’ve noticed the more “accessible” algorithmic compositions are likely to trigger the question from journalists: “does this make human musicians obsolete?” Usually the researchers say they’re “not trying to replace humans”, but they’re trying to “build new tools”, or they encourage musicians to “think of the algorithms as collaborators”. Talking about creative AI as “augmenting” the human creative process feels reassuring. But is there any reason that an AI won’t eventually create a pop hit from scratch? Or not a pop hit, but just one of your favorite songs? I think the big question is less about whether human artists and musicians are obsolete in the face of AI, and more about what work we will accept as “art”, or as “music”. Maybe your favorite singer-songwriter can’t be replaced because you need to know there is a human behind those chords and lyrics for it to “work”. But when you’re dancing to a club hit you don’t need a human behind it, you just need to know that everyone else is dancing too.
There’s also an opportunity here to look beyond traditional models for what makes music “work”. Vocaloids like Hatsune Miku have shown that a virtual persona backed by a vocal synthesizer can bring together millions of people in a massively crowdsourced act of composition and listening. Music is probably older than language, but we’re still discovering all the things music can be, and all the ways it might be crafted.
Thanks to Lauren McCarthy, Brian Whitman, Kyle Kastner, and Parag Mital for feedback.