Music Generation with the Help of AI

Published in

Elucidate AI

6 min readJun 23, 2021

Artificial Intelligence (AI) is being used to improve various aspects of our lives, even if we might not always be aware of it. The Spotify playlist that got recommended to you today. Siri providing you with the weather outlook for the week. Using your face to unlock your iPhone. These are all examples of how AI is making our lives easier and more productive. The success of AI in areas such as image recognition, natural language processing, personalisation, etc. has inspired us to investigate the full breadth of its applications. One area of application relates to an art form that is older than language itself — music. Music generation is not one of the fields where AI performs best (at least not yet), but it is one of the most interesting fields in which it is being applied.

Music as a Sequence of Notes

If you break it down to its raw form, music (or at least the melodies that it is comprised of) can be represented by a series of notes. There are 12 different notes that you can play (A, A#/Bb, B, C, C#/Db, D, D#/Eb, E, F, F#/Gb, G, G#/Ab). You can think of these as 12 different keys on a conventional piano, where the ‘#/b’ (sharp/flat) denote the black keys.

These notes can all be played at different pitches (or octaves). Shifting a note up by an octave involves doubling the frequency of its waveform. Since the range of human hearing is approximately 20Hz — 20KHz, our range is approximately log(20,000/20)/log(2) = 10 octaves. Hence, we can represent a melody as a sequence of notes, with 10*12 = 120 options for each position in the sequence. Although it is possible to use far fewer (e.g., a full-sized piano only has 88 keys).

Notes on a piano — Names of all the notes on a keyboard

Music Generation through AI

So how can we get an AI to produce novel melodies/music for us?

One way is to build a model that — given a particular note — predicts the likelihood or probability of the next note (from e.g., a choice of 120 different options). Given the probability distribution over these 120 different options, you can then use random sampling to choose the next note. This process can be carried out iteratively to produce a full sequence or melody.

But how does the model know which notes are more likely?

The model needs to be trained using a large sample of music or melodies. The music needs to be processed into a form that is intelligible to the computer such as an encoded series of MIDI inputs or notes. The training set needs to be based on the genre of music that you want to generate. If you want the AI to generate classical music, you need to train it with classical music samples. Through the training process, the model will learn the typical notes/keys used in that genre, and which notes are more likely to follow one another.

Midi input — MIDI/notes can be encoded and modelled

The choice of model plays a large role in how effectively the above can be implemented and numerous architectures have been tested over the years — to varying degrees of success. We briefly cover three broad classes here:

[Hidden] Markov Models (HMMs)
Recurrent Neural Networks (RNNs)
Generative Adversarial Networks (GANs)

[Hidden] Markov Models

The work of Hiller and Isaacson (two American professors), leading to the completion of the Illiac Suite string quartet in 1957, is often credited as the first extensive use of algorithms to create music. The musical establishment’s reaction to their creation was one of hostility, as the consensus was that it undermined human creativity. When it comes to AI, one of the prevailing fears (no matter what the field) is that it performs better than humans and can hence serve as a replacement.

The two professors programmed a computer to compose music and the process relied on generating probabilities via a Markov model. A Markov model is a stochastic model used to model a system with changing states. The Markov property assumes that the probability of future states depends only on the current state.

An extension of this is a Hidden Markov Model (HMM), where the observable sequence is generated through an underlying (hidden) process which is assumed to have the Markov property. One of the major limitations cited in using HMMs to generate music is the lack of global structure and melodic progression in the composed pieces. Additionally, HMMs can only produce sub-sequences that also exist in the original data. To create novel sequences, we need to explore a class of models that has risen to prominence in the last decade or so — neural networks.

Recurrent Neural Networks (RNNs)

A recurrent neural network is a class of artificial neural networks that deals with processing sequential information such as text (or indeed music). They perform the same function for every single element of a sequence, with the result being dependent on previous computations. This contrasts with traditional neural networks, where outputs are independent of previous computations. RNNs have had great success in applications such as speech recognition, language translation and sentiment analysis.

RNNs are built to have a concept of ‘memory’. As it processes parts of the data, it considers the data that came before it. This is critical when generating music as the key elements such as harmony and rhythm all involve the consideration of more than just one note of music at a time. However, one drawback of RNNs is that they suffer from the vanishing gradient problem.

*Typical RNN architecture: inputs (x) are fed into the model sequentially, along with the previous outputs ©*

This problem is solved through the implementation of a Long Short-Term Memory Model (LSTMs). They are a type of RNN that can efficiently learn via gradient descent. LSTMs use gating mechanisms to recognize and encode long-term patterns. They are therefore extremely useful to solve problems where the network needs to remember information for a long period of time (e.g., in music generation). For a more detailed explanation of how LSTMs solve the RNN gradient problem, please see this article.

Generative Adversarial Networks (GANs)

Generative Adversarial Networks, described by some as ‘the most interesting Machine Learning idea in the last 10 years’ have had huge success since they were introduced in 2014. A GAN is a two-part model in which neural networks compete to become more accurate in their predictions. The two neural networks that make up a GAN are referred to as the generator and the discriminator. The goal of the generator is to artificially manufacture outputs that could be mistaken for actual data, while the goal of the discriminator is to identify which outputs it receives are artificial.

Essentially, GANs create their own training data (they typically run unsupervised). As the feedback loop between the networks continue, the generator will produce higher-quality output and the discriminator will become better at flagging artificially created data. This unique arrangement enables GANs to achieve impressive feats of media synthesis such as composing melodies, generating human faces, blending photos, and many more.

Conclusion

There is no doubt that AI music generation still has a long way to go. But like other deep learning applications, we will continually improve as we experiment with different architectures and the computing power at our disposal increases. AI will undoubtedly play a big role in music production in the future, providing producers with a tool for interesting, novel sounds and ideas. Fast forward a couple years and it is entirely plausible that the next viral summer hit is made, not by David Guetta or co., but by a computer!