Image courtesy of https://indstrlmnky.deviantart.com/art/Robot-DJ-22107727

AI Jukebox: Creating Music with Neural Networks

Brian McMahon
6 min readApr 6, 2018

5 April 2018

The AI Jukebox is a neural network that generates music. Lets start out by sampling some of AI Jukebox’s work:

The AI jukebox trains on a collection of midi music files, where it gains a “machine understanding” by mapping the latent, internal structural relationships of the dataset, and from this “understanding” is then able to create new, unique generated content.

The work thus far has focused on collections of midi files by genre. A few of the genres sampled include:

You can train AI Jukebox with your own collections of midi files; simply use the code and follow the operational instructions on its GitHub repo.

Now that we have seen what the AI Jukebox can do, lets dive a bit deeper into how it works, and why it is important.

Generative Models

I’d like to start out by pondering a quote from the legendary theoretical physicist Richard Feynman:

Taking a bit of liberty in interpreting, I would draw out a symbiotic relationship between creation and understanding. As in, understanding is required in order to create, and the act of creation feeds understanding:

Generative models map the hidden structure within a dataset, and then new, unique content can be generated as “samples” from this mapping.

How a generative model works.

There are an abundance of potential use cases of generative models, with applications including the creation of new and unique images, audio and text, and perhaps in the not-too-distant future the generation of other key blueprints of our society, such as code, designs or even physical structures (such as with 3D printing).

Long Short-Term Memory (“LSTM”) is a type of recurrent neural network which is often used in generative models to generate sequences of text, or in our case, notes and chords. This model is useful in that it carries “memory” which allows information to persist within the network, including long-term dependencies. The state of a recurrent neural network, with LSTM being one of the most popular, is constantly updated both via new inputs as well as via the previous state of the model.

A LSTM network. Diagram courtesy of Christopher Olah’s blog.

The LSTM network works via a “gate layer” structure, where each LSTM node is actually made up of several “gates” managing the “cell state”, or memory of the network.

Architecture

The AI Jukebox is designed as a bidirectional Long Short-Term Memory (“LSTM”) network, including two LSTM layers, two dense layers and dropout at each layer.

Bidirectional LSTM neural network architecture.

Bidirectional is a special type of LSTM network where there are actually two layers of LSTM. One layer trains on the sequence of notes/chords in the forward direction and the other trains on the same sequence in reverse. This is one way to more comprehensively map the latent relationships within the data.

Dropout has been evenly distributed throughout so as to avoid loss of memory; sporadic use of dropout in recurrent networks has been known to cause issues of this sort, but the problem is thought to be largely mitigated if care is taken by evenly distributing dropout amongst the layers. For this model, 50% dropout was used at each layer.

The input layer is 512 nodes and the softmax is configured to output one of each distinct note or chord as found in the training dataset. The dataset input into the model can be any collection of midi music — a few different genres were explored in this exercise, including Celtic, Dance, Jazz and Classical.

The AI Jukebox was built in Python with Keras and Tensorflow used for the neural network, and music21 and musescore2 for music analysis. Models were trained in an Amazon Web Services p2.xlarge instance.

Music Generation

Sequences of notes/chords are generated by, from a random starting point, looking at a window of sequential notes and chords (200 in our case) from the underlying dataset and predicting the 201st note based on a probability distribution (reflected in the softmax activation function). The sequence window of 200 notes will then will then shift over one by one until the model has generated a full sequence of the requested notes; in our case 500 notes/chords.

Sequence generation by an LSTM network. Diagrams courtesy of Sigurður Skúli, Towards Data Science.

Temperature is an added hyperparameter which increases or decreases the probability of any given note to be chosen. A decrease in temperature will lead to more accurate, yet less interesting, rhythms. An increase in temperature will lead to more randomized note selection, which could potentially lead to more interesting pieces. But if you turn the temperature up too high you may just get random noise!

Testing

When listening to the music generated by AI Jukebox (see Soundcloud), we should keep the following points in mind:

  • as the model is generative (as opposed to discriminative) there are no labels, and as such, the best judges are us
  • in general, we are looking for repeating patterns within a reasonable long term structure
  • the model has been trained to minimize both training and validation loss to prevent overfitting on the underlying dataset
  • most importantly, the music should be aesthetically pleasing

For those musically-inclined, the first half of the generated Celtic Piano 1 piece is “noted” below:

Key Takeaways

In this post we explored one application of how a generative model can generate unique, new content. I hope you would agree that the music samples contain some interesting, if not evocative, rhythmic patterns — but perhaps we aren’t in the running for awards just yet. This model just “scratches the surface” of generative modeling in music — more work to be done!

Next Steps

There are many avenues to take in refining the performance of this model. We could continue to run different datasets, perhaps narrowing the collections from genres to artist, or even style or tempo for a specific artist.

There are also many other architectures which may be able to further enhance the output of the model. Generative Adversarial Networks (GANs), variational autoencoders and attention RNN all seem to be gaining significant traction in the generative music area.

We can further explore different inputs, such as raw music or even text.

There are also other tweaks that can be explored, such as adding multiple instrument functionality, varying rhythmic patterns (currently we are only working with eighth notes) and rhythm seeding where the initial notes are “seeded” in order to influence initial melodies.

Ideally it would be an aspiration to write the model into a web app where a user can upload any collection of music, and then a unique AI-generated midi file would be output. But it isn’t ready for the web app just yet — training times and output consistency would make this relatively infeasible at this time.

AI Jukebox is just a simple first prototype into a “bleeding-edge” area of generative neural networks. It is truly amazing to think about how quickly this technology has come, and a bit unfathomable as to where this may lead in the not-so-distant future.

In the following video, I presented the AI Jukebox as my “passion project” at Metis Data Science Bootcamp Career Day on 5 April 2018.

Once again, the code and presentation are available on GitHub here.

If you liked this post, a clap (or two) would be much appreciated, which will make the content more easily available for the benefit of other readers.

Thanks for stopping by!

--

--

Brian McMahon

Machine Learning and AI enthusiast. Never stop learning.