How we made music using Neural Networks

My name is Aleksandr Tavgen and I work as a Software Architect in Playtech. I have always loved playing. When I was younger, I mostly played with LEGOs, but these days, my toys are slightly more complex. For example, recently I have been playing around with Recurrent Neural Network models.

Last year, I proposed a collaboration to a friend of mine, Aleksandr Zedeljov (http://faershtein.com/), who is a composer, musician, and musical director at the Russian Theatre of Estonia. We decided to implement it with the musical project MODULSHTEIN (Aleksandr Zedeljov, Aleksej Seminihhin, Marten Altrov) and to present the results of our work during a live performance at the badge pick-up party at Topconf 2017 in Tallinn.

We had less than two months for implementing this project, which was not much considering that we live in different cities and could only perform our experiments at weekends, but we received a lot of invaluable support from Playtech in our research.

The Basics

The Unreasonable Effectiveness of Recurrent Neural Networks by Andrej Karpathy (deep learning expert and Tesla’s Director of Artificial Intelligence) serves as an excellent introduction to the technology behind this project.

Generally speaking, Recurrent Neural Networks produce good results when handling time series data with an underlying structure. Take natural languages, for example. The structure of a natural language has a number of dimensions. On the one hand, there’s the semantic dimension, which is quite difficult for machines to grasp. A famous example here is the Chinese room argument, which demonstrates that a program cannot give a computer a “mind”. However, this argument is less persuasive when it comes to multimodal learning, which is a large topic in and of itself.

On the other hand, there is the syntactic dimension, which is quite manageable for machines. The “magic” behind Recurrent Neural Networks (RNN) is related to the fact that an RNN can hold states and also take into account previous states. Roughly speaking, we can feed a model many different strings of text and then ask it to predict the next character in a string depending on what the previous characters were. For example, there is a very high probability that the string ‘I lov’ is followed by an ‘e’, but the next character in the string ‘I love ‘ is not as obvious. It could be ‘y’ (‘you’), ‘h’ (‘him/her’), ’t’ (‘them’), and so on.

Music is also structured along various dimensions such as rhythm, intervals, dynamics, etc. It could be said that music constitutes well-structured time series data, which we can use as input for our RNNs. The MIDI format is highly suitable for this, because a MIDI signal is essentially a numeric set of code arranged in time.

It was while I was thinking about how to implement our project and performing some tests using the Python MIDI library that I stumbled upon the Magenta project, which was launched in 2016 by the Google Brain team. Its aim was to find out whether machine learning could be used to create compelling art (images, music, texts, etc.). Today it functions both as a research project and as a community for like-minded researchers, developers, and artists. The models and tools used by the team are regularly released to the public in open source form together with demos, tutorials, and technical papers.

Essentially, Magenta is a MIDI interface for working with TensorFlow, which is a widely used open-source framework for machine intelligence. Magenta creates virtual MIDI ports for call-response (or input-output) interactions. It is possible to create many parallel virtual MIDI port pairs. Each such pair can be connected to a TensorFlow model, which is a Recurrent Neural Network that has been trained on a set of MIDI files.

Technical Overview

The interaction process looks as follows.

Due to the fact that RNNs do not work well with multidimensional data, MIDI signals should first be mapped onto an artificial alphabet, where every musical element corresponds to a unique character or hash code. During the interaction process, a MIDI signal goes through the Magenta MIDI interface and is converted into the NoteSequence format, which is a protocol buffers format for exchanging data with a TensorFlow model. A TensorFlow model is a RNN that receives an input sequence and produces a response using this sequence as a seed.

At first, I intended to train TensorFlow models from scratch, but I ran into some problems. To begin with, it was difficult to obtain large sets of data from specific styles of music. It is possible to download some training data (for example, from http://colinraffel.com/projects/lmd/), but it makes little sense to train your own models on the same data set that the Magenta team uses. So I decided to take one of Magenta’s pre-trained models (a LSTM RNN with two hidden layers of 128 elements) and continue its training process. I downloaded a lot of music with break beats in the MIDI format, ranging from Prodigy’s Out of Space and Goldie’s Inner City Life to Moby and Massive Attack. Magenta’s GitHub contains conversion tools that can help you prepare your own training set from a collection of MIDI files.

The second problem arose when I was trying to start the training process on the AWS Cloud. I started the process on a p2.8xlarge instance, but kept having non-obvious hang-ups that were related to native calls. Unfortunately, I had no time for investigating that problem.

Only the last two bars were taken into account during every training instance, so we only spent a few nights training the models at different settings. The gradient descent was not especially fast and I cannot say whether the trend was towards continued descent or whether it all amounted to oscillation around local minima, but it was acceptable for my purposes.

The last checkpoint from the training process was converted into a bundle file using Magenta’s tools. Our configuration consisted of two pairs of virtual MIDI ports with a separate model for each. The first model was to listen to the rhythm provided by one of the live musicians and then provide a rhythm in response. The second model was to listen to its output and generate its own response, and so on. It is also possible to build a set-up, where two or more models are listening to each other and creating music together. We did perform one such test as well, letting our models play with each other while we took a lunch break. The result was vaguely trance-like and tinged with Afro-beats, bizarre but not unpleasant.

The next step was to figure out how to bind the models together with live musicians during an improvisation session. Aleksandr Zedeljov proposed that we use Ableton Live as a universal glue. It is a musical sequencer and digital audio workstation that was designed to be used for live performances as well as for music production. The MaxSP plugin has previously been used to bind Ableton Live with Magenta, but this solution did not work for us, because MaxSP always crashed and took Ableton with it. So we ended up discarding MaxSP and binding them straight together. Later, we also had some problems with synching Ableton and Magenta via midi_clock.

Our first attempts at improvisation looked like this:

Aleksandr Zedeljov played a rhythm example on the drum pad (Ableton Push 2), the model received the MIDI signals through Magenta, and then produced a response.

We went through many “try, test, modify, repeat” cycles. It was quite entertaining, especially when the results were unusual. The main complexity arose from the fact that the models gave different responses every time, so every test amounted to pure improvisation. During the process, we noticed that longer input from live musicians seemed to result in more sensible responses from the models. It felt like the models gained courage as they kept working. I decided to check whether it was merely our belief, or whether there were actual differences in the performance of the RNNs. In the beginning, the models produced responses with log-likelihood -70, but after a certain amount of time, the log-likelihood value fell to -150, -400, and even -700, which was distinguishable by ear. It seems to be somehow related to the internal state of the RNN, which seems to converge to values that start to generate increasingly better responses within a certain amount of time.

We decided to film our first real improvisation session with live musicians in Playtech’s Tallinn office. It was pretty cool, because the office was empty that late in the evening and we were up on the 10th floor, with a view of planes landing at the airport.

A bit of Moog :)
Work moments…

Magenta’s browser interface enabled us to monitor what was going on within the models in real time, making it possible for us to change the parameters of the models on the fly (see the orange “bricks” running on the screen).

Performance

Martin Altrov, Aleksandr Zedeljov, Aleksei Semenihhin (MODULSHTEIN)
20 minutes before the start
Control panel

We achieved much better results during our Topconf performance due to the additional time we had for tuning the whole system. However, due to the lack of time and data for training melodic models, only the rhythm section was provided by two RNNs.

MIDI signals can also be used for controlling digital video workstations, so it would be interesting to also use models that produce video responses in order to supplement the music with an improvised video stream. There are a lot of possible approaches to chaining various models and combining them with music and video devices, experimenting with various harmonic models, implementing call-response loops during any intermediate step, and so on.

Great thanks to the team: Aleksandr Zedeljov, Martin Altrov, Aleksei Semenihhin, Nikolay Alhazov, and Playtech and personally Marianne Võime, Ergo Jõepere.