During my senior year in the University of Washington’s Computer Science Department, I had the honor of taking the Sound Capstone class (CSE481i) taught by senior lecturer, Bruce Hemingway. There were only a dozen of us students in the class, and the task given to us was simple: Design, implement, and release an impressive application using computer audio. Bruce provided a whole slew of hardware in the sound lab for us to tinker with: Leap Motion controllers, Oculus VR, sonar sensors, Microsoft Kinects and NVIDIA Jetson TK1 Developer Kits. It was up to us to decide what to build.
The ideas we generated were limitless. The ensemble of students created long list of possible projects, location-based sound capturing app, music recommendation apps, sentiment analysis to complementary music, anything really.
Our group chose to build a website called Algo Rhythm which allows anybody to easily listen to and generate classical music using machine learning. The following is a summary of our results.
This project began with multiple purposes, including exploring and expanding on existing recurrent neural network (RNN) based algorithmic music generation techniques. We sought to utilize an NVIDIA Jetson-TX1 teraflop-level supercomputer hoping to accelerate what we knew to be a computationally intensive machine learning (ML) process. But more than that, we thought why should computer programmers be the only ones who can use machine learning to produce music?
What if you could let anybody train computer to produce music?
Now were getting interesting. Ideally any non-technical person would be able to use an computer application to choose the style of music to generate, and listen to the results right then and there. To bring this idea into fruition, we built a website that can interface with the GPU accelerated hardware running our ML service.
The whole user flow runs as follows:
- Upload a pieces of music in XML format. Or choose from existing pieces.
- Select any number of samples of music an press “Train” to generate a configuration.
- Select a configuration and press “Generate Song” to generate a song of N seconds.
How do we exactly produce computer generated music? You could build a rules-based system that attempts to generate the next note given a set of facts about the song thus far or spawn the next note using a Hidden Markov Model. However, music notes don’t change their tempo or melody every second. Rather, the cadence and style of sound have persistence. For this reason we chose to use a neural network to describe the way in which music is modeled.
Neural nets are excellent at modeling complex processes that cannot be simply modeled through more functional approaches. Recurrent neural nets specifically are powerful in their ability to retain memory and construct sequences, making them especially applicable to the problem of music generation. The issue lies in training them. Not only does it take a very large amount of processing time for a network to begin to come anywhere close to modeling the desired behavior — humans can take years to learn things with significantly more processing power and much more complex networking systems — but it is also necessary to give RNNs clean data that they can model. Too much noise in the data, or too complicated of a system and your network will fail to produce anything. Too little, and the network will mimic the training data rather than producing anything of its own. How can we have users of Algo Rhythm upload usable music data for us to train on?
Cue the music…
A common way to save sheet music is in a format called MusicXML. MusicXML is the universal standard for sheet music and can be used/edited by most of the currently available scorewriters including Sibelius, Finale, and more than 100 others. The XML provides a standard format for valuable song information such as tempo, note pitch, duration, octave, et cetera. It is often used to produce graphical scores and can be converted to audio formats such as MIDI.
Composers use their favorite music notation software to write a score that looks like…
Which can be exported to…
Many free online copies of MusicXML exist on websites such as musicalion.com and openmusicscore.org. We’ve already crawled these sites for a good variety of classical pieces that can be seen/downloaded here. We use this XML format as a convenient way for users to customize the input style of the generated music.
Music As Input
One prerequisite for music generation with RNNs, at least at the level of abstraction on which we did it, is quantization of the music.
The process of converting, or digitizing, the almost infinitely variable amplitude of an analog waveform to one of a finite series of discrete levels.
This means that each note must be assigned a discrete pitch, as well as starting and ending times in terms of the underlying beat of the music. Initially, we planned to use MIDI as input, but while MIDI is pitch-quantized, time quantization is more complicated. MIDI music is quantized around a time unit called a “tick”, which has a very fine granularity (a beat might be 180 ticks for example). However, this number was “noisy”, i.e., the number of ticks per beat would vary from beat to beat to beat. In addition grace notes, trills, and tempo changes exist in MIDI music, making it challenging to beat-align. While one option was to develop reliable time quantization code for MIDI, we instead decided to use MusicXML as an input format for training data.
Parsing training data from MusicXML solved our problems with time quantization. The only deficiency was that not all pieces of MusicXML contain tempo information (i.e., beats per second), which is required for our system to map the music to real time. Our solution was to simply require that all MusicXML used as training data include this. In many cases this involved adding a reasonable tempo tag to the XML file by hand.
Also relevant to the compilation of our training set was that we restricted our training data to music that was in a simple, duple meter (i.e., music in times with groups of two or four beats in a measure). In some cases, music existed in our training set that while primarily simple/duple had some triplet divisions of beats. Our import system would handle this by simply deleting the notes that did not align to simple/duple boundaries.
Now that we have relatively clean data to use, let’s look at how we perform the learning magic, bottom-up.
NVIDIA Jetson TX1
The training of a neural net, as well as the evaluation of a neural net after training involves a large amount of parallelizable floating point computation from matrix multiplications. This type of computation is amenable to being accelerated by GPUs. We were fortunate enough to have a few of the 2015 NVIDIA Jetson TX1 boards capable of up to 1 TeraFLOP of performance.
Jetson TX 1 board is a development board for the NVIDIA Tegra APU chip (accelerated processing unit) that combines CPU and GPU functionality in a single unit. The Tegra APU is targeted at the embedded systems market, providing high floating point throughput while consuming very low power (10W). They’re available for purchase at the retail price of $600.
Setting up the board was not an easy process as fully described here. After flashing Ubuntu v.14.04+ and attaching a hard drive to beef up it’s 16GB eMMC, the Jetson requires the installation of “Jetpack”, the NVidia libraries such as CUDA (the parallel computing platform) and cuDNN (the CUDA Deep Neural Net libraries). These packages require an approved NVIDIA account (takes a day to sign up and get approved). Theano, in our case, must also be installed. Many settings need to be changed, such as the virtual memory and virtual hard disk sizes, ssh settings, CUDA settings, clock speed. It’s quite a hassle setting up the first time, but once it’s properly configured, the performance is worth the setup time.
The Tegra GPU consumes under 10 watts as opposed to the Intel Core i5 which consumes around 200 watts. Our project using the Jetson was measured to be 10 times faster than an Intel i5 Core processor at performing neural net training. For our typical usage, running 10,000 training iterations took 30 hours on the Jetson while it would have taken 300 hours on commodity hardware, quite the difference.
Training our Neural Network
Provided we have a sufficient amount of MusicXML files (40 5-minute pieces in our case), we can now start training our network. What format should the input be though? Naïve approaches, such as simply feeding in a matrix of binary on/off switches for notes at each time step, or passing notes in as raw time and pitch values, failed by usually producing no sound at all or constantly playing every possible note at every possible time. Our most successful attempt came from bootstrapping off the work from Daniel Johnson blog, which uses the Theano library to take advantage of GPU acceleration and involves a series of linked, partially recurrent, LSTM networks, each corresponding to a note. As input, each network would take the state of its own note (started, off, or continuing to be played) at the previous time step, as well as the position of its note (e.g. middle C or high A in numerical form corresponding to MIDI pitch value), the states of other notes up to an octave in either direction, the number of notes played at each value mod 12 at the last time step (12 is the number of half steps in an octave, so this is effectively a count of each letter note), and a four-bit count of position within the measure. In addition, later layers take “recurrent” input from the networks of surrounding notes rather than from themselves.
This approach has a number of benefits. First, having an independent network for each note both allows the network to treat notes as relative rather than absolute, meaning that similar patterns from songs that happen to be in different keys or time signatures are correctly identified as similar, and allows a speedup in training, as only one such note network actually needs to be trained, and thus is effectively given as many pieces of training data as there are possible notes at each time step rather than just the one that other approaches had offered. Second, the high level of recurrence coming from the networks interacting with both themselves and each other at every time step gives the network a good deal of ability to reconstruct patterns it detected in training. Third, since each note-network is performing a relatively simple function (where the function output is the probability whether a note should be played or not), it has no need for the excessive size that characterizes most powerful neural networks, which allows for faster performance.
We found that 10,000 is the minimum number of RNN epochs/iterations to produce pleasing sounding music. A lower number of iterations would produce seemingly random piano notes.
Using a random initialization, we are able to produce a unique piece of music every time we run our genmusic script for a given model. Our system on average takes 6 seconds for every 1 second of music produced.
Process Management Backend
We created a software module dedicated to allowing users to spawning and managing machine learning and music generation processes. A method exists to fork a new Python process, which in turn, given a list of filenames that have already been uploaded to the server, trains a machine learning model. Another method forks a new Python process for music generation and saves a MIDI file to a generated music directory. Locking prevents concurrency issues between forked training/music generation threads and the server state. Lastly, there is a method that queries the state of the system, giving information about all existing training pieces, trained configurations, generated pieces, and training and generation processes, along with their percent completion. Additionally, this module actively monitors running processes, and delivers events in software when the statuses of processes change. These events are captured by the server front end, and translated into notifications that are sent to the user’s web browser.
The easiest way to interact with the process of generating music is through an always available website where a user can access the dedicated hardware to create and listen to generated music. To make this a reality, there are many parts to the web interface.
To make a friendlier interface for the neural net programs, we created a Flask Python server that contains an API for uploading MusicXML files, training models, and generating songs as well as a way to serve static files and dynamic HTML pages. There is one main HTTP GET handler for getting the home page and a couple of POST handlers which handle network communication that must be done with HTTP, like uploading XML files. In addition, there are many internal server methods that validate input from the front end, report any errors, and dispatch processes via the backend module. The server is configured to run on host 0.0.0.0, port 80, which allows the HTTP and websocket interface be publicly visible on the host machine. We ran the server on jetson2.cs.washington.edu during our final presentation.
To build a seamless front end that automatically syncs the state of the server with the UI in the browser, a dedicated websocket is initialized for every client that connects to the server. This has a variety of benefits, from being able to submit form data without a page reload, to the server being able to broadcast status updates when other clients start new processes or the percent completion of any process updates.
The UI was built with Facebook’s React UI framework, which allowed us to create isolated components that could easily be updated as the state from the server changed. Some of these components, like the progress bars, are used more than once.
An open-source notification manager shows a nice confirmation message when a user starts a new learning or generation process. All the forms also have form validation logic (also seen on the server) to prevent the system from crashing. Lastly, plain CSS adds a nice bit of color, some structure, and subtle animations to make the website look clean and professional.
Playing MIDI on the Web
For playing the generated songs, we put the files in a static folder on the server and used the library MIDI.js found at http://midijs.net.
Here’s a selection of the unedited sounds we were able to produce:
Some songs have long periods of silence or long periods of the same note playing over and over again. A future project could be detecting these patterns and eliminating them from the final song.
Success of a project such as this one is somewhat difficult to judge.
What constitutes “good” with something as subjective as art? How close to “good” should we expect a computer program to be at something that is decidedly considered a human skill?
The project succeeded in generating something that few would argue isn’t music, and it did so with a robust and attractive interface that allows both ease of use and ease of extension should we or anyone else decide to expand upon our work in the future. We created something that incites curiosity and has some ability to entertain, which is what we set out to do, even if the music itself might not be good enough to listen to were it not written by a computer. So in that sense, the project was a success. We did perhaps fall short in some areas; we had hoped to experiment with alternate network structures and training data, but finding alternate network structures that generated anything usable proved challenging, and a time crunch combined with slow computational speeds limited our ability to experiment with training sets in different musical styles.
Both of these shortcomings also provide obvious avenues for potential expansion, however, so we can count ourselves lucky that they aren’t more critical to the viability or success of the project itself and instead open avenues for expansion.
Build it yourself!
Project source code and documentation can be found at github.com/grant/algo-rhythm. The README.md contains instructions on how to run the website. Feel free to contribute pull requests or open issues!
Grant Timmerman is a senior in the University of Washington’s Computer Science Department interested in machine learning, programming languages, and mobile applications. While he’s not building 3D printers, Grant actively contributes to open source, gives talks about the hottest in web frameworks, and practices speaking Mandarin Chinese. For more interesting projects and experiments, visit Grant’s personal website grant.cm.