The Believability of Music Generated By AI

Read about my undergraduate senior project!

Emily Thi Tran
12 min readDec 15, 2018

I think this semester has been very messy for me because I struggled to figure out what my project would be for a while. Now that we are in the endgame, I have a better idea of what my project looks like.

Project Overview

Essentially, I am testing the results of single-instrument music generated by AI on real people to gain a sense of plausibility that AI music generators can understand musical themes and sound convincingly human-composed. Specifically, I fed several music generators training data from three categories and tested the outputs on people. As a result, people were able to tell which music was created by AI through patterns of unnatural phrasings, repetition, and awkward pauses. People were also having trouble with predicting the genre. This is all to say that music generated by AI still has its limitations and can still be improved before it can sound just as fluid as the work of arts of Mozart himself.

Motivation

I am interested in this project because as a musician myself, I know that music composition is a human-centered activity that is difficult to replicate well. However, I know that there are art generators and transfer algorithms out there that can create beautiful art, so maybe music generated by AI is possible.

There are companies that focus on using AI to generate music like JukeDeck that sounds decent, but I want to test specific algorithms and find out if humans can hear the difference.

Data Collection

I created a training data set of 3 categories:

  • all Mozart compositions (41 samples) because each composer usually has their own style, which I am hoping the music generator can catch and recreate.
  • only Mozart sonatas (20 samples) because sonatas have a particular theme to them, so I want to test if the music generator can recognize the sonata theme.
  • various jazz pieces (90 samples) of different jazz pieces. For this category, I am particularly interested in how the music generator can handle a wide variety of data. Since jazz

I used import.io to data scrape content and a chrome extension to batch download the results. You can check out how I did it in this blog post.

All the MIDIs contained the full composition, including left hand, right hand, and sometimes counter melody parts. I did not clean up the MIDI files for Hexahedria’s music generator because the person who created the created didn’t modify their data. However, I did clean up MIDI files for Watson Beat because it can’t handle dense melodies.

Music Generator 1: Hexahedria

Motivation

There are many music generators that I researched on, some of which I have written about in a previous blog post. Most music generators use Long-Short Term Memory (LSTM) neural networks (which is a type of Recurrent Neural Network that has a better “memory”; you can read more about the architecture of LSTM’s here) because

  • the sequence aspect of LSTM’s fits well with music generating
  • well-composed music relies on a musical theme, so remembering previous output is crucial to music generating (hence why people use LSTM’s as opposed to Recurrent Neural Networks, which has bad memory)

I’ve looked at music generators that use a simple LSTM structure to produce musical notes, but the result is also simple and overfits training data as there is no structure to handle time. For example, Skuldur’s Classical Piano Composer trained on final fantasy songs and assumes an output for every eight note. Therefore the result can sound robotic and too limited:

This is why I was so intrigued by Hexahedria’s LSTM “biaxial structure”.

How it works

The biaxial structure applies the concept of convolutional neural networks (in image recognition) to LSTM so that it can understand relative pitches, all the meanwhile creating a biaxial neural structure so that the network can handle both time and note information independently. A visualization of the beautiful architecture looks like this:

Taken from the Hexahedria blog post

It might look intimidating, but essentially, the yellow blocks are the note inputs (where every note is an input), the first two purple columns are feeding each other information about time, and the last two purple columns are feeding each other information about the notes. The inputs feed into the current layer as well as 12 semitones above and below the current layer, mimicking the analysis of neighboring pixels in a convolutional neural network.

What I had to do

You can find the Github repo here.

Now that I was set on using this architecture, I went ahead and forked the Github repo (yay open source) and made it that it worked on my machine. After fixing compatibility issues, I was able to run the code only to realize how long training would take. With training rate as slow as what I had experienced, it would take weeks to complete the training, meaning weeks of not being able to use to my laptop. So I worked with my advisor, Justin Li, to set up training on clustering.

Every several days, the clustering outputs a sample MIDI, which I export to mp3.

The results

I created a site that holds most of the results that I currently have. The site isn’t live, but here is the Github repo if you want to run the site locally to access the results. I’ve also included playlists of all the outputs:

Here are the samples of each genre. The higher the Epoch number, the more trained the network, meaning the better the output. Have fun playing through the samples!

Music Generator 2: Watson Beat

Watson Beat is IBM’s music generator and is popular for its ability to understand Western music theory and to create compositions inspired by a training snippet in different genres. Therefore, it can create an EDM version of “Twinkle Twinkle Little Star” if you want it to.

Motivation

Since training for Hexahedria was taking longer than expected, I decided to explore another music generator. I chose Watson Beat because it has open source code, it isn’t LSTM-based, and it seems successful. I wanted to test its potential.

How It Works

Watson Beat takes in a short MIDI sample. You define the “Mood” and the composition structure (called “ini”). Watson Beat has a lot of code defining Western music theory. Then, it uses a Deep Belief Network, a different kind of deep learning network, to predict outputs based on the sample MIDI and the chosen “Mood”. Then, it uses Reinforcement Learning to either punish or reward itself to make sure that it follows the right music theory.

What I had to do

You can find the Github repo here.

All the provided moods are multi-instrument, but I was interested in single-instrument output to match the current data I already have from Hexahedria’s. Therefore, I had to create my own mood for single-instrumentation, which I called “PianoSolo”. Here is a blog post I wrote about the code structure of Watson Beat.

Training is really quick, taking less than a minute. The more time-consuming task for Watson beat is to stitch together the outputs. Watson Beat spits out many MIDI files per instrument, per section. Every training session can end up with over 25 MIDI files that I have to synthesize together in a Digital Audio Workstation and export it as a WAV or mp3 file. I used REAPER! Stitching the files together require supervision because sometimes there are duplicate parts, which ruins the harmonics and balance of the entire segment.

The Results and Analysis

I created many samples with Watson Beat to play around with composition flow. As I played with the samples, I noticed a couple of things.

Watson Beat really pays attention to the training sample and uses the structure of it. Sometimes what Watson Beat creates from the training data is good, and sometimes it is not. Can you tell?

Testing the Results

A part of my project is to test the outputs on real people to find out if

  • they can tell the difference between the genres in order to see if the AIs have a level of understanding of each genre
  • they can tell if the music is composed by AI or human in order to see if the AI s are doing a comparable job at a creatively human task.

To test people, I played a combination of samples produced by Hexahedria, Watson Beat, and humans. I then asked them to answer two questions for each sample: 1) what genre is this? 2) was this composed by human or AI?

For a bit, I tested people manually. However I stopped because it took a while to get results, and it seemed like people felt under pressure to answer, especially those who weren’t musicians. So, I created an online quiz (Github repo) where I could easily send the link to all of my non-Oxy friends and have them take it on their own time. I linked all the results to one specific spreadsheet for further analysis.

There were about 65 participants with various backgrounds in music.

Results of the Testing

The results from the testing were very interesting. I posted the percentages of each answer down below:

The percentages of each answer for every sample that was tested on real people. The cells with the colored block indicate the correct answer. The color of the block indicates the satisfaction of the results.

The color of the blocks represents the evaluation of the correct answer. Red means “the percentage of the correct answer is way off”, yellow means “it’s acceptable but it’s still almost close to being incorrect”, and blue means “it’s decently high”. (I assigned the colors based on if it is lower than the highest percentage and if it is higher than (n-1)/n percent, where n is the total possible choices)

Human vs. AI

TL;DR: AI music wasn’t convincingly human-composed. Watson Mozart was the most convincing, where it was able to convince half the listeners.

Mozart/Classical: If you look at the right side of the chart, you can see that most people were able to tell if a composition sample was written by a human or AI. The confidence is highest for the real Mozart (89.4%) and Hexahedria Mozart (86.4%), which means that Hexahedria Mozart’s composition was not comparable to Mozart himself.

Watson Mozart, on the other hand, confused the listeners, where half of them thought that the composition was written by AI while the other half thought otherwise.

Sonata: The second highest confidences are between the real Sonata (73.2%) and Hexahedria Sonata (71.1%), which means that Hexahedria Sonata is a little more convincing than Hexahedria Mozart. However, Hexahedria Sonata didn’t sound convincing to most listeners.

Watson Sonata didn’t do any better since it was able to only convince 35% of the listeners that it was human-composed.

Jazz: Listeners were 39% confident that Watson Jazz was human, which is not bad, but it also is not good. Hexahedria Jazz performed worse; it could only convince 20.3% of listeners that it was human.

Genre Differentiation

The confidence overall was rather low and often incorrect when people were trying to guess the genre of all the samples; the highest accuracy was 65.9% for Real Sonata while the accuracies of the AI composed pieces were below that.

The category where the AI relatively had the most success was jazz, where the majority of the genre votes was correct.

Analysis

Human vs. AI

For the most part, people had a rough idea of which samples were AI given the accuracies from the testing. The genres that most failed to sound human-composed were Classical and Sonata, all of which were trained on Mozart compositions. Mozart’s writing is very advanced and complicated, so it was not too surprising that the AIs weren’t able to produce something close to the masterpieces that I fed it.

Certain giveaway were that the music sounded robotic, some parts were awkwardly repetitive, and there were awkward phrasings. Human-composed pieces, on the other hand, tend to sound seamless and intentional. As some of the participants had described it,

“the only indicators that made me choose AI was whenever the samples would sound… robotic. The keys that were played almost sounded like a mechanical clock going “tick… tock… tick… tock”… steady beats. However, the human ones had… emotion.”

“I don’t think a computer can ever perfectly duplicate the expression and naturalism from humans”

“ [The AI samples] seem less pleasing, possibly having to do with a slightly unnatural sound in their melodic content and spacing.”

Interestingly, there were people who genuinely had a very difficult time hearing the difference. Funny thing is that my music teacher told me that he actually scored a zero when taking my quiz. Other participants have commented,

“I must say this was tough choosing if the AI made the music or a human”

“Honestly, I cant really tell which one is made by AI.”

Maybe the AI can trick people, but there are more participants that could hear the difference, and the scores show this as well.

Genre Differentiation

In general, it seems like people were confusing between Mozart/Classical with Sonata, which makes sense because

  • they did sound somewhat similar. The training set for the sonata is a subset of the Mozart set, so the bulk of Mozart’s overall style sounds similar in both. This can result in similar weights for both the Mozart/Classical and Sonata categories. I’d say that almost half of the participants were musicians, and they were still thrown off by the similarities of the two categories.
  • most of the listeners didn’t have trained ears to tell the difference between sonata and concertos in the first place. Despite given real examples, they were still confusing the two categories. Even the Real Sonata and Real Mozart had marginal accuracies: 65.9% and 36.2% respectively. As some of the participants have said,

“I can’t tell music apart to save my life, so I apologize if the genres I actually put down are completely wrong.”

“Well for me, it was already hard to distinguish the difference between the genre, now to distinguish if it’s AI or human isn’t really possible if I don’t know how to differences the genres even after listening to the example samples”

  • the samples that I provided might have tricked listeners. Watson Beat is a little different from Hexahedria in that Watson Beat builds a composition using the short MIDI sample as the main melody. Hexahedria trains on MIDI samples to try to recreate the samples. Therefore, Watson’s samples tend to not sound like the original genre of the short MIDI sample.

I think it is very interesting that jazz had the best accuracy out of the other genres because the jazz dataset had a large variety of jazz pieces, which I presumed would confuse the weights. The Sonata and Classical networks trained on consistent-sounding pieces (aka all Mozart stuff), which should help the networks create something more accurate and consistent.

Conclusion

All in all, the music from AI sounds like AI due to the unnatural, repetitive, robotic-sounding composition. The participants had a really tough time predicting the genre of the tracks, which was due to a combination of participants not knowing what to look for and of me confusing listeners. If I could go back in time, I would have made sure to give more examples of each genre so the participants know what to look for rather than feeling for. But the fact that people weren’t able to tell the genres apart speaks volume on the AI’s lack of understanding of the music that it trains on.

On the bright side, many people were very impressed by the AI music despite how robotic they might sound because creating something that sounds okay from just computation is outstanding. Some people can’t even create a ringtone if their life depends on it.

But at the end of the day, I am not too convinced that AI generated music — at least the ones that I created — can truly trick people, yet. I am excited to see where this field goes in the upcoming years!

Bonus: The Challenges

There were several challenges that I faced:

  1. Training takes forever for the biaxial structure! It has been almost a month and the training is still running (75% finished). With the way that the code is written, it doesn’t create a long composition until training finishes. However, it does create a sample output every 500 epochs, so at least I have some data. The reason why training takes so long is that of the biaxial structure: since many of the neural layers depend on neighboring and previous layers, you’d have to train on one axis before being able to train the other axis.
  2. Working with code that isn’t yours. As much as I love open source code, modifying it to what you need it to do is tricky. I spent a while trying to create a composition using the training weights that I already have, but the code isn’t compatible with the newer Python version. Fixing one part created more problems in other parts, to the point where it just wasn’t worth the hassle anymore. At least I have sample files to work with

Of course there were many more, but I’ll save your time by not including them. That is it, folks! I hope you enjoyed reading this!

--

--