Deep Learning: Can We Use Computer Vision to Predict the Composer of Classical Music?

An expert that understands the history and high points of classical music can often spot a composer from just a few bars.

We’re going to see whether a Deep Learning model (specifically a Convolutional Neural Network), can do the same. But there’s a catch.

Our model won’t listen to a single note. We’re going to show images to the model that reflect the shape and form of a piece of music to find out whether our model can see the telltale signs of a particular composer from a visualization of the sound itself. We’ll test using 149 different composers.

As we go through this experiment I’ll share a basic workflow for image classification gleaned from the great course, as well as some rough but useful scripts for wrangling data. By the end of the article, we’ll have a flawed but functional model.

Supporting code and a Jupyter Notebook version of this model is available here.

Sourcing Data

Unfortunately, the data is a mess. Some folders are composers, some are compositions, and there’s no clear order to any of the files.

We need to get things in a roughly sensible order. I manually dumped compositions to the folder of their composer as best I could. Then I merged duplicates, and extracted any zipped midi files. This gives us a list of folders named by composer.

Next we need to normalize our files and folders.


Next, let’s normalize our file names to get them ready for conversion.


MIDI files not only encode the notes of a given composition, but also data about the instruments. Our source data has symphonies, pieces for solo guitar, piano concertos, and just about everything in between. Some composers wrote heavily for certain instrument groups, so we could use this instrumentation data for prediction.

But for this project we’ll see if we can find patterns in only the notes themselves. To do this, we need to convert all MIDIs to the same synth. All compositions will use the same organ-like synth sound.

We’ll use a NodeJS package called synth to handle the conversion.

Here are two scripts to convert all midi files in a directory. The bash and Python scripts should be in the same directory.

Here’s the companion bash script.

Conversion may take a long time! Don’t worry if something fails or the script gets interrupted. You can call the script as many times as you need without worrying about overwriting or losing the files you’ve already converted.


WAV to Spectrogram

A spectrogram is a visual representation of sound and its properties.

Our spectrograms will plot frequency on the y axis and time on the x axis. The intensity of colors suggests a higher density of a given frequency at a point in time.

Here’s another script to convert all our WAV files to spectrograms. This will take a while!

more waiting!

Look at these beautiful spectrograms!

Here’s a Mozart, which looks pretty orderly and Classical.


Here’s a Beethoven, looking a bit more Romantic.

And here’s early-20th century composer Charles Griffes, looking wild and Modern.

Charles Griffes

Finally, here’s Arnold Schoenberg with one of his atonal masterpieces.

Arnold Schoenberg

Moving Data to GCP

Let’s zip up all our files to make them smaller and easier to send.

Cool. The easiest way to get our data to GCP is with scp.

Haven’t used scp before? That’s ok. Here’s yet another script to send your files off to Google Cloud.

You may need to move the files from your root directory in GCP to another location depending on how your files are organized.

Untar (unzip) your data with:


We’ll start with Resnet 34, a model pre-trained for image recognition.

Let’s load in our data! These remaining code snippets should be your Jupyter notebook.

path = Path(‘compositions’)

We’ll use fastai’s DataBunch class to get validation and training sets from our data. We’ll train on 70% of the data and validate on 30%.

data = ImageDataBunch.from_folder(path, train=”.”, valid_pct=0.2, size=224)

.from_folder automatically looks through a directory of folders sorted by class and splits the files contained in each class into a train (training) and valid (validation) set.

We’ll also set size to 224 since this 224x224 are the dimensions of the images used to train Resnet34. We’ll also split our data between training and validation sets. We’ll train on 80% of our data and validate on 20%.

Sanity Check!

First, let’s peek a few rows to make sure they look right:

data.show_batch(rows=5, figsize=(7,6))

You should see something like this. Looks promising!

Let’s check our our classes. These are the labels that help us categorize.

149 composers = 149 labels

Here’s our 149 composers.

Cool! Let’s go ahead and create our learner and train it with 5 cycles through the data.

learn = create_cnn(data, models.resnet34, metrics=error_rate)

Let’s save the model so we can use it in future sessions without having to wait for training again.'res-34')

Cool. Our model is looking not too bad, with a roughly 79% accuracy rate. This is a decent first start.

Let’s try a few more epochs and see how we’re doing.


With our accuracy up to about 81%, we’re doing pretty well. In the world of deep learning, a model achieving 81% is not outstanding.

But if we consider that we’re identifying the composer of classical music by looking at spectrograms, this is pretty cool. I’d be impressed if someone could hear a piece and correctly identify the composer 81% of the time!‘resnet-34–8-epoch’)

Model Interpretation and Analysis

interp = ClassificationInterpretation.from_learner(learn)

First we’ll plot our top_losses. These are the images that the model was most confused about. fastai offers the ClassificationInterpretation class to help us understand our model.

Based on the spectrograms, it’s a bit hard to say whether these confusions are reasonable or not. Let’s compare the confusion using what we know about on musical eras.

Here are our confused composers and their lifespans:

  • Scarlatti (1685–1757)
  • Bacewitz (1909–1969)
  • Czerny (1791–1857)
  • Rossini (1792–1868)
  • Bach (1685–1750)
  • Méhul (1763–1817)
  • Mozart (1756–1791)
  • Dussek (1760–1812)
  • Paganini (1782–1840)

Scarlatti/Bacewitz: This is definitely a big confusion.

Czerny/Rossini: These composers were contemporaries, born one year apart. Not unreasonable.

Bach/Mehul: These composers were not contemporaries. Méhul is often considered an early Romantic and Bach was squarely in the Baroque period. Seems like confusion.

Mozart/Dussek: Contemporaries in a similar style. Like Mozart’s, Dussek’s later pieces are considered precursors to Romanticism. Not unreasonable.

Scarlatti/Paganini: Definitely confusion.

Of our top losses, we have 3/5 which are obviously confusions.

Next, we’ll plot a confusion matrix. A confusion matrix visualizes the labels that the model is most confused about. If you look to the above and below the sharp diagonal line, you’ll see how many times a pair of labels was miscategorized.

The confusion matrix is too large to visualize here, so you’ll have to check out the notebook for more details.

a portion of the confusion matrix

As we look across the diagonal line that indicates correct identification by our model, a pattern starts to emerge.

The highest performers were overwhelmingly the classes with the most data. Famous composers like Bach, Mozart, Mendolssohn and Scarlatti, which were well-represented in our data set, are well classified. We correctly identified Bach 79 times, Chopin 23 times, and Bartok 12 times! But some of our classes have only one or two spectrograms, so we can’t expect much accuracy in those cases.

This tells us that we might be able to improve our model just by getting a lot more spectrograms, especially for the less famous composers in our list. We should also take our accuracy estimates with a grain of salt. If we removed our best represented composers from our dataset, our accuracy would plummet.

Let’s check out our most confused pairs to see what else we can learn.


This seems great! Is our model only significantly confused between Satie and Mozart? Let’s lower our min_val…



For a classifier with fewer labels (a dataset with fewer composers), we could set a higher minimum value. We would only be interested in seeing where our model consistently confuses two labels.

But since we have so many labels, our model might be confused about a particular composer but not confused about that composer in relation to only one other label.

As you can see in our list, Haydn is confused three times, but not more than once with any other composer. Here we can see that the model does struggle with certain composers like Haydn.

Improving the Model with Resnet 50

We’ll follow the same process as last time, just with slightly different parameters. Since Resnet 50 is more memory intensive, we’ll need to lower our batch size (bs) to 16.

data = data = ImageDataBunch.from_folder(path, train=".", valid_pct=0.2, size=224, bs=16)
learn = create_cnn(data, models.resnet50, metrics=error_rate)
resnet 50 results after 5 epochs

Awesome! We’re at roughly 82% accuracy. That’s already better than our Resnet34 model, but after only 5 epochs. It looks like we may be overfitting our data here, but let’s keep going.

Let’s see if we can’t tune it up a bit more by unfreezing our model and letting our model train each layer.


Bad idea. Let’s go back to our original model and tune it try unfreezing again, but with a better learning rate.

learn.lr_find(start_lr=1e-12, end_lr=1e-8)

We can pick a slice with a nice downward trend to set our learning rate. From about 10^-11 to 10^-9 should do.

learn.fit_one_cycle(1, max_lr=slice(1e-11,1e-10))

Awesome! Although we’re still not near 100% accuracy, this model works pretty well given our limited data.

We learned a basic but useful workflow for creating deep learning models using fastai. Most importantly, we figured out that it’s possible to detect composers from spectrograms. But there’s still a lot of problems with this model.

First, our model falls flat with composers that aren’t well represented in the data. Our accuracy would significantly decrease if removed even one composer (Bach) from our dataset.

Second, we’re using a model pre-trained on ImageNet. ImageNet was trained on real objects like dogs and cars. Since we’re using spectrograms, we likely only benefit from the earliest layers in our pre-trained model. These early layers detect more abstract shapes like lines and corners, which exist in spectrograms just as they exist on the pre-trained data. But later layers, which detect features like eyes or signs, probably don’t help our model much. Perhaps we could unfreeze the entire model and train it from scratch, letting the features of spectrograms emerge over time. Or we could search for a model pre-trained on spectrograms.

We definitely should add more data by finding more MIDI files or transforming our images. We could also try this same approach with better spectrograms, like those generated by Librosa.

Thanks for reading!

If you enjoyed this article consider following on Twitter or Github.

Technical: Node, React, Serverless, GraphQL and more… | Human: focus, optimism, minimalism |