Learning Music Genres

5 min readDec 13, 2017

Can you listen to a song and call its genre? How many seconds would you need to listen?

A song’s genre is not something you can categorically classify with just one tag. You could tag a song with more than one genre and make sense. More importantly, the definition of a genre is somewhat subjective. So building a, one class, music classifier would be kind of counter intuitive having said the latter. Nevertheless, inspired by Sander Dieleman’s et al awesome work and post, I decided to build a Deep Learning music classifier.

For any Deep Learning project you need data, lot’s of it. After doing a bit of research on music datasets that provide raw signals, I figured that the best option for my purposes was the FMA (Free Music Archive) dataset. It has a considerable amount of tracks and comprehensive metadata. Special greetings to Michaël Defferrard et al for putting it together. I chose to use the small dataset on FMA for the purpose of this project.

High level overview

Besides the data, songs and tags, in order to build a Deep Learning classifier you’ll need:

A Deep Learning architecture, I’ll base mine in Dieleman’s work.
A programming language, Python.
A useful library for audio analysis, librosa
and Deep Learning frameworks, TensorFlow + Keras

As most machine learning projects, this one involved training, validating, testing, visualizing and doing interesting analysis to explore the model’s behavior.

What to do with all these songs?

The FMA small dataset has 8,000 tracks with a length of thirty seconds each. The genres for this dataset included Hip-Hop, Electronic, Rock, Instrumental, International, Experimental, Pop and Folk. Since some of these genres sounded a bit ambiguous I decided to trim the dataset to only consider Hip-Hop, Electronic, Rock and Instrumental. You might think that this is not enough data to train an interesting model. Even more, Sander used a million songs. But there’s a trick to perform data augmentation on this kind of dataset that I’ll describe further in the post.

In order to exploit the temporality and content information of the songs, I represented each one of them as a spectrogram. A spectrogram is a space, frequency and amplitude representation. It’s commonly used to visualize these three attributes of a signal.

It’s also common to use a mel-spectrogram, this is a spectrogram in the Mel scale. The mel scale uses different pitches but the transformation is virtually imperceptible to listeners.

Strictly speaking the representation I use is the log based mel-spectrogram. This is were librosa is quite handy, it takes 3 lines of code to transform the audio signal into the log-mel spectrogram.

librosa FTW

After the songs have been transformed into a our useful representation, log-mel spectrograms, the trick to perform data augmentation is to split the thirty seconds into 10 chunks, 3 seconds each. Each window of the song will be tagged with the same genre of the original thirty seconds. If a song had rock as genre, then all 10 windows that came out of the splitting will have rock as genre. This trick gives 10x more data than the original dataset.

Model’s Architecture

The model I implemented has a similar architecture to the one described by Sander in his post:

Three layers of one-dimensional convolutional layer + max pooling
A merged layer of Global average pooling and Global max pooling over the time dimension.
Two layers of Dense Layer + Dropout
Finally, a Softmax layer to predict class probabilities.

On Sander’s original work, instead of the Softmax, the last layer is described as a:

Regression trying to predict latent representations from a collaborative filtering model.

Training and Performance

Having built the model, I trained it for 10 epochs. Keras allows you to save your model only if a validation metric improves with each epoch using the callbacks parameter. In this way, the model kept has the best performance, validation wise, but also prevents it from overfitting.

Evaluating the classifier’s performance was done in two ways. Single window and averaging windows. The first is done by predicting each window’s genre, and the second one is done by averaging the predictions of the ten windows of each track. Here are the results of one of the models I experimented with.

+----------------+--------------------+---------------+
| evaluation     |       set          |     accuracy  |
+----------------+--------------------+---------------+
| single window  |     training       |      79%      |
| single window  |     validation     |      69%      |
| single window  |     test           |      70%      |
|   avg windows  |     training       |      84%      |
|   avg windows  |     validation     |      74%      |
|   avg windows  |     test           |      75%      |
+----------------+--------------------+---------------+

These results make total sense if you think about it.
Averaging the prediction for all windows per track is what we are looking for. A single 3 second window could sound Instrumental and maybe the next will sound like Rock. This means that averaging the predicted probabilities for all windows per track gets the entire song’s best description. This is a more accurate metric to evaluate the classifier.

Also, just in case you were curious, here’s the confusion matrix on the predictions of the training dataset.

To better understand the previous figure, here’s the genre mapping:
0 - Electronic, 1 - Hip-Hop, 2 - Instrumental and 3 - Rock.

The learning part

In Convolutional Networks, the first layers in the model will learn low level features. The closer the layer is to the output of the model, the higher level the features will be.

So it’s interesting to compare the activations, i.e. the track transformation after the first layer, from genre to genre.

Hip-Hop vs Instrumental

Instrumental songs seem to evenly trigger low frequencies while Hip-Hop triggers a breath of frequencies, probably within human’s voice spectrum.

First convolutional activation of correctly classification: Hip-Hop (left) and Instrumental (right)

Rock vs Electronic

Rock activates a wide spectrum of frequencies, my guess is that it picks up combinations of drums (low and high frequencies), guitars and vocals. While it looks like Electronic activates specific frequencies.

First convolutional activation of correctly classification: Rock (left) and Electronic (right)

Conclusions

This project involved subject research, data cleaning and augmentation, hyperparameter testing and analysis. While this is comprehensive, it only scratches the surface of possibilities for applying Deep Learning to music content analysis.

Possible alternative projects

Using this as an introduction, you could think of alternative formulations / projects regarding music content analysis:

Auto-encoders for clustering
Music style transferring
Automatic music composition

I hope this motivates you to start diving deeper into this exciting field. The code to replicate this analysis can be found on this repo along with Jupyter Notebooks to explore models and visualizations.

Happy modeling :)