Member preview

Identifying the Genre of a Song with Neural Networks

This article will show you how to build a neural network that can identify the genre of a song.

DataSet: You can find the GTZAN Genre Collection at the following link: GTZAN

It has 1,000 different songs from over 10 different genres, with 100 songs per genre and each song is about 30 seconds long.

Library Used: Python library, librosa to extract features from the songs and use Mel-frequency cepstral coefficients (MFCC).

MFCC values mimic human hearing, and they are commonly used in speech recognition applications as well as music genre detection. These MFCC values will be fed directly into the neural network.

Let’s understand MFCC in detail

To help you understand the MFCC, let’s use two examples. Download Kick Loop 5 by Stereo Surgeon. You can do this by visiting Kick Loop 5, and download Whistling by cmagar by visiting Whistling by cmagar. One of them is a low-bass beat and the other is a higher pitched whistling. They are clearly different and you can see how they look different with MFCC values.

Let’s go to the code (Note that all the necessary code files for this article can be found at Github link).

The following is the list of things you need to import:

  • The librosalibrary
  • globbecause you’ll have to list the files in the different genre directories
  • numpy
  • matplotlib to draw the MFCC graphs
  • the Sequential model from Keras, a typical feed-forward neural network
  • the dense neural network layer, which is just a layer that has a bunch of neurons in it.
Dependencies to import

Unlike a convolution, for example, it’s going to have 2D representations. You’ll have to use import activation, which allows you to give each neuron layer an activation function, and to_categorical, which allows you to turn the class names into things such as rock, disco, and so forth, called one-hot encoding, as follows:

You have officially developed a helper function to display the MFCC values

First, load the song and then extract the MFCC values from it. Then, use specshow, which is a spectrogram show from the librosa library. Here’s the kick drum:

Low frequency: Kick loop 5

You can see that at low frequency, the bass is very obvious and the rest of the time it’s kind of like a wash. Not many other frequencies are represented. However, if you look at the whistling, it’s pretty clear that there are higher frequencies being represented:

High frequency: Whistling

The darker the color, or closer to red, the more power is in that frequency range at that time.


Qualifying song genres

So, you can even see the kind of change in frequency with the whistles. Now, here is the frequency for disco songs:

Song type/genre: Disco

This is the frequency output:

Disco Songs

You can see the beats in the preceding outputs, but since they’re only 30 seconds long, it is hard to see the individual beats. Compare this with classical, where there are not so many beats as a continuous kind of bass line such as the one that would come from a cello, for example:

Song genre: Classical

Here is the frequency for hip-hop songs:

Song genre: HipHop
HipHop songs

It looks kind of similar to disco, but if it were required that you could tell the difference with your own eyes, you wouldn’t really need a neural network because it’d probably be a relatively simple problem. So, the fact that you can’t really tell the difference between these is neural network’s problem.

There’s another auxiliary function here that again only loads the MFCC values, but this time you are preparing it for the neural network:

Also loaded are the MFCC values for the song, but because these values may be between negative 250 to positive 150, they are no good for a neural network. You don’t want to feed in these large and small values. You need to feed in values near negative 1 and positive 1 or from 0 to 1.

Therefore, figure out the max and the absolute value for each song. Then divide all the values by that max. Also, the songs are of a slightly different length, so you want to pick just 25,000 MFCC values. You have to be super certain that what you feed into the neural network is always the same size because there are only so many input neurons and you can’t change that once you’ve built the network.


Qualifying songs to get MFCC values and class names

Next, there’s a function called generate _features_and_labels, which will go through all the different genres and go through all the songs in the dataset and produce those MFCC values and the class names:

As shown in the preceding screenshot, prepare a list of all the features and labels. Go through each of the 10 genres. For each genre, look at the files in that folder. The ‘generes/’+genre+’/*.au‘ folder shows how the dataset is organised.

When processing this folder, there will be 100 songs each for each file; you can extract the features and put those features in the all_features.append(features) list. The name of the genre for that song needs to be put in a list as well. So, in the end, all features will have 1,000 entries and all labels will have 1,000 entries. In the case of all features, each of those 1,000 entries will have 25,000 entries. That will be a 1,000 x 25,000 matrix.

For all labels at the moment, there is a 1,000 entry-long list, and inside are words such as blues, classical, country, disco, hiphop, jazz, metal, pop, reggae, and rock. Now, this is going to be a problem because a neural network is not going to predict a word or even letters. You need to give it a one-hot encoding, which means that each word here is going to be represented as ten binary numbers.

  • In the case of the blues, it is going to be one and then nine zeros.
  • In the case of classical, it’s going to be zero followed by one, followed by nine zeros, and so forth. First, figure out all the unique names by using the np.unique(all_labels, return_inverse=True) command to get them back as integers. Then, use to_categorical, which turns those integers into one-hot encoding.

So, what returns are 1000 x 10 dimensions. 1,000 because there are 1,000 songs and each of those have ten binary numbers to represent the one-hot encoding. Then, return all the features stacked together by the command return np.stack(all_features), onehot_labels into a single matrix, as well as the one-hot matrix. So, call that upper function and save the features and labels:

Just to be sure, print the shape of the features and the labels as shown in the following screenshot. So, it is 1,000 by 25,000 for the features and 1,000 by 10 for the labels. Now, split the dataset into a train and test split. Decide the 80% mark defined as training_split= 0.8 to perform a split:

Before this, shuffle, and before you shuffle, you need to put the labels with the features so that they don’t shuffle in different orders. Call np.random.shuffle(alldata) and do the shuffle, split it using splitidx= int(len(alldata)*training_split), and then you’ll have train and testsets, as shown in the snapshot earlier.

Looking at the shape of the train and the test sets, the train is 800, so 80% of the 1,000 for the rows: you have 25,010 features. Those aren’t really all features, though. It is actually the 25,000 features plus the 10 for the one-hot encoding because you stacked those together before you shuffled. Therefore, you’ve to strip that back.

You can do that with train_input = train[:,:-10]. For both the training input and the test input, take everything but the last 10 columns, and for the labels, take the 10 columns to the end, and then you can see the shapes of the training input and train labels. So, now you have the proper 800 by 25,000 and 800 by 10.

Next, build the neural network:

You’ll have a sequential neural network. The first layer will be a dense layer of 100 neurons. Now, just on the first layer, it matters that you give the input dimensions or the input shape, and that’s going to be 25,000 in your case.

This says how many input values are coming per example. Those 25,000 are going to connect to the 100 in the first layer.

The first layer will do the weighted sum of its inputs, its weights, and bias term, and then run the relu activation function. relu states that anything less than 0 will turn out to be a 0, anything higher than 0 will just be the value itself.

These 100 will then connect to 10 more and that will be the output layer. It will be 10 because you have done someone-hot encoding and have 10 binary numbers in that encoding.

The activation used in the code, softmax, tells you to take the output of the 10 and normalize them so that they add up to 1. That way, they end up being probabilities. Now consider the highest scoring or the highest probability out of the 10 as the prediction. This will directly correspond to highest number position. For example, if it is in position 4, it would be disco.

Next, compile the model, choose an optimizer such as Adam, and define the loss function. Any time you have multiple outputs, you probably want to do categorical cross-entropy and metrics accuracy to see the accuracy as its training and during evaluation, in addition to the loss, which is always shown. However, accuracy makes more sense. Next, print model.summary, which tells you details about the layers. It will look something as follows:

The output shape of the first 100 neuron layer is definitely 100 values because there are 100 neurons, and the output of the dense second layer is 10 because there are 10 neurons. So, why are there 2.5 million parameters or weights, in the first layer? That’s because you have 25,000 inputs.

Well, you have 25,000 inputs and each one of those inputs is going to each one of the 100 dense neurons. So that’s 2.5 million, and then plus 100, because each of those neurons in the 100 has its own bias term, its own bias weight and that needs to be learned as well.

Overall, you have about 2.5 million parameters or weights. Next, run the fit. It takes the training input and training labels and takes the number of epochs that you want. You want 10, so that’s 10 repeats over the trained input. It takes a batch size that tells you the number, in this case, songs to go through before updating the weights; and a validation_split of 0.2 says to take 20% of that trained input, split it out, don’t actually train on that and use that to evaluate how well it’s doing after every epoch. It never actually trains on the validation split, but the validation split lets you look at the progress as it goes.

Finally, because you separated the training and test ahead of time, do an evaluation on the test, the test data, and print the loss and accuracy of that. Here are the training results:

It printed this as it went. It always prints the loss and the accuracy. This is on the training set itself, not the validation set, so this should get pretty close to 1.0. You actually probably don’t want it to go close to 1.0 because that could represent overfitting, but if you let it go long enough, it often does reach 1.0 accuracy on the training set because it’s memorising the training set.

What you really care about is the validation accuracy because that’s letting you use the test set. It is data that it has never looked at before, at least not for training, and indeed it’s relatively close to the validation accuracy, which is your final accuracy. This final accuracy is on the test data that you separated ahead of time. Now you’re getting an accuracy of around 53%. That seems relatively low until you realize that there are 10 different genres. Random guessing would give you 10% accuracy, so it’s a lot better than random guessing.

If you found this article, you can explore Dr. Joshua Eckroth’s Python Artificial Intelligence Projects for Beginners to build smart applications by implementing real-world artificial intelligence projects. This book demonstrates AI projects in Python, covering modern techniques that make up the world of artificial intelligence.


For more updates you can follow me on Twitter on my twitter handle @NavRudraSambyal

Thanks for reading, please share it if you found it useful