Music Genre Classification Using Feed Forward Neural Network (Using Pytorch)

Pankaj Kumar
8 min readJun 11, 2020

--

Music classification is a very tough task in deep learning because lots of features are present in any single tone like Amplitude, Spectrum, Zero-Crossing Rate, Spectral Bandwidth, Spectral Centroid, bla..bla..bla but don’t worry, I am not going to use these pieces of information in this post. In this post, I am using one of these simplest methods to classify correct genres of the music, So the idea is to convert genre classification into the image classification but how?

Before going to start, let’s discuss the flow of the post…

  1. About Data Set

2. Data Visualisation And Cleaning

3. Data Preparation

4. Defining Feed Forward Neural Network Model

5. Load Model And Data Into GPU

6. Train The Model

7. Test Accuracy And Loss

8. And Final Save The Data And Code Into Jovian Platform

About DataSet

This data set is very popular and used in the well-known paper in genre classification “Musical genre classification of audio signals” by G. Tzanetakis and P. Cook in IEEE Transactions on Audio and Speech Processing 2002.

You can download the dataset from here. The dataset consists of 1000 audio tracks each 30 seconds long. It contains 10 genres, each represented by 100 tracks. The tracks are all 22050 Hz monophonic 16-bit audio files in .wav format.

The dataset consists of 10 genres i.e

  • Blues
  • Classical
  • Country
  • Disco
  • Hip-hop
  • Jazz
  • Metal
  • Pop
  • Reggae
  • Rock

Each genre contains 100 songs. Total dataset: 1000 songs.

You can find the whole code and data in my Kaggle hosted NoteBook

!mkdir genres && wget http://opihi.cs.uvic.ca/sound/genres.tar.gz && tar -xf genres.tar.gz genres/

mkdir genres: for creating a directory, after downloading extract the data into genres directory using tar -xf genres.tar.gz genres/ or you can manually download and upload into the respective platform as I did.

Step 1: Convert Music To Image And Store Into Different Folder

In the following code, convert music into there respective spectrum image and store into respective genres directory. For this, I am using these libraries.

import librosa
import librosa.display
from PIL import Image
import matplotlib.pyplot as plt
from skimage.io import imread, imsave

Let’s Visualise one music…, This wave plot displays the blues genre

Let’s play.

Now you can see the spectrogram image of the same sound which is converted and stored into a different folder

Step 2: Create DataSet From Image Folder

Before creating a data set need to discuss Image augmentation, Image augmentation is an artificial method to create a training image through manipulating with mage pixels like shifts, random rotation, shear, and flips, etc. So for the image augmentation, I use transforms.compose to achieve following type augmentation using the following libraries

import torch
import torchaudio
import torchvision
import torch.nn as nn
import torch.nn.functional as F
from torchvision.transforms import ToTensor,transforms
from torchvision.utils import make_grid
from torch.utils.data.dataloader import DataLoader
from torch.utils.data import random_split
from torch.utils.data.sampler import SubsetRandomSampler

Now time to create dataSet from image folder using ImageFolder() in torchvision library, in the code

Arguments are the root for passing the path of the image folder and transform for augmentation images and converting image into tensors

After transformation

Before transformation some sound images

Step 3: Create DataLoader

No time to cut the vegetables and mix. After creating the dataset, Time to create a data loader, the Data loader loads the data into batches. here I am using batch size is 32, but before creating data loader need to split the data set into the two-part one is training data set and the other is validating dataset using random split method preset in torch.utils.data library

Let’s test the mixture. This is the one batch of the data loader. see image shape

images.shape: torch.Size([32, 3, 299, 299])

32 is the batch size, 3 is a color dimension, and 299299 is the image dimension

Step 4: Defining Model AND Supportive Functions

Now time to define our magic 😏

This is the accuracy function for calculating the accuracy of the model. In this function simply check the predicted output with actual output and calculate an average

this model is very simple, before going to model description, this is the single neuron which consists of inputs, weights, non-linear activation function, and output

with using this neuron we build a very dense network which consists of the input layer, few hidden layer and output layer

Similarly defined in our model…

class ClassifyMusic(nn.Module):
def __init__(self,input_size,output_size):
super().__init__()
self.linear1 = nn.Linear(input_size,1024)
self.linear2 = nn.Linear(1024,512)
self.linear3 = nn.Linear(512,128)
self.linear4 = nn.Linear(128,32)
self.linear5 = nn.Linear(32,output_size)

The model consists of 5 layers

layer 1: input layer with 299x299 neurons and 1024 output

layer 2: 1st hidden layer with 1024 neurons and 512 output

layer 3: 2nd hidden layer with 512 neurons and 128 output

layer 4: 1st hidden layer with 128 neurons and 32 output

layer 5: output with 32 neurons and 10 final output

def forward(self, xb):
out = xb.view(xb.size(0), -1)
out = self.linear1(out)
out = F.relu(out)
out = self.linear2(out)
out = F.relu(out)
out = self.linear3(out)
out = F.relu(out)
out = self.linear4(out)
out = F.relu(out)
out = self.linear5(out)
return out

forward() method is reshaping the input batch and pass into the first layer and generate output and this output pass into the activation function, here I am using relu activation function and again pass into another layer and finally generates the final output

ReLU stands for the rectified linear unit and is a type of activation function. Mathematically, it is defined as y = max(0, x). Visually, it looks like the following:

Next function for Training_srep

def training_step(self,batch):
image,labels =batch
out = self(image)
loss =F.cross_entropy(out,labels)
return loss

This function consists of 3 main lines of code first line, take the input batch and separate image and their correct labels, and that images pass into the forward method which actually generates the prediction but that prediction is not accurate so we need to calculate the loss, For calculating loss I am using cross-entropy function

This is nothing but -ve product of actual and natural log of the predicted sum

This is very interesting this also differentiable for Stochastic Gradient Descent (SGD)

Credit: data-camp

Stochastic Gradient Descent (SGD)

after training_step next is validation_step

def validation_step(self, batch):
images, labels = batch
out = self(images) # Generate predictions
loss = F.cross_entropy(out, labels) # Calculate loss
acc = accuracy(out, labels) # Calculate accuracy
return {'val_loss': loss.detach(), 'val_acc': acc}

this function doing same as training_step but this function return average loss and average accuracy of the batch

Finally, one view of my model is this

Let’s define evaluate and fit function for the evaluate and fit the weights or hyperparameters

def evaluate(model,val_loader):
outputs = [model.validation_step(batch) for batch in val_loader]
return model.validation_epoch_end(outputs)

This function takes the validation loader and pass batch by batch into the validation_step and calculate the output (average accuracy and loss)

Now time to define a fit function, actually this is the actual gameplay. In the fit function need following arguments

epochs : How many time to run this model
lr : learning rate for the Stochastic Gradient Descent (SGD)
model: model name
train_loader: traning data loader
val_loader: validating data loader
opt_func=torch.optim.SGD : Stochastic Gradient Descent (SGD) funtion by defaunt

in the definition, history list for store epoch histories, optimizer is handling SDG, initialized with model parameters nothing but weights and biases and learning rate

run the epochs and calculate loss, train the weights with backpropagation with loss.backward() and adjust the weights and biases and reinitialized optimizer gredients batch by batch and then finally evaluate the result

Step 6: Load Data And Model Into GPU

Ok! this section is optional, in this section I am losing my model and data into the GPU because GPU is mode powerful and specially designed for matrix calculation and manipulation

Test GPU is available? So in the Pytorch cuda.is_available is a function for testing, here Cuda is the NVIDIA Cudaprogramming interface language

This function simply set the device if GPU is available else set CPU

Load The DataLoader into the GPU Using this function

train_loader = DeviceDataLoader(train_loader, device)
val_loader = DeviceDataLoader(val_loader, device)
test_loader = DeviceDataLoader(test_loader, device)

Load the ClassifyMusic() model into the GPU

model = to_device(ClassifyMusic(input_size,output_size), device)

7: Time To Play Movie

Now time to train this model but before training test initial accuracy and loss and then call fit function for adjust the weights and adjusting

After fitting the model with different epochs and learning rates I achieve 33% of accuracy and 1.8341 is loss

Step 8: Test Accuracy And Loss

In the Test data, I can achieve 25% accuracy. I know this is very low but this model is a very very simple model for this complex task but I will try the same data set into a very complex model.

So I run this model 3 to 4 times after changing hyperparameter and I can able to achieve 35% accuracy with the test data. You can see other results also

For this tracking, Thanks to Jovian to build this tool for handling machine learning project very effectively.

You can find my Jovian profile

Kaggle NoteBook

--

--

Pankaj Kumar

Full Stack Developer With Machine Learning Enthusiast.