Music Genre Classification Using Feed Forward Neural Network (Using Pytorch)
Music classification is a very tough task in deep learning because lots of features are present in any single tone like Amplitude, Spectrum, Zero-Crossing Rate, Spectral Bandwidth, Spectral Centroid, bla..bla..bla but don’t worry, I am not going to use these pieces of information in this post. In this post, I am using one of these simplest methods to classify correct genres of the music, So the idea is to convert genre classification into the image classification but how?
Before going to start, let’s discuss the flow of the post…
- About Data Set
2. Data Visualisation And Cleaning
3. Data Preparation
4. Defining Feed Forward Neural Network Model
5. Load Model And Data Into GPU
6. Train The Model
7. Test Accuracy And Loss
8. And Final Save The Data And Code Into Jovian Platform
About DataSet
This data set is very popular and used in the well-known paper in genre classification “Musical genre classification of audio signals” by G. Tzanetakis and P. Cook in IEEE Transactions on Audio and Speech Processing 2002.
You can download the dataset from here. The dataset consists of 1000 audio tracks each 30 seconds long. It contains 10 genres, each represented by 100 tracks. The tracks are all 22050 Hz monophonic 16-bit audio files in .wav format.
The dataset consists of 10 genres i.e
- Blues
- Classical
- Country
- Disco
- Hip-hop
- Jazz
- Metal
- Pop
- Reggae
- Rock
Each genre contains 100 songs. Total dataset: 1000 songs.
You can find the whole code and data in my Kaggle hosted NoteBook
!mkdir genres && wget http://opihi.cs.uvic.ca/sound/genres.tar.gz && tar -xf genres.tar.gz genres/
mkdir genres: for creating a directory, after downloading extract the data into genres directory using tar -xf genres.tar.gz genres/ or you can manually download and upload into the respective platform as I did.
Step 1: Convert Music To Image And Store Into Different Folder
In the following code, convert music into there respective spectrum image and store into respective genres directory. For this, I am using these libraries.
import librosa
import librosa.display
from PIL import Image
import matplotlib.pyplot as plt
from skimage.io import imread, imsave
Let’s Visualise one music…, This wave plot displays the blues genre
Let’s play.
Now you can see the spectrogram image of the same sound which is converted and stored into a different folder
Step 2: Create DataSet From Image Folder
Before creating a data set need to discuss Image augmentation, Image augmentation is an artificial method to create a training image through manipulating with mage pixels like shifts, random rotation, shear, and flips, etc. So for the image augmentation, I use transforms.compose to achieve following type augmentation using the following libraries
import torch
import torchaudio
import torchvision
import torch.nn as nn
import torch.nn.functional as F
from torchvision.transforms import ToTensor,transforms
from torchvision.utils import make_grid
from torch.utils.data.dataloader import DataLoader
from torch.utils.data import random_split
from torch.utils.data.sampler import SubsetRandomSampler
Now time to create dataSet from image folder using ImageFolder() in torchvision library, in the code
Arguments are the root for passing the path of the image folder and transform for augmentation images and converting image into tensors
After transformation
Before transformation some sound images
Step 3: Create DataLoader
No time to cut the vegetables and mix. After creating the dataset, Time to create a data loader, the Data loader loads the data into batches. here I am using batch size is 32, but before creating data loader need to split the data set into the two-part one is training data set and the other is validating dataset using random split method preset in torch.utils.data library
Let’s test the mixture. This is the one batch of the data loader. see image shape
images.shape: torch.Size([32, 3, 299, 299])
32 is the batch size, 3 is a color dimension, and 299299 is the image dimension
Step 4: Defining Model AND Supportive Functions
Now time to define our magic 😏
This is the accuracy function for calculating the accuracy of the model. In this function simply check the predicted output with actual output and calculate an average
this model is very simple, before going to model description, this is the single neuron which consists of inputs, weights, non-linear activation function, and output
with using this neuron we build a very dense network which consists of the input layer, few hidden layer and output layer
Similarly defined in our model…
class ClassifyMusic(nn.Module):
def __init__(self,input_size,output_size):
super().__init__()
self.linear1 = nn.Linear(input_size,1024)
self.linear2 = nn.Linear(1024,512)
self.linear3 = nn.Linear(512,128)
self.linear4 = nn.Linear(128,32)
self.linear5 = nn.Linear(32,output_size)
The model consists of 5 layers
layer 1: input layer with 299x299 neurons and 1024 output
layer 2: 1st hidden layer with 1024 neurons and 512 output
layer 3: 2nd hidden layer with 512 neurons and 128 output
layer 4: 1st hidden layer with 128 neurons and 32 output
layer 5: output with 32 neurons and 10 final output
def forward(self, xb):
out = xb.view(xb.size(0), -1)
out = self.linear1(out)
out = F.relu(out)
out = self.linear2(out)
out = F.relu(out)
out = self.linear3(out)
out = F.relu(out)
out = self.linear4(out)
out = F.relu(out)
out = self.linear5(out)
return out
forward() method is reshaping the input batch and pass into the first layer and generate output and this output pass into the activation function, here I am using relu activation function and again pass into another layer and finally generates the final output
ReLU stands for the rectified linear unit and is a type of activation function. Mathematically, it is defined as y = max(0, x). Visually, it looks like the following:
Next function for Training_srep
def training_step(self,batch):
image,labels =batch
out = self(image)
loss =F.cross_entropy(out,labels)
return loss
This function consists of 3 main lines of code first line, take the input batch and separate image and their correct labels, and that images pass into the forward method which actually generates the prediction but that prediction is not accurate so we need to calculate the loss, For calculating loss I am using cross-entropy function
This is nothing but -ve product of actual and natural log of the predicted sum
This is very interesting this also differentiable for Stochastic Gradient Descent (SGD)
Stochastic Gradient Descent (SGD)
after training_step next is validation_step
def validation_step(self, batch):
images, labels = batch
out = self(images) # Generate predictions
loss = F.cross_entropy(out, labels) # Calculate loss
acc = accuracy(out, labels) # Calculate accuracy
return {'val_loss': loss.detach(), 'val_acc': acc}
this function doing same as training_step but this function return average loss and average accuracy of the batch
Finally, one view of my model is this
Let’s define evaluate and fit function for the evaluate and fit the weights or hyperparameters
def evaluate(model,val_loader):
outputs = [model.validation_step(batch) for batch in val_loader]
return model.validation_epoch_end(outputs)
This function takes the validation loader and pass batch by batch into the validation_step and calculate the output (average accuracy and loss)
Now time to define a fit function, actually this is the actual gameplay. In the fit function need following arguments
epochs : How many time to run this model
lr : learning rate for the Stochastic Gradient Descent (SGD)
model: model name
train_loader: traning data loader
val_loader: validating data loader
opt_func=torch.optim.SGD : Stochastic Gradient Descent (SGD) funtion by defaunt
in the definition, history list for store epoch histories, optimizer is handling SDG, initialized with model parameters nothing but weights and biases and learning rate
run the epochs and calculate loss, train the weights with backpropagation with loss.backward() and adjust the weights and biases and reinitialized optimizer gredients batch by batch and then finally evaluate the result
Step 6: Load Data And Model Into GPU
Ok! this section is optional, in this section I am losing my model and data into the GPU because GPU is mode powerful and specially designed for matrix calculation and manipulation
Test GPU is available? So in the Pytorch cuda.is_available is a function for testing, here Cuda is the NVIDIA Cudaprogramming interface language
This function simply set the device if GPU is available else set CPU
Load The DataLoader into the GPU Using this function
train_loader = DeviceDataLoader(train_loader, device)
val_loader = DeviceDataLoader(val_loader, device)
test_loader = DeviceDataLoader(test_loader, device)
Load the ClassifyMusic() model into the GPU
model = to_device(ClassifyMusic(input_size,output_size), device)
7: Time To Play Movie
Now time to train this model but before training test initial accuracy and loss and then call fit function for adjust the weights and adjusting
After fitting the model with different epochs and learning rates I achieve 33% of accuracy and 1.8341 is loss
Step 8: Test Accuracy And Loss
In the Test data, I can achieve 25% accuracy. I know this is very low but this model is a very very simple model for this complex task but I will try the same data set into a very complex model.
So I run this model 3 to 4 times after changing hyperparameter and I can able to achieve 35% accuracy with the test data. You can see other results also
For this tracking, Thanks to Jovian to build this tool for handling machine learning project very effectively.
You can find my Jovian profile
Kaggle NoteBook