Music Genre Classification using Transfer Learning(Pytorch)

Published in

The Startup

6 min readJun 26, 2020

Spectrogram of different genres of Music

My work is an extension of Pankaj Kumar’s work that can be found here. Instead of a feed-forward Neural Net, I used a pre-trained ResNet model(Transfer Learning) to gain better accuracy. (Thanks a lot Pankaj Kumar).

You can download the dataset from here. The dataset consists of 1000 audio tracks each 30 seconds long. It contains 10 genres, each represented by 100 tracks. The tracks are all 22050 Hz monophonic 16-bit audio files in .wav format.

The dataset consists of 10 genres i.e

Blues
Classical
Country
Disco
Hip-hop
Jazz
Metal
Pop
Reggae
Rock

Let’s start

!mkdir genres && wget http://opihi.cs.uvic.ca/sound/genres.tar.gz && tar -xf genres.tar.gz genres/

This helps to download the data and unpack it into a folder called ‘genres’.

Now, Let’s import the libraries we will be needing:

import torch
import torchvision
import torchaudio
import random
import numpy as np
import librosa
import librosa.display
import pandas as pd
import os
from PIL import Image
import pathlib
import matplotlib.pyplot as plt
import torch.nn as nn
import torch.nn.functional as F
from torchvision.transforms import ToTensor
from torchvision.utils import make_grid
from torch.utils.data.dataloader import DataLoader
from torch.utils.data import random_split
%matplotlib inline
from tqdm.autonotebook import tqdm
import IPython.display as ipd
import torchvision.transforms as T

Now, I will give the data directory:

data_path = '/content/genres/'

Now we will convert the music files i.e .wav files to image format using the Librosa library. A detailed explanation can be found here.

cmap = plt.get_cmap('inferno')plt.figure(figsize=(8,8))genres = 'blues classical country disco hiphop jazz metal pop reggae rock'.split()for g in genres:pathlib.Path(f'img_data/{g}').mkdir(parents=True, exist_ok=True)for filename in os.listdir(f'{data_path}/{g}'):songname = f'{data_path}/{g}/{filename}'y, sr = librosa.load(songname, mono=True, duration=5)plt.specgram(y, NFFT=2048, Fs=2, Fc=0, noverlap=128, cmap=cmap, sides='default', mode='default', scale='dB');plt.axis('off');plt.savefig(f'img_data/{g}/{filename[:-3].replace(".", "")}.png')plt.clf()

Let’s visualize an image:

import matplotlib.image as mpimg
img=mpimg.imread(img_path+'/blues/blues00093.png')
imgplot = plt.imshow(img)
plt.show()
print('shape of image is:',img.shape)

Now we will create a custom dataset:

batch_size = 8
image_size = 224train_trms = T.Compose([
T.Resize(image_size),
T.RandomRotation(20),
T.RandomHorizontalFlip(),
T.ToTensor()])val_trms = T.Compose([
T.Resize(image_size),
T.ToTensor()])train_data = torchvision.datasets.ImageFolder(root = img_path, transform = train_trms)
val_data = torchvision.datasets.ImageFolder(root = img_path, transform = val_trms)

Let’s have a look at the Data Augmentation we just did.

After this, we have to split the data into training and validation set. For that we will use Pytorch’s Dataloader class and random_split class.

First, we define how much data we will give to the training and validation set. In my case, I am giving 10% data to the validation set and 90% to the training set.

torch.manual_seed(43)
val_size = int(len(train_data)*0.1)
train_size = len(train_data) - val_size

Now split the data using the random_split class in Pytorch.

from torch.utils.data import random_split
train_ds, val_ds = random_split(train_data, [train_size,val_size])
len(train_ds), len(val_ds)

900 images are in test set and 100 in validation set.

Now using the Dataloader class we will load the data.

train_dl = DataLoader(train_ds, batch_size, shuffle=True, num_workers=4, pin_memory=True)
val_dl = DataLoader(val_ds, batch_size*2, num_workers=4, pin_memory=True)

Let’s have a look at the batch of data after loading.

The shape of the data is (8, 3, 224, 224)

The accuracy Function which would evaluate our model would be:

def accuracy(outputs, labels):
_,preds = torch.max(outputs,dim=1)
return torch.tensor(torch.sum(preds == labels).item()/len(preds))

Now we will go to the good part i.e. making the model. I have used the Pretrained ResNet34 model which shot up the accuracy at a considerable rate, I also used the Learning rate Scheduler and Gradient Clipping.

The basic template for training which extends the nn.Module Class

The Model(ResNet34)

import torchvision.models as models
class Net(MultilabelImageClassificationBase):def __init__(self):
super().__init__()# Use a pretrained model
self.network = models.resnet34(pretrained=True)
# Replace last layer
num_ftrs = self.network.fc.in_features
self.network.fc = nn.Linear(num_ftrs, 10)def forward(self, xb):
return self.network(xb)def freeze(self):
for param in self.network.parameters():
param.require_grad = False
for param in self.network.fc.parameters():
param.require_grad = Truedef unfreeze(self):
for param in self.network.parameters():
param.require_grad = True

The evaluate function that will help to evaluate the model on the validation set:

def evaluate(model,val_loader):
outputs = [model.validation_step(batch) for batch in val_loader]
return model.validation_epoch_end(outputs)

Now we will define the fit function with Gradient Clipping, Learning Rate Scheduler, weight decay and optimizer that we will use is Adam.

For using the GPU power of your machine or Google Colab we will use this function which gets the GPU and moves the model and datasets to the GPU which will make the training faster.

For using the GPU in Google Colab for training

Moving the model and datasets to the GPU:

Evaluating the model with randomly initialized weights and biases. Of course, when the weights and biases are random the model will perform badly but as the training progresses the model will get better and the val_accuracy will increase.

history = [evaluate(model, val_dl)]

Now, we will freeze the layers of the model except the last layer because the model is already trained and we don’t need to train it any further, the last layer is the one which has to be trained. For that, we will call the function we made earlier while making the model.

model.freeze()

Now, we will define the hyperparameters of our model i.e. max learning rate, gradient clipping factor, weight decay, and optimizer.

epochs = 5
max_lr = 0.001
grad_clip = 0.1
weight_decay = 1e-4
opt_func = torch.optim.Adam

Let’s use the fit_one_cycle function to train the model.

%%time
history += fit_one_cycle(epochs, max_lr, model, train_dl, val_dl,grad_clip=grad_clip,weight_decay=weight_decay,opt_func=opt_func)

At the end we get an accuracy of 48%:

Now train the model a little more with more epochs. In the end, we get an accuracy of 68.75%.

Training after unfreezing the layers:

model.unfreeze()

At the end of 100 epochs, we get a val_score of 74.11%.

Saving the weights and biases of the model:

Conclusion : After doing the course till now I have learned a lot of new things I did not know Pytorch at all at the beginning but now I am pretty familiar with the framework and most of the Deep Learning concepts are clear. I thank Aakash sir for this opportunity and Pankaj Kumar.
Credits: Pankaj Kumar’s notebook, Aakash Sir.

If you want to have a look at my notebook it can be found here.

Music Genre Classification using Transfer Learning(Pytorch)

Let’s start

The Model(ResNet34)

Written by Aryan Khatana