Music Genre Classification using RNN-LSTM

7 min readJul 16, 2020

Abstract

Every day numerous music albums are released. To manage these albums manually is arduous. Also, whenever listening to music, song recommendation is a tough task. Music Information Retrieval (MIR) is one of the most impressive algorithm that gave the motivation to write about Music Genre Classification (and of-course to make some money). This blog will use a dataset that has audio signals and convert that signals into meaningful MFCCs vectors which will be used as a input vector for RNN-LSTM to classify into a particular genre. This model will classify 10 different genres which include disco, reggae, rock, pop, blues, country, jazz, classical, metal and hip-hop.

Introduction

Nowadays, the booming word “Machine Learning” is on its way to its pinnacle. Every field around the world knowingly or unknowingly uses machine learning for its better performance. Be it health care, medicine, security, marketing, business analytics and the most crucial one is information retrieval. Music Information Retrieval (MIR) is one of the most profound techniques used nowadays to get useful information from the music(audio signal). Artificial neural networks is one of the most complex and effective techniques used for solving prediction and classification problems. In this blog, I have implemented a Recurrent neural network(RNN) with Long Short Term Memory(LSTM), which is a bit more complex than simple RNN for music genre classification. Our target for this project is to classify audio files 10 different genres. Our algorithm is very useful for the user to search for their favorite music pieces and has great commercial potential. The applications of machine learning techniques to classification is not as common as that the image classification. To classify music genres can also be tackled by Convolutional Neural Network (CNN) but the results are not satisfactory for our dataset because it is small. The accuracy drops as the number of genres increases. In this project I have used a Recurrent neural network (RNN) with Long Short Term Memory (LSTM) instead of Convolutional Neural Network(CNN) to classify the input audio files into 10 different genres. We will train a model to classify 10 different genres. There are mainly two types of genre in the dataset strong and mild classes. The strong class have high amplitude which includes hip-hop, pop, reggae, metal and rock. And on the other hand the mild class have low amplitude which includes disco, blues, country, jazz and classical.

The remaining sections of this paper are arranged as follows. In section 2, we define the problem statement of our work. In Section 3, the theory used is discussed. In Section 4, we discuss the dataset and preprocessing of the audio files. In section 5, the conclusion of the project. And then in 6 references.

Methodology

We use the Gtzan music dataset to train our system.We apply the Librosa library to extract audio features, i.e. the Melfrequency cepstral coefficients (MFCC), from the raw data. The extracted features are input to the Long Short Term Memory (LSTM) neural network model for training.Our LSTM are built with Keras and Tensorflow.

Mel frequency cepstral coefficient (MFCC)

MFCC features are commonly used for speech recognition, music genre classification and audio signal similarity measurement. The computation of MFCC has already been discussed in various paper. We will focus on how to apply the MFCC data for our application. In practice, we use the Librosa library to extract the MFCCs from the audio tracks. The main aim for this blog is to classify the audio file into its respective genre so the I am not explaining the whole theory of how MFCC are extracted. But just the pipeline of the process.

Figure below shows the code for extracting MFCC using librosa and output array of the program. It is a 2-D array. One dimension represents time while the other dimension represents the different frequencies.

Code to extract MFCC

The Long Short Term Memory Network (LSTM)

There are several approaches to solve this Music Genre Classification such as (CNN) Convolutional Neural Networks. But, I have tried something new which is (RNN) Recurrent Neural Networks with (LSTM) Long Short Term Memory. This LSTM is a subdivision of RNN. RNN itself is little different than the traditional neural network style. It stores and uses the past data as information to predict the new output. Moreover, LSTM is a new modified, better division which overcomes the problem of long term dependencies problem. Even though, RNN uses the stored data as information of the past state to predict the current state. It fails to link the information when the gap is large between the current state and the past state where the information needs to be taken from. The network that I have used has 4 layers. The details of the long term dependencies has been discussed in. Figure below shows the internal structure of a typical LSTM model.

If you want to understand what the gates and the functions in the LSTM do. You can this amazing video which explains LSTM very concisely.

Dataset and Pre-processing

Dataset

We used the GTZAN dataset from the marsyas.info website.It contains of 100 samples of the each blues, classic, country, disco, hip-hop, jazz, metal, pop, Reggae and rock music genres in our experiments. Each genre includes 100 soundtracks of 30 seconds long in .au format. We have randomly splitted our training and testing data from the model.selection module from the Sci-kit learn library. I have done a 25%-75% testing-trainig data split and none them are overlapped. And then the training dataset is further divided in training and validation dataset, where validation set is 20% of training set. From the waveforms showed in figure below we have found some similar genres such as Blue, Jazz and Country. Rock, Pop and Reggae are similar.

Pre-processing

We cannot directly use the audio file as an input for our RNN-LSTM neural network, we need to preprocess it and then we can use the data in the GTZAN dataset. And preprocessing means to extract useful features from the audio signal. So, for Mel-frequency cepstral coefficient (MFCC) are one of the way to extract useful information from the signal because it defines the brightness of a sound. It can also be used to calculate the timbre (quality) of the sound. In this model, I have used the Librosa library to convert the audio file from the GTZAN datasets into_MFCC_features._In_particular,_we choose the frame size as 25ms. Each 30second soundtrack has 1293 frames and 13 MFCC features, C1 to C13. Figure below shows some examples of the Mel frequency cepstrum plots of the music signals in the database. In the following discussion, issued to demonstrate how to process data to achieve classification using LSTM.

Experimental Results

The model is trained with a dataset in 30 epochs. The accuracy and loss functions are taken into the consideration for evaluation of the model. To do this I have to utilize the model to predict the suitable result on the evaluation dataset and compare that result with predicted target with an actual answer.

Figure above compares the model’s accuracies as achieved on running it on training and testing data set. On training dataset, the model achieved an accuracy of 68.3% while on the test data, it had an average accuracy of 59.07% at the end of 30 epochs.

The accuracy is low because of dataset is small.

There’s a saying in machine learning “Not the best algorithm, but the model with more data wins”.

Conclusion

Music genre classification plays a vital role in music industry as manually arranging data is an arduous task. And even recommendations for similar music genres will also be easy. RNN-LSTN is good technique to use for music genre classification as it remembers the past result of the cell in the recurrent layer and classify music more better and efficient way.

In this project, I get training accuracy 68.3% and testing accuracy 59.7%.

Written by Prem Tibadiya

Email:- premtibadiya7@gmail.com