Music Genre Classification

9 min readJan 8, 2022

Team: Ashita Boyina, Rahul Dahiya, Sai Chandan, Arunim Gupta

We’re always looking for new ways to improve our musical experience. We began our musical trip by listening to radio programmes, watching television channels, and then downloading well-known websites for offline listening.

The present world reached the "apogee" of contentment with streaming platforms. Still, the streaming platforms do not always give consumers an utterly pleasing experience when discovering new music. We have a conflict when categorizing our music into genres and styles. Our goal was to fix this issue by using machine learning techniques to classify or cluster the music into genres and provide a way for the user to discover new music based on this information. We are not attempting to resolve the problem but rather to take a step toward a better experience.

We present Multi Classification models to categorize the supplied music dataset into distinct music genres. We have carefully experimented with 7–8 machine learning models to obtain the highest possible performance, including hyperparameter tuning.

This article does not go into too much technical detail. For more precise information: https://github.com/ashcode028/Music-Genre-Classification

1. Dataset Description

The dataset we will be working with for this project is GTZAN(the famous GTZAN dataset, the MNIST of sounds), consisting of 1,000 audio tracks, each 30 seconds long. It contains ten genres, each represented by 100 tracks.

The ten genres are as follows:

Blues
Classical
Country
Disco
Hip-hop
Jazz
Metal
Pop
Reggae
Rock

Let's dive into little deeper stuff now...

2. Audio Signal feature Extraction

We have audio files and extract information to be understood by the modal. So, we tried exploring the music in terms of waveforms. There are two ways of representing waveforms: Time domain and frequency domain.

Time-domain: It didn't provide much information, music quality that can be extracted and explored apart from visual distinction in the waveforms.

Then we did Fourier Transform to convert them into the frequency domain to extract more information like timber, pitch, etc. We get two types of features spectral and Rhythm.
Python has a package named Librosa for music and audio analysis. For further information on this, you can refer here.

Frequency Domain: Some features are shown below for one genre, but we analyzed each genre and each feature.

2.1 Visualizations

To analyze the data, we must know the relation between features. So we performed PCA to visualize the data, then a heatmap of the correlation matrix.

Standardization: The goal of this is to uniformly distribute the range of continuous beginning variables so that they all contribute equally to the study.

Don't forget to standardize your data before PCA. More specifically, it is necessary to apply standardization before PCA because the latter is susceptible to the variances of the initial variables. Suppose the ranges of starting variables differ significantly. In that case, the variables with more comprehensive ranges will dominate over those with more minor degrees (for example, a variable ranging from 0 to 100 will dominate over a variable ranging from 0 to 1), resulting in biased results. As a result, converting the data to equivalent scales might help avoid this issue.

3. Various Classifiers

We tested ten different classifiers to discover the best one. GridSearchCV and 10-fold cross-validation were used to improve different parameters in the models below.

We used hyperparameter tweaking to find the ideal parameters for logistic regression, Random Forest, Naive Bayers, SGD, KNN, and classifiers. For SVM and MLP, we manually modified parameters until we discovered the optimal ones.

In addition, for strengthening decision trees and Random Forests baseline models, I applied different boosting approaches such as gradient boost, ADA boost, cat boost, cross gradient boost, and cross gradient boost.

The primary pipeline for all models was to create a baseline model with default parameters. Then do hyperparameter tuning to improve the performance of the baseline model. We did manual HP tuning for SVM, DT, RF, XGB.

3.1 Logistic Regression

This model uses a predictive analysis algorithm based on the concept of probability. It works well for categorical variables(as genres in our case)

GridSearchCV was used to pass all combinations of hyperparameters into the model, and we selected the best parameters. This increased accuracy from 67% to 70%.

3.2 SGD Classifier

A variant of gradient descent called stochastic gradient descent (SGD) runs one training example every iteration. SGD Classifier uses Stochastic Gradient Descent to implement regularised linear models.

Our simple SGD model gave 61% accuracy. It increased to 64% after HP tuning. This was due to some underperforming genres like rock,disco, and country.

3.3 Gaussian Naive Bayer's

Naive Bayes Classifiers are based on the Bayes Theorem. The Bayes Theorem is used to create Naive Bayes Classifiers. The high independence assumptions between the features are one of the assumptions made. These classifiers assume that the value of one feature is unrelated to the value of any other characteristic. It's a probabilistic classifier, making predictions based on an object's probability.

This model performed too poorly with only 48% accuracy and increased to 51% after HP tuning. Poor performance was due to its assumption of independence among variables which wasn't true. This assumption works very well for text classification and email spam detection.

3.4 KNN

The K-Nearest Neighbour algorithm is based on the Supervised Learning technique. It is one of the most basic Machine Learning algorithms. It is based on the idea that every data point close to another belongs to the same class.[More info]

This model gave 86% accuracy, and we got a 90% accuracy with Hyperparameter tweaking. Compared to Gaussian NB models, this model was better as it works with clustering. We observed, after HP tuning, the correlation between the features has decreased; some had even 0 correlation.

3.5 Decision Trees

A Decision Tree is a supervised machine learning technique that constantly separates data according to a split parameter. Ensembling methods were used to avoid overfitting as in Decision trees. They are of two types bagging or boosting (one is parallel, another is sequential).

Decision trees performed worse than ensemble approaches. Catboost outperformed all ensemble methods with an accuracy of 82 %. The gradient boost was close enough, while the remainder were all in the 50–60% range. CatBoost had a high AUC for all genres, unlike gradient, which had low accuracy for some genres.

3.6 Random Forest

Random forest is a supervised learning algorithm. The random forest creates numerous decision trees and blends them to generate a more accurate and reliable prediction.

The hyperparameters of a random forest are pretty similar to those of a decision tree or a bagging classifier. Fortunately, you may utilize the classifier-class of random forest instead of combining a decision tree with a bagging classifier. You can also use the algorithm's regressor to cope with regression tasks with random forest.

While growing the trees, the random forest adds more randomness to the model. When splitting a node, it looks for the best feature from a random subset of features rather than an essential feature. As a result, there is a lot of variety, which leads to a better model.

In our case, it performed well for some genres while very bad for others. Even with HPtuning, accuracy didn't improve much.

We also tried XGBoosting on RF, but it reduced accuracy to 70%. It reduced the accuracy and even reduced Precision and recall to a large extent.

3.7 XGB classifiers

XGBoost improves upon the basic Gradient Boosting Method framework through systems optimization and algorithmic enhancements. It is the most famous library in the ML community [read here for more info].

This model outperformed all boosting classifiers with an accuracy of around 90% for all genres. Some observations noted:

Less correlation with the variables
Every genre was always classified with 85+% accuracy.
Genres like classical, hip-hop had even 100% accuracy.

3.8 MLP

This model is an Artificial Neural Network with numerous layers and many activation neurons in each layer. The following metrics were used to manually tune hyperparameters: activation function, number of neurons, number of hidden layers, learning rate, regularisation and decomposition, and so forth. The output layer's Softmax function was kept fixed.

Performance was altered using manual hyperparameter tuning using the following metrics:

Learning Rate
Epochs
Number of hidden layers
Neurons per hidden layer
Activation function

Manipulated the performance using different combinations and cross verification. Regularization and dropout techniques were used to decrease the variance of the model.

Before

Training accuracy: 0.69914
Testing accuracy: 0.4900

After

Training accuracy: 0.77200001
Testing accuracy: 0.606666

Overfitting and faulty performance of MLP were due to a high no. of variables and details in the data.

3.9 SVM(Linear & Polynomial)

A support vector machine (SVM) is a supervised machine learning model that uses classification algorithms for two-group classification problems. It has the significant advantage of high speed and better performance than neural networks.

A support vector machine only takes care of finding the decision boundary. The hyperplane (which in two dimensions is essentially a line) that optimally separates the tags is produced by a support vector machine using the data points. This line serves as a decision boundary: everything falling on one side will be classified as one group, while anything falling on the other will be classified as another group. For nonlinear data, it cleverly maps to higher dimensions. It uses kernel function for computing those transformations faster, which we can customize according to our requirements.

For this project, HP tuning on this model was done manually. SVM outperformed every model we used and gave the best accuracy of 0.9424%. The kernel is the most critical parameter, as it controls how the input variables will be projected. Linear, polynomial, and RBF kernel were compared using a confusion matrix. Hyperparameter tuning was done manually by manipulating the following metrics:

kernels = [‘linear’, ‘poly’, ‘rbf’]
Degree = 1 to 7
C = 10 to 1000
Gamma = 0.01 to 10

The best linear kernel has an accuracy of 70%, while the best polynomial kernel gave 88% accuracy. Finally, RBF kernel smashed all other models with 94%, where every genre was classified with almost 90%.

4. Conclusions

In this research, we studied and developed a strategy for automating music genre identification based on signal features such as spectral, Rhythm, and pitch patterns.

The first metric chosen was Precision in the early stages of the project. Since genre classes were balanced, the tradeoff between Precision and recall was less observed. So continued using accuracy as the primary metric in the later stages of our project.

Precision was more than recall among all KNN, DT, and ensemble classifiers. While in the case of LR, SGD, NB, MLP, SVM recall was observed more than Precision.

The most successful classifier is SVM with RBF Kernel. Gaussian outperformed the polynomial kernel in almost all iterations. When removing significant errors in genre categorization, the ensemble techniques like ADAboost, Gradient Boost surpassed the Polynomial SVM classifier.

We always look for improvements in our code. Feel free to submit an issue or a pull request [here]