Music Genre Classification — Part 2: SVM, Convolutional Neural Network, Convolutional Recurrent Neural Network

Young Park
Geek Culture
Published in
4 min readApr 19, 2021

As a continuation of my previous post, in this post, I dive into the modeling stages of building out my music genre classifier. I divided my modeling approach into two stages. For the first stage, I experimented with various classification models such as Logistic Regression, KNN, Random Forest, SVM, Gradient Boost, and XGBoost. For the second stage, I experimented with deep learning models such as Convolutional Neural Network and Convolutional Recurrent Neural Network.

For the first stage of my modeling process, instead of diving into each classification models individually, I’ll lay out the general framework that I used and highlight a few notes worth mentioning.

Pipeline and GridSearch

https://www.gamasutra.com/blogs/RadekKoncewicz/20110705/89706/Super_Mario_Bros_3_Level_Design_Lessons_Part_3.php

As a general rule of thumb, when I am modeling, I like to setup my model in such a way that is repeatable and scalable. This is especially important given that data science is never linear, but rather iterative. This is why setting up my model in such a way that allows me to change or pivot is extremely critical. One of the ways that I ensure to remain flexible with my model is by setting up a pipeline.

According to official scikit-learn documentation, “the purpose of the pipeline is to assemble several steps that can be cross-validated together while setting different parameters”.

In simple terms, pipeline allows you to chain various data transformers and tie to an estimator of your selection. For instance, to properly train a support vector machine classifier, I needed to scale my data, particularly given that SVM uses gradient descent to minimize its loss function. To combine the process of scaling my data and training a SVM classifier, I created a pipeline in which I specify the data transformer (StandScaler) and the estimator (SVM). This is a simple case for demonstration purposes but you can see how powerful setting up a pipeline can be if you have multiple data transformers you need to apply before training a model.

It’s important to note that the order of your input into a pipeline matters. For in this case, I want to scale my data first and then train my model so I put StandScaler as my first input.

Another great thing about setting up a pipeline is that you can GridSearch over the hyperparameters for optimal performance!

Here is a simple GridSearch set up for demonstration purposes. What I love about GridSearch is that not only can I tune various hyperparameters, but I can cross validate all at the same time!

Convolutional Neural Network

Using Mel-Spectrograms that I extracted from my music dataset, which are visual representations of all the core elements of music (time, frequency, and magnitude), I built and trained a Convolutional Neural Network.

Convolutional Neural Network is a powerful model that performs really well with spatial and image data.

Reshaping mel-spectrogram data to be compatible for convolutional layers
CNN Topology

Topology consists of convolutional layers, coupled with MaxPooling2D, and then followed by a series of Dense layers with dropout layers in an effort to regularize and avoid overfitting.

Convolutional Recurrent Neural Network

As a final attempt at improving performance, I decided to combine both convolutional layers and recurrent layers (GRU). The rationale behind this effort is in the fact that music is essentially a sequential data over a period of time. Since recurrent layers such as GRU and LTSM have proven to work really well with sequential data, I decided to combine the two layers and measure the increase in performance.

CRNN Toplogy

The basic topology structure remains the same but it is important to remember to change the shape back to its original shape in order for GRU to be processed correctly. After GRU, Dense and Dropout layers remain the same.

Performance based on accuracy

In terms of overall performance, CRNN (with dropouts) turned out to be the best performing model based on accuracy score. 98.6% with training data and 86.7% with testing data.

Given the current state-of-the-art accuracy score hovers between 90–95%, I found my CRNN model to be promising and worth the additional effort to continue to enhance for better performance.

--

--