Deep learning models and music genre classification

Published in

SIGMA XI VIT

5 min readDec 2, 2023

The chronically online are usually known for having a penchant for the obscure. The internet’s a fascinating place, I don’t blame them. But to say that the seemingly innocuous activity of farming karma and arguing with strangers on message boards is without its demerits would be naïve. I’d even go to the extent of saying that the social atrophy you receive in exchange for that fleeting dopamine hit is not a fair trade off but hey that’s just me. This habit can be tough to break, especially because it often leads users to develop some pretty unusual interests. It is not the interests themselves that are left field or off kilter, but the depth to which they are pursued that is of academic intrigue.

Take music for instance. A cursory glance at top RYM threads would reveal a music classification system so incredibly extensive that it might’ve induced a state of terminal illness in a Victorian child. The descriptors on albums range from your typical tags of pop, hip hop etc to confusingly specific labels like ‘melancholic male vocals’ and ‘ominous’. While these tags are mostly community driven, this might not necessarily be the case in the near future. What if I were to tell you that the task of music genre classification could possibly be automated by deep learning models?

Album descriptors on RYM for clipping.’s last LP

Music genre identification and classification is a task that primarily involves the categorization of audio signals into prefed musical genres. That’s a broad statement. But how exactly does this work? What are the mechanisms in place that enable such an activity? The answer lies in deep learning models. Deep learning models have gained a lot of popularity in recent times because of their ability to extract intricate features from raw audio data, which helps aid in accurate genre classification. This audio data however first needs to undergo preprocessing before it can be fed into the said deep learning models. Normally for training purposes, you could use sample datasets such as the GITZAN dataset.

Data Preprocessing:

The available raw audio signals are usually first transformed into a feature representation in order for them to be compatible with the consequent deep learning model. There are two methods via which this conversion can be carried out: spectrogram generation and wavelet generation. A spectrogram is a visual representation of the spectrum signal frequencies as it varies with time. The spectrogram methods that are usually used are the Short Time Fourier Transform (STFT) or Mel Frequency Cepstral Coefficients (MFCCs). The generated features help capture the spectral and temporal characteristics of the audio.

Deep Learning Models:

Convolutional Neural Networks (CNNs): CNNs are most commonly used for music genre classification. But what is it about CNNs that help us achieve this odd goal of ours? What they do, and you’ll want to hold on to your chairs for this one, is treat audio spectrograms as images and employ convolutional layers to additionally learn hierarchial features. This is typically followed by fully connected layers for classification.

Recurrent Neural Networks (RNNs): RNNs, particularly Long Short Term Memory networks are a popular choice as well; primarily because of the way they help in capturing temporal dependencies in music by processing sequential audio data. Now there’s an especially interesting feature of RNNs that I shall now tell y’all about, as a present for making it this far into the article. They’re actually suitable for models that consider not just individual frames but also the context of previous audio frames. I had to sit down and process this for a full minute before I found myself in a state fit for writing. Make of that what you will.

Hybrid Models: Some approaches combine both CNNs and RNNs to leverage spectral as well as temporal information for improved accuracy.

Training and Evaluation:

Now this is often an understated aspect of developing machine learning models. Selecting and ideating on the right model only gets you so far.

Data Splitting: The dataset that is used should be divided into training, validation, and testing sets in order to train and tune hyperparameters, as well as evaluate the model’s performance.

Loss Functions: Categorical cross-entropy is a common choice for the loss function to measure the dissimilarity or difference between predicted and actual genre labels.

Optimization: Stochastic Gradient Descent (SGD) or more advanced optimizers like Adam are employed to minimize the loss function during training.

Evaluation Metrics: Metrics such as accuracy, precision, recall, F1-score, and confusion matrices are absolutely pivotal in the assessment of the model’s performance on the test set.

Challenges:

There are multiple challenges that come with developing a model that helps in music genre classification. Some of these can be listed as follows:

Data Variability: Music data is highly variable due to differences in instrumentation, tempo, and artist styles, making classification particularly challenging.

Data Imbalance: Unequal distribution of genres in datasets often lead to biased models, thus making techniques like data augmentation or re-sampling essential.

Generalization: Ensuring the model generalizes well to unseen music and different subgenres is essential. I’d even go the extent of saying that this is the end goal of our interest in automated music genre classification. Imagine the outrage that would follow if you were to label a Drake album as anime J-pop. It would be in our best interest not to concern ourselves with such overtly dark hypotheses.

Advanced Techniques:

Transfer Learning: Previously trained models on large music datasets, such as those for speech recognition, can be tweaked or fine tuned for music genre classification.

Ensemble Methods: Combining predictions from multiple models, such as Random Forests or stacking, can sometimes enhance accuracy, serving as a valid choice.

Conclusion:

Deep learning models, particularly CNNs and RNNs, have so far demonstrated promising results in music genre identification and classification. However the accuracy and robustness of these models still needs to be worked on, necessitating research and trials. It genuinely amazes me that we’ve gotten to a point of technological advancement where machines or software systems can, with a fair degree of accuracy, label art borne out of human creativity. Something about this feels jarring, to me at least. Art and music are arguably the cornerstones of human civilization. To hand these over to lifeless systems, for them to mechanically draw conclusions on feels fundamentally wrong. But I suppose that deserves its own article. So long.

References:

https://www.projectpro.io/article/music-genre-classification-project-python-code/566

https://www.analyticsvidhya.com/blog/2021/06/music-genres-classification-using-deep-learning-techniques/

Deep learning models and music genre classification

Written by Abhik