Music genre classification using CNN: Part 1- Feature extraction

Namrata Dutt
6 min readMay 20, 2022

--

Learn how to extract features from audio and classify music into different genres using a Convolutional Neural Network.

Photo by Marius Masalar on Unsplash

Introduction

We listen to music every day whether at home, in the car, or anywhere. Music is categorized into different genres like Pop, Rock, Metal, Jazz, Blues, etc. Around 24,000 songs are released every day. With so many songs that are released every day, it becomes impossible to categorize them manually. Apps like Shazam and Spotify stream millions of songs and use the Music genre Classification system to group them into categories. For this purpose, we use Machine Learning algorithms to automate this work. Music platforms group music into different categories for customized UX (User Experience) using music genre classification. Spotify categorizes its music into 5, 071 genres. They use music classification to provide personalized recommendations to their customers.

For classification purposes, we extract time and frequency domain features from the audio files. These features are Spectrograms, Mel-spectrograms, MFCC, Spectral centroids, Chromagrams, Energy, Spectral Roll-off, Spectral Flux, Spectral Entropy, Zero-crossing rate, and Pitch. One major problem that Music Genre Classification deals with is that some of the genres are likely to be misclassified as each other such as Country and Rock, Pop and Disco, Jazz and Reggae, etc. Few classes are highly similar and overlap significantly with each other.

Dataset description

We have used the GTZAN dataset. The GTZAN dataset contains 1000 audio files and 100 audios per group (10 groups). It contains the following classes: Rock, Pop, Jazz, Blues, Country, Metal, Disco, Reggae, Hip-hop, and Classical. While reading the audio files, 56 corrupt files were found. So, the features were extracted and classified on 944 files.

Temporal features are extracted from the time domain and spectral features are extracted from the frequency domain. However, features like spectrograms, MFCCs, etc. contain both time and frequency information.

Time-Frequency Domain Features

We have used the following features for the classification task:

1. Spectrogram

The spectrogram is the visual representation of the strength of a signal over time with different frequencies present at each time step. In the case of audio, it is also known as sonographs or voicegrams. There are three dimensions in a spectrogram, two are frequency (y-axis) and time(x-axis), and the third dimension is amplitude.

A spectrogram is shown in the figure below, where yellow represents high amplitude and blue represents low amplitude.

Spectrograms are calculated from the time domain using Fourier Transform. When the audio is sampled (in the time domain), it is segmented into several overlapping windows. Then, the Short-time Fourier Transform (STFT) is used to calculate the frequency spectrum for each window and each window represents a vertical line in the image. These parts are put together side by side. This process is called windowing. The process of creating a spectrogram through a digital process essentially involves the computation of the squared magnitude of the STFT of the signal for a particular window width. The spectrogram of an audio file is shown in Figure 1.

Fig 1: Spectrogram of sample audio (Image by the Author)

2. Mel-Spectrogram

The Mel scale (after the word melody) is a perceptual scale of pitches judged by listeners to be equal in distance from one another.

Humans can detect lower frequencies well as compared to higher frequencies. For example, we can easily tell difference between 200 Hz and 400 Hz but we cannot tell the difference between 2000 Hz and 2200 Hz even though the difference between both the pairs is 200 Hz. The reason is that humans perceive sound in a non-linear way. Mel-scale does exactly that. To simplify, we can say that the mel-scale groups higher frequencies exponentially as the frequency increases. The Mel scale vs Hertz scale is shown in Figure 2.

Fig 2: Mel scale versus Hertz scale (source)

In a Mel-Spectrogram, the spectrogram is converted to mel-scale using mel filter-banks. After computing a spectrogram, we map the frequency (y-axis) of the spectrogram to the mel-scale to form a Mel-Spectrogram. The Mel-Spectrogram of an audio file is shown in Figure 3.

Fig 3: Mel-Spectrogram of sample audio (Image by the Author)

3. MFCC

MFCC is Mel Frequency Cepstral Coefficients. They are a small set of features that precisely describe the complete shape of a spectral envelope. The MFCC uses the MEL scale to divide the frequency band into sub-bands. Then, the Cepstral Coefficients are extracted by computing the Discrete Cosine Transform (DCT). In a way, MFCCs compress the data of a Mel-Spectrogram. DCTs are popular for easy image compression. MFCC represents the shape of the vocal tract. The MFCC of an audio file is shown in Figure 5.

Fig 5: MFCC of sample audio (Image by the Author)

Implementation

Here are the steps to extract different kinds of features:

  1. Import libraries

2. Save audio paths and target labels

3. Extract features: Now, we extract different features from the audio. Here, we have extracted Spectrogram, Mel-Spectrogram, MFCC, Zero-crossing rate, Spectral centroids, and Chromagrams. But for classification purposes, we will only use Spectrogram, Mel-Spectrogram, and MFCC. Some audio files were corrupt, so we found the index of those corrupt audio files and saved them in a list.

4. Remove corrupt files: Delete the audio files and labels at the corrupt indices. The features are then converted to float32 data type to save memory usage. After that, we assign the labels numerical values and convert them into categorical data. In the end, we save all the extracted features and their labels into a .npz file. When we start the classification task, we can directly load the .npz file to use the features.

The complete code is available on GitHub here.

Conclusion

In this article, we learned how to extract different features from audio. In the next part of this article, we will learn how to classify audio using these features separately and an ensemble of these features. We will explore deep CNNs for music genre classification.

Thanks for reading! I hope you found this article helpful.

Go Gators!🐊

References

--

--

Namrata Dutt

Ph.D. Student at University of Florida | Interested in Image Processing, Machine Learning and Remote Sensing | Poetry Enthusiast