Music Genre Classification — Part 1

Young Park
The Startup
Published in
5 min readJan 10, 2021
Source: https://in.pinterest.com/pin/66709638220423404/

For my capstone project, I decided to explore the wonderful world of music genre classification through machine learning. I purposely chose this topic because I wanted to take this opportunity to challenge myself to learn something new and to demonstrate how data science is truly an inter-disciplinary field.

As someone who casually enjoys listening to music and plays a few instruments here and there, learning all the fundamentals of audio analysis was a challenge in and of itself. That said, it was really amazing and inspiring to learn about all the intricate and complex dynamics that take place under the hood when we plug in our earphones and press play on our devices.

As a side note, I owe a tremendous amount of thanks to Valerio Velardo, who is a subject matter expert in the world of AI/Audio/Music. If this topic is something that interests you and you want to learn more about all the fundamentals of sound and audio analysis, I highly recommend you check out his YouTube page!

Dataset

For my project, I decided to use the GTZAN Genre Collection dataset which consists of 1000 audio tracks, each 30 seconds long. It contains 10 genres, each represented by 100 tracks.

10 Genres:
- Blues
- Classical
- Country
- Disco
- Hiphop
- Jazz
- Metal
- Pop
- Reggae
- Rock

Dataset → http://marsyas.info/downloads/datasets.html

Librosa

With raw audio sample files, I had to figure out a way to do my own feature extractions. It turns out there is a great Python package suitable for music and audio analysis called Librosa. By using Librosa, I was able to extract certain key features from my audio samples such as Tempo, Chroma Energy Normalized, Mel-Freqency Cepstral Coefficients, Spectral Centroid, Spectral Contrast, Spectral Rolloff, and Zero Crossing Rate.

Without going down the rabbit hole, it’s sufficient to note that these features represent certain aspects of a music piece in relation to its timbral texture, rhythmic features, and pitch content. I’ll be using these features to run classification models. For deep dive look into what each feature means and how it relates to audio analysis, I highly recommend this blog post where the author describes in detail the significance of each feature.

Waveform

One great thing about Librosa is the ability visually graph the audio data that you are working with. For instance, I created what are known as waveforms using a sample from each genre. Waveforms are visual representations of sound as time on the x-axis and amplitude on the y-axis. Waveforms are great because they allow you to quickly scan your audio data and visually compare and contrast which genres might be more similar than others.

Taking one step further, Librosa also allows you to separate out the harmonic and percussive signals from your audio data. Harmonic sounds contain information regarding the pitch whereas percussive sounds are more perceived as two objects colliding to create certain noise in patterns. For the purpose of this analysis, it is sufficient to note that the separation of harmonic and percussive signals is only important as far as feature extractions are concerned as certain feature extractions within Librosa require harmonic signals as inputs.

Mel-Spectrograms

As seen above, a waveform is a visual representation of audio in a time domain. It turns out we can apply what is known as Fourier transform to move waveform out of a time domain into a frequency domain. What this essentially means is that we can derive a whole new set of features and characteristics in a frequency domain. However, one obvious downside of Fourier transform is that we lose the “time” element of the audio data. To circumvent this issue, another operation called Short-Time Fourier Transform is often applied instead. Again, without going too much into detail, STFT basically computes several Fourier transforms at different intervals over a fixed frame size and produces what is known as a Spectrogram which is a visual representation of all the core elements of an audio data (time, frequency, and magnitude).

How do you go from Spectrogram to a Mel-Spectrogram? In order to go from one to the other, another operation needs to take where place we convert the frequencies on a Mel Scale. Rationale behind this is because humans auditory system does not function on a linear scale, rather it is more logarithmic. Mel Scale. In simple terms, a Mel-Spectrogram is a Spectrogram with frequencies in Mel-Scale to mimic human auditory sensory.

Interestingly enough, it turns out that Mel-Spectrograms have shown in research to work really well with deep learning models, especially Convolutional Neural Networks since Mel-Spectrograms can be treated like images.

Mel-Frequency Cepstral Coefficients

While MFCCs are more common in the field of human speech recognition, they have also shown to be effective in music genre classification and instrument classification problems as well. MFCCs are derived from a type of cepstral representation called cepstrum which is basically a spectrum of a spectrum, which is then transformed to be on the Mel-scale. With the rise of deep learning, MFCCs have sort of taken a backseat over Mel-Spectrograms but nonetheless, MFCCs are still considered to be one of the most important and fundamental features in the world of audio and sound analysis.

For my next post, my plan is to start modeling with the features that I extracted using Librosa and see which model performs best based on accuracy score. Then, I’ll extract Mel-Spectrograms and apply CNN and compare the results to see which approach performs better!

--

--