Music Information Retrieval: Feature Engineering

9 min readFeb 15, 2024

--

In the first article on MIR, I explained what audio data are, what basic transformations and knowledge is required to start working with audio files and even generate your own sounds. Now let’s talk about ways to use audio data in machine learning. In this article we will generate features that are commonly used for many audio data modeling tasks.

Audio Data

There are many applications of audio processing such as:

speech recognition used to control device via voice commands and dictate texts, for example, Siri by Apple, Alexa by Amazon, Google Assistant, and Cortana by Microsoft,
voice recognition used in security systems for user authentication, for example, Nuance Gatekeeper biometric engine applied in the banking sector,
music recognition often used together with genre classification and recommendation systems. Well-know examples are Shazam and Spotify,
environment sound recognition has a very wide scope of tasks from self-driving cars, maintenance to healthcare. Examples are Audio Analytic, SoundSee, Sleep.ai.

Depending on your task, you can get data in various sources:

free sound libraries Freesound and BigSoundBank include voice recordings, environment sounds, and noises,
free audio datasets Bird Audio Detection dataset (Machine Listening Lab) , Environmental Sound Classification dataset (ESC-50), AudioSet (Google based on YouTube),
commercial datasets such as ProSoundEffects,
expert datasets are obtained from partnership with domain experts.

Commercial and expert dataset usually provide greater quality of audio data. You can also select between different audio data formats, for example:

WAV or WAVE (Waveform Audio File Format) is a lossless or raw file format developed by Microsoft and IBM,
AIFF (Audio Interchange File Format) developed by Apple also saves uncompressed audio,
FLAC (Free Lossless Audio Codec) files are compressed without losing sound quality,
MP3 (mpeg-1 audio layer 3) compresses audio with an acceptable sound quality.

It’s recommended to use uncompressed audio formats such as WAV (WAVE) or AIFF.

For genre classification task, I’ll use the kaggle GTZAN dataset, the MNIST of sound. The GTZAN dataset is the most-used public dataset for evaluation in machine listening research for music genre recognition. It includes a collection of 10 genres with 100 audio files each, all having a length of 30 seconds. Along with audio files, GTZAN includes ML features and Mel spectrograms for the audio data. In this article you’ll see how these features can be generated.

Great! Once we have our data, we can start analysis and feature engineering.

From stream to discrete

Before starting to generate features, let’s spend some time to understand how sound as a continuous stream is transformed into discrete features.

Sound is processed in a few steps:

Framing means cutting the continuous stream of sound into short pieces (frames) of the same length (typically, of 20–40 ms),
Windowing is a fundamental audio processing technique when a (usually bell-shaped such as Hanning) function is applied to a sound frame. It reduces or smooths the amplitude at the start and the end of each frame while increasing it at the center to preserve the average value,
Overlap-add (OLA) method prevents losing vital information that can be caused by windowing. OLA provides 30–50 percent overlap between adjacent frames, allowing to modify them without the risk of distortion.

The steps of sound processing should be clear from this image:

Sound preprocessing: K is a frame size, Q is a hop length. Source: https://dsp.stackexchange.com/questions/36509/why-is-each-window-frame-overlapping

There’s no need to preprocess each sound manually, you’ll see how easy feature generation is when using librosa — an open-source Python library for visualization and feature extraction.

Classic machine learning approach for audio data modeling

In classic ML approach we generate a bunch of different kinds of features and feed them all into a model. We can split all possible audio features into three groups:

Example of MIR features. There is no single classification of features, this is a sum up of different sources

High-level features are describing a full sound. Mid-level features cover all three dimensions of sound and include MFCCs, chromagrams and separation of harmonic and percussive components. Finally, low-level features are based on just two dimensions and can be further divided into time and frequency domain features. If you want to find out more about audio data dimensions, check out this article.

High level features

High level includes features that describe audio as a whole, e.g. band, mood, instrumentation. In GTZAN we have no such data, all audio files are only labeled by their genre. The only available high-level feature that we also can derive is tempo which estimates beats per minute.

For analysis we can use a box plot to compare tempo of different genres:

Tempo measured in beats per minute for different genres

Mid-level features

Mid-level features cover all three dimensions of sound, we’ll generate mel spectrograms, chromagrams, harmonic and percussive components.

A mel spectrogram helps to represent all 3 dimensions of audio data at once: the horizontal axis represents time, the vertical axis represents frequency, and the color intensity represents the amplitude of a frequency at a certain point in time. As human perception of pitch is logarithmic, at a mel spectrogram frequencies are converted from Hertz to a mel scale.

Mel spectrograms of four sounds of different genres

The mel frequency cepstral coefficients (MFCCs) of a signal are a small set of features which concisely describe the overall shape of a spectral envelope. Usually first 12–13 coefficients are used together with their first and second derivatives. To get them, we apply Discrete Cosine transform to a mel spectrogram. Discrete Cosine Transform is a simplified version of Fourier Transform which returns real-valued coefficient, decorrelates energy in different mel bands and reduces dimensionality to represent spectrum. I like to make an analogy between principal components and MFCCs: in both cases we use a few first decorrelated components to represent data.

MFCCs of four sounds of different genres

Chroma features. The chroma is computed by summing the log-frequency magnitude spectrum across octaves. The resulting sequence of chroma vectors is known as a chromagram. One main property of chroma features is that they capture harmonic and melodic characteristics of music, while being robust to changes in timbre and instrumentation.

Chromagram shows what notes are being played in a specific moment

Basically, chromagram shows what notes were played with different intensities in a specific moment regardless of their octaves.

Chromagrams of 4 sounds of different genres

Time domain features

Waveform is a audio data plot showing how amplitude changes with time. Let’s see what our data look like by plotting their waveforms with librosa.

Example of waveforms of four audio files of different genres

The first two features are very similar — it’s amplitude envelope and root mean square energy.

Amplitude envelope — is a time domain feature that is made up of the maximum amplitude values across all samples in each frame. It gives a rough idea of sound loudness but is sensitive to outliers.

Root mean square energy. By energy we mean overall magnitude of a signal, which in case of audio signals corresponds to how loud the signal is. For amplitude envelope we calculated maximum of a frame’s energy, to get root mean square energy we calculate RMS of a frame’s energy (=amplitude). The resulting feature still is an indicator of loudness but is less sensitive to outliers.

Amplitude envelope and RMS energy of four audio files of different genres

Zero crossing rate is another time domain feature that is equal to the number of times a signal crosses the horizontal axis of a waveform.

Definition of zero crossings. Source: https://www.researchgate.net/figure/Definition-of-zero-crossings-rate_fig2_259823741

This feature helps to differentiate between percussive and pitched sounds, voiced and unvoiced audio fragments.

Waveforms with zero crossing rate of four audio files of different genres

Frequency domain features

We’ll analyze three spectral features: spectral centroid, spectral bandwidth and spectral roll-off:

Plots of spectral centroid, spectral bandwidth and spectral roll-off for single sound

These are not all spectral features — check out documentation about Two!Ears Auditory Model to find out about spectral crest, entropy and other features used in signal processing.

Spectral centroid indicates where the center of mass of the spectrum is located. Spectral centroid is calculated as amplitude-weighted mean of frequencies present in the signal. Perceptually, it has a robust connection with the impression of brightness of a sound. It is widely used in the measurement of the tone quality of any audio file.

Spectral bandwidth is the spectral range of interest around the centroid, that is, the variance from the spectral centroid. Mathematically, it is the weighted mean of the distances of frequency bands from the Spectral Centroid. Spectral bandwidth correlates with perceived timbre.

Similar to the zero crossing rate, there is a rise in spectral centroid at the beginning of the signal. That is because the silence at the beginning has such small amplitude that high frequency components have a chance to dominate:

Spectral centroid with bandwidth plots of four sounds of different genres

Spectral roll-off is the third spectral feature that we’ll plot today. Spectral roll-off is the frequency below which a specified percentage of the total spectral energy, e.g. 85%, lies. This feature can be useful to distinguish voiced from unvoiced signals and harmonic from noisy sounds.

Spectral roll-off plots of four sounds of different genres

Harmonic-percussive source separation. Musical sounds include two broad categories of sounds: harmonic sounds and percussive sounds. A harmonic sound is what we perceive as pitched sound, what makes us hear melodies and chords. On the other hand, a percussive sound is what we perceive as a drum stroke. In spectrogram horizontal lines correspond to harmonic sounds and vertical lines — to percussive sounds.

Spectrograms of full sounds and percussive and harmonic components. Harmonic spectrograms mostly gets horizontal lines and percussive — vertical lines

Modeling

As we have knowledge of different kinds of features, we can use them to build a quick genre classification model. Following classic ML approach, I’ll generate the features for all audio pieces in the data sample and aggregate them on a song level by calculating mean and variance.

Preprocessing of data is very easy: I’ll apply standard scaler to all features and encode the target variable:

Finally, we can apply a model:

Without tuning, the accuracy is 0.79! Almost 80% of songs of 10 different genres were correctly labeled with default model hyperparameters. Obviously this could farther be improved by hyperparameter tuning and model selection but this is the topic of another article.

There are lots of features that can be generated for audio signal processing. Here I’ve covered the frequently used ones that you can generate for your audio data set or use already prepared features for modeling of GTZAN data.

Different modeling options for music data

Besides classical approach with feature generation that we just saw, there are other options such as applying transformer models or computer vision algorithms to mel spectrograms. In the next article I’ll compare alternative modeling approaches.

Checkout my GitHub to view the complete code.

Further read:

Hugging Face Audio course
Audio Signal Processing for Machine Learning — series of lectures on YouTube
Music Information Retrieval — e-book