Music Information Retrieval: Introduction

Katsiaryna Ruksha
7 min readJan 19, 2024

--

Photo by Diego Catto on Unsplash

This article will give you the knowledge required to start working on a Data Science project with music data. It will cover the specifics of sound data, mathematical dependencies hidden in music, feature engineering and, finally, approaches to modeling. Let’s dive in!

Dimensions of sound

Audio data is a mix of wave frequencies at different intensities. It represents analog sounds in a digital form, preserving the main properties of the original. Audio data has three dimensions:

  • Time period tells how long a certain sound lasts or, in other words, how many seconds it takes to complete one cycle of vibrations,
  • Amplitude is the sound intensity measured in decibels (dB) which humans perceive as loudness,
  • Frequency is measured in Hertz (Hz) and indicates how many sound vibrations happen per second. People interpret frequency as low or high pitch.

Not only the loudness of sounds that human ear can hear is limited but also the pitch:

Difference in perceptible sound frequencies by animals. Source: Source: https://theory.labster.com/hearing-range-dbs/

Frequency is used to distinguish between pure and complex tones. Pure tones are sounds that can be modelled using sine and cosine functions, representing a fundamental frequency. Musical notes, as they stand alone, are often considered to be pure tones. Complex tones are defined by pure tones that make them up and are the combination of fundamental frequencies.

A complex tome is a sum of pure tones. Source: https://pressbooks.umn.edu/sensationandperception/chapter/timbre/

With this knowledge we can go to Python and create pure tones of musical notes and combine them to get the C chord — a complex tone. The C chord includes three notes C, E and G, their corresponding frequencies you can find at Wikipedia — those are 261.626, 329.628 and 391.995 correspondingly. Check out how the sounds are created and combine into a chord:

A waveform of generated C chord

Let’s listen to the result:

Complex tone — C chord

Once you read an audio file, you’ll get a tuple containing:

  • A sample rate. The standard is 44100 Hz which means that a recorder was recording 44100 times per second,
  • A floating-point time series. y(t) corresponds to the amplitude of the waveform at sample t. Its length is equal to multiplication of the sample rate by the audio length in seconds.

As sound has three dimensions, there are different ways to visualize it.

  • A waveform reflects how an amplitude changes over time. This is the most standard type of plot that you must have seen in any music player,
  • A spectrum is a graph with X-axis showing the frequency of the sound wave and Y-axis representing its amplitude,
  • A spectrogram covers all three dimensions: X-axis is time, Y-axis is frequency (Hertz), color represents amplitude,
  • Finally, a mel spectrogram is based on mel scale that represents the way humans perceive sound characteristics.

Human perception of pitch is logarithmic as well so for mel spectrogram we make the corresponding conversion from Hertz to mel scale.

Let’s create these plots for the single audio file using librosa — a Python library for audio data processing and feature engineering:

Different types of plots of piano C chord: a) Waveform, b) Spectrum , c) Spectrogram, d) Mel spectrogram

To get frequency data from an audio file we applied the Fourier transform. The Fourier transform (FT) is a mathematical function that breaks a signal into spikes of different amplitudes and frequencies.

Fourier Transform formula

where:

  • (f) is the output of the Fourier Transform in the frequency domain
  • x(t) is the input time domain function
  • 2πf is the frequency in radians per second

It is used to convert waveforms into corresponding spectrum plots to look at the same signal from a different angle and perform frequency analysis. Fourier showed that any signal can be represented as a series of sine waves of different amplitude and phase.

Application of FFT to view the same signal from time and frequency perspectives. Source: https://www.nti-audio.com/en/support/know-how/fast-fourier-transform-fft

The Fast Fourier Transform (FFT) is the algorithm computing the Fourier transform. The short-time Fourier transform (STFT) is a sequence of Fourier transforms converting a waveform into a spectrogram.

A spectrum of generated C chord with top frequencies corresponding to notes C4, E4 and G4

At the spectrum plot we see three main frequencies which correspond to the three notes used in this chord — C (261.626 Hz), E (329.6288 Hz), and G (391.995 Hz).

Music and Math

The musical system that we are accustomed to is called 12 equal temperament. It divides an octave (an interval between two closest identically named notes) into 12 parts, all of which are equally tempered (equally spaced) on a logarithmic scale, with a ratio equal to the 12th root of 2 (12√2 ≈ 1.05946).

Frequencies of notes on the piano. Source: https://pressbooks.pub/sound/chapter/pitch-and-frequency-in-music/

Frequency of any note can be calculated using the equation below which gives the frequency f of the n-th key on the idealized standard piano with the 49th key tuned to A4 at 440 Hz (highlighted at the plot):

Formula of musical notes frequencies where n is a key number

This post provides a handy piece of code to get frequencies of any note:

Synthesized note C

You noticed, of course, how unnatural the generated sound is. That’s because we generated a pure tone. Instruments and voices produce a complex sound that includes overtones (or harmonics). Different instruments produce different overtones and hence create timbre which allows us to distinguish one instrument from another even while playing the same note.

Instruments and voices have different timbres. Source: https://byjus.com/physics/timbre/

If we extract frequencies from a real piano middle C sound, we’ll get more than one value:

At the spectrum of piano C4 sound there’re additional frequencies besides the fundamental frequency at 261.6 Hz

Additionally, synthesized sound has constant amplitude while a piano key, when struck and held, creates a near-immediate initial sound which gradually decreases in volume to zero.

Waveforms of a) real piano C4 sound, b) synthesized C4 sound

The most common schema of sound amplitude has four stages: attack, decay, sustain, and release (ADSR):

  • Attack is the time taken for the rise of the level from nil to peak,
  • Decay is the time taken for the level to reduce from the attack level to the sustain level,
  • Sustain is the level maintained until the key is released,
  • Release is the time taken for the level to decay to nil.

Following the Katie He’s article, I extracted the overtones from a real piano sound, applied them to my synthesized C chord along with ADSR weights. The complete code is at my GitHub.

That’s the final sound:

Note C with overtones and ADSR effects

With this knowledge, we can generate a song played by two hands at piano:

Generated piano version of Dogs’ Waltz

It still sounds a bit unnatural of course but we got that far with just Math and no ML at all! Of course, using all the power of AI we can achieve more natural sound.

Photo by Nick Fewings on Unsplash

As we saw, using Fourier transform, one can decompose any complex sound to pure tones. With this motivation, Fourier transform was applied to solve one of the greatest music mysteries — an entry chord of The Beatles’ “A Hard Day’s Night”. According to the musicologist Jeremy Summerly, “the sound of this chord is the most discussed pop opening of all time”. There were different versions of who and how played this chord: two versions considered different notes played on a guitar by George Harrison, the third one also included John Lennon playing guitar and Pall McCartney playing on his bass.

I recommend you read the article Mathematics, Physics and A Hard Day’s Night” by Jason I. Brown who attempted to use math to solve the mystery. The problem with identifying the musical notes used to play the chord is in overtones added to fundamental frequencies as multiple instruments with their own timbres are used. Later a few researchers argued with his conclusions and, unfortunately, the mystery of the chord is unsolved yet: music is, afterwards, more than just math.

This post included introductory information about sound data and specifics of working with them. The next post is dedicated to Data Science aspects — feature engineering (including MFCCs) for classical modeling approach. See you there!

Don’t forget to checkout the complete code at my GitHub.

--

--