Using AI to detect instrumental tracks

Published in

Musixmatch Blog

7 min readDec 20, 2021

From the time of Plato, harmony was considered a fundamental branch of physics, now known as musical acoustics. In general, mathematical principles of sound have been studied intensely over the years and today are very well known. On the contrary, the feelings that musical composition can arouse in a listener are completely irrational and cannot be explained by any rational theory. This duality is one of the things that strikes me most about music. Over the years, my passion for music evolved, but the strong admiration for this form of art has always remained the same. When I discovered the Musixmatch company, I was interested in finding a common denominator to represent the music heart. This allows a computer to learn it for tagging and labelling purposes. Limiting the unlimited seemed a relatively easy task, thanks to the AI potentials I learned at the University. In addition, the opportunity to influence the user's perception of music convinced me to accept this challenge.

Using AI to automate tasks and improve quality

Musixmatch is a community-based service, which aggregates and licenses musical metadata since 2010, with the mission to enhance the listening experience. In this field, it is crucial to provide automatic systems that can monitor the quality of the work carried out by a heterogeneous community of users. Furthermore, in a large and growing musical landscape, it is a complex process to be up-to-date with the continuous publishing of new audio material. In this context, automated systems can complement and accelerate the users’ work, by extracting information autonomously or assisting users in their tasks.

Instrumentalness

Together with the Musixmatch AI Team, I developed an automatic system called Instrumentalness with a specific goal in mind: to distinguish between vocals and instrumental sections in a song.

The Instrumentalness task was addressed using deep learning, which is by far the AI technique that in recent years reached the state of the art in various fields, including music. In the following sections, the architecture and methodologies applied to solve this task will be discussed.

Architecture

In this section, the main techniques used to address the task are described in depth.

The Pipeline

The architecture involves three steps:

Preprocessing
Binary Regression
Overall Score Computation

In the preprocessing step, the data is normalized: the 16-bit values are scaled in a range from -1.0 to 1.0 and resampled at 16khz.

Subsequently, the song is split into five seconds long windows, with a hop size of 1 second.

In the binary regression step, the YAMNET embeddings are extracted from every five seconds of audio, and the binary regressor predicts the likelihood of the clip being fully instrumental or not.

In the last step, two overall scores are computed (continuous and discrete) using the Viterbi algorithm.

Google’s YAMNet

The popular and freely available Google’s YAMNet network is a sound event classifier, which proposes 521 classes learned from the AudioSet-Youtube dataset. It is based on Mobilenet v1, a classifier used in computer vision, which was adapted for audio processing purposes.

Internally, YAMNet frames the waveform into sliding windows of 0.96 seconds length and hop of 0.48 seconds. Then, it runs the core of the model on a batch of these frames. For each frame, the network predicts the likelihood that an audio class is present within the window.

Input: waveform of a song. Output: presence of class sounds for each second

Since YAMNet is able to recognize the differences between many audio classes, this signifies that it is able to extract and recognize the timbre of the sound. This knowledge can be exploited to distinguish the human voice from everything else, with no complications of language, voice type, etc.

Binary Regressor

A binary regressor is appended in front of the YAMNet backbone, to adapt the network for the Instrumentalness task. The goal is to predict the vocal presence of an audio sample. This network is trained with a custom, strongly labeled internal dataset. This dataset contains about 1500 samples with voice/instrumental tags. The details are presented in the following table.

Finally, the model was tested on the Jamendo Corpus. The network predicts the likelihood of a 1-second frame being fully instrumental or vocal. It requires a context of 2 seconds in both directions. This results in a 5-second window, where prediction refers to the third second, as presented in the figure below:

The likelihood = 0.9 is assigned only to second number 3 in the window

This allows the network to be more aware of the audio context.

Performance

This network reaches optimal performance on the test set, as presented in the following table:

Sample embeddings extracted from the test set. Red = vocals, Blue = instrumental

We used the embedding retrieved from the network to visualize the sample positions in the hyperspace. In the figure on the side, vocal and instrumental samples are clearly separated into two different clusters. This confirms the optimal results shown previously.

Viterbi Algorithm

The Viterbi algorithm is a dynamic programming algorithm for determining the best discrete path in a sequence of states. In this context, it is used to discretize the continuous scores predicted by the regressor, as it is presented in the figure. The hyper-parameters have been empirically calibrated.

At this point, a continuous score and a discrete one are computed. The continuous score is computed using the following formula:

The discrete score is computed by rounding the continuous one with a threshold of 60%, chosen empirically by processing a large dataset of songs.

Results and Performance

This system has been tested over two different Musixmatch internal datasets.

Musixmatch Internal Dataset A

This dataset contains 6,280 songs, each associated with a discrete score, defined by trustworthy Musixmatch curators along with a continuous score, defined by Spotify.

All the songs have been processed and then initially compared to the discrete Musixmatch score. Nearly all the tracks have been correctly labeled, as shown in the figure.

The distance between the continuous scores and the Spotify ones is computed and grouped together according to ranges.

Musixmatch Internal Dataset B

This dataset contains 1,855 songs, labeled as fully instrumental by various types of Musixmatch users. The system processed all the songs and predicted a correct discrete label in 98.5% of the cases. I manually checked the 1.5% of incorrect predictions by the network, and discovered that approximately half of them are lyricless songs.

The term lyricless refers to a song without a lyric, in which the human voice does vocalizations or other sound emissions. With Musixmatch, these types of songs should be considered as instrumental tracks, but this system is not able to recognize it because it is solely a remarkable recognizer of the human voice.

A system that continuously improves

As it was demonstrated in this article, Musixmatch develops and maintains extremely complex systems, precisely due to the fact that the algorithms must scale on a large dataset of songs.

For this reason, feedback is extremely important for us, as it contributes to making higher quality products. The colleagues of the Studio team provide periodic reports regarding the results of the systems. In this way, it is possible to comprehensively assess the quality of the AI work, and simultaneously continue to develop the product by reducing mistakes and improving performance.

However, in most cases, AI has proved it can be useful for correcting or highlighting human errors and this makes it a tool to be implemented within products.

In conclusion, at Musixmatch, humans and intelligent systems continually correct one another, leading to an overall continuous improvement. As a consequence, Musixmatch asserts itself as an organization at the forefront of the challenges of this century.