Exploiting time-synced lyrics and vocal features for music emotion detection

A new research paper from the Musixmatch AI Team

Musixmatch
Musixmatch Blog
4 min readJan 16, 2019

--

A recent study confirms that music-streaming listeners are especially attuned to the perception of singing. Of several hundred users surveyed, listeners indicated that vocals (29.7%), lyrics (55.6%), or both (16.1%) are among the salient attributes they notice in music. Additionally, the four most important “broad” content categories were found to be emotion/mood, voice, lyrics, and beat/rhythm. Meanwhile, listeners said that the seven most important vocal semantic categories are a skill, “vocal fit” (to the music), lyricism, the meaning of lyrics, authenticity, uniqueness, and vocal emotion.

The research conducted by Musixmatch’s AI team focused primarily on emotion/mood in relation to vocals and lyrics. Credited mainly to the world’s largest lyrics catalog created by Musixmatch with its vast community of lyrics-passionate users counting more than 40 million active contributors.
Considering how passionate users are about song lyrics (one of the most searched keywords on Google) and considering the evolution of digital music streaming services and recommendations systems for playlist, radio, and discovery, Musixmatch has focused on automatically detecting Mood/Sentiment related to any songs via lyrics and building a dataset that will, in turn, be available to the music industry.

Abstract — Research Paper from Musixmatch AI Team.

Recommender systems are a popular recent topic, especially in the field of music streaming services. Presenting users with music collections organized according to their feelings and their tastes, engaging them to listen to and discover new artists and genres, thereby extending and bringing the listening experience to a new level. Most of the Music Recommendation systems make use of Machine Learning algorithms for building a more personalized experience.

Music Emotion Recognition (MER) refers to the task of finding a relationship between music and human emotions.

With audio and lyrics representing the two main sources for retrieving low and high-level features that can accurately describe human moods and emotional perception while listening to music, MER is carried on through the use of various techniques ranging from Natural Language Processing (NLP) to Music Information Retrieval (MIR) domains, in order to analyze text and audio for identifying emotions induced by a musical excerpt.

In this paper we present the basis of all our experimentations:

The Synchronised Lyrics Emotion Dataset has been created through the Musixmatch Community, based on millions of passionate music lovers who actively synchronize lyrics with the help of advanced sync tools built by Musixmatch.

Russell’s Valence-Arousal Space: A two-dimensional space representing mood states as continuous numerical values. The combination of valence-arousal values represents different emotions depending on their coordinates on the 2D space.

We use this data to perform text-based and audio-based emotion classification exploiting different techniques with deep learning architectures. Thanks to this new dataset we keep track of the time synchronization between lyrics and vocals while training, therefore we analyze the portion of audio in which a certain lyrics line is sung.

Each lyrics line is associated to a specific emotion, reinforcing that emotions can change over the duration of a musical composition.

We rely on Amazon Sagemaker machine learning pipeline, part of the AWS cloud computing platform for both training and testing. The use of Sagemaker allowed us to automate and speed-up the training process, letting us focus on modeling while experimenting with deep architectures.

Audio-based classification pipeline. In (A) the original audio (mixture of vocals and instruments) is the input to the CNN. In (B) the original audio is separated into a vocal signal and a mixture of instruments signal by means of a Wave-U-Net, then the voiced signal only is given as input to the CNN

We tested several state-of-the-art text classification algorithms and models (Google’s Bert, ELMo, Facebook’s fastText) against audio-based neural networks (CNN). Our results show how the text/lyrics classifiers outperform the audio-based models. An interesting result was achieved by separating the vocals from the audio track, in fact, the emotion classification based only on the vocals outperformed the training pipeline that was using the whole audio (vocals + instruments).
This separation was made possible thanks to novel Google’s WaveNet-like architectures that achieve good performances in audio source separation.

Lyrics Prediction Task Pipeline: inputs of the pipeline are rows of the time-synced lyrics for which, after a text pre-processing and normalization phase, embedding is calculated it is used as input for a Deep Neural Network prediction task

Considering the promising results achieved using the Synchronised Lyrics Emotion Dataset, as future work we aim to combine both the text-based and vocals-based architectures in a multi-modal solution in order to achieve even better results.

We are confident that this is the right direction for building reliable models for automatic music emotion recognition which could be helpful for better recommendation systems, playlist management, and music discovery.

This Paper is part of Musixmatch’s continuous R&D on Machine Learning and text classification as Musixmatch manages the world’s largest catalog of lyrics and licenses data and content to companies like Amazon Music, Apple, Facebook, Google, Shazam, Vevo, Saavn, etc

The full research paper is available here on arXiv.org

Research paper published on arXiv by Loreto Parisi, Simone Francia, Maria Stella Tavella, and Silvio Olivastri.

--

--