“Oh MAMMAMIA”, I’ve used AI to guess the next Sanremo winner!

Published in

Musixmatch Blog

16 min readFeb 4, 2022

Looking back to the past, I’m struck with a vivid memory that never fades. Being a small girl taking pen to paper feverishly writing down the lyrics of a song playing on the radio.

That girl was me, around 2004, I was 8.

Since then, I can not imagine a single day not being accompanied by music and lyrics.

My passion for Linguistics is more recent but no less important. I started studying Applied Linguistics two years ago and when my academic journey was in the home stretch, my fascination with music became more powerful than ever.

This is when I discovered Musixmatch is the world’s leading music data company whose mission is providing data, tools and services that enrich the music experience across the world. It is the largest platform with 80 million users and 8 million lyrics.

An undeniable strength of the company is the AI platform. Thanks to the AI platform trained by millions of passionate music lovers, it is possible to get deep into the meaning of lyrics and song metadata. Musixmatch AI reveals indeed the structure and the meaning of lyrics by offering powerful machine learning models and natural language processing techniques based on the largest lyrics dataset.

Started my internship at Musixmatch 6 months ago with this in mind: analyzing the content of song lyrics using state-of-the-art algorithms for topic modeling.

My Master Thesis represents therefore the ideal meeting point between Music and Linguistics.

Introduction

The way people consume music has considerably changed over the last decade. Huge collections of songs make it extremely difficult for users to overlook the vast offer but, on the other hand, these large collections lead to innovative ways of exploring and discovering music in accordance with users’ taste. Song Lyrics play an important role in the perception of music; furthermore, in some genres, lyrics even claim a central role: these genres are defined by a specific use of words and thematic content. I’m an avid fan of rock music and one of the main reasons why I literally love my favorite rock bands lies in the lyrics themselves. Although some research has been carried out, in commercial platforms the influence of the content of lyrics is minimal and the catalogues are rarely provided with functions that enable searching songs on the basis of something different from title, artist or genre. To fill this gap and with the intention of enhancing the music experience, the focus of my research is on the thematic content of lyrics.

My research stems from the following question: can Unsupervised Machine Learning algorithms recognize topics that are interpretable for humans in song lyrics?

Clearly, a system which is able to assign topics to songs could be used for several innovative goals such as enhancing music recommendation systems by incorporating lyrical features, automatic generation of content-based playlists or filtering songs for topic rather than mood, genre or artist.

Topic Modeling

While the digital collections of songs keep on growing, we, as humans, do not have the power to manually annotate all these large collections of music. To this end, researchers have developed algorithms for discovering themes in large collections of documents. They do not require any prior labeling of documents since the topics emerge from the analysis. The goal of topic modeling is therefore to automatically discover the topics in a corpus.

BERTopic

In my research, I decided to use BERTopic, one of the most powerful state-of-the-art topic modeling algorithms. BERTopic, written by Maarten Grootendorst in 2020 and continuously updated, is a topic modeling technique that leverages BERT embeddings and c-TF-IDF to create dense clusters allowing for easily interpretable topics whilst keeping important words in the topic descriptions.

At the beginning BERTopic embeds the documents using Sentence Transformers and supports different languages by providing multilingual embedding models. The second step performed by BERTopic is clustering: it reduces the dimensionality of embeddings using UMAP and it clusters semantically similar documents with a density-based clustering algorithm: HDBSCAN. As last step, BERTopic creates topic representations from clusters: it extracts and reduces topics with c-TF-IDF that allows comparing the importance of words between documents by computing the frequency of a word in a given document and also the measure of how prevalent the word is in the entire corpus.

BERTopic LDAvis Topic Clusters Visualization

Dataset

The dataset used for this research are the following two:

A subset of the Internal Musixmatch Lyrics Dataset (filtered on Italian lyrics ) for training: 12226 lyrics
Sanremo dataset for testing: 1234 lyrics

System Overview

Having clear the potential application of BERTopic, I focused on the following steps, depicted in the block diagram above:

Pre-processing

As suggested by the developer of BERTopic, it is not necessary to preprocess data since keeping the original structure of the text is particularly important for transformer-based models to understand the context. However the output, without preprocessing being carried out, was noisy and not particularly clear. We would have obtained a clearer output as a result of pre-processing. The pre-processing operations carried out are the following:

stopwords removal: stopwords are the most common words in any language (like articles, prepositions, pronouns, conjunctions, etc) and do not add much information to the text.
POS tagging: part-of-speech tagging (POS tagging) is the process of marking up words in a corpus as corresponding to a particular part of speech such as nouns, verbs, adjectives and adverbs (preposition, articles, etc. are not important for the topics).
lemmatization: which stands for the process of grouping together the inflected forms of a word to analyze them as a single item identified by the word’s lemma or dictionary form. This helps to consider in the same way the singular and plural form of the same word, as instance.
interjections removal: interjection is a word or expression conveying a spontaneous feeling or reaction; the interjections encompass many different parts of speech, such as exclamations (ouch!, wow!), curses (damn!), greetings (hey, bye) etc.

Embedding model

BERTopic supports different embedding models that can be used to embed the documents. Since the language of the corpus was Italian, we consider multilingual models that cover it. The power of multilingual models lies in the ability of embedding words having the same meaning in different languages in a way that they are close in the N-dimensional space. This allows the model to recognize occasional words in other languages inside Italian lyrics, by the way. The three embedding models tested in the research are the following:

distiluse-base-multilingual-cased-v2
paraphrase-multilingual-MiniLM-L12-v2
paraphrase-multilingual-mpnet-base-v2

All lyrics or dataset splitted in stanzas

It has been decided to train the model on the splitted dataset which means that the original dataset has been splitted in stanzas corresponding to the various sections of song lyrics such as chorus, verse etc. All lyrics have been splitted in stanzas, being aware of the fact that it is rather common in song lyrics that one sentence is splitted in two different verses. The decision of splitting the dataset was also related to a preliminary assumption: although a song may cover a main topic, it is quite rare that song lyrics deal with just one topic. Creating the model with the splitted dataset implied having more than one topic for each song in the subsequent phase of prediction.

Lyrics dataset: 12226
Lyrics dataset splitted in stanzas: 82954

Parameterization

Having read the documentations of UMAP and HDBSCAN and thanks to the suggestions provided by the developer of BERTopic, during the parameterization, I iteratively selected different possible values for the following parameters:

embedding model (BERTopic)
min_topic_size (BERTopic)
nr_topic (BERTopic)
min_cluster_size (HDBSCAN)
min_samples (HDBSCAN)
cluster_selection_method (HDBSCAN)

HDBSCAN has a large number of parameters that can be set on initialization but not all of them had a significant practical effect on clustering in the present research. The phase of parameterization focused on the three following parameters:

min_cluster_size. This is a relatively intuitive parameter to select. Increasing the min_cluster_size reduces the number of clusters, merging some together. This is a result of HDBSCAN reoptimizing which flat clustering provides greater stability under a slightly different notion of what constitutes a cluster.
min_samples. The implementation defaults this value (if it is unspecified) to whatever min_cluster_size is set to. min_samples has a dramatic effect on clustering, the question becomes: how can I select this parameter? The simplest intuition for what min_samples does is provide a measure of how conservative you want clustering to be. The larger the value of min_samples you provide, the more conservative the clustering — more points will be declared as noise, and clusters will be restricted to progressively more dense areas. Increasing min_samples will make the clustering progressively more conservative.
cluster_selection_method. HDBSCAN supports an extra parameter cluster_selection_method to determine how it selects flat clusters from the cluster tree hierarchy. The default method is ‘eom’ for Excess of Mass. This is not always the best approach to cluster selection. If you are more interested in having small homogeneous clusters then you may find Excess of Mass has a tendency to pick one or two large clusters and then a number of small extra clusters. In this situation you may be tempted to recluster just the data in the single large cluster. Instead, a better option is to select ‘leaf’ as a cluster selection method. This will select leaf nodes from the tree, producing many small homogeneous clusters.

Experiments

During the experiments I started from the default values of all the mentioned parameters (both BERTopic and HDBSCAN parameters) and I iteratively raised the values in accordance with the properties of the dataset used and the practical goal of the research. I tested all the possible combinations of parameters in order to find a good BERTopic model.

Evaluation

To apply topic modeling in real world problems some form of evaluation is required. In research concerning topic modeling, depending on the application, a variety of methods have been performed to evaluate the quality of a topic model since topic modeling is an unsupervised technique. It is possible to distinguish between quantitative and qualitative evaluations. The most common are measures that evaluate the model on unseen documents. A better model will give rise to a higher probability of the held-out documents.

A second category focuses on the semantic meaningfulness of the topics. This has been measured in several ways. One way is a manual evaluation of the semantic coherence of the words with high probability for each topic cluster returned; this can be done simply by scoring the topics with a grade or detect intruding words in a list of most significant words per topic.

With the awareness that a good algorithm performance does not necessarily correspond to good topic clusters, I decided to carry out two different although closely related evaluations.

In order to evaluate the goodness of the topic cluster returned by the model, I manually annotated each topic cluster with a topic label such as ❤️Love, 👨‍👩‍👧‍👦Family, 💊Drugs, etc and UNK(Unknown) for uninterpretable topics. The manual labeling also allowed us to interpret the topic clusters, each of them consisting of the top 10 content words denoting that topic. The manual labeling is a crucial part of the qualitative analysis: BERTopic does not return a label for each topic cluster but it returns the top 10 content words extracted from the corpus. Generally, the first word is the best for each topic cluster but this is not always the case: given that, I assigned as a label for several topic clusters a more general word which was not included among the first ten. The second step of this qualitative evaluation consisted in assigning one of the following labels “good”, “good with intruders”, “mixed topic” and “confused topic” to all the interpretable topic clusters.

As a second evaluation we tested the model on unseen documents in order to assess its performance.

Dynamic Topic Modeling on Sanremo Lyrics

What if it were possible to guess the Sanremo winner on the basis of the lyrics topics?

Sanremo is the most popular Italian Music festival housing since 1951 several Italian artists who present unreleased tracks. Looking at the participating songs of all the past Sanremo editions, our questions were: have Sanremo song topics changed over the years? Are there certain topics particularly covered during a certain decade? Can topics help guess a possible Sanremo winner? In order to answer these questions, we analyzed the lyrics of all the participating songs of all the past Sanremo editions, from 1951 to 2021.

In the plot below we can observe a diachronic overview of the variation of topics throughout the years, it shows the frequency of the most prevalent topics in the 71 Sanremo editions: looking at the most frequent topics we noticed several interesting aspects.

Overall Sanremo Topics Analysis (1951–2021)

🎵Music and ❤️Love are the most frequent and evergreen topics in accordance with our expectations as we can see from the plot below. As far as 🎵Music is concerned (orange), it registered an increase in the 80’s and a significant increase in the early 2000s. From 2000 to 2021 it remains a topic covered in an important way. Something similar can be said for ❤️Love (light-blue): an evergreen topic which has shown a significant increase in the mid-90s.

Music and Love Topics evolution over time (1951–2021)

One topic related to ❤️Love is 😘Kiss (plot below): its frequency is lower than that of ❤️Love during the period considered. This lower frequency until the 1980s might be traced back to the fact that the songs used to be less explicit not only in terms of profanities but also in terms of explicit expressions of love such as kiss.

Kiss Topic evolution over time (1951–2021)

Some topics more than others have drawn out attention; a topic in my opinion interesting is 🙏Religion, which remains on the sly until the 1990s and then it explodes around 2012. Something similar has been noticed for 💣War not intended — or at least almost never — as an armed conflict but in a figurative sense as soul-conflict, an inner struggle. 💣War, almost not at all dealt until 2009, it shows peaks registered between 2011 and 2019: we assumed that this curious increase may be attributed to the explosion of some alternative genres which make it a central topic in their lyrics. In the last few years in fact, song lyrics that are not afraid of giving voice to difficulties, a sense of discomfort and interior fights have appeared in the Italian music scene.

Religion and War Topics evolution over time (1951–2021)

An opposite consideration can be done for the topic 💐Flowers: this is a topic particularly addressed in the first editions of the festival but then it slightly declined since the 80’s. Just to give an example, the first song lyrics of the dataset witness a great attention to flowers in Italian songs between the 50’s and 60’s:

“Grazie dei fior

Fra tutti gli altri li ho riconosciuti

Mi han fatto male, eppure li ho graditi

Son rose rosse e parlano d’amor”

(Grazie dei Fior — Nilla Pizzi)

Other frequent topics such as ❄️Winter, 🌊Sea, 🌃Night, 🌌Night sky are registered in rather constant way throughout the years: my assumption is that this aspect may be explained in the light of a typical use in Italian songs of metaphors, allegorical images in which the allusions to the daylight, to the night and to natural phenomena are ever-present: just think of how many times our feelings are described in lyrics by recourse to these elements.

2011–2021

In order to restrict our diachronic analysis, we carried out two further analyses: in the first one we focused on the last editions of Sanremo, which means considering songs historically closer to 2021. In the second one we just focused on the podium of the various editions which means all the songs who have reached the podium from 1951 to 2021.

As far as the analysis on the last editions is concerned, we would have expected more consistent changes in topics. However 🎵Music and ❤️Love remain the evergreen of Italian music and the most frequent topics of the last decade as we can see from the plot below.

Music and Love Topics evolution over time (2011–2021)

Beside these, ❄️Winter, 🌃Night, 🌌Night Sky and 🌊Sea which are, as mentioned above, typically associated with feelings of the soul. One topic that caught our attention is, again, 💣War exploding between 2017 and 2018 and then starting a slight decline: a reasonable explanation calls into play musical genres which have imposed themselves on the Italian Music scene in the last few years: Rap/Trap and other alternative genres.

War Topic evolution over time (2011–2021)

From 2017 we can also observe in the plot below an increase of the topic 💐Flowers which encloses references to nature and to the environment in general. I just assumed that this could be explained in light of a greater sensitivity to nature and its elements which has been registered in the last few years.

Flowers Topic evolution over time (2011–2021)

The winners over the years

Since our intention was predicting, as a funny game, a possible winner for Sanremo 2022 based on the observation of topics evolution over time, we further restricted our research field by filtering the dataset just for the 3 finalists.

Dynamic Topic Modeling Representation (podium of all the past editions)

Even in this case, we noticed a certain kind of originality of the Italian songs: the most frequent topic is, as expected, ❤️Love. Surprisingly, as we can see from the plot above, and more closely in the plot below, 🎵Music is not one of the most relevant topics in songs reaching the podium.

Love and Music Topics evolution over time (podium of all the past editions)

Two relevant considerations concern the following two topics that, as mentioned above, have shown an increase in terms of frequency in the last few years: 💣War and 🙏Religion. The topic 💣War, started imposing itself on the podium since 2012 while absent up to that year. The topic 🙏Religion, practically absent until the 1970s, registered a relevant increase between 1970 and 1990 and then between 2009 and 2021 — although not steadily — as we can see in the plot below.

Religion Topic evolution over time (podium of all the past editions)

One possible explanation is that through the years, many topics considered taboo or simply misunderstandable for a stage as Sanremo, they started being legitimized during recent editions. Hence we hope this trend will keep this way and, consequently, that in the future editions of Sanremo topics such as ❤️Love and 🎵Music will give way to something different such as….We’ll see!!

2022

As a very last analysis: today all the lyrics of Sanremo 2022 songs are available, therefore we decided to add them to the dataset. The idea was the same: using the plot “Topics over Time” to take a look at the most frequent topics in lyrics and try to predict a possible winning song on the basis of song lyrics’ topics.

❤️Love and 🎵Music, as expected, are confirmed as particularly frequent topics in Italian songs but with an opposite trend: ❤️Love, as shown in the plot below, registers an increase while 🎵Music shows a slight decrease corroborating that trend described just above.

In my opinion, three topics deserve particular attention: ❄️Winter, 🌧Rain and 🕺Dance. ❄️Winter and 🌧Rain are two common topics in Italian lyrics as depicted in all the past editions. However, as we can observe in the plot below, since last year it has been registered a curious increase.

🌧Rain is typically used in song lyrics with a metaphorical meaning being usually related to moods such as sadness and melancholy. The season of ❄️Winter is traditionally and, again, in a metaphorical way associated with the time of waiting, a sort of lethargy.

Maybe that the pandemic that is keeping us company for two years now and the mood that goes with it and we are all experiencing can be a reasonable explanation for this registered trend?

Another topic which deserves a special consideration is, in my opinion, 🕺Dance. This topic registered a peak in the early 2000s as expected but then it started decreasing from the mid-2000s. From 2021 the topic 🕺Dance starts rising again.

In connection to what has just been said, maybe this year more than ever we feel the need of dance music as a vaccine to sadness and isolation.

Well, then, my dear Dargen D’Amico: We really hope you make it to the podium!

“Quindi dove andiamo?

Dove si balla

Fottitene e balla

Tra i rottami

Balla per restare a galla

Negli incubi mediterranei”

(Dove si balla — Dargen D’Amico)

Conclusions

The idea of guessing a potential winner on the basis of topics has to be taken for what it is, a sort of funny game that leverages an innovative and interesting topic modeling technique, in particular the function of dynamic topic modeling. In addition to that, we are well aware of the fact that there are clearly other relevant factors, some of them subjective factors, which haven’t been modeled here and that significantly influence the songs reaching the podium. Therefore there is no direct correlation between topics and popularity of a song and this specific analysis was not carried out in this work. Clearly, there is not even a direct correlation between topics and the beauty of a song. It goes without saying that topics can not determine the beauty and the profoundness of a song, as many other factors play an important role for this.

For the time being, we just have to wait.

And the winner of Sanremo 2022 is …

“Oh MAMMAMIA”, I’ve used AI to guess the next Sanremo winner!

Written by Arianna Severini Perla