Understanding what’s behind the lyrics

Natural Language Processing of song lyrics using Deep Learning

Published in

Musixmatch Blog

13 min readMar 12, 2020

I was at University that day, when my professor came up with: “Today Musixmatch seminar, at 11 am!”.
I thought some normal engineers would have come to talk about usual boring stuff. Instead when I listened to AI Musixmatch team speaking about “Mood recognition in song lyrics” I suddenly woke up.

Musixmatch joins two of my passions: AI and music. I became interested in what they do and I was lucky enough to do an internship with them for my Master’s Thesis.

Part 1 — A dive into the Deep Learning-based NLP world

Natural Language Processing has made great progress in the last few years. The scope of my thesis was to study how to apply Deep Learning techniques to the analysis of songs lyrics.

Nowadays Statistical NLP system is still used. Anyway recent studies have demonstrated that Deep Learning overcomes all the previous technologies on NLP tasks.

Classical NLP schema — https://s3.amazonaws.com/aylien-main/misc/blog/images/nlp-language-dependence-small.png

A Classical NLP pipeline expects a language detection system as first step: the reason is that next steps can differ depending on it. After this detection, the corresponding pipeline of preprocessing steps is performed, which includes Tokenization, Part-Of-Speech tagger and Named Entity Recognition. Human-designed features come from the output of these preprocessing steps. Then a model can be created and the inference for the desired task executed.

Deep Learning-based NLP — https://s3.amazonaws.com/aylien-main/misc/blog/images/nlp-language-dependence-small.png

Deep Learning is based on a completely different approach. After an initial preprocessing of raw data, the input is embedded in dense vectors, which can be generated by different techniques like word2vec, GloVe and doc2vec. This becomes the new input of the neural network which feeds the hidden layers. Through these layers the network learns how to reach the goal of the task.

Deep Learning exploits representation learning (the systems can automatically discover the representation of features from raw data, without human assistance, and use them to make predictions).It’s possible to notice that such a model doesn’t know which is the language of the documents. This means that the architecture of the model is language-independent.

Musixmatch uses a customized version of Stanford CoreNLP on lyrics.

The raw text crosses a cascade pipeline until every metadata is extracted by a specific Annotator. First of all lyrics is tokenized, so it’s chopped into pieces. Only after tokenization, steps as POS and NER can be executed.

CoreNLP is mostly a Statistical NLP System that makes use of Annotators to process the input text. Musixmatch devised custom annotators which use of look-up vocabulary and regular expressions to handle specific cases, like compound terms as "Lambortini" which is understood as "Lamborghini + Martini".

Slang Annotator was an extension added by Musixmatch to the original CoreNLP. Indeed in lyrics there are many words which aren’t grammatically correct, but they represent slangs of the singer’s language. This often happens between rappers who tend to have their own vocabulary. Since slangs are a big part of the words used in lyrics, they have to be understandable as normal words. Slang annotator is able to normalize words as “Rolly” to the correct form “Rolex” or “goin’” to “going”. This is done thanks to a vocabulary of mappings between slang and normalized form, which can be regularly annotated by the default Annotators.

Being Musixmatch a music company, the NER Annotator, which finds Named Entity in lyrics, was extended to distinguish a common Person and an Artist. To tag a person as an artist, he/she has to be tagged as PERSON before. Then the CRF classifier tags that person as an ARTIST, a specific Musixmatch NER category. High-level steps have to wait the end of the previous annotations, in such a way that they can use already produced metadata, to extract their own metadata.

CoreNLP is very accurate thanks to robust statistical classifiers as CRF (widely used in industry and give proof to be very reliable), its regex and look-up tables through which it can tag many specific cases. Unfortunately, the pipeline strongly depends on lyrics language: Annotators are different between the languages, due to the differences between each grammar. So before starting the pipeline, the language has to be detected and a specific pipeline has to be computed.

The aim of the Thesis was to study state-of-art Language Models to understand if it is possible to replace Statistical CoreNLP in order to generalize the pipeline and be more confident with unknown words and common people’s speeches.

Part 2 — BERT for NER

After some researches, I understood that Pretrained Language Models were a good starting point to face NLP tasks since they represent state-of-art in almost any NLP task. Since NER (Named Entity Recognition) is one of the main NLP tasks, I started with a BERT model to build a NER classifier.

BERT (Bidirectional Encoder Representations from Transformers) is a language model created by Google AI Language researchers. BERT is designed to pre-train deep bidirectional representations from unlabeled text by jointly being conditioned by both left and right context using Transformers. The novelty lies in the fact that a sequence is no more examined only from left to right (the principal problem of standard language models) but in each direction. Such an innovation seems to improve the comprehension of the context of sentences.

Many Language Models are trained to guess the next word looking at the left context, which is an obstacle to the total understanding of a sentence. BERT overcomes this problem with 2 training techniques:

Masked Language Model: 15% of tokens of the sequences are substituted with masks. The model has to guess the words in place of masks.

Next Sentence Prediction: given 2 sentences A and B, the model has to guess if B is the following sentence of A or not.

How to build a NER classifier for lyrics

Among the available BERT models, I selected BERT-Base, Multilingual_Cased: 104 languages, 12-layer, 768-hidden, 12-heads, 110M parameters. This model is pre-trained on Wikipedia and BookCorpus considering many languages so it can understand texts from different linguistic sources.

Schema of my usage of BERT for NER on lyrics

Song lyrics have a different format, length and usage of slangs, compared to common texts (like newspaper articles, Wikipedia, websites, ..). Here an example,

Classical text format:

The domestic dog (Canis familiaris when considered a distinct species or Canis lupus familiaris when considered a subspecies of the wolf) is a member of the genus Canis (canines), which forms part of the wolf-like canids, and is the most widely abundant terrestrial carnivore…

Lyrics text format:

…
We are the champions, my friends
And we’ll keep on fighting ’til the end
We are the champions
We are the champions
No time for losers
’Cause we are the champions of the world
I’ve taken my bows
And my curtain calls
You brought me fame and fortune and everything that goes with it
I thank you all
…

We intuitively understand that these two texts are not so different from one another therefore a multilingual BERT is capable of understanding both. Anyway if we wanted to build a NER classifier for lyrics, it would be better to have a model which is more confident on lyrics-format texts.

To do that we apply a so called Domain Adaptation on lyrics. Such a technique is used when the source distribution (normal texts) is different, but related, to the target distribution (lyrics texts). Musixmatch gave me a dataset of 200K songs in different languages. I fed the pre-trained model with these unlabelled lyrics, to have a model which was more sensitive to them.

After this unsupervised step, I had a new language model adapted on lyrics but a NER classifier was still missing. To train the model on a downstream task, as the Named Entity Recognition, a Fine-Tuning was necessary. To fine-tune the model on NER I took the most authoritative dataset: CoNLL-2003. It contains tagged newspapers articles in English and German and I used the English corpus. This dataset contains 5 types of labels: PER (Person), LOC (Location), ORG (Organisation), MISC (Miscellaneous entities) and O (Other).

Making a comparison on the test set, the model which was adapted on lyrics worked slightly worse than the other one. This was an expected result, because, considering that BERT is trained on Wikipedia and CoNLL-2003 contains newspaper articles(very similar to Wikipedia), it was senseless a domain adaption on lyrics. Anyway both the two models got over 90.0 as F1 score.

If I had had a dataset of tagged lyrics I could have evaluated how the adapted model is better than the original one, and this could be a future work.

“And on Italian corpora…?”

PART 3 — UmBERTo: an Italian Language Model trained with Whole Word Masking

Since we had already performed some experiments on Italian datasets, we asked ourselves:

“Why not generating a new Italian Language Model from scratch on Italian corpora?”

Well, we did it.

What is UmBERTo

UmBERTo is a RoBERTa-based Language Model trained on large Italian Corpora.

Marco Lodola, Monument to Umberto Eco, Alessandria 2019

It inherits from RoBERTa, a Facebook optimization of BERT, key hyperparameters for better results.

Umberto uses two innovative approaches: SentencePiece and Whole Word Masking.

SentencePiece is a Google’s language-independent subword tokenizer and detokenizer for Neural Network-based text processing systems. It’s an end-to-end system, so no pre-tokenize step is required. The dimension of the vocabulary is predetermined and it moves toward the trend of language-agnostic systems, to bypass the step of language recognition and the use of a language-dependent algorithm. This is done thanks to a lossless tokenization: encoded text maintains the information to return to the decoded one, with no need of a specific grammar rule. A sentence like this:

Hello world.

is normalized as:

Hello_world.

and tokenized (encoded) as:

Hello _wor ld.

So it’s possible to come back for the decoder (without the ‘_’ it wouldn’t be possible without the knowledge that the language is English, because, for example in Chinese the worlds are attached). 2 subword segmentation algorithms are implemented by SentencePiece: Byte Pair Encoding (BPE) and unigram language model.

Whole Word Masking (WWM) applies masks to an entire word, if at least one of all tokens created by SentencePiece Tokenizer was originally chosen as mask. It’s more difficult for the model to guess an entire word rather than its subwords, so it became more robust. Given, as instance, a sentence like this:

Once you confirm the car reservation, we guarantee your car.

So tokenized:

Once _you _con fir m _the _car _reserva tion, _we _gu aran tee _your _car.

If normal masking transformed the sentence in a similar way:

[MASK] _you _con fir m _the [MASK] _reserva tion, _we _gu [MASK] tee [MASK] _car.

WWM would mask:

Once _you _con fir m _the [MASK] _reserva tion, _we [MASK] [MASK] [MASK] _your _car.

Two models

We created 2 models for UmBERTo:

umberto-wikipedia-uncased
umberto-commoncrawl-cased

umberto-wikipedia-uncased is based on a relatively small corpus of data (almost 7GB) but of high quality. They come from Wikipedia Italian articles, so they are written without grammatical errors and in a formal style. Another characteristic of the model is that it considers only lower case characters. So the input text has to be converted to lower case before the processing.

umberto-commoncrawl-cased instead is trained on a Italian subcorpus of OSCAR (Open Super-large Crawled ALMAnaCH19 coRpus). OSCAR is a multilingual corpus which covers 166 languages and is obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture. The dataset chosen was the deduplicated Italian version: 69 GB, 210M of sentences and 11B of words. To avoid problems with copyright, sentences were shuffled. Obviously the style of this model is less formal than the Wikipedia one and it contains sentences written by normal users in the crawlable sites. This model is sensitive to upper case.

Both models have 12-layer, 768-hidden, 12-heads, 110M parameters (like RoBERTa).

Screen from UmBERTo Github README.md, https://github.com/musixmatchresearch/umberto

Training UmBERTo from scracth

Since the implementation of RoBERTa is done in Fairseq, we used the same library for the training of UmBERTo.

After the training of the UmBERTo tokenizer (with Sentencepiece) and the creation of datasets in the format needed by Fairseq, there were all the elements to begin the training from scratch of the real language model.

The framework allows to train both from scratch and from a pre-trained model (like BERT, RoBERTa and so on). This choice is very significative for our aim, because we didn’t want to take an already trained model (our purpose was to create the first Italian one!) and train it, but we wanted to build it from scratch.

The two models were both trained for about 125K steps with a batch size of 2048 (thanks to 8 GPU of Amazon SageMaker). Here I reported the perplexity of the 2 models.

Perplexity measures the uncertainty of language model. The lower the perplexity, the more confident the model is that the generated sentence is valid in the given language. Conceptually, perplexity represents the number of choices the model is trying to choose from when producing the next token.

Perplexity of **umberto-wikipedia-uncased (left)** and **umberto-commoncrawl-cased (right)** trained from scratch

Both Hugging Face and Fairseq have integrated UmBERTo in their framework. The snippet below shows how well the masking of UmBERTo works.

<mask> can be replaced by ‘considerato’ (with a probability of 18%), ‘stato’ (17%), and so on.

Fine-Tuning on NER

With the library Transformers of Hugging Face, I fine tuned UmBERTo on POS and NER. The schema was the same of BERT, but without the initial domain adaptation.

Schema of my usage of UmBERTo for POS and NER on lyrics

I will focus on NER task. I took 2 Italian datasets for NER fine-tuning: Wiki-NER ITA and I-CAB Evalita07 .

Here there is a piece of I-CAB:

Oggi B adige20041007_id413942 O
incontro SS adige20041007_id413942 O
al ES adige20041007_id413942 O
Centro SS adige20041007_id413942 O
S. YA adige20041007_id413942 O
Chiara AS adige20041007_id413942Latte SS adige20041007_id413942 O
al ES adige20041007_id413942 O
seno SS adige20041007_id413942 O
, XPW adige20041007_id413942 O
sos SN adige20041007_id413942 O
di E adige20041007_id413942 O
Pedrotti SPN adige20041007_id413942 B-PER
...

And here a piece of Wiki-NER

-DOCSTART- O OZuglio I-LOC I-LOC
è O O
un O O
comune O O
italiano O O 
di O O 
609 O O 
abitanti O O 
della O O 
provincia I-LOC I-LOC 
di I-LOC I-LOC
Udine I-LOC I-LOC
in O O
Friuli-Venezia I-LOC I-LOC
Giulia I-LOC I-LOC
. O O-DOCSTART- O OFenomenologia O O
ha O O
quattro O O
significati O O
...

The result of the fine-tuning was successful: UmBERTo (here I reported Commoncrawl version) has higher F1 score than multilingual BERT on both Wiki-NER Ita and I-CAB Evalita07.

F1 score of umberto-commoncrawl-cased* and m-BERT

The table shows that a language model trained from scratch on a single language works better than a multilingual model in the evaluation of that language.

I have reported a snippet of the code to make the prediction of the sentence with umberto-commoncrawl-cased:

Il calciatore Cristiano Ronaldo si trasferì a Torino, ma non fece vincere alla Juventus la Champions League

All the tokens are correctly tagged.

F1 score of umberto-commoncrawl-cased and tint on I-CAB Evalita07

Another important result is that F1 score of UmBERTo is also significantly higher than F1 of tint, which is CoreNLP-based, like Musixmatch framework.

This demonstrates how much Deep Learning based NLP has overcome statistical NLP.

Domain Adaption on lyrics

As I did with BERT, I wanted to test the performances of UmBERTo compared to UmBERTo adapted on lyrics.

This is also an empirical experiment, because UmBERTo was born to be a general Language Model, it’s not designed only for lyrics.

I used a script from Trasnformers library to do this experiment, as I did with BERT in the previous section.

Musixmatch gave me a dataset of 97K Italian lyrics. I adapted umberto-commoncrawl-cased with them with these results.

F1 score of umberto-commoncrawl-cased* adapted and not

As for BERT, scores are slightly lower, but considering that both Wiki-NER and I-CAB aren’t dataset of lyrics, it’s absolutely normal!

Conclusions and future works

As stated after these experiments, Deep Learning NLP empirically, other than theoretically, reaches better performances than Statistical NLP.

Another factor is the importance of training a model from scratch if we want to use it on a single language or on a specific kind of data.

A future interesting work will be to create a dataset of tagged lyrics to fine tune language models like UmBERTo and creating a classifier for a specific task, like NER, only for lyrics texts.

I thank I-CAB (Italian Content Annotation Bank), EvalITA authors to kindly providing me the dataset for this research.
I’d also like to thank the whole team of Musixmatch for the opportunity it gave me and for the awesome welcome. In particular I’m very grateful to the members of AI team (Loreto Parisi, Simone Francia and Stella Tavella) for the support I’ve received from them during my internship. Results I’ve reached are also due to them.