How can NLP help French learners find songs that match their current level?

Published in

Geek Culture

12 min readJun 20, 2022

I have taught French to a lot of students over the years and have been in the shoes of a learner many times myself. This long language learning experience has taught me one thing: when you’re a beginner the best way to get vocabulary and a sense of the prosody of the language is through songs. However, whenever I would explain this to my students I saw that they were expecting me to give them a selection of songs, yet, my tastes aren’t theirs and I find myself offering them songs that are neither their level nor their taste.

After reading the series of articles written by Frank Andrade who used data science tools to classify more than 3000 films and tv shows according to their language level, I thought I would try to do the same with French songs.

The aim of this analysis is therefore to classify the songs according to their CEFR level. However, how do I evaluate the level of a text? Obviously, there are many ways to work on this problem (grammar, length, readability, pronunciation,…) As I see this as a tool to develop language comprehension and improve vocabulary, I will focus on the lexicon used. In this article, I will show you how I worked on this challenge before revealing the most appropriate songs per level. If you’re only interested in the song list, you can head directly to the result section.

What information do we need?

So I have to go looking for the data, but what information is essential? At the very least, I need the text, the artist and the title of each song. To this, I would also like to add the genre, as learners don’t realise the actual diversity of the French music scene, believing that it’s relatively small and limited to the artists that cross the borders of French-speaking countries. Generally, the learners are only presented with pop songs or older songs from famous singers like Edith Piaf. The fact that you’re learning French shouldn’t force you to listen to pop songs for hours if you like metal. In addition to all these elements, I also need a list of words that are classified according to their level (i.e. A1 to C2).

Getting the Data

For the first part, I have collected all the songs categorised as ‘French’ on the Lyricstranslate website. I realise, however, that this implies an important bias in my corpus, which is that these songs have been already translated into other languages by the users (which would therefore imply that they are already popular with non-French speakers). Despite this, I still favoured it because it provided a piece of essential information: the popularity of the song. Indeed, when sorting the songs, you can choose to order the list by popularity. The index of my corpus would thus serve as a basis to know their popularity level.

After collecting the songs, I was still missing the genres. To do this I created several lists: rock, pop, chanson francaise, metal, electro, folk/children’s music, and film music. In each list, I added the artists corresponding to the genre.

Finally, all I needed was a list of words classified by level. However, unlike English where it’s quite easy to find a vocabulary list classified by levels from A1 to C1, in French, I couldn't get one and had to create it. To do this, I discovered two lists on the internet that would give me an overview of the vocabulary that a learner of French should know. The first one is from the French Ministry of National Education and Youth which gives a list of the 1500 most frequent words in French, and the other is from Memrise where you can find a list of 5000 most used words (according to the creator of the list, so it has to be taken with caution). I will explain below how I used those two lists to classify the words by level.

So I started my work with a list of 5000 songs, several lists of artists classified according to their genre and a list of over 6000 French words ranked according to their frequency. All of them in their own CSV files.

Data wrangling

The first step is to load the songs' lyrics CSV file into a DataFrame and check its information with df.info() and how it looks like with df.head() here are the results:

I notice that I’m missing some lyrics and many album titles. I as well have a column ‘Unnamed: 0’ that I will get rid of. Looking more specifically at the text of the songs, I can see that there are some elements that would have to disappear before being able to lemmatise the text correctly (HTML remains, line breaks, punctuation, …) I used the replace method with a regular expression pattern to do so (for example, <.*> would be the regex pattern to delete all the remaining HTML elements like <div>or <em>. Here is the complete code:

I had to add the 178 missing songs manually. However, I limited myself to the songs that were in the top 300 since I was mainly interested in the ones that are popular. It was principally songs from Stromae, Gims and Jacques Brel. I as well added the genre corresponding to each artist. Also, by looking at some of the songs I quickly noticed that some of them weren’t in French. This isn’t dramatic since they would simply be at the bottom of the ranking but I still wanted to see if a Python library could tell me the language of the text. I was happy to discover the Langdetect library which allows you to do just this. Thanks to langdetect, I created a function that returns True if the song is in French and False if it’s in another language. This allows me to create a new column on my df that stores this. The next step is simply to remove from my df all rows whose value for this French column is False.

Before moving on to lemmatisation, I also wanted to create a final column that contained the number of words in each song. This would be useful to better understand my data and to get the percentage of words in each level.

The next step is lemmatising my text, but what is lemmatising?

The action […] of giving (a word) the canonical neutral form that it has, for example, in a dictionary (Le Robert)
Component for assigning base forms to tokens using rules based on part-of-speech tags, or lookup tables. (Spacy)

Indeed, to avoid our classifier counting ‘bon’ (good, masculine singular), ‘bons’ (good, masculine plural), ‘bonnes’ (good, feminine plural) as different words, it seems more coherent to keep only the neutral form. Of course, this creates some problems, such as the fact that a student may know the indicative form ‘vouloir’ but not its subjunctive ‘veuille’. However, I think the positive impacts outweigh the negatives.

In order to be able to lemmatise each of the song lyrics, I used Spacy’s lemmatizer which offers nice performances in French and I created a function to add a new column called ‘lemmatised ’ to my data frame.

To give a concrete example, the first words from the song ‘La vie en rose’ by Edith Piaf go from “Des yeux qui font baisser les miens un rire qui se perd sur sa bouche” to “un, oeil, qui, faire, baisser, le, mien, un, rire, qui, se, perdre, sur, son, bouche”. After all this preliminary work, my data frame is ready for analysis. Here is what it looks like after this first part. Now we have 4632 songs (after dropping the nan) distributed among these 7 columns:

Understanding the data

Before starting the analysis of the levels per song, let’s try to understand our corpus a little better. First, we can look at the most used words in our list. To do this I used a word cloud and removed from the list the so-called stop words, i.e. words that are used frequently in the corpus and don’t add any meaning. I used the stop-word list given by Spacy to which I appended a few words that were high in the ranking but didn’t add weight to it (‘j’, ‘c’est’, …) I ended up with this words cloud.

We can see that words such as ‘amour’, ‘aime’ (‘love’), ‘vie’ (‘life’), temps (‘time’), ‘veux’(‘want’) are very prominent in the songs.

I then created a bar chart ranking the artists by the number of songs they had in the corpus (Note that technically Jacques Brel is in the lead with more than 100 songs, but since many of his songs weren’t ranked in the top 300, I didn’t add them to the corpus and therefore removed them from the data frame before lemmatisation).

Then I tried to find out which music genre had the most words on average. We can see that it’s rap music which is largely in the lead. Indeed if we look at the mean number of words per artist, we can see that Fonky Family has an average of 1186 words per song, followed by Kery James (1105 words) and Mafia K’1 Fry (1089 words), the three of them are rappers or groups of rappers.

Classification

After these first inputs that allow us to better understand our dataset, we can move on to the classification part. However, we’re still missing an important element which is the list of words per level. As noted earlier, I already have a list of 1500 words classified by the frequency of use. In addition to this list, I also have another list of 5000 of the most used words in French made by a Memrise user. But, how can I classify them by level?

After some research on the internet, I found this table which could help me:

Source: http://polydog.org/index.php?threads/the-cefr-scale-and-language-level.26/

We’re in a comprehension exercise so I will be more interested in the ‘passive vocabulary’ part and I will limit myself to levels A1, A2, B1, and B2 since I have around 5000 words. I think it’s logical to assume that for A1 and A2 levels, it’s better to know words that are widely used. I will therefore populate these first lists with the word frequency list of the French Ministry of Education. Then, after removing the words that are already present in the frequency list, I take the Memrise list and complete what is missing to reach the quotas of the different CEFR levels. As the Memrise list wasn’t made by a professional, the difference between levels B1 and B2 will be taken with a pinch of salt.

So now I have four lists of words which add up to 5000 words divided between all four levels as shown below.

I can finally start working on the classification.

In order to do this, I’m going to use the scikit-learn count vectorizer that I discovered thanks to Frank Andrade’s posts. The count Vectorizer ‘convert a collection of text documents to a matrix of token counts’ (scikit-learn documentation).

Once I have this matrix (26551 rows over 4633 columns!), I add to it the level linked to each word and I get the table below. We can see, for example, that the word ‘alors’ (then) appears thrice in the song at index 3 and twice in the one at index 5 and is ranked level A1.

Then I create a table grouping the sum of words for each category using df_dtm.groupby(by=’level').sum() and I get a list of each of the songs with the number of words per category.

You can see in this table that the song at index 4 has 193 words from the A1 list, 17 from the A2 list, 6 from the B1 list and 4 from the B2 list. The last part consists of associating this information with that of the songs. To do this I create a dictionary which takes the index of the song as its key and a list of the four levels for each song as its value. The index allows me to link those pieces of information to the song that has the same index in the original df. I as well, divide the words per level by the total number of words in the song and get a percentage for each level.

You can see in the code above that I have also created two new columns: one called ‘A’ which covers the lexical range for A1 and A2 and a column called ‘A+B’ which covers the vocabulary from levels A1 to B2.

Results

Finally, we can analyse the results! Firstly, I have created a graph that allows us to see the distribution of the songs according to their level coverage, either A or A+B.

In the graph below you will find the average A1 to B2 level per music genre. Not surprisingly, poetry and chanson francaise are ranked first, while rap and rai have fewer words covered by the vocabulary lists created. This can be explained for rai (an Algerian musical genre) because many Arabic words are used in the lyrics. As for rap, this can be attributed to the use of many slang words, neologisms or words of foreign origin.

Then I created three lists of songs: one with the highest percentage of known words for A1 level, then for A2 level and finally for B2 level. To do this, I collected the songs that ranked highest in terms of the level chosen, and from A2 level onwards I also removed the songs that were already in the previous lists. I thus obtained these three lists.

Top 10 best songs to learn French if you’re at A1 level

‘Il faut manger’ by Manu Chao (genre = pop)
‘Qu’avons-nous fait de vous?’ by Le Roi Soleil (genre = comedie musicale/film)
‘La poupée qui fait non’ by Mylène Farmer (genre = pop, chanson francaise)
‘Elle’ by Melissa M (genre = R&B)
‘Tu me manques’ by Mia Martina (genre = pop)
‘Reste-là’ by Keren Ann (genre = pop,chanson francaise)
‘Tu te reconnaîtras’ by Anne-Marie David( genre = pop)
‘Attendre ’ by Céline Dion (genre = pop, chanson francaise)
‘Que sont devenues les fleurs’ by Dalida (genre = pop)

I as well created a playlist on Spotify with a selection of A1 level songs.

Top 10 best songs to learn French if you’re at A2 level

‘Je te laisserai des mots’ by Patrick Watson (genre = pop,classique)
‘Pas avec toi’ by Plastiscines (genre = pop)
‘Pour une fois’ by Marie-Mai (genre = pop)
‘L’intranquilité ’ by Louise Attaque (genre = pop,rock,chanson francaise)
‘Te passe pas de moi lyrics’ by Judith (genre = pop)
‘Tu me manques’ by Sheryfa Luna (genre = pop)
‘Le temps des fleurs lyrics’ by Vicky Leandros (genre = pop)
‘Tes yeux noirs’ by Indochine (genre = pop,rock,chanson francaise)
‘Fais ce que tu voudras lyrics’ by Céline Dion( genre = pop,chanson francaise)
‘Sous le vent’ by Garou (genre = pop,chanson francaise)

Here is the Spotify playlist for the level A2.

Top 10 best songs to learn French if you’re at B2 level

‘Ne partez pas sans moi lyrics’ by Céline Dion (genre = pop,chanson francaise)
‘Si la vie est cadeau ’ by Corinne Hermès (genre = pop,chanson francaise)
‘Mon Dieu’ by Édith Piaf (genre = chanson francaise)
‘La taille de mon âme’ by Daniel Darc (genre = rock)
‘Maman me dit’ by Angélina (genre = pop)
‘Un ange est tombé’ by Lara Fabian (genre = pop)
‘Ce que je sais’ by Johnny Hallyday (genre = pop,rock)
‘Pourquoi tu vis et où tu vas’ by Jeanette (genre = pop)
‘Ma référence’ by Jena Lee (genre = pop)
‘Le verger aux petits’ by Bastien Lallemant (genre = chanson francaise)

And the final playlist.

If you rather want to search by yourself according to your preferences, you can use the table below to browse the whole corpus

To view this table in full page mode follow this link https://datawrapper.dwcdn.net/6Q7LS/5/

I’m quite happy with the result. However, there are a few things to consider that could improve this list. Firstly, to rate the level of a song only on the lexicon is rather reductive. It could be interesting to add grammar points, and verbal tenses into the criteria to evaluate the level of a song. Secondly, pronunciation is an important element that has been overlooked. Finally, ideally, it would be interesting to be able to classify the songs by theme. If we’re working on the theme of ecology, having a few songs that correspond to both the level of the learner and the topic being addressed would be an invaluable tool.

You can find the complete code on my Github: https://github.com/StMaCre/french_song_level_analysis. Thank you for reading!